1

I'm trying to process some Point Cloud data with Haskell, and it seems to use a LOT of memory. The code I'm using is below, it basically parses the data into a format I can work with. The dataset has 440MB with 10M rows. When I run it with runhaskell, it uses up all the ram in a short time (~3-4gb) and then crashes. If I compile it with -O2 and run it, it goes to 100% cpu and takes a long time to finish (~3 minutes). I should mention that I'm using an i7 cpu with 4GB ram and an SSD, so there should be plenty of resources. How can I improve the performance of this?

{-# LANGUAGE OverloadedStrings #-}
import Prelude hiding (lines, readFile)
import Data.Text.Lazy (Text, splitOn, unpack, lines)
import Data.Text.Lazy.IO (readFile)
import Data.Maybe (fromJust)
import Text.Read (readMaybe)

filename :: FilePath
filename = "sample.txt"

readTextMaybe = readMaybe . unpack

data Classification = Classification
    { id :: Int, description :: Text
    } deriving (Show)

data Point = Point
    { x :: Int, y :: Int, z :: Int, classification :: Classification
    } deriving (Show)

type PointCloud = [Point]

maybeReadPoint :: Text -> Maybe Point
maybeReadPoint text = parse $ splitOn "," text
    where toMaybePoint :: Maybe Int -> Maybe Int -> Maybe Int -> Maybe Int -> Text -> Maybe Point
          toMaybePoint (Just x) (Just y) (Just z) (Just cid) cdesc = Just (Point x y z (Classification cid cdesc))
          toMaybePoint _ _ _ _ _                                   = Nothing
          parse :: [Text] -> Maybe Point
          parse [x, y, z, cid, cdesc] = toMaybePoint (readTextMaybe x) (readTextMaybe y) (readTextMaybe z) (readTextMaybe cid) cdesc
          parse _                     = Nothing

readPointCloud :: Text -> PointCloud
readPointCloud = map (fromJust . maybeReadPoint) . lines

main = (readFile filename) >>= (putStrLn . show . sum . map x . readPointCloud)
5
  • Start by not using List, convert that List into a Vector. Also, test memory usage with binaries you know you have optimization turned on for.
    – bitemyapp
    Commented Feb 28, 2015 at 10:53
  • @bitemyapp as I've said above, I've tried with -O2 and while it doesn't fill up the memory, it's still super slow. Commented Feb 28, 2015 at 10:54
  • Convert the List into a Vector.
    – bitemyapp
    Commented Feb 28, 2015 at 10:59
  • 2
    List should be Vector. Point should be { x :: !Int,, y :: !Int, ... }. Read performance techniques in RWH ch25 for the basic techniques you should start with for appropriate data type representations. Commented Feb 28, 2015 at 11:04
  • 2
    Use strict Text and either pipes or conduit for streaming. If you've not used either before then try both and use whichever you prefer. My guess is you might have an easier time getting started with conduit but that's only a guess.
    – GarethR
    Commented Feb 28, 2015 at 11:28

1 Answer 1

3

The reason this uses all your memory when compiled without optimization is most likely because sum is defined using foldl. Without the strictness analysis that comes with optimization, that will blow up badly. You can try using this function instead:

sum' :: Num n => [n] -> n
sum' = foldl' (+) 0

The reason this is slow when compiled with optimization seems likely related to the way you parse the input. A cons will be allocated for each character when reading in the input, and again when breaking the input into lines, and probably yet again when splitting on commas. Using a proper parsing library (any of them) will almost certainly help; using one of the streaming ones like pipes or conduit may or may not be best (I'm not sure).

Another issue, not related to performance: fromJust is rather poor form in general, and is a really bad idea when dealing with user input. You should instead mapM over the list in the Maybe monad, which will produce a Maybe [Point] for you.

Not the answer you're looking for? Browse other questions tagged or ask your own question.