6

I am comparing performance for reading how many lines contains a file.

I did it first using the wc command line tool:

$ time wc -l bigFile.csv
1673820 bigFile.csv

real    0m0.157s
user    0m0.124s
sys     0m0.062s

and then in a clean Pharo Core Smalltalk latest 1.3

| file lineCount |
Smalltalk garbageCollect.
( Duration milliSeconds: [ file := FileStream readOnlyFileNamed: 'bigFile.csv'.
lineCount := 0.
[ file atEnd ] whileFalse: [
    file nextLine.
    lineCount := lineCount + 1 ].
file close.
lineCount. ] timeToRun ) asSeconds. 
15

How can I speed up the Smalltalk code to be faster or closer than the wc performance?

2 Answers 2

9
[ (PipeableOSProcess waitForCommand: 'wc -l /path/to/bigfile2.csv') output ] timeToRun.

The above reports ~207 milliseconds, where time reported:

real    0m0.160s
user    0m0.131s
sys     0m0.029s

I'm kidding, but also serious. No need to reinvent the wheel. FFI, OSProcess, Zinc, etc. provide ample opportunity to utilize things like UNIX utilities that have been battle-tested over decades.

If your question was really more about Smalltalk itself, a start would be:

[ FileStream 
    readOnlyFileNamed: '/path/to/reallybigfile2.csv'
    do: [ :file | | endings count |
        count := 0.
        file binary.
        file contents do: [ :c | c = 10 ifTrue: [ count := count + 1 ] ].
        count ]
] timeToRun.

That will get you down to 2.5 seconds:

  • making the stream binary saved ~10 seconds
  • readOnlyFileNamed:do: saved ~1 second
  • finding the line endings manually instead of using #nextLine saved ~4 seconds

A cleaner, but 1/2 second longer op would be:

file contents occurrencesOf: 10.

Of course, if better performance is needed, and you don't want to use FFI/OSProcess, you would then write a plugin.

2
  • I bet the largest savings in your code do not come from making the file binary, but from reading the whole file into memory before processing using "contents". Reading the file in reasonably sized chunks should fare about the same.
    – codefrau
    Commented Nov 8, 2011 at 9:51
  • I double-checked... #binary actually saved 10 seconds vs. calling #asciiValue or comparing to "Character lf" (even if cached in a temp). #contents saved 3.5 seconds vs. a manual loop with #next. Commented Nov 8, 2011 at 15:19
2

If you can afford reading the whole file in memory, then the simplest code is

[ FileStream 
    readOnlyFileNamed: '/path/to/reallybigfile2.csv'
    do: [ :file | file contents lineCount ]
] timeToRun.

This will handle the zoo of LF (Linux), CR (Old Mac), CR-LF (you name it). The code from Sean only handles LF, for approximately the same cost. I'd say a factor 10 for Smalltalk vs C is expected for such basic operations, so I doubt you get much more efficiency without adding your own primitives.

Not the answer you're looking for? Browse other questions tagged or ask your own question.