frequency of millions of data values in a csv column

Question

I have a long list of numbers (a single column that has 5 million rows), that are not all unique from one another. I want to see which thousand of them are the most frequent occurrences in the list. Any ideas on how I could do this easily? I could use excel or a python script or other means too.

Python could do this quickly but share us a sample of the data (I assume it is a CSV-file) — Anton vBR, Commented Jul 14, 2018 at 22:05
Read each line. Use a dict to count occurrences. Sort by count. — Tom Zych, Commented Jul 14, 2018 at 22:09

rsaxvc · Accepted Answer · 2018-07-15 20:47:51Z

6

In Bash:

sort filename | uniq -c | sort -nr

edited Jul 15, 2018 at 20:47

answered Jul 14, 2018 at 22:17

rsaxvc

1,79813 silver badges24 bronze badges

1

Or just sort < filename | etc.
– Tom Zych
Commented Jul 15, 2018 at 16:22

Add a comment |

jpp · Accepted Answer · 2018-07-14 22:19:00Z

2

Here's one way with Python using csv.reader and collections.Counter:

import csv
from collections import Counter
from itertools import chain
from io import StringIO

mystr = StringIO("""1
2
3
3
1
1""")

# replace mystr with open('file.csv', 'r')
with mystr as fin:
    # define lazy reader object
    reader = csv.reader(mystr)
    # flatten, convert to int, feed to Counter object
    c = Counter(map(int, chain.from_iterable(reader)))

# calculate 2 most common items, return number and counts
print(c.most_common(2))

[(1, 3), (3, 2)]

answered Jul 14, 2018 at 22:19

jpp

163k35 gold badges291 silver badges349 bronze badges

thanks for the explicit example. i'm a little confused though; where are inputting the csv file that's to be sorted/counted?
– user2047228
Commented Jul 18, 2018 at 0:56
Thanks again and apologies for the oversight. I used the above code and encountered this error. Any idea what it means? Is it saying that there's a data element that is not an integer?: ValueError: invalid literal for int() with base 10: '\x1a'
– user2047228
Commented Jul 20, 2018 at 2:01
argh. sorry for all the messages but i figured out that that error was because python can't read the EOF character so i just deleted that row. but now i am getting the problem that the numbers that the script is writing to the output CSV file are all in scientific notation. how can i make it so that the script doesn't do that and instead writes the integer and just the integer in standard notation to file? like fir example the first row of output i am currently getting is of the form: 2.202384070000000000e+08,3.700000000000000000e+01. Thanks again.
– user2047228
Commented Jul 20, 2018 at 2:21

Add a comment |

rsaxvc · Accepted Answer · 2018-07-20 01:28:26Z

2

Tom's approach in Python:

d = dict()

import sys
for filename in sys.argv[1:]:
    file = open(filename, 'r')
    for line in file.read().splitlines():
        if line not in d:
            d[line] = 1
        else:
            d[line] += 1
    file.close()

import operator
print "Item,Count"
for line in sorted(d.items(), key=operator.itemgetter(1)):
    print line[0] + "," + str( line[1] )

Usage:

python linesorter.py filename1.txt filename2.txt filename_...

edited Jul 20, 2018 at 1:28

answered Jul 14, 2018 at 22:20

rsaxvc

1,79813 silver badges24 bronze badges

thanks so much. i tried this but received an error. what i did is place the script into a file called linesorter.py in my directory that has the csv file. then i opened CMD in windows and tried this line: C:Users\username\Desktop\hs-2> cat test.csv | python linesorter.py Which gave me this error: 'cat' is not recognized as an internal or external command, operable program or batch file. And advice? Thanks and much appreciated...
– user2047228
Commented Jul 18, 2018 at 0:59
by the way -- i don't know if this is the correct way to do it but for the character between test.csv and python in the command, i used the character that's SHIFT+Backslash.
– user2047228
Commented Jul 18, 2018 at 1:03
@user2047228, you entered everything right, you're just on windows so you don't have cat installed by default. I updated the program so it'll open your files directly.
– rsaxvc
Commented Jul 20, 2018 at 1:29

Add a comment |

Collectives™ on Stack Overflow

frequency of millions of data values in a csv column

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
python
excel
csv
histogram
frequency
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Not the answer you're looking for? Browse other questions tagged pythonexcelcsvhistogramfrequency or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
excel
csv
histogram
frequency
or ask your own question.