0

I have a long list of numbers (a single column that has 5 million rows), that are not all unique from one another. I want to see which thousand of them are the most frequent occurrences in the list. Any ideas on how I could do this easily? I could use excel or a python script or other means too.

2
  • Python could do this quickly but share us a sample of the data (I assume it is a CSV-file)
    – Anton vBR
    Commented Jul 14, 2018 at 22:05
  • 2
    Read each line. Use a dict to count occurrences. Sort by count.
    – Tom Zych
    Commented Jul 14, 2018 at 22:09

3 Answers 3

6

In Bash:

sort filename | uniq -c | sort -nr
1
  • 1
    Or just sort < filename | etc.
    – Tom Zych
    Commented Jul 15, 2018 at 16:22
2

Here's one way with Python using csv.reader and collections.Counter:

import csv
from collections import Counter
from itertools import chain
from io import StringIO

mystr = StringIO("""1
2
3
3
1
1""")

# replace mystr with open('file.csv', 'r')
with mystr as fin:
    # define lazy reader object
    reader = csv.reader(mystr)
    # flatten, convert to int, feed to Counter object
    c = Counter(map(int, chain.from_iterable(reader)))

# calculate 2 most common items, return number and counts
print(c.most_common(2))

[(1, 3), (3, 2)]
3
  • thanks for the explicit example. i'm a little confused though; where are inputting the csv file that's to be sorted/counted? Commented Jul 18, 2018 at 0:56
  • Thanks again and apologies for the oversight. I used the above code and encountered this error. Any idea what it means? Is it saying that there's a data element that is not an integer?: ValueError: invalid literal for int() with base 10: '\x1a' Commented Jul 20, 2018 at 2:01
  • argh. sorry for all the messages but i figured out that that error was because python can't read the EOF character so i just deleted that row. but now i am getting the problem that the numbers that the script is writing to the output CSV file are all in scientific notation. how can i make it so that the script doesn't do that and instead writes the integer and just the integer in standard notation to file? like fir example the first row of output i am currently getting is of the form: 2.202384070000000000e+08,3.700000000000000000e+01. Thanks again. Commented Jul 20, 2018 at 2:21
2

Tom's approach in Python:

d = dict()

import sys
for filename in sys.argv[1:]:
    file = open(filename, 'r')
    for line in file.read().splitlines():
        if line not in d:
            d[line] = 1
        else:
            d[line] += 1
    file.close()

import operator
print "Item,Count"
for line in sorted(d.items(), key=operator.itemgetter(1)):
    print line[0] + "," + str( line[1] )

Usage:

python linesorter.py filename1.txt filename2.txt filename_...
3
  • thanks so much. i tried this but received an error. what i did is place the script into a file called linesorter.py in my directory that has the csv file. then i opened CMD in windows and tried this line: C:Users\username\Desktop\hs-2> cat test.csv | python linesorter.py Which gave me this error: 'cat' is not recognized as an internal or external command, operable program or batch file. And advice? Thanks and much appreciated... Commented Jul 18, 2018 at 0:59
  • by the way -- i don't know if this is the correct way to do it but for the character between test.csv and python in the command, i used the character that's SHIFT+Backslash. Commented Jul 18, 2018 at 1:03
  • @user2047228, you entered everything right, you're just on windows so you don't have cat installed by default. I updated the program so it'll open your files directly.
    – rsaxvc
    Commented Jul 20, 2018 at 1:29

Not the answer you're looking for? Browse other questions tagged or ask your own question.