I want to count the A's T's C's G's N's and "-" characters in a file, or every letter if needed, is there a quick Unix command to do this?

    Counting bases in DNA strands?
    – Indrek
    Commented Oct 10, 2012 at 11:32
    I love this question, so many different approaches and tools used to solve the same problem.
    – Journeyman Geek
    Commented Oct 10, 2012 at 11:42
    Heh, this is borderline code-golf
    – Earlz
    Commented Oct 10, 2012 at 13:37
    [System.IO.File]::ReadAllText("C:\yourfile.txt").ToCharArray() | Group-Object $_ | Sort Count -Descending Commented Oct 10, 2012 at 14:53
    Ok I think I found the pure PS way: Get-Content "C:\eula.3082.txt" | % { $_.ToCharArray() } | Group-Object | Sort Count -Descending Commented Oct 10, 2012 at 16:33

18 Answers 18


If you want some real speed:

echo 'int cache[256],x,y;char buf[4096],letters[]="tacgn-"; int main(){while((x=read(0,buf,sizeof buf))>0)for(y=0;y<x;y++)cache[(unsigned char)buf[y]]++;for(x=0;x<sizeof letters-1;x++)printf("%c: %d\n",letters[x],cache[letters[x]]);}' | gcc -w -xc -; ./a.out < file; rm a.out;

Is an incredibly fast pseudo-one-liner.

A simple test shows that on my Core i7 CPU 870 @ 2.93GHz it counts at just over 600MB/s:

$ du -h bigdna 
1.1G    bigdna

time ./a.out < bigdna 
t: 178977308
a: 178958411
c: 178958823
g: 178947772
n: 178959673
-: 178939837

real    0m1.718s
user    0m1.539s
sys     0m0.171s

Unlike solutions involving sorting, this one runs in constant (4K) memory, which is very useful, if your file is far larger than your ram.

And, of course with a little bit of elbow grease, we can shave off 0.7 seconds:

echo 'int cache[256],x,buf[4096],*bp,*ep;char letters[]="tacgn-"; int main(){while((ep=buf+(read(0,buf,sizeof buf)/sizeof(int)))>buf)for(bp=buf;bp<ep;bp++){cache[(*bp)&0xff]++;cache[(*bp>>8)&0xff]++;cache[(*bp>>16)&0xff]++;cache[(*bp>>24)&0xff]++;}for(x=0;x<sizeof letters-1;x++)printf("%c: %d\n",letters[x],cache[letters[x]]);}' | gcc -O2 -xc -; ./a.out < file; rm a.out;

Nets just over 1.1GB/s finishing in:

real    0m0.943s
user    0m0.798s
sys     0m0.134s

For comparison, I tested some of the other solutions on this page which seemed to have some kind of speed promise.

The sed/awk solution made a valiant effort, but died after 30 seconds. With such a simple regex, I expect this to be a bug in sed (GNU sed version 4.2.1):

$ time sed 's/./&\n/g' bigdna | awk '!/^$/{a[$0]++}END{for (i in a)print i,a[i];}' 
sed: couldn't re-allocate memory

real    0m31.326s
user    0m21.696s
sys     0m2.111s

The perl method seemed promising as well, but I gave up after running it for 7 minutes

time perl -e 'while (<>) {$c{$&}++ while /./g} print "$c{$_} $_\n" for keys %c' < bigdna 

real    7m44.161s
user    4m53.941s
sys     2m35.593s
    +1 For a sane solution when it's lots of data, and not just a handful of bytes. The files are in the disk cache though, aren't they?
    – Daniel Beck
    Commented Oct 10, 2012 at 18:24
    The neat thing is that it has a complexity of O(N) in processing and O(1) in memory. The pipes usually have O(N log N) in processing (or even O(N^2)) and O(N) in memory. Commented Oct 10, 2012 at 19:54
    You are stretching the definition of "command line" quite a bit, though.
    – gerrit
    Commented Oct 10, 2012 at 20:42
    Epic bending of the question's requirements -I approve ;p. superuser.com/a/486037/10165 <- someone ran benchmarks, and this is the fastest option.
    – Journeyman Geek
    Commented Oct 11, 2012 at 0:34
    +1 I appreciate me some good use of C in the right places. Commented Oct 11, 2012 at 7:06

grep -o foo.text -e A -e T -e C -e G -e N -e -|sort|uniq -c

Will do the trick as a one liner. A little explanation is needed though.

grep -o foo.text -e A -e T -e C -e G -e N -e - greps the file foo.text for letters a and g and the character - for each character you want to search for. It also prints it one character a line.

sort sorts it in order. This sets the stage for the next tool

uniq -c counts the duplicate consecutive occurrences of any line. In this case, since we have a sorted list of characters, we get a neat count of when the characters we grepped out in the first step

If foo.txt contained the string GATTACA-this is what I'd get from this set of commands

[geek@atremis ~]$ grep -o foo.text -e A -e T -e C -e G -e N -e -|sort|uniq -c
      1 -
      3 A
      1 C
      1 G
      2 T
    Bloody unix magic! :D
    – Pitto
    Commented Oct 10, 2012 at 14:30
    if there is only CTAG- characters in your files, the regexp itself becomes pointless, right ? grep -o . | sort | uniq -c would work equally well, afaik.
    – sylvainulg
    Commented Oct 10, 2012 at 14:55
    +1 I've been using grep for 25 years and didn't know about -o.
    – LarsH
    Commented Oct 10, 2012 at 19:28
    @JourneymanGeek: The problem with this is that it generates a lot of data that is then forwarded to sort. It would be cheaper to let a program parse each character. See Dave's answer for a O(1) instead O(N) memory complexity answer. Commented Oct 10, 2012 at 19:52
    @Pitto Native Windows builds of coreutils are widely available - just ask Google or somesuch
    – OrangeDog
    Commented Oct 10, 2012 at 20:08

Try this one, inspired by @Journeyman's answer.

grep -o -E 'A|T|C|G|N|-' foo.txt | sort | uniq -c

The key is knowing about the -o option for grep. This splits the match up, so that each output line corresponds to a single instance of the pattern, rather than the entire line for any line that matches. Given this knowledge, all we need is a pattern to use, and a way to count the lines. Using a regex, we can create a disjunctive pattern that will match any of the characters you mention:


This means "match A or T or C or G or N or -". The manual describes various regular expression syntax you can use.

Now we have output that looks something like this:

$ grep -o -E 'A|T|C|G|N|-' foo.txt 

Our last step is to merge and count all the similar lines, which can simply be accomplished with a sort | uniq -c, as in @Journeyman's answer. The sort gives us output like this:

$ grep -o -E 'A|T|C|G|N|-' foo.txt | sort

Which, when piped through uniq -c, finally resembles what we want:

$ grep -o -E 'A|T|C|G|N|-' foo.txt | sort | uniq -c
      2 -
      3 A
      1 C
      1 G
      4 N
      1 T

Addendum: If you want to total the number of A, C, G, N, T, and - characters in a file, you can pipe the grep output through wc -l instead of sort | uniq -c. There's lots of different things you can count with only slight modifications to this approach.

  I really need to delve into the rabbitholes that are coreutils and regex. This is somewhat more elegant than mine for it ;p
    – Journeyman Geek
    Commented Oct 10, 2012 at 14:36
  • 2
    @JourneymanGeek: Learing regex is well worth the trouble, since it's useful for so many things. Just understand it's limitations, and don't abuse the power by attempting to do things outside the scope of regexes capabilites, like trying to parse XHTML.
    – crazy2be
    Commented Oct 10, 2012 at 15:17
    grep -o '[ATCGN-]' could be a bit more readable here.
    – sylvainulg
    Commented Oct 10, 2012 at 15:45

One liner counting all letters using Python:

$ python -c "import collections, pprint; pprint.pprint(dict(collections.Counter(open('FILENAME_HERE', 'r').read())))"

...producing a YAML friendly output like this:

{'\n': 202,
 ' ': 2153,
 '!': 4,
 '"': 62,
 '#': 12,
 '%': 9,
 "'": 10,
 '(': 84,
 ')': 84,
 '*': 1,
 ',': 39,
 '-': 5,
 '.': 121,
 '/': 12,
 '0': 5,
 '1': 7,
 '2': 1,
 '3': 1,
 ':': 65,
 ';': 3,
 '<': 1,
 '=': 41,
 '>': 12,
 '@': 6,
 'A': 3,
 'B': 2,
 'C': 1,
 'D': 3,
 'E': 25}

It's interesting to see how most of the times Python can easily beat even bash in terms of clarity of code.


Similar to Guru's awk method:

perl -e 'while (<>) {$c{$&}++ while /./g} print "$c{$_} $_\n" for keys %c'

After using UNIX for a couple of years, you get very proficient at linking together a number of small operations to accomplish various filtering and counting tasks. Everyone has their own style-- some like awk and sed, some like cut and tr. Here's the way I would do it:

To process a particular filename:

 od -a FILENAME_HERE | cut -b 9- | tr " " \\n | egrep -v "^$" | sort | uniq -c

or as a filter:

 od -a | cut -b 9- | tr " " \\n | egrep -v "^$" | sort | uniq -c

It works like this:

  1. od -a separates the file into ASCII characters.
  2. cut -b 9- eliminates the prefix od puts.
  3. tr " " \\n converts the spaces between characters to newlines so there's one character per line.
  4. egrep -v "^$" gets rid of all the extra blank lines this creates.
  5. sort gathers instances of each character together.
  6. uniq -c counts the number of repeats of each line.

I fed it "Hello, world!" followed by a newline and got this:

  1 ,
  1 !
  1 d
  1 e
  1 H
  3 l
  1 nl
  2 o
  1 r
  1 sp
  1 w

The sed part being based on @Guru’s answer, here’s another approach using uniq, similar to David Schwartz’ solution.

$ cat foo
$ sed 's/\(.\)/\1\n/g' foo | sort | uniq -c
1 a
1 b
1 d
1 f
2 i
1 l
1 n
2 o
1 s
1 u
2 x
    Use [[:alpha:]] rather than . in sed to only match characters and not newlines.
    – Claudius
    Commented Oct 10, 2012 at 11:54
  • 1
    [[:alpha:]] will fail if you're also trying to match stuff like -, which was mentioned in the question
    – Izkata
    Commented Oct 10, 2012 at 14:58
  Correct. It might be nicer to add a second expression to sed to first filter out everything else and then explicitly match on the desired characters: sed -e 's/[^ATCGN-]//g' -e 's/\([ATCGN-]\)/\1\n/g' foo | sort | uniq -c. However, I don't know how to get rid off the newlines there :\
    – Claudius
    Commented Oct 10, 2012 at 15:09

You can combine grep and wc to do this:

grep -o 'character' file.txt | wc -w

grep searches the given file(s) for the specified text, and the -o option tells it to only print the actual matches (ie. the characters you were looking for), rather than the default which is to print each line in which the search text was found on.

wc prints the byte, word and line counts for each file, or in this case, the output of the grep command. The -w option tells it to count words, with each word being an occurrence of your search character. Of course, the -l option (which counts lines) would work as well, since grep prints each occurrence of your search character on a separate line.

To do this for a number of characters at once, put the characters in an array and loop over it:

chars=(A T C G N -)
for c in "${chars[@]}"; do echo -n $c ' ' && grep -o $c file.txt | wc -w; done

Example: for a file containing the string TGC-GTCCNATGCGNNTCACANN-, the output would be:

A  3
T  4
C  6
G  4
N  5
-  2

For more information, see man grep and man wc.

The downside of this approach, as user Journeyman Geek notes below in a comment, is that grep has to be run once for each character. Depending on how large your files are, this can incur a noticeable performance hit. On the other hand, when done this way it's a bit easier to quickly see which characters are being searched for, and to add/remove them, as they're on a separate line from the rest of the code.

    they'd need to repeat it per charecter they want... I'd add. I could swear there's a more elegant solution but it needs more poking ;p
    – Journeyman Geek
    Commented Oct 10, 2012 at 11:27
  @JourneymanGeek Good point. One approach that springs to mind is putting the characters in an array and looping through it. I've updated my post.
    – Indrek
    Commented Oct 10, 2012 at 11:55
  too complex IMO. Just use grep -e a -e t and so on. If you put it in an array and loop through it, wouldn't you have to run through the grep cycle once per character?
    – Journeyman Geek
    Commented Oct 10, 2012 at 11:58
  @JourneymanGeek You're probably right. uniq -c also seems like a better way of getting nicely formatted output. I'm no *nix guru, the above is just what I managed to put together from my limited knowledge and some man pages :)
    – Indrek
    Commented Oct 10, 2012 at 12:04
  So did I ;p, and one of my assignments last term involved sorting through about 5000 address book entries, and uniq made it a LOT easier.
    – Journeyman Geek
    Commented Oct 10, 2012 at 12:06

Using the sequence lines from 22hgp10a.txt the timing difference between grep and awk on my system make using awk the way to go...

[Edit]: After having seen Dave's compiled solution forget awk too, as his completed in ~ 0.1 seconds on this file for full case sensitive counting.

# A nice large sample file.
wget http://gutenberg.readingroo.ms/etext02/22hgp10a.txt

# Omit the regular text up to the start `>chr22` indicator.
sed -ie '1,/^>chr22/d' 22hgp10a.txt

sudo test # Just get sudo setup to not ask for password...

# ghostdog74 answered a question <linked below> about character frequency which
# gave me all case sensitive [ACGNTacgnt] counts in ~10 seconds.
sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" \
awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' 22hgp10a.txt

# The grep version given by Journeyman Geek took a whopping 3:41.47 minutes
# and yielded the case sensitive [ACGNT] counts.
sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" \
grep -o foo.text -e A -e T -e C -e G -e N -e -|sort|uniq -c

The case insensitive version of ghostdog's completed in ~ 14 seconds.

The sed is explained in the accepted answer to this question.
The benchmarking is as in the accepted answer to this question.
The accepted answer by ghostdog74 was to this question.

  • 1
    – Dave
    – Dave
    Commented Oct 10, 2012 at 19:25

I think any decent implementation avoids sort. But because it's also bad idea to read everything 4 times, I think one could somehow generate a stream that goes through 4 filters, one for each character, which is filtered out and where the stream lengths are also somehow calculated.

time cat /dev/random | tr -d -C 'AGCTN\-' | head -c16M >dna.txt
real    0m5.797s
user    0m6.816s
sys     0m1.371s

$ time tr -d -C 'AGCTN\-' <dna.txt | tee >(wc -c >tmp0.txt) | tr -d 'A' | 
tee >(wc -c >tmp1.txt) | tr -d 'G' | tee >(wc -c >tmp2.txt) | tr -d 'C' | 
tee >(wc -c >tmp3.txt) | tr -d 'T' | tee >(wc -c >tmp4.txt) | tr -d 'N' | 
tee >(wc -c >tmp5.txt) | tr -d '\-' | wc -c >tmp6.txt && cat tmp[0-6].txt

real    0m0.742s
user    0m0.883s
sys     0m0.866s


The cumulative sums are then in tmp[0-6].txt .. so work is still in progress

There are merely 13 pipes in this approach, which converts to less than 1 Mb of memory.
Of course my favourite solution is:

time cat >f.c && gcc -O6 f.c && ./a.out
# then type your favourite c-program
real    0m42.130s
  This is a very nice use of tr.
    – adavid
    Commented Oct 12, 2012 at 10:12
  But... you just need one tee with multiple process substitutions. Trimming down one char at a time may help with speed on low-end systems but you need to compute the end result, and if you have one cpu per tr you probably won't gain anything. My version: tr -dC AGCTN- <dna.txt |tee >(tr -dC A |wc -c |sed s/^/A:/) >(tr -dC G |wc -c |sed s/^/G:/) >(tr -dC C |wc -c |sed s/^/C:/) >(tr -dC T |wc -c |sed s/^/T:/) >(tr -dC N |wc -c |sed s/^/N:/) > >(tr -dC - |wc -c |sed s/^/-:/) Commented Sep 17, 2021 at 7:17

I didn't knew about uniq nor about grep -o, but since my comments on @JourneymanGeek and @crazy2be had such support, maybe I should turn it into an anwser of its own:

If you know there is only "good" characters (those you want to count) in your file, you can go for

grep . -o YourFile | sort | uniq -c

If only some characters must be counted and others not (i.e. separators)

grep '[ACTGN-]' YourFile | sort | uniq -c

The first one uses the regular expression wildcard ., which match any single character. The second one use a 'set of accepted characters', with no specific order, except that - must come last (A-C is interpreted as 'any character betweenA and C). Quotes are required in that case so that your shell do not try to expand that to check single-character files if any (and produce a "no match" error if none).

Note that "sort" also has a -unique flag so that it only reports things once, but no companion flag to count duplicates, so uniq is indeed mandatory.

  - doesn't have to come last if you escape it with a backslash: '[A\-CTGN]' should work just fine.
    – Indrek
    Commented Oct 11, 2012 at 12:04

A silly one:

tr -cd ATCGN- | iconv -f ascii -t ucs2 | tr '\0' '\n' | sort | uniq -c
  • tr to delete (-d) all characters but (-c) ATCGN-
  • iconv to convert to ucs2 (UTF16 limited to 2 bytes) to add a 0 byte after every byte,
  • another tr to translate those NUL characters to NL. Now every character is on its own line
  • sort | uniq -c to count each uniq line

That's an alternative to the non-standard (GNU) -o grep option.

  Could you give a brief explanation of the commands and logic here? Commented Oct 10, 2012 at 23:09
time $( { tr -cd ACGTD- < dna.txt | dd | tr -d A | dd | tr -d C | dd | tr -d G |
dd | tr -d T | dd | tr -d D | dd | tr -d - | dd >/dev/null; } 2>tmp ) &&
grep byte < tmp | sort -r -g | awk '{ if ((s-$0)>=0) { print s-$0} s=$0 }'

The output format is not the best...

real    0m0.176s
user    0m0.200s
sys     0m0.160s

Theory of operation:

  • $( { command | command } 2> tmp ) redirects the stderr of the stream to a temporary file.
  • dd outputs stdin to stdout and outputs the number of bytes passed to stderr
  • tr -d filters out one character at a time
  • grep and sort filters the output of dd to descending order
  • awk calculates the difference
  • sort is used only in post-processing stage to handle the uncertainty of exit order of instances of dd

Speed seems to be 60MBps +

  Improvements: get rid of tmp? use 'paste' to print the letter involved? Commented Oct 11, 2012 at 9:09

Sample file:

$ cat file


$ sed 's/./&\n/g' file | awk '!/^$/{a[$0]++}END{for (i in a)print i,a[i];}'
u 2
i 3
x 3
l 1
n 2
a 1
  -1 for lack of clarity, and for posting a one-liner without explanation. AFAIK, this could be a fork bomb
    – PPC
    Commented Oct 10, 2012 at 20:52

Combining a few others

grep -o -i "[$chars]" foo|sort | uniq -c

Add | sort -nr to see the results in order of frequency.


Short answer:

If circumstances permit, compare file sizes of low character sets to one with no characters to get an offset and just count bytes.

Ah, but the tangled details:

Those are all Ascii characters. One byte per. Files of course have extra metadata prepended for a variety of stuff used by the OS and the app that created it. In most cases I would expect these to take up the same amount of space regardless of metadata but I would try to maintain identical circumstances when you first test the approach and then verify that you have a constant offset before not worrying about it. The other gotcha is that line-breaks typically involve two ascii white space characters and any tabs or spaces would be one each. If you can be certain these will be present and there's no way to know how many beforehand, I'd stop reading now.

It might seem like a lot of constraints but if you can easily establish them, this strikes me as the easiest/best performing approach if you have a ton of these to look at (which seems likely if that's DNA). Checking a ton of files for length and subtracting a constant would be gobs faster than running grep (or similar) on every one.


  • These are simple unbroken strings in pure text files
  • They are in identical file types created by the same vanilla non-formatting text-editor like Scite (pasting is okay as long as you check for spaces/returns) or some basic program somebody wrote

And Two Things That Might Not Matter But I Would Test With First

  • The file names are of equal length
  • The files are in the same directory

Try Finding The Offset By Doing the Following:

Compare an empty file to one with a few easily-human-counted characters to one with a few more characters. If subtracting the empty file from both of the other two files gives you byte counts that match character count, you're done. Check file lengths and subtract that empty amount. If you want to try to figure out multi-line files, most editors attach two special one-byte characters for line breaks since one tends to be ignored by Microsoft but you'd have to at least grep for white-space chars in which case you might as well do it all with grep.


Haskell way:

import Data.Ord
import Data.List
import Control.Arrow

main :: IO ()
main = interact $
  show . sortBy (comparing fst) . map (length &&& head) . group . sort

it works like this:

=> sort
=> group
11111 2222 333 44 5
=> map (length &&& head)
(5 '1') (4 '2') (3 '3') (2 '4') (1,'5')
=> sortBy (comparing fst)
(1 '5') (2 '4') (3 '3') (4 '2') (5 '1')
=> one can add some pretty-printing here

compiling and using:

$ ghc -O2 q.hs
[1 of 1] Compiling Main             ( q.hs, q.o )
Linking q ...
$ echo 112123123412345 | ./q
$ cat path/to/file | ./q

not good for huge files maybe.


Quick perl hack:

perl -nle 'while(/[ATCGN]/g){$a{$&}+=1};END{for(keys(%a)){print "$_:$a{$_}"}}'
  • -n: Iterate over input lines but don't print anything for them
  • -l: Strip or add line breaks automatically
  • while: iterate over all occurrences of your requested symbols in the current line
  • END: At the end, print results
  • %a: Hash where the values are stored

Characters which don't occur at all won't be included in the result.

