command line utility to print statistics of numbers in linux

Question

I often find myself with a file that has one number per line. I end up importing it in excel to view things like median, standard deviation and so forth.

Is there a command line utility in linux to do the same? I usually need to find the average, median, min, max and std deviation.

This is probably relevant: stackoverflow.com/questions/214363/…. — Oliver Charlesworth, Commented Mar 20, 2012 at 15:34
Voting to close as tool rec. stats.stackexchange.com/questions/24934/… || serverfault.com/questions/548322/… — Ciro Santilli OurBigBook.com, Commented Oct 12, 2015 at 11:22
People coming for this question might also be interested in jp, a CLI utility for making plots. — Matt Parker, Commented Jun 15, 2018 at 13:49

jasonleonhard · Accepted Answer · 2019-09-10 00:56:53Z

65

This is a breeze with R. For a file that looks like this:

Use this:

R -q -e "x <- read.csv('nums.txt', header = F); summary(x); sd(x[ , 1])"

To get this:

       V1       
 Min.   : 1.00  
 1st Qu.: 3.25  
 Median : 5.50  
 Mean   : 5.50  
 3rd Qu.: 7.75  
 Max.   :10.00  
[1] 3.02765

The -q flag squelches R's startup licensing and help output
The -e flag tells R you'll be passing an expression from the terminal
x is a data.frame - a table, basically. It's a structure that accommodates multiple vectors/columns of data, which is a little peculiar if you're just reading in a single vector. This has an impact on which functions you can use.
Some functions, like summary(), naturally accommodate data.frames. If x had multiple fields, summary() would provide the above descriptive stats for each.
But sd() can only take one vector at a time, which is why I index x for that command (x[ , 1] returns the first column of x). You could use apply(x, MARGIN = 2, FUN = sd) to get the SDs for all columns.

edited Sep 10, 2019 at 0:56

jasonleonhard

13.4k1 gold badge94 silver badges69 bronze badges

answered Mar 22, 2012 at 16:25

Matt Parker

27.2k7 gold badges56 silver badges73 bronze badges

Thanks. I have started using R since and I think it is a great tool to understand data
– MK.
Commented May 20, 2012 at 9:19
4

To save the extra search for how to get R on Ubuntu: sudo apt-get install r-base
– E-rich
Commented Jun 19, 2015 at 15:54
1

R might be nice to use more in the future, but I'm guessing it's a huge library install, so I had thoughts about installing st form below. Not related to that comment, but my brew install R just about an hour on a MacBook Pro Mid 2015 10.12.5 2.5GHz i7 16GB with Chrome, Atom, and other apps open. Most of it was spent building some gcc jit patch with the Xcode CLT O_o, but now I am happily using parts of this answer :)
– Pysis
Commented Oct 6, 2017 at 18:53
Here's a blog post on general data-wrangling with bash tools that might be of interest to people who found this question.
– Matt Parker
Commented Mar 20, 2018 at 16:12

Add a comment |

user2747481 · Accepted Answer · 2013-09-24 14:54:59Z

49

Using "st" (https://github.com/nferraz/st)

$ st numbers.txt
N    min   max   sum   mean  stddev
10   1     10    55    5.5   3.02765

Or:

$ st numbers.txt --transpose
N      10
min    1
max    10
sum    55
mean   5.5
stddev 3.02765

(DISCLAIMER: I wrote this tool :))

edited Sep 24, 2013 at 14:54

answered Sep 4, 2013 at 15:23

user2747481

6115 silver badges4 bronze badges

1

Any info about installation for newbies
– NeDark
Commented Feb 9, 2015 at 6:31
3

If you're using homebrew installing this is as simple as brew install st.
– Jason Axelson
Commented Feb 1, 2017 at 2:36
3

Beware that st may also reference to simple terminal.
– Skippy le Grand Gourou
Commented Feb 6, 2019 at 10:22

Add a comment |

Benjamin W. · Accepted Answer · 2021-06-13 04:01:56Z

For the average, median & standard deviation you can use awk. This will generally be faster than R solutions. For instance the following will print the average :

awk '{a+=$1} END{print a/NR}' myfile

(NR is an awk variable for the number of records, $1 means the first (space-separated) argument of the line ($0 would be the whole line, which would also work here but in principle would be less secure, although for the computation it would probably just take the first argument anyway) and END means that the following commands will be executed after having processed the whole file (one could also have initialized a to 0 in a BEGIN{a=0} statement)).

Here is a simple awk script which provides more detailed statistics (takes a CSV file as input, otherwise change FS) :

#!/usr/bin/awk -f

BEGIN {
    FS=",";
}
{
   a += $1;
   b[++i] = $1;
}
END {
    m = a/NR; # mean
    for (i in b)
    {
        d += (b[i]-m)^2;
        e += (b[i]-m)^3;
        f += (b[i]-m)^4;
    }
    va = d/NR; # variance
    sd = sqrt(va); # standard deviation
    sk = (e/NR)/sd^3; # skewness
    ku = (f/NR)/sd^4-3; # standardized kurtosis
    print "N,sum,mean,variance,std,SEM,skewness,kurtosis"
    print NR "," a "," m "," va "," sd "," sd/sqrt(NR) "," sk "," ku
}

It is straightforward to add min/max to this script, but it is as easy to pipe sort & head/tail :

sort -n myfile | head -n1
sort -n myfile | tail -n1

einpoklum · Accepted Answer · 2020-09-29 16:32:00Z

Yet another tool which could be used for calculating statistics and view distribution in ASCII mode is ministat. It's a tool from FreeBSD, but it also packaged for popular Linux distribution like Debian/Ubuntu. Or you can simply download and build it from sources - it only requires a C compiler and the C standard library.

Usage example:

$ cat test.log 
Handled 1000000 packets.Time elapsed: 7.575278
Handled 1000000 packets.Time elapsed: 7.569267
Handled 1000000 packets.Time elapsed: 7.540344
Handled 1000000 packets.Time elapsed: 7.547680
Handled 1000000 packets.Time elapsed: 7.692373
Handled 1000000 packets.Time elapsed: 7.390200
Handled 1000000 packets.Time elapsed: 7.391308
Handled 1000000 packets.Time elapsed: 7.388075

$ cat test.log| awk '{print $5}' | ministat -w 74
x <stdin>
+--------------------------------------------------------------------------+
| x                                                                        |
|xx                                   xx    x x                           x|
|   |__________________________A_______M_________________|                 |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   8      7.388075      7.692373       7.54768     7.5118156    0.11126122

bua · Accepted Answer · 2012-03-20 16:10:56Z

21

Yep, it's called perl
and here is concise one-liner:

perl -e 'use List::Util qw(max min sum); @a=();while(<>){$sqsum+=$_*$_; push(@a,$_)}; $n=@a;$s=sum(@a);$a=$s/@a;$m=max(@a);$mm=min(@a);$std=sqrt($sqsum/$n-($s/$n)*($s/$n));$mid=int @a/2;@srtd=sort @a;if(@a%2){$med=$srtd[$mid];}else{$med=($srtd[$mid-1]+$srtd[$mid])/2;};print "records:$n\nsum:$s\navg:$a\nstd:$std\nmed:$med\max:$m\min:$mm";'

Example

$ cat tt
1
3
4
5
6.5
7.
2
3
4

And the command

cat tt | perl -e 'use List::Util qw(max min sum); @a=();while(<>){$sqsum+=$_*$_; push(@a,$_)}; $n=@a;$s=sum(@a);$a=$s/@a;$m=max(@a);$mm=min(@a);$std=sqrt($sqsum/$n-($s/$n)*($s/$n));$mid=int @a/2;@srtd=sort @a;if(@a%2){$med=$srtd[$mid];}else{$med=($srtd[$mid-1]+$srtd[$mid])/2;};print "records:$n\nsum:$s\navg:$a\nstd:$std\nmed:$med\max:$m\min:$mm";'
records:9
sum:35.5
avg:3.94444444444444
std:1.86256162380447
med:4
max:7.
min:1

edited Mar 20, 2012 at 16:10

answered Mar 20, 2012 at 15:43

bua

4,8111 gold badge27 silver badges33 bronze badges

23

I'm sure that works, but doing it all one line makes my eyes bleed. Why not create a script, rather than that atrocity?
– Oliver Charlesworth
Commented Mar 20, 2012 at 18:15
20

I'm pretty sure that "functional languages" != "write everything on one line as tersely as possible".
– Oliver Charlesworth
Commented Mar 20, 2012 at 21:48
13

Just because you can do something one line doesn't mean you should.
– Mike Monkiewicz
Commented Feb 22, 2013 at 21:50
10

Oneliners are definitely not for reading. But they are great to just copy-paste into my putty, and get the stats on some numbers grepped out from apache log... So bua roxx
– Vajk Hermecz
Commented Feb 23, 2014 at 14:52
3

The command line (at least bash) supports multi-line strings, too. Just use a line break inside the literal.
– Paŭlo Ebermann
Commented Nov 15, 2016 at 16:57

| Show 4 more comments

olejorgenb · Accepted Answer · 2016-08-17 13:34:01Z

18

Yet another tool: https://www.gnu.org/software/datamash/

# Example: calculate the sum and mean of values 1 to 10:
$ seq 10 | datamash sum 1 mean 1
55 5.5

Might be more commonly packaged (the first tool I found prepackaged for nix at least)

answered Aug 17, 2016 at 13:34

olejorgenb

1,29114 silver badges28 bronze badges

Add a comment |

ghoti · Accepted Answer · 2019-03-21 10:35:20Z

18

Mean:

awk '{sum += $1} END {print "mean = " sum/NR}' filename

Median:

gawk -v max=128 '

    function median(c,v,    j) { 
       asort(v,j) 
       if (c % 2) return j[(c+1)/2]
       else return (j[c/2+1]+j[c/2])/2.0
    }

    { 
       count++
       values[count]=$1
       if (count >= max) { 
         print  median(count,values); count=0
       } 
    } 

    END { 
       print  "median = " median(count,values)
    }
    ' filename

Mode:

awk '{c[$1]++} END {for (i in count) {if (c[i]>max) {max=i}} print "mode = " max}' filename

This mode calculation requires an even number of samples, but you see how it works...

Standard Deviation:

awk '{sum+=$1; sumsq+=$1*$1} END {print "stdev = " sqrt(sumsq/NR - (sum/NR)**2)}' filename

edited Mar 21, 2019 at 10:35

answered Mar 20, 2012 at 15:49

ghoti

46.4k8 gold badges68 silver badges107 bronze badges

2

Nicely done - and using a tool that is on every linux distro.
– rbellamy
Commented Sep 23, 2016 at 6:37
1

@rbellamy - thanks! And not just Linux - I administer FreeBSD systems, where the distinction between awk and gawk is important (since Plain Old Awk on BSD doesn't include asort()).
– ghoti
Commented Sep 23, 2016 at 13:57
This is great.. I didn't have to install any additional tools!
– janechii
Commented Mar 17, 2022 at 23:56

Add a comment |

Javier · Accepted Answer · 2013-03-27 00:02:30Z

Just in case, there's datastat, a simple program for Linux computing simple statistics from the command-line. For example,

cat file.dat | datastat

will output the average value across all rows for each column of file.dat. If you need to know the standard deviation, min, max, you can add the --dev, --min and --max options, respectively.

datastat has the possibility to aggregate rows based on the value of one or more "key" columns. For example,

cat file.dat | datastat -k 1

will produce, for each different value found on the first column (the "key"), the average of all other column values as aggregated among all rows with the same value on the key. You can use more columns as key fields (e.g., -k 1-3, -k 2,4 etc...).

It's written in C++, runs fast and with small memory occupation, and can be piped nicely with other tools such as cut, grep, sed, sort, awk etc.

Stylistic minor cringe for the useless use of cat
– tripleee
Commented Jan 9, 2018 at 4:31 — tripleee, Commented Jan 9, 2018 at 4:31
Missing an authorship disclaimer, I think.
– Benjamin W.
Commented Jun 13, 2021 at 4:05 — Benjamin W., Commented Jun 13, 2021 at 4:05

Matt Parker · Accepted Answer · 2014-05-06 15:52:40Z

data_hacks is a Python command-line utility for basic statistics.

The first example from that page produces the desired results:

$ cat /tmp/data | histogram.py
# NumSamples = 29; Max = 10.00; Min = 1.00
# Mean = 4.379310; Variance = 5.131986; SD = 2.265389
# each * represents a count of 1
    1.0000 -     1.9000 [     1]: *
    1.9000 -     2.8000 [     5]: *****
    2.8000 -     3.7000 [     8]: ********
    3.7000 -     4.6000 [     3]: ***
    4.6000 -     5.5000 [     4]: ****
    5.5000 -     6.4000 [     2]: **
    6.4000 -     7.3000 [     3]: ***
    7.3000 -     8.2000 [     1]: *
    8.2000 -     9.1000 [     1]: *
    9.1000 -    10.0000 [     1]: *

Community · Accepted Answer · 2020-06-20 09:12:55Z

You might also consider using clistats. It is a highly configurable command line interface tool to compute statistics for a stream of delimited input numbers.

I/O options

Input data can be from a file, standard input, or a pipe
Output can be written to a file, standard output, or a pipe
Output uses headers that start with "#" to enable piping to gnuplot

Parsing options

Signal, end-of-file, or blank line based detection to stop processing
Comment and delimiter character can be set
Columns can be filtered out from processing
Rows can be filtered out from processing based on numeric constraint
Rows can be filtered out from processing based on string constraint
Initial header rows can be skipped
Fixed number of rows can be processed
Duplicate delimiters can be ignored
Rows can be reshaped into columns
Strictly enforce that only rows of the same size are processed
A row containing column titles can be used to title output statistics

Statistics options

Summary statistics (Count, Minimum, Mean, Maximum, Standard deviation)
Covariance
Correlation
Least squares offset
Least squares slope
Histogram
Raw data after filtering

NOTE: I'm the author.

Benjamin W. · Accepted Answer · 2021-06-13 04:05:58Z

I found myself wanting to do this in a shell pipeline, and getting all the right arguments for R took a while. Here's what I came up with:

seq 10 | R --slave -e 'x <- scan(file="stdin",quiet=TRUE); summary(x)'
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    3.25    5.50    5.50    7.75   10.00

The --slave option "Make(s) R run as quietly as possible...It implies --quiet and --no-save." The -e option tells R to treat the following string as R code. The first statement reads from standard in, and stores what's read in the variable called "x". The quiet=TRUE option to the scan function suppresses the writing of a line saying how many items were read. The second statement applies the summary function to x, which produces the output.

Tom · Accepted Answer · 2013-09-30 20:37:29Z

3

There is also simple-r, which can do almost everything that R can, but with less keystrokes:

https://code.google.com/p/simple-r/

To calculate basic descriptive statistics, one would have to type one of:

r summary file.txt
r summary - < file.txt
cat file.txt | r summary -

For each of average, median, min, max and std deviation, the code would be:

seq 1 100 | r mean - 
seq 1 100 | r median -
seq 1 100 | r min -
seq 1 100 | r max -
seq 1 100 | r sd -

Doesn't get any simple-R!

answered Sep 30, 2013 at 20:37

Tom

411 bronze badge

Funny, it's a Perl wrapper to R. R is not a programming language! >:-)
– Ciro Santilli OurBigBook.com
Commented Oct 12, 2015 at 9:49

Add a comment |

unhammer · Accepted Answer · 2018-04-24 12:28:43Z

3

Using xsv:

$ echo '3 1 4 1 5 9 2 6 5 3 5 9' |tr ' ' '\n' > numbers-one-per-line.csv

$ xsv stats -n < numbers-one-per-line.csv 
field,type,sum,min,max,min_length,max_length,mean,stddev
0,Integer,53,1,9,1,1,4.416666666666667,2.5644470922381863

# mode/median/cardinality not shown by default since it requires storing full file in memory:
$ xsv stats -n --everything < numbers-one-per-line.csv | xsv table
field  type     sum  min  max  min_length  max_length  mean               stddev              median  mode  cardinality
0      Integer  53   1    9    1           1           4.416666666666667  2.5644470922381863  4.5     5     7

answered Apr 24, 2018 at 12:28

unhammer

4,5672 gold badges41 silver badges57 bronze badges

1

Installing this with brew revealed lots of dependencies. Pretty "heavy" for this functionality.
– Alex Moore-Niemi
Commented Nov 27, 2019 at 21:20
So don't use brew? github.com/BurntSushi/xsv/releases has precompiled binaries for macos, so there should be no reason to install the full rust toolchain or whatever it is brew does.
– unhammer
Commented Nov 28, 2019 at 8:53

Add a comment |

Benjamin W. · Accepted Answer · 2021-06-13 04:06:28Z

#!/usr/bin/perl
#
# stdev - figure N, min, max, median, mode, mean, & std deviation
#
# pull out all the real numbers in the input
# stream and run standard calculations on them.
# they may be intermixed with other test, need
# not be on the same or different lines, and 
# can be in scientific notion (avagadro=6.02e23).
# they also admit a leading + or -.
#
# Tom Christiansen
# [email protected]

use strict;
use warnings;

use List::Util qw< min max >;

#
my $number_rx = qr{

  # leading sign, positive or negative
    (?: [+-] ? )

  # mantissa
    (?= [0123456789.] )
    (?: 
        # "N" or "N." or "N.N"
        (?:
            (?: [0123456789] +     )
            (?:
                (?: [.] )
                (?: [0123456789] * )
            ) ?
      |
        # ".N", no leading digits
            (?:
                (?: [.] )
                (?: [0123456789] + )
            ) 
        )
    )

  # abscissa
    (?:
        (?: [Ee] )
        (?:
            (?: [+-] ? )
            (?: [0123456789] + )
        )
        |
    )
}x;

my $n = 0;
my $sum = 0;
my @values = ();

my %seen = ();

while (<>) {
    while (/($number_rx)/g) {
        $n++;
        my $num = 0 + $1;  # 0+ is so numbers in alternate form count as same
        $sum += $num;
        push @values, $num;
        $seen{$num}++;
    } 
} 

die "no values" if $n == 0;

my $mean = $sum / $n;

my $sqsum = 0;
for (@values) {
    $sqsum += ( $_ ** 2 );
} 
$sqsum /= $n;
$sqsum -= ( $mean ** 2 );
my $stdev = sqrt($sqsum);

my $max_seen_count = max values %seen;
my @modes = grep { $seen{$_} == $max_seen_count } keys %seen;

my $mode = @modes == 1 
            ? $modes[0] 
            : "(" . join(", ", @modes) . ")";
$mode .= ' @ ' . $max_seen_count;

my $median;
my $mid = int @values/2;
if (@values % 2) {
    $median = $values[ $mid ];
} else {
    $median = ($values[$mid-1] + $values[$mid])/2;
} 

my $min = min @values;
my $max = max @values;

printf "n is %d, min is %g, max is %d\n", $n, $min, $max;
printf "mode is %s, median is %g, mean is %g, stdev is %g\n", 
    $mode, $median, $mean, $stdev;

JonDeg · Accepted Answer · 2020-07-11 07:31:32Z

2

Another tool: tsv-summarize, from eBay's tsv utilities. Min, max, mean, median, standard deviation are all supported. Intended for large data sets. Example:

$ seq 10 | tsv-summarize --min 1 --max 1 --median 1 --stdev 1
1    10    5.5    3.0276503541

Disclaimer: I'm the author.

edited Jul 11, 2020 at 7:31

answered Jan 20, 2018 at 18:36

JonDeg

3963 silver badges8 bronze badges

Add a comment |

Harry Mangalam · Accepted Answer · 2020-10-21 21:39:43Z

Also, the self-write stats, (bundled with 'scut') a perl util to do just that. Fed a stream of numbers on STDIN, it tries to reject non-numbers and emits the following:

$ ls -lR | scut -f=4 | stats
Sum       3.10271e+07
Number    452
Mean      68643.9
Median    4469.5
Mode      4096
NModes    6
Min       2
Max       1.01171e+07
Range     1.01171e+07
Variance  3.03828e+11
Std_Dev   551206
SEM       25926.6
95% Conf  17827.9 to 119460
          (for a normal distribution - see skew)
Skew      15.4631
          (skew = 0 for a symmetric dist)
Std_Skew  134.212
Kurtosis  258.477
          (K=3 for a normal dist)

It can also do a number of transforms on the input stream and emit only the unadorned value if you ask it; ie 'stats --mean' will return the mean as an unlabelled float.

Olav Kvittem · Accepted Answer · 2023-09-29 08:57:37Z

Not enough solutions ?-) : I would like to toss in gnuplot stats command. Gnuplot is a remarkably fast data analytics tool - plotting, regression...

seq 10 | gnuplot -e "stats '-' u 1"

* FILE: 
  Records:           10
  Out of range:       0
  Invalid:            0
  Header records:     0
  Blank:              0
  Data Blocks:        1

* COLUMN: 
  Mean:               5.5000
  Std Dev:            2.8723
  Sample StdDev:      3.0277
  Skewness:           0.0000
  Kurtosis:           1.7758
  Avg Dev:            2.5000
  Sum:               55.0000
  Sum Sq.:          385.0000

  Mean Err.:          0.9083
  Std Dev Err.:       0.6423
  Skewness Err.:      0.7746
  Kurtosis Err.:      1.5492

  Minimum:            1.0000 [ 0]
  Maximum:           10.0000 [ 9]
  Quartile:           3.0000 
  Median:             5.5000 
  Quartile:           8.0000

HerCerM · Accepted Answer · 2023-06-06 02:21:47Z

The chosen answer uses R. Using the same tool, I find a script nicer to work with (than a one-liner) as it can be modified more comfortably to add any specific stats, or format the output differently.

Given this file data.txt:

Having this basic-stats script in $PATH:

#!/usr/bin/env Rscript

# Build a numeric vector.
x <- as.numeric(readLines("stdin"))

# Custom basic statistics.
basic_stats <- data.frame(
    N = length(x), min = min(x), mean = mean(x), median = median(x), stddev = sd(x),
    percentile_95 = quantile(x, c(.95)), percentile_99 = quantile(x, c(.99)),
    max = max(x))

# Print output.
print(round(basic_stats, 3), row.names = FALSE, right = FALSE)

Execute basic-stats < data.txt to print to stdout the following:

 N  min mean median stddev percentile_95 percentile_99 max
 10 1   5.5  5.5    3.028  9.55          9.91          10

The formatting can look a bit nicer by replacing the last 2 lines of the script with the following:

# Print output. Tabular formatting is done by the `column` command.
temp_file <- tempfile("basic_stats_", fileext = ".csv")
write.csv(round(basic_stats, 3), file = temp_file, row.names = FALSE, quote = FALSE)
system(paste("column -s, -t", temp_file))
. <- file.remove(temp_file)

This is the output now, with 2 spaces between columns (instead of 1 space):

N   min  mean  median  stddev  percentile_95  percentile_99  max
10  1    5.5   5.5     3.028   9.55           9.91           10

Collectives™ on Stack Overflow

command line utility to print statistics of numbers in linux

18 Answers 18

I/O options

Parsing options

Statistics options

Not the answer you're looking for? Browse other questions tagged
linux
command-line
statistics
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

18 Answers 18

I/O options

Parsing options

Statistics options

Not the answer you're looking for? Browse other questions tagged linuxcommand-linestatistics or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
linux
command-line
statistics
or ask your own question.