How to shuffle large text files efficiently while piping it on Linux?

Question

I have a text file that is couple GBs. I am trying to shuffle this text file in a pipe.

For example these are some sample lines of what I am using but it is not efficient and in fact the pipe does not seem to start until the whole file is read. Maybe I am wrong on it.

shuf HUGETEXTFILE.txt|some command

cat HUGETEXTFILE.txt|sort -R |some command

I also tried to use

split -n 1/numberofchunks HUGETEXTFILE.txt|sort -R|some command

But the piping ends when the first chunk finishes.

I am trying to find an efficient way to pipe text file shuffling in a pipe because I do not want to write hundreds of files everytime I need a new way of shuffling, or random distribution.

thanks

Have you tried to use shuf --input-range=$LO-$HI? Instead of split ... you can give to shuf the range in linenumbers... — Hastur, Commented Jul 25, 2014 at 22:31
Well I amm trying to shuffle the whole file at once if possible. This just sounds like it would shuffle a range from the input file. — yarun can, Commented Jul 25, 2014 at 22:45
also that argument just creates bunch of random numbers,i am not sure if that is what I need. Can you be more elaborate please? — yarun can, Commented Jul 25, 2014 at 22:48
Have you tried using shuf with the --output option, then using cat outfile.txt | some command. I know you said you didn't want to write hundreds of files, but this is only one and the name can be resued, meaning you should only have one. — Tyson, Commented Jul 25, 2014 at 23:03
You do know, that there simply is no "efficient" way to shuffle a multi-GB textfile (i.e. one, that doesn't fit in RAM) - shuffling is an intrinsicly expensive operation. — Eugen Rieck, Commented Jul 25, 2014 at 23:04

Ruslan Gerasimov · Accepted Answer · 2014-07-26 00:59:27Z

0

You can try this approach:

cat bigfile.txt|
  while IFS= read -r line; do
    echo '%s\n' "$line" |shuf |sort -n| grep "sample";
  done

IFS is used to split the output into lines here.

answered Jul 26, 2014 at 0:59

Ruslan Gerasimov

2,1961 gold badge16 silver badges12 bronze badges

Add a comment |

Stack Exchange Network

How to shuffle large text files efficiently while piping it on Linux?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
command-line
pipe
shuffle
.

Hot Network Questions

How to shuffle large text files efficiently while piping it on Linux?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxcommand-linepipeshuffle.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
command-line
pipe
shuffle
.