0

I was sorting a large file (101MB - about 700MB after unzipping) using sort command on server that has 48GB of memory. It was the only heavy job it was doing at that time. However I noticed that sort created lots of temporary files. Does it mean it was lacking RAM memory?

Or is it so that sort always creates files? Can I speed up sorting process by passing a folder filesystem mounted in RAM with -T command? I tried it, but I haven't noticed significant speed up and I'm wondering whether I constructed the test wrong or I'm just not understanding what's going on correctly.

This is a command I issued:

zcat file0.nq.gz | sort

In about 20 seconds I have the following files in /tmp

nuoritoveri@nubis:/tmp[127]$ ls
sortecuGwN  sorteKeowj  sortGn7dCr  sortkdk5Ws  sortNb9Khh  sortPGTQ6b  sortQearCg  sortvBB5eS  sortZW2mWj
sort1UsQla  sortEGauDb  sortFMn7bW  sortiUDJYd  sortlaGUgo  sortpEmGb5  sortPQUNQx  sortqlb7jh  sortxcjjuM
sortaVKeEN  sortejgptJ  sortgAJJ9l  sortJRq2GB  sortmQf888  sortpFfWdy  sortpv9kO8  sortT52TVQ  sortxq8r80

The files disappear when commands finishes. I also checked what happens when I don't pipe, but just sort unzipped file:

sort file0.nq

The files in /tmp appear also, but not such a fast rate (maybe because it has to read the file by itself).

6
  • No; It does not mean you were lacking memory. It means your system was using the page file/swap.
    – Ramhound
    Commented Jun 19, 2014 at 13:17
  • I've never heard of sort creating files. Could you give an example of the file (path) that was created?
    – mtak
    Commented Jun 19, 2014 at 14:11
  • Thanks for comments, I updated the question. @Ramhound, why would sort use swap if it has spare memory? Commented Jun 19, 2014 at 15:52
  • @nuoritoveri - your asking the wrong question. Why do you have the swap on the hdd to begin with instead of putting it in memory?
    – Ramhound
    Commented Jun 19, 2014 at 16:06
  • @Ramhound /tmp is not swap. If the computer uses swap, it wouldn't create files in /tmp, it would just use the swap partition - assuming there's a swap partition.
    – Lawrence
    Commented Jun 20, 2014 at 1:08

1 Answer 1

1

In general "keep using memory until you run out" is a poor strategy, you may cause problems for other users, you may end up using swap which looks like memory but has far worse performance characteristics or (because Linux overcommits memory by default) you may end up just getting killed by the OOM killer.

When sorting large volumes of data a common strategy is a "batch merge", you split the data into batches which are sorted in-memory and written to temporary files. Then you have a merge process that reads the batches and merges them together. If the data set is very large there may be multiple layers of merging.

I had a quick dig through the code for sort at https://sources.debian.org/src/coreutils/8.30-3/src/sort.c/

It looks like the buffer size sort decides to use depends on a number of factors, including ulimit values, free memory, the -S parameter if specified and the size of the input files.

It seems that in the case of unknown input size (e.g. input from a pipe), no particular memory pressure, and no particular sort size specified that sort uses a buffer size determined from "INPUT_FILE_SIZE_GUESS", which according to the comments works out to about a 17 megabyte buffer (note that said buffer doesn't just store the raw line text, so it may not fit 17 megabytes of input).

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .