I have a 150GB xml file that I would like to shorten (ie truncate) to about 1GB - is there a simple (bash or similar) command I can use, or do I have to go the programmatic route (editing it in vi or emacs is a nightmare even on big iron systems)?

(I am not particlarly concerned about the loss of information, I want a shorter file so I can test a piece of software on this and not wait many hours for the answer, a shorter file will allow me to do that.)

    Do you mean you want to truncate the file, or do you want to remove information from throughout the file?
    – AFH
    Commented Jan 5, 2018 at 15:52
    Found this on SO; stackoverflow.com/a/15934078/2800918.
    – CAB
    Commented Jan 5, 2018 at 16:02
    Since this is an XML file, which I assume contains a sequence with a great number of elements, you could also use an XML transformation language such as XQuery to filter out a certain number of these elements, which would have the advantage to output valid XML (Example)
    – Aaron
    Commented Jan 5, 2018 at 17:31
    Does the file still need to be valid XML when done?
    – Joe
    Commented Jan 5, 2018 at 21:22
    no, I just patched it up so it was Commented Jan 5, 2018 at 23:09

Assuming you want to truncate and extract the first 1 GB of the 150 GB file:

With head:

head -c 1G infile > outfile

Note that the G suffix can be replaced with GB to align to 1000 instead of 1024.

Or with dd:

dd if=infile of=outfile bs=1M count=1024

Or as in Wumpus Q. Wumbley's answer, dd can truncate in place.

    That will likely not result in a readable XML file when done.
    – Joe
    Commented Jan 5, 2018 at 21:21
    @Joe - OP did not request a readable file (nor did they say it could be unreadable). They did say that they did not care about loss of information. I would expect a new question from OP about how to fix said file.
    – KevinDTimm
    Commented Jan 5, 2018 at 22:17
    I know enough xml to fix it, I wrote the DTD for the format! Commented Jan 5, 2018 at 23:12

To truncate a file to 1 gigabyte, use the truncate command:

truncate -s 1G file.xml

The result of truncation will likely not be a valid XML file but I gather that you understand that.

Documentation for the GNU version of truncate is here and documentation for the BSD version is here


Where possible, I'd use the truncate command as in John1024's answer. It's not a standard unix command, though, so you might some day find yourself unable to use it. In that case, dd can do an in-place truncation too.

dd's default behavior is to truncate the output file at the point where the copying ends, so you just give it a 0-length input file and tell it to start writing at the desired truncation point:

dd if=/dev/null of=filename bs=1048576 seek=1024

(This is not the same as the copy-and-truncate dd in multithr3at3d's answer.)

Note that I used 1048576 and 1024 because 1048576*1024 is the desired size. I avoided bs=1m because this is a "portability" answer, and classic dd only knows suffixes k, b, and w.

    For the general solution, you should probably note that the bs number multiplied by the seek number is the number of bytes to keep.  Any two numbers that satisfy that constraint should work; e.g., bs=1073741824 seek=1 or bs=1 seek=1073741824.  Or, since bs defaults to 512, seek=2097152 alone should also work.  And you can use notation like 1M, 1K, 1G and 2M. Commented Jan 5, 2018 at 20:10

I'm not entirely sure what you are asking. Do you just want to get rid of the other 149GB or are you trying to compress 150GB into 1 GB? Regardless, this may be a useful method to accomplish this.

The split command can split any file into multiple pieces. See man split. You can specify the size of the file chunks you want to split it into with the -b option. For instance:

$ split -b 1GB myfile.xml

Without any other options this should create several files in the current directory starting with the letter x. If you want to adjust the names of the split files refer to the man page.

To re-assemble the file just use cat * > re-assembled.xml.


[kent_x86.py@c7 split-test]$ ls -l opendocman*
-rw-rw-r--.  1 kent_x86.py kent_x86.py 2082602 Mar 31  2017 opendocman-1.3.5.tar.gz

[kent_x86.py@c7 split-test]$ split -b 100K opendocman-1.3.5.tar.gz 
[kent_x86.py@c7 split-test]$ ls
opendocman-1.3.5.tar.gz  xaa  xab  xac  xad  xae  xaf  xag  xah  xai  xaj  xak  xal  xam  xan  xao  xap  xaq  xar  xas  xat  xau
[kent_x86.py@c7 split-test]$ ll
total 4072
-rw-rw-r--. 1 kent_x86.py kent_x86.py 2082602 Jan  5 11:06 opendocman-1.3.5.tar.gz
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xaa
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xab
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xac
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xad
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xae
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xaf
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xag
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xah
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xai
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xaj
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xak
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xal
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xam
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xan
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xao
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xap
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xaq
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xar
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xas
-rw-rw-r--. 1 kent_x86.py kent_x86.py  102400 Jan  5 11:06 xat
-rw-rw-r--. 1 kent_x86.py kent_x86.py   34602 Jan  5 11:06 xau
[kent_x86.py@c7 split-test]$ cat xa* > opendoc-reassembled.tar.gz
[kent_x86.py@c7 split-test]$ ls -l opendoc-reassembled*
-rw-rw-r--. 1 kent_x86.py kent_x86.py 2082602 Jan  5 11:07 opendoc-reassembled.tar.gz

You can use the split command.

split -C 1G <filename>

For more details take a look at this stackoverflow answer


In the end I just used sed to extract an arbitrary number of lines:

sed -n 1,1000000p infile.xml>outfile.xml
    Putting aside whether this answers the question or not, this will scan the entire file, I believe, so it's much more efficient to use sed 1000000q (and a bit more compact, visually speaking).
    – B Layer
    Commented Feb 9, 2018 at 14:48

