My use case is I need to parse text from wikipedia articles. There is a dump available at https://dumps.wikimedia.org/enwiki/20221001/ that contains the files I want. Essentially the articles are broken up into several pairs of compressed files: an xml document that consists of a subset of wikipedia articles, and a text file that contains metadata pertaining to the xml document. Typically, the xml documents run 200MB compressed, and the text files run less than 1MB compressed.
For example, here's a pair of files on the dump page referenced above:
enwiki-20221001-pages-articles-multistream1.xml-p1p41242.bz2 251.7 MB
enwiki-20221001-pages-articles-multistream-index1.txt-p1p41242.bz2 221 KB
Using WinZip (trial version) I am able to extract the text files. However, when I try to extract the xml file from the articles file, WinZip says the file is corrupt and offers to save what it was able to extract. Regardless of which compressed xml file I am trying to extract, it always saves the same amount -- approximately 3KB.
I thought the problem might be the file size, so I compressed a 4GB file and tried to extract the file, and it worked.
I'm not sure where to go with this.