1

My use case is I need to parse text from wikipedia articles. There is a dump available at https://dumps.wikimedia.org/enwiki/20221001/ that contains the files I want. Essentially the articles are broken up into several pairs of compressed files: an xml document that consists of a subset of wikipedia articles, and a text file that contains metadata pertaining to the xml document. Typically, the xml documents run 200MB compressed, and the text files run less than 1MB compressed.

For example, here's a pair of files on the dump page referenced above:

enwiki-20221001-pages-articles-multistream1.xml-p1p41242.bz2 251.7 MB

enwiki-20221001-pages-articles-multistream-index1.txt-p1p41242.bz2 221 KB

Using WinZip (trial version) I am able to extract the text files. However, when I try to extract the xml file from the articles file, WinZip says the file is corrupt and offers to save what it was able to extract. Regardless of which compressed xml file I am trying to extract, it always saves the same amount -- approximately 3KB.

I thought the problem might be the file size, so I compressed a 4GB file and tried to extract the file, and it worked.

I'm not sure where to go with this.

2
  • Try to download again the file. If the same problem occurs, try unzipping using another program. Products I like are Bandizip and 7Zip.
    – harrymc
    Commented Oct 17, 2022 at 9:35
  • Thank you very much! I downloaded 7zip and it worked! If you post your comment as an answer, I'll accept it.
    – Len White
    Commented Oct 17, 2022 at 14:01

1 Answer 1

1

Try to download again the file.

If the same problem occurs, try unzipping using another program.

Example products : 7Zip and Bandizip.

1
  • 7zip worked, thanks!
    – Len White
    Commented Oct 17, 2022 at 15:00

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .