Trouble Unzipping Wikipedia Dump

Question

My use case is I need to parse text from wikipedia articles. There is a dump available at https://dumps.wikimedia.org/enwiki/20221001/ that contains the files I want. Essentially the articles are broken up into several pairs of compressed files: an xml document that consists of a subset of wikipedia articles, and a text file that contains metadata pertaining to the xml document. Typically, the xml documents run 200MB compressed, and the text files run less than 1MB compressed.

For example, here's a pair of files on the dump page referenced above:

enwiki-20221001-pages-articles-multistream1.xml-p1p41242.bz2 251.7 MB

enwiki-20221001-pages-articles-multistream-index1.txt-p1p41242.bz2 221 KB

Using WinZip (trial version) I am able to extract the text files. However, when I try to extract the xml file from the articles file, WinZip says the file is corrupt and offers to save what it was able to extract. Regardless of which compressed xml file I am trying to extract, it always saves the same amount -- approximately 3KB.

I thought the problem might be the file size, so I compressed a 4GB file and tried to extract the file, and it worked.

I'm not sure where to go with this.

Try to download again the file. If the same problem occurs, try unzipping using another program. Products I like are Bandizip and 7Zip. — harrymc, Commented Oct 17, 2022 at 9:35
Thank you very much! I downloaded 7zip and it worked! If you post your comment as an answer, I'll accept it. — Len White, Commented Oct 17, 2022 at 14:01

harrymc · Accepted Answer · 2022-10-17 14:16:55Z

1

Try to download again the file.

If the same problem occurs, try unzipping using another program.

Example products : 7Zip and Bandizip.

answered Oct 17, 2022 at 14:16

harrymc

1

7zip worked, thanks!
– Len White
Commented Oct 17, 2022 at 15:00

Add a comment |

Stack Exchange Network

Trouble Unzipping Wikipedia Dump

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
extract
winzip
wikipedia
.

Hot Network Questions

Trouble Unzipping Wikipedia Dump

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged extractwinzipwikipedia.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
extract
winzip
wikipedia
.