61

I have around 100000 rows of data. If I store this data in a text file format, it takes more space than if I store it in an Excel file format. Why is that?

3 Answers 3

118

The xlsx format used by modern Excel is actually a compressed format. It's a ZIP archive that contains text (XML) files in a certain structure.

If you compress your plain text file with a similar ZIP compression tool, you should achieve similar file sizes.

Additionally, as mentioned by Bradley Uffner and Morgen in the comments, Excel will deduplicate identical strings and only store one copy of them. I'm not sure about the exact gains of such a method, and it will depend on your data set, but simple zip compression will probably get you most of the way there.1


9.1.3 Physical Packages

Each Office Open XML document is implemented as a ZIP archive.

ECMA-376-1:2016


1 My guess is that this deduplication is most effective when you have multiple worksheets, since zip compression applies independently to each file in an archive and only over limited sections of the data at a time - by storing all strings together in a single file, there should be some benefit to the later compression. More practically, if your plain text format is in a single file anyway then there'll probably be little difference.

10
  • 2
    This is great! It gives very good insight on how Excel files are handled. Thanks!
    – Dominique
    Commented Jun 1, 2017 at 9:34
  • 3
    @Dominique If you would like to know more, you can play around with one by unzipping it (e.g. with 7zip, or by renaming it so it ends in .zip). The spec is also available, but it makes for rather dry reading.
    – Bob
    Commented Jun 1, 2017 at 9:40
  • 3
    I think excel also uses a string dictionary, where bits of text that are identical between cells can be reused while only being stored once. Commented Jun 1, 2017 at 13:47
  • 1
    Based on the API exposed by the Apache ooxml library, a shared strings dictionary is at least probable.
    – Morgen
    Commented Jun 1, 2017 at 15:28
  • 2
    Shared Strings were a feature of Excel even back when it wrote XLS (no x) files, which were a record-oriented format called BIFF within an OLE container. If you compare old XLS specs with XLSX, you'll see XLSX is just a asciified xmlified zipped version of BIFF. So bascially MS didn't invent the shared string feature for XLSX, where it doesn't make much sense due to compression; they just went the easiest route to turn XLS into XLSX. Commented Jun 1, 2017 at 19:39
3

The answer given is correct, it is due to Excel storing your data as xml. It is also due to this, that sorting your data efficiently will also reduce the file size. Test it yourself - say you have data like

A            B                              C
John         Smith-Johnson-Williamson       12345
Sally        Smith-Johnson-Williamson       67890
John         Williams                       34567

If you sort by C (a column with all or almost all unique values) only, then the identical values of B will not be adjacent. In Excel's xml it looks like this:

<12345><John><Smith-Johnson-Williamson>
<34567><John><Williams>
<67890><Sally><Smith-Johnson-Williamson>

If you sort by B (a column with common values), then the identical values are adjacent. In Excel's xml it looks like this:

<Smith-Johnson-Williamson><John><12345>
  <Sally><67890>
<Williams><John><34567>

Because that long string is identical and adjacent, Excel knows it can pack them together, similar to when people write lists, and to repeat part of the above line, they type quotes rather than re-write the same thing. I did not find any evidence of a shared-string dictionary in my investigation - just this indentation in place of the repeated field value.

I had mailing lists of 250,000 customers across only 11 states, and on each record there was a field that was one of two strings identifying the offer they get. Our employees were for some reason accustomed to looking people up by their street address as spoken, so it was sorted on the street number column, then the street name, city, etc... when I resorted by the offer first, then state, zip code, city, street name, street number, and finally address-line-2, the file size was reduced incredibly. I examined the unpacked xml on the file sorted each way to see what was going on, and the above is what I deduced. If there are multiple fields with more than say 5 characters, but the values are of a limited set (say, ticket dispositions like 'resolved','rejected','approved',etc), then give some thought to sorting and see if it helps.

1
  • Hello CompanionCube, This is extremely awesome knowledge. !!! Thank You So Much.
    – user734178
    Commented Jun 10, 2017 at 6:39
-3

If you have a number like 3.14159265359, you need 13 bytes to store this in a text file; if you store this number as a float, you need only 4 bytes.

13
  • 19
    While this might've been the case with xls (BIFF), I don't think xlsx does this. xlsx stores all data in XML files, which do not perform binary encoding of numerals - they're converted to and stored as text strings. I'm still trying to hunt down the relevant section of the specification, but my empirical testing (i.e. extracting a saved document) shows that 1.123 is literally stored as a 5 character string in XML.
    – Bob
    Commented Jun 1, 2017 at 7:36
  • 14
    ECMA-376-1:2016 § 18.3.1.4 c (Cell) shows that cells store their data in v tags, as defined in ECMA-376-1:2016 § 18.3.1.96 v (Cell Value) where they say The possible values for this element are defined by the ST_Xstring simple type (§22.9.2.19). — they provide examples such as <v>28086.3541666667</v> where the data is clearly stored as a string.
    – Bob
    Commented Jun 1, 2017 at 7:40
  • 4
    @gerrit Unfortunately Excel does interpret numbers as floats within the program, leading to interesting problems like phone numbers dropping the last few digits. It just doesn't save them to disk as floats (anymore?).
    – Bob
    Commented Jun 2, 2017 at 0:46
  • 2
    @Bob this is also the case for xlsb, which is highly recommended for huge sheets
    – phuclv
    Commented Jun 2, 2017 at 12:16
  • 2
    I don't know why this answer gets so many downvotes. The OP didn't specify which Excel format so technically this is also a valid answer
    – phuclv
    Commented Jun 2, 2017 at 13:48

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .