36

As a web developer I have very little understanding of binary data.

If I take the sentence "Hello world.", convert it to binary, and store it as binary in an SQL database, it seems like the 1s and 0s would take up more space than the letters. It seems to me like using letters would sort of be like using compression, where one symbol stands for multiple.

But is that really how it works?

Does storing plain text data take up less space than storing the equivalent message in binary?

6
  • 129
    You do not know the absolute minimum that every developer must know about character encoding. Fortunately the founder of this site wrote you an article. Read it before you program again. joelonsoftware.com/2003/10/08/… Commented May 26, 2017 at 22:01
  • 16
    @EricLippert A great read and I'm better off as a result thank you.
    – john doe
    Commented May 27, 2017 at 1:35
  • 4
    I recommend also utf8everywhere.org Commented May 27, 2017 at 3:04
  • 2
    Being a web developer isn't a excuse to don't know how character encoding and binary data works. You really need to brush up your skills...
    – T. Sar
    Commented May 29, 2017 at 11:46
  • 2
    Ho ho, easy. He had a different entry point alright? When I started messing with computers there was no way around bits and bytes and ASCII so you picked that up. Today you can start making web sites straight away using powerful tools. If you can browse you can click together a website. And that is not necessarily bad, automation finally paid off. I don't think his first problem is with encoding though, he does not have the right idea about bits and bytes yet so I would start with those. Commented Aug 21, 2020 at 16:55

5 Answers 5

138

Plaintext is binary.

When you write an H to a hard drive, the write head doesn't carve two vertical lines and a horizontal line into the platter, it magnetically encodes the bits 010010001 into the platter.

From there, it should be obvious that storing plain text data takes up exactly the same amount of space as storing binary data.

But plaintext is just one2 particular binary format

Plaintext can be reversibly transformed into other binary formats. One common transformation is compression which usually results in a more compact representation, meaning fewer bits used to represent the same information.

Depending on what you're using the plaintext to represent, you may be able to use different binary formats to represent the same information. This may use more space, it may use less.

For example, the numbers 5 and 1234567 could be represented in plaintext using digit characters, resulting in these bit sequences on disk3:

00110101 00000000
00110001 00110010 00110011 00110100 00110101 00110110 00110111 00000000

Alternatively, you could use 32-bit two's complement:

00000000 00000000 00000000 00000101
00000000 00010010 11010110 10000111

Which is a less compact representation of 5, but more compact representation of 1234567.

And there is a literally infinite number of other representations which would have varying levels of compactness, and flexibility, although, in practice far less than that many representations are actually used.


1 Assuming UTF-8. The exact sequence of bits for a character depends on which specific encoding you're using.

2 Or really, several formats, given the various encodings.

3 If you're wondering what those eight zeros on the ends are, well, you need some way of knowing how long the data is. The options basically boil down to a marker (I used this, via a null byte), space dedicated to storing the length (Pascal used a byte to store the length of a string), or a fixed size (used in the subsequent two's complement example).

10
  • 6
    One slight difference is the representation of End-of-line, which in Unix/binary takes one byte (LF) while in Windows/text takes two bytes (CR-LF). Commented May 26, 2017 at 17:35
  • 102
    +1 for "the write head doesn't carve two vertical lines and a horizontal line into the platter. Commented May 26, 2017 at 18:09
  • 2
    @BaardKopperud There is/was LightScribe, but that wasn't really meant for computer reading, though perhaps something like Google Goggles could read some LightScribe labels. But doing that on the actual data storage side would be pretty interesting. Reminds me of songs that have fancy graphics when run through an oscilloscope.
    – 8bittree
    Commented May 26, 2017 at 21:23
  • 2
    @TulainsCórdova Though actually, Turing machines operate on an arbitrary alphabet, so they in theory could write letters onto the tape. It just so happens we've settled on using a two-symbol alphabet.
    – gardenhead
    Commented May 27, 2017 at 4:14
  • 2
    @8bittree oh, I agree completely! I was trying to extend the humorous metaphor that if HDDs could, in fact, carve glyphs onto the platter, that SSD would have to somehow employ the same "carve" API to the controller, but would do something different entirely. Commented Aug 21, 2020 at 15:15
20

I find this a great fun thing to think about. Binary is not 1s and 0s in the way you talk about it.

Imagine there is a quantity, I can tell you what quantity it is in many different ways:

  • Nine in English
  • Neuf in French
  • 9 in Arabic numerals
  • IX in Roman numerals
  • 1001 in Binary with Arabic numerals
  • on off off on in Binary with on/off
  • high low low high in Binary represented with voltages or levers or water levels or electric charge ... or English words 'high' and 'low'

They all represent the same thing. The point here is that binary is not 1s and 0s, that's only one way to represent of a value.

When you talk of converting a H into binary, you probably imagine seeing 10101010 on screen - but that's not "binary", that's one digit for each binary bit.

Yes, if you converted H to "binary" as people normally talk about it, and then represented that in Arabic digits and then stored it, it would take more space in the same way that converting H to aitch takes more space.

But you can see that binary is one way of representing a quantity, well by that logic saying "if I converted H to binary and represented it as high low high low high low high low then it would take 35 characters! That's even more than 10101010! But these two are both 'binary' .. so how is one bigger than the other?

The other side of this is to wonder how H is stored by a computer, and to see that H is itself just a way of representing a quantity - the same quantity 72, 01001000, or seventy two or ASCII character code H. Which is 8bittree's answer that plain text is binary, but this is me trying to show what that means.

So you get a bit pattern in a computer 01001000 and what does it mean? Anything - could be talked about as a number, as a part of a zip file, as a character, depends what the intent of the person who created it was. If you know it's supposed to be plain text, then it came from a character encoding H -> 01001000 and you look it up the other way in the character encoding table - ASCII, UTF-8, shift-jis, etc. and find the right font character and out comes a H or whatever. Or out comes the wrong character if you use a different encoding lookup than the person who created it used. This is @Eric Lippert's link.

But as I write this, and as you think about it, H is one byte and 01001000 is 8 bytes, yes that's more space. And yes it's (a representation of) binary. But it's at a higher level of abstraction than the computer is using - binary displayed in ASCII characters, where each character is represented behind the scenes with a binary bit pattern, each as big as the H alone.

14

Does storing plain text data take up less space than storing the equivalent message in binary?

No, never.

Your computer already stores the plain text data in the equivalent binary representation. Storing something as plain text versus binary just signals how the computer should interpret that identical binary stream.

It seems to me like using letters would sort of be like using compression, where one symbol stands for multiple.

That is kinda true. One character will represent more than one bit. The problem is that they're different sized things. It only takes one bit to store a 1 or a 0, but 8 bits (or more) to store a plain text character. You don't gain anything by using characters.

If anything, you can compress things the other way. After all, 8 bits is 256 different possible values, yet plain text usually is limited to letters, numbers, and a few punctuation characters. It doesn't need as many bits as it takes.

3
  • 3
    Well, maybe sometimes :-) Two possible cases I can think of. 1) You have a short text string which you compress. The compressed file contains some metadata, which makes the compressed file larger than the original string. 2) You have some floating point values, say 1.2. Storing as text would be 3 bytes (4 with a terminator), while storing a binary double would take 8 bytes.
    – jamesqf
    Commented May 26, 2017 at 20:44
  • 5
    The answer really depends on what you mean by 'binary.' For example, UTF-32 takes up four times as much space as ASCII, so if by 'plain text' you meant ASCII, and by 'binary' you meant UTF-32, plain text would take less space than binary. But you can reverse the definitions and get the opposite result. Commented May 26, 2017 at 21:13
  • 1
    @DavidConrad Well, that just skirts on the "there's no such thing as plain text". The closest thing you have is a binary file with no metadata/headers identifying the type and guessing "must be text encoded as XXX!". There has been a time when "plain text file" meant something reasonable, in a limited context, but it doesn't really anymore. The best you can get is "all the data in the file is encoded as text" in contrast with "some/all parts of the data are not encoded as text".
    – Luaan
    Commented May 29, 2017 at 7:40
1

Well, this is a challenge... Let me try and build it up for you. You have heard of bits (ones and zeros). First, try not think of those as symbols or characters or even digits, think of them as states. Mapped to a technical contexts that means switches. A computer is a bunch of switches.

You cannot express much with single binary states (yes/no, on/off, that's it) so you need to combine them in groups. Like you need words in natural language to say something meaningful, you cannot say much with just the language's basic building blocks, the letters.

So in computers the words are bytes: groups of 8 bits. Why 8? There are technical and economical reasons for this we do not need to get into now.

You can tell there are 2^8 = 256 different combinations of 8 bits. That is plenty for a basic western alphabet. Now all you need is a convention to map letters to those bytes. An early such convention is ASCII. This defines only half of the space by the way (it uses just 7 of the 8 available bits) but that does not matter, what matters is we have a way to express characters as bytes.

A convention like ASCII is called an encoding.

What you see on the screen (the characters) are views. Pretty representations of letters. Try to keep thinking switches. The computer ultimately stores your text as switch states, be it in volatile memory or on disk or whatever medium. A storage medium is an addressable array of switches each of which can only be on or off.

Now, size. The size (the number of bytes needed) will depend on the encoding used.

You loosely use terms like binary format and plain text. Let's address those.

You speak of "binary formats". I am not sure what you mean by that. When I hear people say "it's a binary format" they typically mean they can't read it, when they open the file in a text editor they see just gibberish. The reading program does not know about the encoding. To "store something in binary" just means encode it in a way not obvious to the user but it could be anything.

Plain text means ASCII encoded text. Every byte represents one character, nothing more, nothing less.

We talked about text. Another important type of data is numbers. I do not mean the characters 0123456789, I mean the kind of numbers you can add up. Those are encoded different than text exactly because of the need to do calculations with them. The computer does not do calculations with the characters 0123456789.

Anyway, the important thing to realize is that everything that is stored on a computer is stored as bytes. What the individual bytes represent depends on the application. They can be (parts of) numbers, (parts of) characters or custom data like gender, color or a date. The term "binary" does not mean a lot in this context (ultimately everything in a computer is binary). It typically means "encoded in some unspecified way".

So, can binary be smaller than plain text? Yes! It can be bigger too, it all depends on your encoding. Suppose you have some chat bot that knows a number of sentences, like 10000. If you assign each sentence a number you only need two bytes to identify each, regardless how long each sentence is. So the "binary" version of a chat log of two bots talking to each other could be very small indeed.

I hope this all makes sense.

-2

The 'binary representation' is just a (one) popular/accepted method/way of encoding/storing (text) data. With the current ASCII/Unicode encoding, data stored as plain text takes up more space than stored as what's commonly referred to as a binary file.

A (plain) text file is a binary file; it's stored in a computer as a sequence of 0 and 1 (binary means 2 values). Most people refer binary files to mean files that contain more than just the printable characters (letters, numbers & symbols) eg. mp3 or dvd files.

The sequences (of 0 and 1) by themselves don't have meaning; they are interpreted by different software or people to mean something eg. 41 could mean the letter A in hex, or 65 could mean letter A in decimal. To some people, the letter A forms part of the alphabet, to musicians the letter A forms a musical note and is represented by sound of some frequency, instead of writing some shape on a paper.

A green leaf, for example is a result of a genome sequence, which is a sequence of chromosomes, which in turn is a result of R/DNA sequences, which are sequences of nucleotides, which are sequences of nucleobases (G/C/A/T/U enclosed in sugar & phosphate, like delimiters to separate words in a sentence or null bytes to separate files on a disk), which are sequences of organic/chemical elements/molecules eg. carbon, nitrogen, oxygen, hydrogen, which are sequences of (different numbers) of protons, electrons, ions (qubits) which are the subject of a new type of computer called Quantum Computer.

Text data in English is made of 26 letters of alphabet (base 26 information system), 10 numbers (base 10 or decimal system) & symbols. Each of them have an 8-bit representation in a computer (base 256 encoding). Higher base encoding takes up lesser space to represent the same data eg. A (one digit) in hex (base 16 encoding) takes up less space than 10 (decimal/base 10 system); both represent the same value, ergo binary files (base 256 encoding) take up less space than text files containing the same information.

Not the answer you're looking for? Browse other questions tagged or ask your own question.