0

I'm trying to understand the full story behind how text gets on screens. For keeping things simple I stay with single-byte encodings (no Unicode).

On my disk there is a sequence of bytes, each with a value between 0 and 255. I can then tell my computer programs which character encoding they should use to display these bytes. I could use ISO-8859-1 where, for example, the byte with value 0xA4 is some circle with for dots (¤). Or I could switch to ISO-8859-15, then my byte with value 0xA4 is defined to be the Euro symbol (€).

This is all still simple to understand. But parallel to changing the character encoding, I can also change the font to define the exact shape of a symbol. Now, a font is meant to work with all character encodings. So, a font should have both symbols: ¤ and €.

So, the steps to get a text on my screen is obviously:

  1. Read byte sequence serially
  2. Use numeric value of current byte to lookup into the character encoding table
  3. Use [something] to lookup in font file to get exact shape of symbol found in step 2
  4. Draw symbol like defined in font file

In step 3, what is this "something" that is used to map character encoding to the font? Do font files depend on character encoding? So, does a font has some built-in "double switch" mechanism that works like (pseudocode)

get_symbol(code, encoding) {
  switch code{
    case 0xA4: switch(encoding) {
      case 'ISO-8859-1' : return '¤';
      case 'ISO-8859-15': return '€';
    }
  }
}

?

What are the details how to get from a given byte sequence and a given character encoding to the actual symbol from the font? How is this mapped to always give the correct symbol?

2 Answers 2

2

Font files are designed to show a particular encoding. The program using a given font has to assume that a value n in a given encoding is displayed by rendering the corresponding glyph number n.

Font files need not have glyphs for all possible values of a given character encoding (for Unicode it is rare for a font to cover the whole range), nor need they start with the first value from the encoding (usually the control characters are omitted). There are different file-format schemes for specifying the starting point, ending point and omitted glyphs which are used to keep font file-sizes manageable.

From the example given, the OP is likely using the X Window system. There is more than one file-format used, with corresponding different ways they are accessed. The principal ones are XLFD (older) and fontconfig (newer). With other systems (Microsoft Windows), other APIs are used (the LOGFONT structure is a good starting point). OSX is another example, with its own API (CoreText).

Those of course are for graphical interfaces. Fonts are more widely applicable than that. For instance, Linux and the BSDs allow one to specify different console fonts — which in addition to encoding, run into limitations on the number of glyphs which are usable. Here are a few useful links for those:

0

The app drawing the text specifies a font in the text drawing APIs it's using, or if it doesn't specify, a system default font is used.

Unicode-based text drawing systems often have a font substitution algorithm to find a font that contains a certain glyph if the specified font doesn't have the glyph requested. But pre-Unicode systems generally just fail to draw a glyph or draw a "missing glyph" glyph. Even Unicode-based systems sometimes draw a "missing glyph" symbol.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .