21

In the source code of the 1972 Pascal compiler (a very large OCR-ed PDF), there are declarations of variables and record fields of type ALFA, which are "packed arrays" of 10 characters. Characters were 6 bit wide on the CDC 6600, and 10 of them would fit in a 60-bit word, implementing a form of small string optimization: they could be passed by value in registers, hashed and compared efficiently, etc., which conferred a substantial speedup on computers with word-oriented architectures, and a similar technique could be beneficial even today.

The 10 character array type named ALFA survives even today, although without the benefit of the optimization.

In the high-level programming language context, a data type is defined by its representation and the set of allowed operations. This makes the ALFA type distinct, and not a mere shortcut for a small array of char, as its set of allowed operations is wider than the set of allowed operations for any other array type: for example, Pascal ALFA values can be compared for equality and ordering (page 17 of the PDF).

The novelty here is adding to a programming language a data type which combines characteristics of an array (individual characters of a variable of the ALFA type could be accessed by index if an implementation fully supported packed arrays), and of a scalar: variables of type ALFA could be passed as parameters to subroutines and returned from subroutines, whereas passing arrays by value or returning them was not allowed, and, at least in some implementations, they could be used in case statement expressions.

Contemporary implementations of small-string optimization hide the actual "small string" object inside a variant class. As Pascal had variant records, it was fully equipped.

Was Niklaus Wirth the first to invent the "small string" type, in the sense of making it a full-fledged predefined type in a programming language rather than an ad-hoc programming trick, or a pre-existing programming language had a similar data type?

20
  • 3
    Not sure what the question is targeting here, as holding characters in a machine word comes natural already way before there is any PASCAL or other compiler. 6 characters for a 36 bit 7090, 8 in a 48 bit CDC 1604, 10 in a 60 bit CDC 6600 or 4 (8 bit) in a 32 bit IBM /360 (The main reason why most identifiers in the IBM world are 4 or 8 characters in length). Similar with smaller systems like 3 chars in 18 bit PDP. In fact, doing single char operations (or variable length) was rather cumbersome in many of these early systems.
    – Raffzahn
    Commented Jan 23, 2022 at 23:04
  • 3
    Indeed. TOPS-10: filenames one machine word, command names 1 machine word, which was 6 chars.. RSX-11D/M filenames 2 machine words (3 chars was a bit limiting even on a mini) + 1 machine word extension. VAX/VMS. command names significant to 1 machine word, 4 chars.
    – dave
    Commented Jan 23, 2022 at 23:27
  • 1
    @Raffzahn Two phrases highlighted in bold should provide a clear answer what the question is targeting.
    – Leo B.
    Commented Jan 24, 2022 at 1:31
  • 5
    Small string optimization would be to not have a separate string type for small strings, but let the compiler optimize for small strings that fit in some minimal required storage. Even standard c++ string libraries do that nowadays, which must mean that the “trick” is very old indeed. ;-)
    – WimC
    Commented Jan 24, 2022 at 6:15
  • 1
    This is the same as FORTRAN's Hollerith constants, which were in use before the 1966 ANSI standard.
    – scruss
    Commented Jan 24, 2022 at 12:10

2 Answers 2

17

TL;DR:

ALFA is a predefined machine specific type (like all Pascal types), that happens to be 10 on CDC 6600, but may have any other value on other machines. ALFA has been added to allow machine specific character handling - the same way the keyword packed was added for the 6600 for this purpose. Without this storage would not only be rather wasteful, but also any interfacing to OS or I/O be rather painful.

Wirth notes in section 6.1.1 Standard Scalar Types of his 1971 The Programming Language Pascal:

alfa the values are sequences of n characters, where n is an implementation dependent parameter. [...] Alfa quantities may be regarded as a packed representation of short character arrays.

(Emphasis added)

Section 14 Pascal 6000 (p.48) describes implementation specific amendments for the CDC 6000 series. Among implementation specific definition of integer, real and char, ALFA is noted as

the number of character packed into an alfa value is n=10


The Long Read:

There seems to be a tiny misalignment between the understanding of word oriented architecture vs. word addressing, as well as how a language is derived or that Pascal is a machine independent language.

Pascal is not machine independent

Pascal is by definition machine/implementation dependent. To cite section 6.1.1 and section 14 of The Programming Language Pascal:

  • integer
    • 6.1.1 : the values are integers within a range depending on the particular implementation.
    • 14 : is defined as type integer is -2^48+1 .. 2^48-1.
  • real
    • 6.1.1 : the values are a subset of real numbers depending on the particular implementation.
    • 14 : is defined according to the CDC 6000 floating point format.
  • char
    • 6.1.1 : the values are a set of characters depending on the particular implementation.
    • 14 : is defined by the CDC 6000 display code character set.

Section 10.1.3 defines in addition pack and unpack as standard procedures for character transfer into and from ALPHA arrays. Both are of course implementation specific with implied machine specific length values.

Pascal is not a greenfield development

It's always helpful to keep in mind that languages are usually not a greenfield development, but meant to run on real world machines with real world usage. This is especially true, as Wirth explicitly wanted to create an every day language suitable for engineering as well, all the way down to OS development. Supporting machine specific details (like packing) is essential.

Word architecture vs. Word addressing

Today's machines are usually word oriented but byte addressing. That means they work great with words, but can access each byte directly. A feature essentially made canon by the IBM /360 (*1). A machine with 32 bit word architecture, but memory addressing in bytes. Here character strings can be accessed by direct addressing.

The CDC 6600 is not only word oriented, but also word addressing (*2). This means the smallest addressable memory element is a (60 bit) word. To access smaller units, like characters, they have to be (un)packed.

No programmer of a 6600 would store a single character per word, maybe except for intermediate states. Strings were always stored as characters 10 per word.

Character handling was (usually) based on CDC Display Code, a custom 6 bit code, which was already used by the predecessor 48 bit machines (CDC 1604 and 3000 series with 8 characters per memory word. There is no way to address a single byte (or character). To access a single character it must be extracted by shifting (LXi/AXi) and masking (MXi). See Section 3.14 Character Manipulation on page 93..101 of Assembly Language Programming for the Control Data 6000 and Cyber series.

Other than the question implied, using 'packed' character arrays isn't an 'optimization', but the natural way for CDC machines. The idea to use a word per character (resulting in 90% wasted storage) would have been absolute alien to any programmer back then - much like it still is today. All CDC programming languages, Assembler, Fortran or COBOL used this, as they did so already before the 6600.

Adding ALFA as a dedicated type to Pascal as well came natural - same way as in CDC COBOL on 6600 or prior machines.

The concept of character, like offered by Pascal, is not native to this machine type. Character handling thus is ugly and comes with lots of inelegant code. Alfa offers a machine specific extension to a simplify handling of the basic character storage unit and allows easy interfacing to other components as well as I/O.

For practical use and the CDC 6000 ALFA is a shortcut for array [1..10] of char, as that's the most common usage of a character array on a CDC 6000. The very same way as STRING is a shortcut for array [1..80] of char. Quite handy when processing punch cards, isn't it?

Having ALFA is simply the way to define a machine specific storage solution in an independent way and postponing relevant decision to the implementation. While this makes program using ALFA not-portable, it does encapsulate everything around machine specific arrays in a way easy detected. There is in fact a nice article by Tannenbaum comparing Pascal to Algol 68 where he points out that pascal tries to support common handling for such formats.

Finally having ALFA in later implementations stand as well for 10 characters can be seen as compatibility service to ease porting older sources to newer machines. Same way as 'packing' is supported by modern compilers a straight copy operations.

Bottom Line: There is no dedicated 'invention', but acknowledging the architecture on source level, making programs fit and easy to read.


*1 - Others did as well, but it was the /360 that established today's view of bytes and words.

*2 - Notable here, the 6600 not only came before the /360 took off, but it's way of character handling was even more 'ancient' as it was inherited from it's predecessors.

Also, let's be honest the CDC was never made for business application - heck, many even employed other computers for the 'dirty' part :))

17
  • 1
    Mentioning Assembler in a response to a question about types does not make sense. If CDC Fortran had a similar type, how did it look? Could values of that type be compared with .LT.?
    – Leo B.
    Commented Jan 24, 2022 at 9:36
  • 1
    @LeoB. All types are machine types and assembler does handle them like any other language.
    – Raffzahn
    Commented Jan 24, 2022 at 14:48
  • 5
    In context, the meaning of "type" should be understood within the type system of a programming language. Your answer beats around the bush, not answering the question as asked, but simply dumping everything you know related to familiar keywords in the question.
    – Leo B.
    Commented Jan 24, 2022 at 16:47
  • 2
    @LeoB. ??? Wow, that's a real hard pull. The full title is mentioned before, and as usual not repeated. Likewise the apposite definition ALFA is not repeated - as both has been in detail (including description what these sections are about) have already been given. Serious, reading an answer only half ways just to construct some claim is so cheap , I had no idea, one could get that low with constructed claims, just to avoid accepting that you're looking for something that isn't there?
    – Raffzahn
    Commented Jan 24, 2022 at 20:36
  • 3
    @Raffzahn In the future, if your first reaction to a question is "Not sure what the question is targeting here", please refrain from attempting to answer.
    – Leo B.
    Commented Jan 25, 2022 at 0:19
9

Before Pascal, there was Algol 68, which includes the data types bits and bytes, fixed-length array-like entities holding bool values or char values respectively, and which were expected to be implemented 'efficiently', i.e., probably as one machine word.

(Naturally, long bits and long bytes were possible, at the implementor's discretion)

Wirth was, of course, one of the original contributors to the design of Algol 68, before he (and others) resigned in objection to the direction the language was taking. Thus the direction of information flow is not clear, even though Pascal post-dates Algol 68. Wirth's earlier Algol W had a BITS type but not an explicit "machine word of characters" type.


Before all this, FORTRAN IV on the IBM 7090/7094 under IBSYS did not actually have a string data type, but a FORMAT of 'An' used on a READ would cause the next n characters to be read into a REAL or INTEGER variable (one machine word), padding or truncating to exactly 6 BCD characters. I think about all you could do with the data would be to eventually print it out again.

I think the FORTRAN case does not really answer the question, not being related to array types, but it does demonstrate that packed characters did not need a whole lot of invention on word-oriented machines.

2
  • That's true about the FORTRAN A format, but I'm asking specifically about the data type.
    – Leo B.
    Commented Jan 24, 2022 at 1:33
  • 2
    I know. I added it as a sort of footnote, while recognizing it wasn't the answer.
    – dave
    Commented Jan 24, 2022 at 1:46

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .