15

I came across a post that states that Unix uses null-terminated strings, ASCIZ, because it was a feature of the PDP-7. This triggered my reading on the CIS instructions in the PDP-11, but these were pascal-style strings with the "n" register holding the length.

I looked over the PDP-7 and -4 material and I can't find anything like the CIS instructions. This is not surprising given the era. But I can't find anything that seems to suggest ASCIZ was directly supported. There's the various I/O instructions, but they all load three chars into/out of a register and I don't see anything that means a length couldn't be used. The example code doesn't show anything interesting either.

So was null-terminated strings on the PDP-7/4 something in the hardware or simply a programming style? I guess if you load multiple characters at once you need to set any leftover /3 to null?

5
  • 3
    Didn't the PDP-4 have SZA - Skip if AC=0? Likewise the PDP-8. That's something to prefer zero terminated strings - at least as long as every character is handled in AC. Same way as any architecture that sets flags according to the last load does prefer zero terminated strings by saving the need for a compare instruction.
    – Raffzahn
    Commented Jul 17, 2022 at 2:48
  • 3
    I'd guess 'programming style'. TOPS-10 frequently used ASCIZ strings (where it wasn't using SIXBIT) , and as far as I know, there was no particular hardware support beyond testing whether an accumulator was zero. To my mind, once you've decided not to use counted strings, there's only two sensible choices for terminators - all zeroes or all ones, with a slight bias to the former, as in @Raffzahn's comment.
    – dave
    Commented Jul 17, 2022 at 3:31
  • 2
    Null-termination is definitely a programming style and not a hardware thing. I've no question about it. But I only have a limited view from living that period of time, too. So it's not definitive. But I never saw a hardware situation that predated a software one.
    – jonk
    Commented Jul 17, 2022 at 9:46
  • 2
    To @another-dave's comment about termination: another perfectly reasonable alternative, iff you know your string is at least one character long, and sticks to 7-bit ASCII, is to set the high bit of the last character. That's how BASIC tokens are terminated in the ZX Spectrum ROM, for example. Commented Jul 17, 2022 at 13:29
  • 3
    @OmarL - Sure, but that requires storing your 7-bit character in 8 bits. Not the case for TOPS-10: the usual choice, hardware supported, was to store five characters in each 36-bit word. In the PDP-7, as per the OP's question, there would be three 6-bit characters per 18-bit word., so no flag bit available there.
    – dave
    Commented Jul 17, 2022 at 13:57

2 Answers 2

10

You are correct. There is no special support on the PDP-1/4/7 for ASCII null-terminated strings. Or strings of any kind, really. It's up to the programmer to decide how they want to represent strings. There are multiple ways to store text on these machines. You might find this other question "How did the PDP-8 handle strings?" relevant. (The PDP-8 has a similar if further-simplified architecture to the PDP-7.)

It was common to store 6-bit characters (often a subset of ASCII without lowercase or control characters since that can be converted with a simple bitmask and arithmetic) packed multiple to a word. This is more compact for storing the string, but requires more complicated routines to extract the nth character. For string processing, it's often easiest to handle one character at a time. In that case, one just tests if the accumulator is zero. (For storage of large amounts of text, this usually is too wasteful, with only 6/7/8 bits in each word being actually used.)

I have written most of a forever-unfinished toy Forth for the PDP-8. I chose to use ASCII (since the IO devices use ASCII), and I chose null-terminated strings for string operations (since I believe in repeating the errors of history). Error messages and other long strings are stored in packed ASCII. So "HELLO" is stored as three words: HE LL O[NULL]. When the program goes to print this, it runs a routine for the 0th character of the string at this pointer, and the 1st, etc. It then handles those characters as plain ASCII, and will exit the printing routine when the accumulator is 0. There's also a routine to translate a NULL-terminated string into its packed representation. This sort of packing/unpacking was a common design pattern on word-oriented machines, whether using Pascal or C-style or some other form of string representation. (The original Pascal compiler for the CDC 6000 packed ten characters per 60-bit word.)

10

a post that states that Unix uses null-terminated strings, ASCIZ, because it was a feature of the PDP-7.

Yes, ASCIZ strings are especially supported by the PDP-7 due existence of the SZA instruction which delivers a zero cost test for zero (pun intended), thus outperforms any other way string termination by memory footprint as well as execution time.

I looked over the PDP-7 and -4 material and I can't find anything like the CIS instructions. This is not surprising given the era. But I can't find anything that seems to suggest ASCIZ was directly supported.

It doesn't need some high level CIS type to benefit from certain data types. Preference may come from way more mundane items, like havng a test for zero implied or at low cost.

There's the various I/O instructions, but they all load three chars into/out of a register

Don't forget all character I/O, most notably teleprinter and paper tape. Like TLS or RSA. They operate on single characters from or to the bits of AC.

Equally important, string handling is more often about touching characters within a string than the string at whole. As soon as a string is handled character by character character based conditions - like end of string - become an inherent advantage.

Last but not least, strings are not blocks.

I don't see anything that means a length couldn't be used.

It would be a bad machine if it would't allow the use of length terminated strings. As so often it's not about what can or can not be done, but what is more efficient or at least less cumbersome.

So was null-terminated strings on the PDP-7/4 something in the hardware or simply a programming style?

Well, any Turing Complete computer can do stings any way imaginable. Assuming the PFP-7 is one, might mean it comes down to programming style, after all, the machine does what it's told. Still, since these are real computers in the real world with constrains like memory size or speed, some methods may perform better on specific computers than others. In case of the PDP-7 it's Zero Termination (ASCIZ) due the fact that

  • its memory access is slow
  • it has only one general purpose register (AC)
  • character handling has to go thru this register
  • it got any SZA instruction

SZA(Skip if Zero Accumulator) tests AC against a value of zero and branches (skips). Since AC has to be loaded with each character to be handled, SZA can be used to test termination. Unlike a compare (SAD x), SZA does not need to access memory for a value to compare against, as zero is an implied. It's simply faster than any other method.


To get a fair comparision, let's look at the some ways to handle strings:

Length Terminated Strings

Length terminated strings carry (usually in front) a length field holding the number of addressable units (characters or words) used to store the string. Nowadays often called Pascal-Strings.

Length termination got important advantages:

  • Blocks and strings are the same.
  • Its content does not need to be processed
  • It can contain any possible symbol (code value).

This is payed for with some requirements:

  • A length must be present
  • Iteration (Handling) requires
    • A pointer to access the actual element and either
      • a counter to keep track or remaining characters, or
      • an end pointer (address)
    • Comparison against counter or end pointer

These requirements are no issue when it comes to processors with a lot of registers and/or fast memory access. Handling 3 values during iteration (position, termination and actual character) gets cumbersome on a CPU with only a single versatile register, or slow memory access like the PDP-4/7 series. With it's single AC, the PDP-4 needs to hold at least pointer/counter in memory, resulting in slow execution due frequent memory access for

  • Indirect access
  • Pointer increment
  • Pointer load for comparison
  • Comparison

Character Terminated Strings

Character (or marker) termination (*1) in turn does not record the length of a string in an additional field, but reserves a special character value to mark the end of that string. This can be any value. Of course it's helpful to use one that does not occur often :)

A common example for character termination are DOS strings, terminated by a dollar sign ($).

Disadvantages of character termination are:

  • Not all characters can be used
  • Content needs to be processed
  • Character need to be loaded
  • Characters have to be compared against termination value

On the plus side, there is

  • No need for a counter
  • No need for an end pointer
  • Content usually has to be loaded anyway for processing

Using character terminated strings is advantages on a PDP-7 as no counter/end pointer is needed, saving several costly memory operations - whcih are replaced by a simple compare of the character in transit against the termination value.

While marker terminated is a generic idea, there are some common used sub classes: Zero terminated and Flag terminated.

Zero Terminated

Zero Termination is a basic application of character termination but using special properties of the value zero - for example that many CPUs either mark any loaded zero with a direct testable zero flag (prominet examples 6800 or 6502. Or have a branch instruction that issueing an implied test for zero.

And that's exactly the case for a PDP-7. SZS tests AC for zero and allows to branch accordingly without the need for any additional compare instruction, and especially without the need to access memory.

Flag Terminated

The use of Flag Termination reduce the usable code set further (*2) to reserve half of the character values for a flag marker - e.g. use only 7 bits of a byte for character values, while the 8th is used as flag. While working much like basic character terminated strings, it holds a few advantages:

  • No extra byte needed for termination
  • Some CPU may use an implied test

Especially the later is true for all CPU that set a condition flag according to the value of a character handles, allowing an implied test, much like with described with zero termination.

Flag termination was somewhat popular after the 8 bit byte became canonical, while the majority of character handling still used 7 bit code - thus enabling the 8th bit as flag wthout much cost.


Bottom Line: ASCIZ is preferred handling for all CPUs that offer an implied test for Zero - the PDP-7 being one of them.


Davisbak made a great point by citing Mr. Donald Knuth:

Special machine instructions that test against zero are why Knuth points out in various places that he prefers to loop downwards towards zero. (One such place is in the middle of his excellent article Structured Programming with go to statements (1974) - see first paragraph pg 268.) (Also check out the quotations he used at the beginning!)

Such tests may seem so obvious to today's readers, that they have a hard tim to praise the advantage machines have in this case over others who don't - which includes today's most prevalent architecture: x86 (*3).


*1 - Here is also a subclss of word terminated strings, but it follows essentially the same rules as character terminated

*2 - Or extended, depending on POV.

*3 - Or not, as x86 it does in fact have a free test for zero in form of the JCXZ instruction, except it's restricted to work with only one register (CX), which at the same time is not as versatile as the main accumulator (AX).

16
  • A minor drawback of Length Terminated Strings is that you have a maximum string length, depending on the size of the length field you chose. Commented Jul 17, 2022 at 20:53
  • 1
    Special machine instructions that test against zero are why Knuth points out in various places that he prefers to loop downwards towards zero. (One such place is in the middle of his excellent article Structured Programming with go to statements (1974) - see first paragraph pg 268.) (Also check out the quotations he used at the beginning!)
    – davidbak
    Commented Jul 17, 2022 at 21:08
  • @JohnDallman - that really bit you in some of the original Pascal implementations which used a single byte for string length. The ones that went to two bytes for string length were a lot easier! In those days, considering machine resources, 65K byte strings were ... pretty darn large! And thus mostly sufficient.
    – davidbak
    Commented Jul 17, 2022 at 21:10
  • 2
    Things were so much simpler then: One machine instruction >= 1 cycles, and shorter instructions saved space (and were important if you had short relative jump instructions and could use them because your code was tiny). None of this overlapping execution, multiple ALUs, out-of-order, branch-prediction, microop caching, speculative stuff. Then, that instruction you saved by having a (short) combined test&jump over two instructions or one longer instruction paid off for sure! These days: not so much ... (except microcontrollers I suppose).
    – davidbak
    Commented Jul 17, 2022 at 22:49
  • 1
    Counting towards zero in the reason I liked the AOBJN instruction so much. Pdp-6 and pdp-10. Commented Jul 18, 2022 at 19:46

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .