a post that states that Unix uses null-terminated strings, ASCIZ, because it was a feature of the PDP-7.
Yes, ASCIZ strings are especially supported by the PDP-7 due existence of the SZA
instruction which delivers a zero cost test for zero (pun intended), thus outperforms any other way string termination by memory footprint as well as execution time.
I looked over the PDP-7 and -4 material and I can't find anything like the CIS instructions. This is not surprising given the era. But I can't find anything that seems to suggest ASCIZ was directly supported.
It doesn't need some high level CIS type to benefit from certain data types. Preference may come from way more mundane items, like havng a test for zero implied or at low cost.
There's the various I/O instructions, but they all load three chars into/out of a register
Don't forget all character I/O, most notably teleprinter and paper tape. Like TLS
or RSA
. They operate on single characters from or to the bits of AC.
Equally important, string handling is more often about touching characters within a string than the string at whole. As soon as a string is handled character by character character based conditions - like end of string - become an inherent advantage.
Last but not least, strings are not blocks.
I don't see anything that means a length couldn't be used.
It would be a bad machine if it would't allow the use of length terminated strings. As so often it's not about what can or can not be done, but what is more efficient or at least less cumbersome.
So was null-terminated strings on the PDP-7/4 something in the hardware or simply a programming style?
Well, any Turing Complete computer can do stings any way imaginable. Assuming the PFP-7 is one, might mean it comes down to programming style, after all, the machine does what it's told. Still, since these are real computers in the real world with constrains like memory size or speed, some methods may perform better on specific computers than others. In case of the PDP-7 it's Zero Termination (ASCIZ) due the fact that
- its memory access is slow
- it has only one general purpose register (AC)
- character handling has to go thru this register
- it got any
SZA
instruction
SZA
(Skip if Zero Accumulator) tests AC against a value of zero and branches (skips). Since AC has to be loaded with each character to be handled, SZA
can be used to test termination. Unlike a compare (SAD x
), SZA
does not need to access memory for a value to compare against, as zero is an implied. It's simply faster than any other method.
To get a fair comparision, let's look at the some ways to handle strings:
Length Terminated Strings
Length terminated strings carry (usually in front) a length field holding the number of addressable units (characters or words) used to store the string. Nowadays often called Pascal-Strings.
Length termination got important advantages:
- Blocks and strings are the same.
- Its content does not need to be processed
- It can contain any possible symbol (code value).
This is payed for with some requirements:
- A length must be present
- Iteration (Handling) requires
- A pointer to access the actual element and either
- a counter to keep track or remaining characters, or
- an end pointer (address)
- Comparison against counter or end pointer
These requirements are no issue when it comes to processors with a lot of registers and/or fast memory access. Handling 3 values during iteration (position, termination and actual character) gets cumbersome on a CPU with only a single versatile register, or slow memory access like the PDP-4/7 series. With it's single AC, the PDP-4 needs to hold at least pointer/counter in memory, resulting in slow execution due frequent memory access for
- Indirect access
- Pointer increment
- Pointer load for comparison
- Comparison
Character Terminated Strings
Character (or marker) termination (*1) in turn does not record the length of a string in an additional field, but reserves a special character value to mark the end of that string. This can be any value. Of course it's helpful to use one that does not occur often :)
A common example for character termination are DOS strings, terminated by a dollar sign ($
).
Disadvantages of character termination are:
- Not all characters can be used
- Content needs to be processed
- Character need to be loaded
- Characters have to be compared against termination value
On the plus side, there is
- No need for a counter
- No need for an end pointer
- Content usually has to be loaded anyway for processing
Using character terminated strings is advantages on a PDP-7 as no counter/end pointer is needed, saving several costly memory operations - whcih are replaced by a simple compare of the character in transit against the termination value.
While marker terminated is a generic idea, there are some common used sub classes: Zero terminated and Flag terminated.
Zero Terminated
Zero Termination is a basic application of character termination but using special properties of the value zero - for example that many CPUs either mark any loaded zero with a direct testable zero flag (prominet examples 6800 or 6502. Or have a branch instruction that issueing an implied test for zero.
And that's exactly the case for a PDP-7. SZS
tests AC for zero and allows to branch accordingly without the need for any additional compare instruction, and especially without the need to access memory.
Flag Terminated
The use of Flag Termination reduce the usable code set further (*2) to reserve half of the character values for a flag marker - e.g. use only 7 bits of a byte for character values, while the 8th is used as flag. While working much like basic character terminated strings, it holds a few advantages:
- No extra byte needed for termination
- Some CPU may use an implied test
Especially the later is true for all CPU that set a condition flag according to the value of a character handles, allowing an implied test, much like with described with zero termination.
Flag termination was somewhat popular after the 8 bit byte became canonical, while the majority of character handling still used 7 bit code - thus enabling the 8th bit as flag wthout much cost.
Bottom Line: ASCIZ is preferred handling for all CPUs that offer an implied test for Zero - the PDP-7 being one of them.
Davisbak made a great point by citing Mr. Donald Knuth:
Special machine instructions that test against zero are why Knuth points out in various places that he prefers to loop downwards towards zero. (One such place is in the middle of his excellent article Structured Programming with go to statements (1974) - see first paragraph pg 268.) (Also check out the quotations he used at the beginning!)
Such tests may seem so obvious to today's readers, that they have a hard tim to praise the advantage machines have in this case over others who don't - which includes today's most prevalent architecture: x86 (*3).
*1 - Here is also a subclss of word terminated strings, but it follows essentially the same rules as character terminated
*2 - Or extended, depending on POV.
*3 - Or not, as x86 it does in fact have a free test for zero in form of the JCXZ
instruction, except it's restricted to work with only one register (CX), which at the same time is not as versatile as the main accumulator (AX).