Are there any articles elucidating the history of the POPCOUNT instruction?

Question

Figuring out how many bits in a group of bits are set to 1, known as computing "population count", Hamming weight, or "bit summation", among others, has various applications.

It is also fairly cheap to implement in hardware to be executed as fast as a fixed point addition if not faster.

Likely for these reasons, CDC 6600 (1964) had CXi (47) described as Count the number of 1's in Xk to Xi specifically included in its quite short list of instructions, although IBM STRETCH (1961) had that capability first as a by-product of all logical operations.

Then, in the list of processors with the "popcount" instruction there are large gaps. Since CDCs and Crays, and until very late 1990s, only SPARC (designed in mid-1980s) had the POPC instruction.

Not a single IBM processor (not counting the aforementioned IBM STRETCH where there was no need for a separate instruction), not a single DEC processor (the instruction first appeared in an Alpha CPU when it was already Compaq's), not a single Motorola processor, not a single Intel or AMD processor until late 2000s, ...

As a result, innumerably many a software engineer had to implement the functionality of POPCOUNT by hand with various degrees of efficiency (even GCC introduced __builtin_popcount only in mid-2000s), and to endure silly interview questions about its efficient implementations.

What was going on? The wiki page doesn't discuss that aspect, but the many-decades long omission is so glaring that there had to be some musings on the subject by people "in the know".

I'm not asking people to speculate here; I'm asking for references to existing material about the lack of POPCOUNT in most of the CPUs of the last quarter of the XX century.

C++ compiler writers have been free to optimize std::bitset::count() using popcnt or equivalent since bitset was invented (1994?). When they actually started doing so requires more research. See: stackoverflow.com/a/45189920/966071 — snips-n-snails, Commented Sep 10, 2017 at 2:25
Probably because ADD and SUB has more applications?. I can remember only a few occasions of wanting such an instruction in the last 20 years. — tofro, Commented Sep 10, 2017 at 8:09
@tofro - I'm going to guess that you don't spend much time writing operating systems. This faciliity is extremely useful for memory management. If you have a word that contains one bit for whether each of a group of memory pages is currently allocated or not, POPCOUNT will tell you whether it's useful to examine that word in order to attempt to allocate a block of a given size. Since this is a very important operation, I'd say it's worth including in an ISA for this reason only. — Jules, Commented Sep 10, 2017 at 11:25
@Jules "any bit set" or "no bit set" are interesting checks. Bit count is nothing I commonly needed. If there were a need for such an operator, I'd guess any of the high-level languages would have come up with a concept for a hamming-weight operator during the last 30 years, at least. They haven't (I'm not talking about builtin_popcount here - That's not in the high-level language). Note I'm not saying it's useless - It's just not as important as other instructions. And it's also, at least for low bit count CPUs, extremely easy and fast to implement using a LUT. — tofro, Commented Sep 10, 2017 at 12:25
Anyone who thinks that modern operating systems don't use bitmaps to track information is sorely mistaken: bitmap.h — Jonathon Reinhart, Commented Apr 3, 2018 at 12:20

Ken Shirriff · Accepted Answer · 2019-01-05 00:08:03Z

14

The closest thing I've found to an article explaining the history of the POPCOUNT instruction is a cryptography forum discussion. That discussion claims that population count was added to machines at the NSA's request for cryptanalysis. See also The NSA Instruction which discusses various systems with POPCOUNT.

The book "Hacker's Delight" claims that "According to computer folklore, the population count function is important to the National Security Agency" (page 96).

Serveral sources state that IBM STRETCH supported population count as a by-product of all logical operations. This is technically true but misleading - STRETCH implemented a population count register (All Ones Count / AOC) that would count the ones resulting from a logical operation. So while it didn't have an explicit population count instruction, it had hardware to compute the population count into a register. (See page 10 of the IBM 7030 manual.)

IBM built HARVEST as an extension of STRETCH for the NSA. This machine added a streaming bit count instruction (SCSI) (see HARVEST system manual page H2.40) among other features.

To answer the original question, there are lots of rumors that POPCOUNT appeared in instruction sets when the NSA wanted it, but unsurprisingly there is no authoritative article on this.

answered Jan 5, 2019 at 0:08

Ken Shirriff

1,13311 silver badges10 bronze badges

That doesn't seem like an answer. The question is, why there was no POPCOUNT instruction in most of the CPUs despite its usefullness, whether NSA wants it or not, and ease of implementation.
– Leo B.
Commented Jan 5, 2019 at 6:41
3

@LeoB - its usefulness does not seem to be a settled conclusion, and that's presumably why it has not shown up in "most CPUs". All that there seems to exist is the folklore that says that cryptanalysis finds it useful, but no substantiation of that. Anecdotes prove nothing, of course, but in a long programming career generally of a system-software implementation nature, including a fair amount of allocation-tracking bitmappery, I don't recall ever once needing to count bits beyond 'all zero or not all zero'.
– dave
Commented Jan 5, 2019 at 13:37
@another-dave Many kinds of discrete optimization algorithms may benefit from the POPCOUNT instruction, from chess to electronic design.
– Leo B.
Commented Jan 5, 2019 at 17:05
3

Just having some potential users of an instruction isn't enough to warrant adding it to a processor. For example, MIPS had the rule that a proposed new instruction had to provide a 1% performance gain over a range of applications or it would be rejected.
– Ken Shirriff
Commented Jan 5, 2019 at 17:43
5

During the 1980s and 1990s we were aware of the population count and leading-zero count instructions in the Seymour Cray designs - but, within Cray Research, never had confirmation as to WHY they were in the instruction set, and a separate functional unit in the hardware. We assumed NSA but that wasn't something that could be confirmed or denied.
– Edward Barnard
Commented Mar 27, 2019 at 0:37

| Show 1 more comment

tofro · Accepted Answer · 2017-09-10 20:39:38Z

10

I would guess the answer is very simple and obvious:

CPU makers don't just implement what they want, what "looks nice in a spec" or "what's easily done". They are driven by the market and implement what programmers need and ask for. In case a certain instruction would make a clear distinction in the market or create a unique selling point for a new CPU, they would certainly have implemented it. There is and has definitely been enough peer pressure in the market to justify.

But apparently it wasn't and they didn't.

edited Sep 10, 2017 at 20:39

answered Sep 10, 2017 at 18:49

tofro

35.6k4 gold badges91 silver badges175 bronze badges

2

What, then, did the programmers need it for in the 60s, then stopped for several decades, and now need again?
– Leo B.
Commented Sep 10, 2017 at 20:13
2

Cryptography definitely is an area today. The sixties, I don't know-maybe it turned out it wasn't needed at all and thus abandoned
– tofro
Commented Sep 10, 2017 at 20:27
3

@LeoB.: popcnt is very useful with SIMD. e.g. to count matches, do a packed-compare into a bitmask (e.g. x86 pcmpeqb xmm0, xmm1 / pmovmskb eax, xmm0), and popcnt the bitmask. In a SIMD loop that processes a variable amount of data each iteration, pointer += popcnt(something) is often useful to step through an input or output array. e.g. see my answer on filtering an array into a new array ("left packing") using AVX2 + BMI2
– Peter Cordes
Commented Dec 15, 2017 at 4:24

Add a comment |

Raffzahn · Accepted Answer · 2020-02-04 16:28:35Z

I'm not asking people to speculate here; I'm asking for references to existing material about the lack of POPCOUNT in most of the CPUs of the last quarter of the XX century.

It's always a bit hard to find documentation about why things have not been done. But maybe a few points to think about

Machines with a Popcount were already a minority early on.
Popcount only makes sense for a narrow class of problems.
Having a Popcount instruction makes only sense if the problem to be solved already needs the upper end of power available.
At the absolute upper end, there were always machines having it (Cray).
Every new class (mini, micro) started out as a most simple implementation, only providing the bare minimum.
The vast majority of applications (for those) do not need Popcount.
Minis never really reached the upper end of performance, making such specialized instructions worthwhile.
It wasn't until the late 1980s that massive parallel microprocessor systems were considered for high performance computing.
While some, like SPARC, defined it, only their upper end implementations really had it.
Likewise Intel implemented it for the (~2000) Itanium, dedicated to high end computing, while letting another decade pass before adding it to x86.
It wasn't until that time (late 2000s) that generic mainstream microprocessor architecture (x86) was really considered viable for high end computing.
By the 2000s microprocessors were quite mature (read bloated), making the addition of additional instructions quite cheap, thus allowing the inclusion of rarely needed ones as well.
Even today, the majority of CPUs used (and designed) do live quite well without.

On a more basic, but equally important level:

Popcount is a fairly simple integer instruction
Popcount can (on most CPU) easy be synthesised by a few, fast instructions.
Performance gain is thus rather small
It's way more efficient to first add instructions with more potential gain

Bottom line: No real need, and easy to implement using basic instructions at the same time

At that point it's also worth to remember, that many things look very different in hindsight.

To avoid lengthy arguments about how complicated a Popcount implementation is, let's look at a most primitive 6502 (*1). Here two basic instruction of 7 cycles total are needed per byte whose bits are to be counted.

        CLC
        LDY  FACC
        LDA  PCTAB,Y
        LDY  FACC+1
        ADC  PCTAB,Y
        ...

PCTAB   DB   0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,...

On an 8086, it can easy be done with string instruction handling even lengthy sequences of up to 65535 bit close to full bus speed:

        MOV   AH,0
        MOV   DX,0
CNTLP:
        LODSB
        XLAT  PCTAB
        ADD   DX,AX
        LOOP

On a /360, the basic counting can be done with a single TR instruction, while 'reuse' of packed decimal instructions and another TR will speed up the adding substantial.

*1 - Ofc, only in sense of complex instructions - otherwise it always will be the second best CPU, right after /360, ever.

Thank you for digging that up; to summarize, the early "upper end" computers were also word-based, thus an efficient implementation would be elusive; whereas for computers with byte-addressable data it can easily be done with simple table lookups in a tight loop, until the steadily growing disparity between the CPU speed and the memory latency makes the potential performance gain worth implementing POPCOUNT as an atomic instruction. — Leo B., Commented Feb 4, 2020 at 16:11
@LeoB. Yes, you're certainly right about the word based nature of the very early machines. I should have added that as well. Performance wise lookup and a tight loop is still a good choice, as either will be held in on chip cache, thus keeping it mostly independent of memory latency - except for the data to be counted, but that would hot a memory based Popcount as well, wouldn't it? — Raffzahn, Commented Feb 4, 2020 at 16:23
True, but even with L1 cache latencies, for a 64-bit value a loop would take tens of clock cycles; in particular because at least one branch out of 8 iterations of the loop would be mispredicted. — Leo B., Commented Feb 4, 2020 at 16:38
@LeoB. Oh, no doubt. A specific instruction is (or at least should) always be faster. Especially if no additional data is needed (that is beside the primary one it works on). Just the gain isn't as big as it seams (also the number of clock cycles depends much on how well the loop can be turned into parallel micro-ops), so it comes again down to how prevalent the usage of the within an application is. If it's the one in a core loop, then it'll make a huge difference, but if it's one of several hundret, it might not show up at all. But I guess at that point we're knee deep in opinion swamp. — Raffzahn, Commented Feb 4, 2020 at 16:47

dave · Accepted Answer · 2020-02-04 03:40:37Z

6

This paper by Simon Lavington on Manchester University computers says that the Ferranti Mark I, the commercialization of the Manchester Mark I, incorporated a "sideways add" instruction at the request of Manchester mathematicians for "their number-theory problems".

answered Feb 4, 2020 at 3:40

dave

36.3k3 gold badges82 silver badges167 bronze badges

Add a comment |

Stack Exchange Network

Are there any articles elucidating the history of the POPCOUNT instruction?

4 Answers 4

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
history
instruction-set
.

Linked

Hot Network Questions

Are there any articles elucidating the history of the POPCOUNT instruction?

4 Answers 4

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged historyinstruction-set.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
history
instruction-set
.