8-bit bytes
Much of this grows out of the adoption of the 8-bit byte. That became popular with the introduction of the IBM 360 family of computers in 1964. In an issue that year of the IBM Technical Journal, an explanation of the choice was offered:
Character size, 6 vs 4/8: In character size, the fundamental problem is that decimal digits require 4 bits, the alphanumeric characters require 6 bits. Three obvious alternatives were considered - 6 bits for all, with 2 bits wasted on numeric data; 4 bits for digits, 8 for alphanumeric, with 2 bits wasted on alphanumeric; and 4 bits for digits, 6 for alphanumeric, which would require adoption of a 12-bit
module as the minimum addressable element. The 7-bit character, which
incorporated a binary recoding of decimal digit pairs, was also
briefly examined.
The 4/6 approach was rejected because (a) [it] was desired it to have the
versatility and power of manipulating character streams and addressing
individual characters, even in models where decimal arithmetic is not
used, (b) limiting the alphabetic character to 6 bits seemed
short-sighted, and (c) the engineering complexities of this approach
might well cost more than the wasted bits in the character.
The straight-6 approach, used in the IBM 702-7080 and 1401-7010
families, as well as in other manufacturers' systems, had the
advantages of familiar usage, existing I/O equipment, simple
specification field structure, and of commensurability with a 48-bit
floating-point word and a 24-bit instruction field.
The 4/8 approach, used in the IBM 650-7074 family and elsewhere, had
greater coding efficiency, spare bits in the alphabetic set (allowing
the set to grow), and commensurability with a 32/64-bit floating-point
word and a 16-bit instruction field. Most important of these factors
was coding efficiency, which arises from the fact that the use of
numeric data in business records is more than twice as frequent as
alphanumeric. This efficiency implies, for a given hardware
investment, better use of core storage, faster tapes, and more
capacious disks.
Overall, an 8-bit byte allowed a reasonably large character set, by the standards of the time, and also allowed two BCD digits per byte.
The move to byte addressing
The priority in the earliest computer designs was to process numbers as rapidly as possible. A number was typically stored in a machine word, and the desired numerical range determined the size of the word. Instructions were normally a single word, and there was often a single address as part of each instruction. The size of the address field in instructions determined the memory size. The IBM 704/709 is an example; it had a maximum of 4096 words of 36 bits, with six characters per word, each of 6 bits. Addresses are 12 bits.
As the range of uses for computers expanded, handling text data became more and more important. Doing that in a word-addressed machine is cumbersome, at best. A byte-addressed machine allows you to access individual characters easily, but demands a larger address field. At the same time, magnetic core memory allowed building much larger memories than vacuum tubes, electrostatic storage or delay lines.
These developments essentially forced computers to have larger address spaces, and ended the practice of having an address in each instruction.
Larger Data Items
It obviously makes things simpler to have a whole number of bytes per data item. Simplicity at this level is extremely worthwhile, because it's always been important to make a computer run as fast as possible within a limited budget of electronics parts (tubes early on, transistors since then). So two bytes (16 bits) becomes an obvious size.
For larger sizes, there are two factors that show up in the electronics design:
Counting things
Implementing instructions often requires counting through the bytes (or bits) of data items. Using powers of two makes the electronics of those counters simpler. To count through 4 bytes, you need a two-bit counter, which can hold values from 0 to 3. Counting through three bytes still needs a two-bit counter, but one of its values is meaningless and has to be treated as a special case in hardware.
Sending data over a serial line requires counting through the bits of each item, which is another benefit of 8-bit bytes. A 3-bit counter will handle them, without any need for special cases.
The IBM 360 picked 32-bit addresses (although it only allowed 24-bit memory addresses for its first decade), and once that was established, it was far easier to compete with IBM using 8-bit bytes and 32-bit addresses than if you wanted to do something different.
Memory fetches and data alignment
Fetching data from memory is simpler if data items are "aligned". This means that their addresses are a multiple of their size. So for a byte-addressed machine, like the IBM 360, a single byte can be at any address. A two-byte (16-bit) item is "aligned" if it is at an even-numbered address. A four-byte (32-bit) item is aligned if its address is a multiple of 4.
Many computer designs of the 1960s through 1990s had memories that could fetch 4 bytes in one operation, starting from an address that was a multiple of 4. If your data items are aligned, then you're guaranteed to be able to fetch any two- or four-byte item in a single read from memory. If they are not aligned, you sometimes need two fetches. That requires more complexity in the memory access system, to recognise that the operation is misaligned and generate the extra fetch. That complexity, and the extra fetch, slow things down.
Items bigger than four bytes will need two fetches, but life is simpler if your larger items are eight bytes, and aligned on 8-byte boundaries. Then you always need exactly two fetches. If you have 8-byte items that are not aligned, then you need three fetches.
In modern fast systems, fetches are always of complete cache lines, usually 32 or 64 bytes. These are always aligned, and aligned data items that fit inside them always arrive complete.
Quite a few computer designs regard a misaligned fetch as a program bug, and kill programs that execute one. x86-based systems don't do that, but have to pay the complexity price. They do run faster with aligned data, so that is normally used even though it is not compulsory.
24-bit systems
I've used a 24-bit system, an ICL 1900 mainframe. It used 6-bit bytes, four per 24-bit word. Those 6-bit bytes limited it to UPPERCASE text, and 24-bit pointers limited it to 16MB of RAM, which is tiny by today's standards.
A more modern 24-bit system with 8-bit bytes would still be limited to 16MB of easily addressable memory, and would be paying the costs of counters with unwanted states, and memory items that were either misaligned, or wasted a byte of memory for every 24-bit integer. A 32-bit system would be more capable, and can be built very cheaply in today's technology.
Lessons of history
There have been a couple of influential computer systems that had 32-bit integers and pointers, but used 24-bit addressing. They're the Motorola 68000 and the IBM 360. In both cases, only the lowest 24 bits of an address were used, but addresses stored in memory occupied 32 bits.
As those systems were limited to 16MB of RAM, programmers stored other data in the spare 8 bits. And when 16MB of RAM clearly wasn't enough and the designs were expanded to 32-bit addressing, that data stored in spare bits became a serious problem, if it was treated as part of the address.
On the 68000 family, existing programs had to be changed to stop using those no-longer-spare bits. This was most noticeable in the wider computer industry for Macintosh software in the late 1980s, when updating for 68020 compatibility, but the same thing happened on Amiga, and presumably other 68000-based systems.
On the successors of the IBM 360, 24-bit address programs could still be run, as could programs using larger addresses. But only 31 of the potential 32 address bits could be used; an address bit had been sacrificed to let the hardware tell the difference between the two kinds of code.
Post-32-bit designs
Everyone who designed a general-purpose architecture with addressing larger than 32 bits knew of the 360 and the 68000, and how much pain 24-bit addressing had caused. Nobody who was serious tried to design a segmented architecture like real-mode x86 for going beyond 32-bit addressing. Everyone used flat address spaces. There are only a few vaguely sane choices for address size.
40-bit addressing is complicated. The electronics have unused values in counting through bits and bytes. If memory fetches are 32 bits, then 40-bit pointers always require two fetches; if memory fetches are 40 bits, then some of your 16-bit and 32-bit fetches require two fetch operations, and some of your 64-bit fetches need three. You can reduce that by widening your 32-bit quantities to 40 bits, and 64-bit quantities to 80 bits, but that isn't a great idea - see below.
40-bit addressing also won't last very long. It only allows addressing 1024GB, and as of 2023, that would already be a problem for some markets. Expanding 40-bit addressing to a bigger address space would cause another round of disruption, as software was updated to make use of it, and would likely destroy backwards compatibility to 40-bit. It also gives you another round of alignment complexity if you took the 40- and 80-bit option.
48- or 56-bit addressing are about as complex as 40-bit, and while they probably would last rather longer, by the time you've gone this far, you might as well go all the way.
64-bit is simpler to build than 40-, 48- or 56-bit. It will last longer. Its register size matches standard floating-point data sizes. It seems logical.
The first general-purpose post-32-bit microprocessors released were the MIPS R4000 in 1991, the Kendal Square Research KSR-1, also in 1991 and the DEC Alpha in 1992.
I don't know many details of the MIPS project, but it was a 64-bit extension of their 32-bit R2000 and R3000 microprocessors. SGI bought MIPS when they got into financial difficulties in 1991-92 to ensure the supply of processors for their workstation products.
The KSR-1 was a supercomputer with at least eight 64-bit microprocessors of their own design. It was not successful.
The DEC project had the most effect, because DEC was a major computer company at the time. The effort had started in 1988, initially aiming to keep the 32-bit VAX architecture relevant in the long term. The designers rapidly realised that this was impractical, and designed a new architecture, intended to last at least 25 years. They therefore went for 64-bit addressing, to make sure that they didn't run out of address space.
Releasing a competitor to Alpha or MIPS which wasn't 64-bit would obviously have a marketing problem with "why isn't it 64-bit?" questions. So 64-bit became the consensus. The much newer RISC-V architecture makes some provision for 128-bit addressing, although this has not yet been designed.
An important detail: no current 64-bit processor can actually have 64-bits worth of memory connected to it. None of them have enough address lines. This does not matter. Future implementations can be given more address lines. Programmers have to be discouraged from using the "spare" address bits, but that's practical to do, and operating systems can be designed to reject such usage.
64-bit ARMv9-A has optional features to improve security that use some of the "spare" high bits, but they are optional, intended for use in mobile devices which don't need peta- and exabyte memories at present.