As a computer architect who has worked on processors including non-microcoded RISCs and microcoded CISCs in industry, I make more distinctions than just hardware/software:
Inside the instruction set, or NOT
Inside the processor, as opposed to being implemented by some completely separate device.
How implemented? A spectrum:
- dedicated combinatoric hardware circuits implementing full operation
- hardware state machines controlling circuits implementing partial operation
- "horizontal microcode" - although I prefer to say microcode that actually controls timing
- "generic microcode" - sometimes called "vertical microcode",
- PALcode (like DEC Alpha)
- trap and emulate
- "pure software" using ordinary instructions
In this implementation spectrum, software is everything below the line drawn between microcode and PALcode, although from my point of view the processor generic microcode is almost the same as software.
But from my point of view, there is a very big distinction between the "real hardware" combinatoric circuits and hardware state machines, and generic microcode. Horizontal or explicit timing controlled microcode is fuzzy.
Inside the instruction set or not is almost completely orthogonal, except for that last item of "pure" software.
As others have pointed out, inside or outside the processor is almost equally orthogonal.
---+ Inside the instruction set
As you (OP) say, an operation like multiply or divide may be provided by the instruction set or not.
If not provided by the instruction set, e.g. one of the early RISC processors that provided no multiply instruction, then multiply is definitely "implemented in software by ordinary instructions"
If the processor has a multiply step or divide step instruction, I would probably say "lacks full hardware multiply or divide", or is "implemented in software with partial hardware support".
If the full operation is provided by the instruction set: ...
... If implemented by trap and emulate, I would classify that as a software implementation - but T&E is important enough that I might say "implemented by trapping and emulating in software".
---+ "implemented in hardware"
On to "implemented in hardware":
---++ Combinatoric Logic and/or Hardware State Machines
Some operations, like 32-bit multiply, might be implemented as a single comminatoric logic circuit, eg a 32x32=32 or 32x32=64 array multiplier, possibly/pipelined over several cycles; or by a smaller circuit, eg a 32x8=32 or 40 bit slice, taking several passes or iterations.
If a single combinatoric circuit, I would say "implemented in hardware".
If several passes over a partial multiplier erase slice, or similarly for a divider, I tend to make distinctions based on how the iterations or loops are controlled.
If the looping is controlled by a dedicated hardware state machine, then I would say "implemented in [a] hardware [state machine]."
---+ Microcode
If you have significant combinatoric logic hardware, but if the looping or control or sequencing of that hardware is controlled by microcode, and in particular by generic microcode, eg the vertical or VLIW microcode of many machines that could actually be used for general purpose programming if it were exposed to the user, then I will usually make the distinction "partially or substantially implemented in hardware, with microcode control", or "microcode with hardware acceleration".
But some microcode implementations don't have big chunks of combinatoric logic to do the work. They might implement the operation just using ordinary microcode loads and stores and arithmetic instructions, and so on - almost exactly as it might have been implemented in a pure software implementation.
From my point of view as a computer architect, pure or generic microcode is almost the same as software.
Note that modern x86 processors have both generic microcode and dedicated hardware state machines. Dedicated hardware state machines for things like integer divide, floating point divide, add floating point (inverse) square root. Generic microcode often for transcendentals like sin and logarithm, although usually with some dedicated hardware such as look up tables. And dedicated non-iterative hardware for things like multiply.
---+ What is "generic" microcode?
What makes micro code "generic" or not? Well, some processors have microcode that could almost be exposed to the user as an ordinary instruction set.
E.g. It might be RISC-like micro-ops, uops, or rops, although frequently failing the classic definition of RISC by having micro instruction sizes like 140 bits, or more than the classic RISC 2 or 3 input / 1 output register. Often VLIW.
Microcode may also differ from software, i.e. micro instructions may differ from the macro instruction set, by having access to register that neither the user nor the operating system can access, or ordinary registers may be extended, e.g. The user visible 32 or 64 bit architectural registers may actually be 40 or 72 wide, e.g. so that condition codes are computed by the same hardware circuits as the main register bits, to be separated later.
From my point of view what makes microcode "generic" or not is whether it has very tight control of timing, the micro instructions are not really aware of timing, and flow through the machine under the control of an out of order or in order instruction scheduler. Some textbooks make the distinction between vertical micro code and horizontal micro code, and some of the original VI instruction sets can be imagined as exposing the parallel operations of horizontal microcode. I find it more useful to distinguish microcode that is timing aware from microcode that is not.
I started using this generic or vertical microcode when I started designing out of order processors: the uops (micro-ops) were the things that flowed through the micro dataflow engine. But you can use the same approach for in-order pipelines as well.
"Generic micro" may be mostly compatible between processor generations and families. Whereas timing dependent micro will obviously have to change if the pipeline depth or interlock change.
---+ PALcode
Why have I spent so much time talking about the difference between hardware and generic microcode? In part because there is another possibility, intermediate between microcode and higher-level software implementations, whether trap and emulate or ordinary software. RISC processors sometimes implement the operations that a so-called CISC machine like the Intel x86 would implement in microcode as what the DEC Alpha called PALcode: essentially the same sort of RISC instructions as are visible to ordinary user code and operating system code, but running in a special mode "underneath" the operating system. Again, PALcode may have access to resources like special registers that ordinary user or operating system code does not have. The main distinction between PALcode and classic microcode is that the classic micro instruction set is usually very much different from the ordinary user and operating system instruction sets whereas for PALcode They are much more similar.
One almost always calls PALcode "software".
Actually, I am told that the Intel i960 RISC processor's microcode was very much like Alpha PALcode - what they called it microcode.
---+ Where Implemented: Inside or Outside
@Arvo, in his answer, mentions another common definition of "in hardware" «In computer context "in hardware" [usually] means offloading some operations off from CPU to some other device».
I completely agree, except for "usually" - but that is undoubtedly a function of what community of people you are working with.
He goes on to say «which itself can use internal software to accomplish its task.» This is important: the external device may itself have hardware combinatoric circuits or state machines, or the operations may be completely implemented in software on the other device.
If implemented in software on the external device, then, if the host processor and the external device processor are comparable, the speed of the operation is probably comparable. Whereas in the common case where the external device processor is significantly slower than the host processor, the overall operation may take longer.
I.e. sometime such "offload engines" result in operations taking longer Then they would to if executed on the host processor. But even if it takes longer, overall performance might improve because the host processor and the external device processor are running in parallel. But perhaps the operation could be run in host processor software on an idle in a multi processor.
One of my friends has long been an advocate of creating accelerator devices like this. But at MIPS, for example, I frequently had to remind people that "We [MIPS] are the 'hardware' accelerator." By that time MIPS had pretty much failed in the general-purpose computer market, and was mostly being used to control such external devices. Arguably many RISC-V companies are in that space.
Note that such external devices used to implement an operation might be accessed from pure software, from trap and emulate software, from PAL code or microcode, possibly even from hardware, e.g. state machines. Yes: You might have hardware in the CPU sending and operation to an external device that implements it in software on that external devices processor.
Or... External "co-processors" might be used to implement instructions defined in the instruction set. In my experience a "co-processor" usually has a connection that is more intimate with the main processor than normal I/O devices do. A co-processor instruction might send something out across a special co-processor bus. Whereas a non-co-processor accelerator might use ordinary memory mapped I/O, just like a disk drive controlled by the operating system.
Note: implementation as an ordinary I/O device in my experience nearly always means that ordinary users cannot directly access the device. They typically have to do a system call to access it, and/or put a request on queue that the hardware will pick up later. Or both. in most modern systems ordinary user software cannot receive device interrupts. And they are often challenged to take page-faults. What this overall means is that there's a certain minimal cost that affects the granularity of operations you can send to the external device. It doesn't have to be this way: you can define an I/O device architecture that can be used from user mode. I have. It's just not very common.
Why use instructions rather than RISC primitives?
Mitch Alsup, a prominent computer architect, often said that there were three reasons to put operations inside the instructions set or in hardware rather than software RISC instructions:
performance
security,
and atomicity.
I sometimes add a fourth criterion: Executable machine code compatibilty. Whether across generations, or between low end and high end implementations at the same time, or simply for consistency between different vendors. E.g. I was told that 1 of the big motivations for Motorola adding floating-point instructions to the 68000 family was that the software implementations were so widely different that it was giving Motorola a bad name.
Sometimes the RISC like primitives that one might use to implement an operation can only be used in certain specific instruction sequences. If the primitives were available to ordinary user programs or even to operating systems software then there might be security problems. Yes: Even exposing something to the operating system may cause security problems, e.g. if a guest operating system is running in a virtual machine environment. Or programs might suffer atomicity issues, whether multi processor/parallel programming or atomicity such as with respect to interrupts. Security or atomicity issues can often be addressed by either microcode or non-micro hardware implementations. Performance issues are often mostly addressed by hardware machines "underneath" generic microcode.
---+ Instruction set or not
So, it can be seen that the question of whether an operation is implemented as part of the instruction set is almost orthogonal to whether the operation is implemented inside the processor or outside the processor, or whether the operation is implemented by dedicated hardware combinatoric logic, hardware state machines, microcode, trap and emulate, or pure software. Or even the more abstract intermediate languages that are compiled to the actual machine code for a particular processor ( most commonly GPU).
Why then put something inside the instruction set?
Well, if it's inside the instruction set you have all of the above implementation possibilities. Any given executable binary containing machine code can be run on implementations with various levels of hardware cost and performance. if compatibility matters to you.
But even the most trivial trap and emulate software implementation of an operation costs something for the processor manufacturer, if only testing.
And if your processor is a black box, and if trap and emulate should is definitely not acceptable performance, then a company may not really have the option of doing anything in the instruction set. Or a company that is a customer of such a processor may be willing to build external hardware that performs faster than an equivalent operation inside the processor, especially if the internal implementation inside the host processor is not very high-performance.
Note: so-called intermediate languages, which are not directly executed but which are translated at program load time to the actual physical instruction set, have a little bit more flexibility here. They can provide compatibility, with a little bit less implementation cost inside the processor. Although somebody has to validate the intermediate language implementation. Which is often harder to do than validating hardware or microcode, because the attack surface in security terminology is larger.
---+ Examples
As noted elsewhere multiplication operations were often implemented in pure software, then software with multiply step instructions, then hardware state machines around array multiplier slices, and now for the most part are pure hardware.
Similarly, divide operations have gone through similar evolution from pure software through divide steps to hardware state machines. I am not aware of anybody building a 100% combinatoric logic implementation of a divide, certainly not for 32 or 64 bits. and how many people need to do divides of 8 bit values (that are not already using logarithmic representations where logarithmic divide corresponds to integer subtract).
Similarly, IIRC some papers mentioned software implementations of floating-point as a motivation for RISC processors. But nowadays most high-end processors do floating-point add, subtract, and multiply in combinatoric albeit pipelined logic, and floating-point divide with hardware state machines controlling sequencing of the hardware multiplier array or something like divide step hardware. similarly square root, or inverse square root, which can often use almost same hardware as divide.
Some machines, however, expose a divide-init or 1/x instruction software, which can then convert a sequence of divides by the same denominator x1/y, x2/y the sequence of multiplications by 1/y... especially if exact IEEE values are not required.
Floating-point NaN or denorm handling is another example where the division between combinatoric hardware and sequencing via state machines or microcode or software trap and emulate or ... can be fuzzy. Classic CPUs such as Intel x86 or classic RISCs like PowerPC or RISC-V often let NaNs or denorms be handled by software or microcode or state machines, since denorms are not that common... but GPUs, if they support NaNs and do not simply flush denorms to 0, are actually more likely than CPUs to implement these things in (pipelined) hardware - since GPUs historically are not very good at handling traps or exceptions. Rather than a single thread of execution on a CPU trapping, in a GPU 16 or 32 or 64 SIMT threads of execution might be stopped for a trap and emulation. which is a big motivation for avoiding the trap, implementing hardware is necessary.
IMHO 1 of the classic examples of these trade-offs are block memory operations, such as memcpy or bcopy. On all of the hardware/software, combinatoric vs sequenced in hardware state machines or microcode or trap and emulate, and external hardware:
Optimizing memcopy in pure software is a right of passage for any so-called performance programmer. Taking advantage of loop unrolling, hash bypass instructions, etc. etc. In particular, taking advantage of knowledge of the use case: e.g. if you know that your code is only going to be used to memcpy 64MB memory buffers, completely aligned, you can make a lot of optimizations.
Optimizations that may not be available to an implementation of memcpy in the instruction set. Such as x86 REP MOVSx.
I consider that 1 of the biggest mistakes I made as a computer architect was implementing Intel P6 REP MOVSx in fairly generic microcode. Oh, sure, we took advantage of 64-bit loads and stores at a time when the actual integer instruction set that only had 32-bit loads and stores, and slightly special memory access uops that reduced unnecessary memory traffic by 33% for large aligned block copies. also a few gimmicky micro instructions to start off.
But IMHO it would've been much better to actually use a hardware state machine to do that REP MOVSx memcopy. There was way too much microcode overhead looking at the alignment of the source and destination memory buffers, and their sizes, before actually getting into the loops that did the memory copy. One thing that hardware combinatoric logic can do much better than software is irregular IF conditions or multiway branching. Software or microcode typically can only do binary two-way branches or 2^N way computed jumps, whereas a combinatoric logic circuit can evaluate minterms with extremely variable number of bits in one pass.
Also... software or microcode branching logic "rots", usually faster than hardware: it may have been designed to evaluate conditions N cycles in front of where they are needed, and it may be doing M-way loop unrolling, but the timing, the values of N and M, may change dramatically on different implementations. Same thing applies to hardware control, except that the hardware control can often just take one or a few cycles, rather than 5 or 15 cycles to go through a chain of if statements or multiway branches. In software, you nearly always have to optimize the branching logic at the start of memcopy to favor big buffers or small buffers or ... whereas in hardware you can branch directly to any of the above, as long as all of the bits you need to look at fit. At the very least, if you are doing such a software or microcode implementation of memcopy, you should create an auto tuning framework that regenerates a code on every different processor microarchitecture.
How about external hardware for memcopy? Sure: as mentioned above, and external DMA engine operating on physical addresses can be hard to access from ordinary user code. So you either pay operating system overhead, or work around it some other way. Or you might need to have will address is ... but then it has to be integrated with your operating system virtual memory code. if the external DMA engine is far enough out, it may have no choice except to evict the data from the processor cache. But cache bypass is often not the best thing to do.
Yada yada yada
---+ CONCLUSION
My point is that there is a spectrum from pure software User code or OS code, through trap and emulate, PALcode, generic microcode, non-generic microcode that is very tightly bound to micro architecture, hardware state machines, hardware combinatoric circuits. Different people may draw the line separating software and hardware at different places in this spectrum.