so how is that being done exactly?
There is one part of the pipeline that's allowed to both read and write from the render target: the blending unit at the final stage. This has specialized hardware that's only configurable in a fixed function fashion, so it can read the colour to blend with, blend, and write the result atomically, in a correctly ordered sequence with all other fragments in flight.
Naturally, this serialized access can be a bottleneck, which is part of the reason why transparent blending is more expensive, and why fill rate is often our limiting factor in GPU rendering. There's a limit to how fast we can force all our writes through even this specialized and very fast hardware unit.
A similar thing happens with the depth value, where depending on your depth testing and writing settings the depth value is compared and possibly overwritten in a single operation.
Because this unit is so performance-critical, in shader models to date it's stayed on a fixed function model, unavailable for you to customize programmatically beyond turning depth testing and writes on/off, or changing the comparison used. That keeps the circuitry simple and fast, with exactly predictable execution time for scheduling — all stuff that gets more complicated if we were to inject customized programs into this stage.
You may object that the depth buffer can also be read earlier in the pipeline, for "early z rejection" to skip executing the fragment shader for fragments that are going to fail the depth test later anyway. This optimization can use a conservative depth estimate though — possibly not yet taking into account writes that are still in flight. If a few fragments leak through because the read was out of date, the depth test at the end still ensures we get the correct result. If the rendering configuration does not allow such a conservative z test (ie. a write in flight might make an occluded fragment no longer occluded) then the early z rejection is disabled.
could I instead choose to write the depth of furthest fragments (e.g. "Greater" depth comparison), and draw color as closest as usual.
No. There are not really two separate writes happening here, just one. Either the depth test passes, and both colour and depth are written (if colour and depth writes are enabled, respectively), or the depth test fails, and neither are written.
So while you can change the direction of the depth test to less-than or greater-than, the same test is always used for the whole fragment, both colour and depth components.
You can control what depth to output in the fragment shader by writing a value to the special DEPTH
semantic, but doing so disables that early z rejection optimization. The hardware pipeline can no longer determine whether the fragment will be occluded until the fragment shader has been run, so it has to run the full shader even for fragments which will never be seen.
You could achieve something similar to this with multiple passes, rendering your content once with a "Less" test and saving the resulting depths to one buffer, then rendering your content a second time with a "Greater" test, and reading the buffer you previously saved to do whatever computations you want that rely on knowing both the greatest and least depth values at a point.
You may also want to look into the technique of "depth peeling" which is used in some order-independent transparency approaches.