Can the C++ compiler coalesce adjacent mutex locks?

Question

Consider the following code, with adjacent mutex sections containing shared memory accesses:

std::mutex mutex;
int x;

void func() {
    {
        std::lock_guard lock{mutex};
        x++;
    }
    {
        std::lock_guard lock{mutex};
        x++;
    }
}

Without the mutexes, the compiler could obviously coalesce the increments, like so:

void func() {
    x += 2;
}

This got me thinking about what might prevent that optimisation. We know even if x is std::atomic<int>, the optimisation is still legal (but perhaps not done in practice).

But what about with mutexes? Is it legal for the compiler to transform func into this?

void func() {
    std::lock_guard lock{mutex};
    x += 2;
}

It's clear that the writes to x must be visible side effects to other threads due to acquire-release semantics, but I'm not sure if that forbids the optimisation. Perhaps you can argue under the as-if rule that you couldn't tell the difference if the optimisation was done or not, therefore the optimisation is legal (since the difference between "unlock then lock again" and "keep holding lock" depends only on thread scheduling timing).
On the other hand, it seems plausible this optimisation could slippery-slope to every program containing a mutex being optimised into: lock mutex, do entire program, unlock mutex. Such behaviour would be obviously unhelpful for concurrent applications, so it might be reasonably forbidden.

What aspect of the C++ memory model, and wording in the Standard, allows or forbids this optimisation?

(Currently, Clang and GCC do not perform this optimisation, as can be seen here. However, this is not proof that the optimisation is or should necessarily be illegal.)

I suspect there might be some wording about mutex (un)locking being a visible external effect akin to I/O, e.g. [intro.progress]/1 lists both. — yeputons, Commented Jun 22 at 13:03
"... The lock() operation on a Mutex is also an Acquire operation" and memory_order_acquire "...no reads or writes in the current thread can be reordered before this load...." — Richard Critten, Commented Jun 22 at 13:25
@yeputons If you are talking about synchronizes-with, happens-before, and how they affect visible side effects, then I'm not convinced. My reading is that IF thread B reads value V then everything before thread A writes V must be committed and visible. It seems plausible that there is no restriction on what value thread B MUST read. (I think this is a similar situation to coalescing std::atomic accesses, which I linked.) — MC ΔT, Commented Jun 22 at 13:25
@RichardCritten I think this is true of std::atomic too, which we think can be coalesced: stackoverflow.com/q/45960387/7064452 — MC ΔT, Commented Jun 22 at 13:30
Interesting question! MSVC doesn't optimize it into one lock either btw. — Ted Lyngmo, Commented Jun 22 at 13:59

user17732522 · Accepted Answer · 2024-06-23 10:48:57Z

Yes, the transformation is allowed, for the reasons you yourself gave.

For any given input, no matter how the rest of the code looks, the set of observable behaviors permitted with the transformed function will always be a subset of the observable behaviors permitted with the original function.

There is no requirement that unlocking the mutex will cause another thread blocked on acquiring a lock to immediately wake up and atomically with the unlocking acquire the lock. Therefore, it is perfectly fine to schedule the unlocking thread so that it will immediately reacquire the lock, regardless of the actions of any other thread.

Whether the lock was released is not an observable behavior. Observable behavior is only volatile access of objects, the final state of files written and interactive IO in an implementation-defined manner. See [intro.abstract]/6.

So the release/reacquire can be skipped without affecting the observable behavior with this scheduling scheme, which is a permissible scheduling scheme and therefore defines one of the permissible observable behavior choices. The compiler only has to assure that any one choice of the permissible observable behaviors is realized. See [intro.abstract]/5.

On the other hand, it seems plausible this optimisation could slippery-slope to every program containing a mutex being optimised into: lock mutex, do entire program, unlock mutex. Such behaviour would be obviously unhelpful for concurrent applications, so it might be reasonably forbidden.

There is [intro.progress]/8, which recommends that the thread executing main and the threads created with std::thread/std::jthread should provide the defined concurrent forward-progress guarantees, which means that the thread should, as long as it hasn't terminated, make progress ([intro.progress]/6) eventually.

For the purpose of making progress, per [intro.progress]/4, a blocking library call is considered to continuously check the condition causing it to block, each such check "making progress". Other execution steps that cause progress to be made are access through volatiles, completion of a call to a library IO function, of synchronization operations and of atomic operations.

However, I don't think that this even forbids the transformation of

while(true)
{
    std::lock_guard lock{mutex};
    //A
    x++; // (assume x is unsigned here)
}

into

std::lock_guard lock{mutex};
while(true)
{
    x++;
}

Even if another thread attempts to lock the mutex, the following scheduling behavior would not violate this forward progress guarantee: Repeatedly execute the loop until //A, then switch to the other thread to "make progress" by checking the blocking condition, then switches back and repeat the above.

Forward progress and scheduling of threads is largely left as a quality-of-implementation decision to the implementation. See also N3209 discussing this from when multi-threaded execution was added to C++11.

I do not expect that any compiler will make any attempt at such transformations at the code level.

However, even common OS schedulers won't provide any strict guarantee that the scheduling scheme I described above won't happen. Generally it just becomes probabilistically unlikely over longer times because threads are preempted at more or less stochastically noisy intervals. If the scheduling happens to be as described above, then even without the compilers transformation the behavior of the program will be as if the transformation was made.

I could imagine other environments where the described scheduling behavior may occur indefinitely. Suppose for example that the mutex locking is implemented as a simple spin lock on a uniprocessor system and suppose that the scheduler is perfectly fair and deterministic in the number of instructions it will let each thread execute in any time slice. In such an environment you may be unlucky that the locked thread happens to be always able to reacquire the lock in the same time slice that it is released.

Or multithreading may be completely cooperative, in which case there may be an issue if the mutex is not implemented to yield after an unlock.

I think the standard can't make any stronger guarantees than it does, in part because it would make supporting such environments impossible.

I am not sure this is correct. AFAIK scopes should never be optimized away. (something to do with memory model and memory ordering, I have not time to look that up right now) — Pepijn Kramer, Commented Jun 22 at 16:02
@PepijnKramer I do not see any reason why that would be the case. Scopes don't matter, as long as the observable behavior isn't affected and it is impossible to observe whether a scope has been "optimized away", whatever that is supposed to mean exactly. — user17732522, Commented Jun 22 at 16:33
@PepijnKramer In fact, it would be bad if a compiler wouldn't optimize across multiple scopes. For example if I have int i = 0; { i = 1; } { i = 2; } I expect an optimizing compiler to only emit the store of 2 to i. — user17732522, Commented Jun 22 at 16:35
Locks are permitted to be (and in practice are) unfair. Releasing a lock and immediately reaquiring it will succeed if nobody can sneak in. (And on a uniprocessor system, nobody will sneak in if the thread still has quantum.) So the system behaves in practice as if the optimization had occurred. — Raymond Chen, Commented Jun 22 at 16:37
I do not believe that, in practice, any compiler can optimize away mutex's lock/unlock. It is less due to the specification of what mutexes do according to the standard but what, in practice, compilers have to perform to lock/unlock a mutex - the code is way too complex. — ALX23z, Commented Jun 22 at 17:25

Collectives™ on Stack Overflow

Can the C++ compiler coalesce adjacent mutex locks?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
c++
c++11
language-lawyer
mutex
memory-model
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged c++c++11language-lawyermutexmemory-model or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c++
c++11
language-lawyer
mutex
memory-model
or ask your own question.