Firstly, this is true, right? I feel that reads will always be faster than writes, also this guy here does some experiments to "prove" it. He doesn't explain why, just mentions "caching issues". (and his experiments don't seem to worry about prefetching)
But I don't understand why. If it matters, let's assume we're talking about the Nehalem architecture (like i7) which has L1, L2 cache for each core and then a shared inclusive L3 cache.
Probably this is because I don't correctly understand how reads and writes work, so I'll write my understanding. Please tell me if something is wrong.
If I read some memory, following steps should happen: (assume all cache misses)
1. Check if already in L1 cache, miss
2. Check if in L2 cache, miss
3. Check if in L3 cache, miss
4. Fetch from memory into (L1?) cache
Not sure about last step. Does data percolate down caches, meaning that in case of cache miss memory is read into L3/L2/L1 first and then read from there? Or can it "bypass" all caches and then caching happens in parallel for later. (reading = access all caches + fetch from RAM to cache + read from cache?)
Then write:
1. All caches have to be checked (read) in this case too
2. If there's a hit, write there and since Nehalem has write through caches,
write to memory immediately and in parallel
3. If all caches miss, write to memory directly?
Again not sure about last step. Can write be done "bypassing" all caches or writing involves always reading into the cache first, modifying the cached copy and letting the write-through hardware actually write to memory location in RAM? (writing = read all caches + fetch from RAM to cache + write to cache, written to RAM in parallel ==> writing is almost a superset of reading?)