how does store operation in memory performance work?

Question

i am using this textbook Randal E. Bryant, David R. O’Hallaron - Computer Systems. A Programmer’s Perspective [3rd ed.] (2016, Pearson), and there is a section I don't really understand very well.

C code:

void write_read(long *src, long *dst, long n)
{
 long cnt = n;
 long val = 0;

 while (cnt) {
  *dst = val;
  val = (*src)+1;
  cnt--;
 }
}

Inner loop of write_read:

#src in %rdi, dst in %rsi, val in %rax
 .L3: 
    movq %rax, (%rsi)  # Write val to dst
    movq (%rdi), %rax  # t = *src
    addq $1, %rax      # val = t+1
    subq $1, %rdx      # cnt--
    jne .L3            # If != 0, goto loop

Given this code, the textbook gives this diagram to describe the program flow

This is the explanation given, for those who don't have access to the TB:

Figure 5.35 shows a data-flow representation of this loop code. The instruction movq %rax,(%rsi) is translated into two operations: The s_addr instruction computes the address for the store operation, creates an entry in the store buffer, and sets the address field for that entry. The s_data operation sets the data field for the entry. As we will see, the fact that these two computations are performed independently can be important to program performance. This motivates the separate functional units for these operations in the reference machine.

In addition to the data dependencies between the operations caused by the writing and reading of registers, the arcs on the right of the operators denote a set of implicit dependencies for these operations. In particular, the address computation of the s_addr operation must clearly precede the s_data operation.

In addition, the load operation generated by decoding the instruction movq (%rdi), %rax must check the addresses of any pending store operations, creating a data dependency between it and the s_addr operation. The figure shows a dashed arc between the s_data and load operations. This dependency is conditional: if the two addresses match, the load operation must wait until the s_data has deposited its result into the store buffer, but if the two addresses differ, the two operations can proceed independently.

a) What I am not really clear about is why after this line movq %rax,(%rsi) there needs to be a load done after s_data is called? I'm assuming that when s_data is called, the value of %rax is stored in the location that the address of %rsi is pointing to? Does this mean that after every s_data there needs to be a load call?

b) It doesn't really show in the diagram but from what I understand from the explanation given in the book, movq (%rdi), %rax this line requires its own set of s_addr and s_data? So is it accurate to say that all movq calls require an s_addr and s_data call followed by the check to check if the addresses match before calling load ?

Quite confused over these parts, would appreciate if someone can explain how the s_addr and s_data calls work with load and when it is required to have these functions, thank you!!

I don't get the sample code at all -- `rsi` and `dsi` are not incremented, so the code simply copies the same byte time and time again. Shouldn't those `movq` instructions be `lodsq` and `stosq`? — TonyK, Oct 31 '21 at 12:57
Don't think it's `lodsq` or `stosq`, in this exercise we're given `movq` to work with :) and we have to 'break it down' to s_addr/s_data as shown in the diagram... @TonyK — Megan Darcy, Oct 31 '21 at 13:02
And now that you've added the `C` code, exactly the same applies. I know it's only an example, but it doesn't have to be so pointless! (Also, the two snippets behave _very_ differently if `cnt` is zero.) — TonyK, Oct 31 '21 at 13:02
@TonyK: It looks like an artificial example to talk about [memory disambiguation](http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/) (store forwarding or not). It doesn't *need* to be a loop, but making it a loop makes it possible to time it for multiple iterations. Fun fact: modern CPUs *dynamically predict* whether a load is from the same address as an earlier store, if there are stores whose address isn't known yet (`s_addr` uop not executed yet, regardless of `s_data`). https://github.com/travisdowns/uarch-bench/wiki/Memory-Disambiguation-on-Skylake has some SKL experiments — Peter Cordes, Oct 31 '21 at 19:43
@TonyK: I assume the asm `do{}while()` style loop is *just* the loop, omitting a check for `cnt == 0` to skip over the whole loop [that a compiler would have put around it](https://stackoverflow.com/questions/47783926/why-are-loops-always-compiled-into-do-while-style-tail-jump). Or else it's from a build where `cnt` was a compile-time constant and thus known non-zero. Since the question isn't about that, it's not a big deal, I don't think the question needs more clutter. — Peter Cordes, Oct 31 '21 at 19:46
@PeterCordes: Still, you must admit that the example code is silly. — TonyK, Oct 31 '21 at 19:58
@TonyK: Yes, absolutely, it's a microbenchmark for exploring CPU behaviour, not something you'd run in a real program. And a smarter compiler might check for aliasing and non-zero `n`, and then remove the loop with either `*src + 1` or `*src + n` or something, since only the last store is visible. GCC and clang don't (https://godbolt.org/z/K7jvb4Pz5), probably because normal code doesn't have loops like this that are worth looking for in the first place. — Peter Cordes, Oct 31 '21 at 19:59

score 3 · Accepted Answer · answered Oct 31 '21 at 19:41

The operations in the blue boxes are micro-operations (also called uops or micro-instructions) emitted by the decoders of the pipeline. They are part of the program being executed. The movq (%rdi), %rax instruction is decoded into the load uop. A uop is the unit of execution in the pipeline. Uops aren't called, they're executed.

According to the hypothetical processor design discussed in the book, a simple store instruction like movq %rax, (%rsi) is decoded into two uops, called s_addr and s_data. This happens in real x86 processors as well. One reason why a macro instruction may be decoded into more than one uop is because the format of a uop doesn't allow it to hold all the information given in the instruction, such as when the instruction has too many operands or represents a complex task. Another reason is to increase instruction-level parallelism. The address of the store and the data of the store could become available in different cycles. If the address is available but the data isn't, the s_addr uop can be dispatched to the load-store unit to enable the addresses of downstream load uops be compared earlier against the address of the store without having to wait for the data of the store. The process of determining whether a later load depends on earlier store is called memory disambiguation. If the load movq (%rdi), %rax doesn't overlap with the earlier store movq %rax, (%rsi), then it can be executed immediately, irrespective of whether the value in %rax is ready or not.

When the s_data uop is executed, the value in %rax is stored in the data field of the store buffer entry in which the store uop was allocated. Storing the value in the target memory location happens later, after all earlier instructions complete execution to maintain program order.

The book says that "the address computation of the s_addr operation must clearly precede the s_data operation" probably because, according to the book, the s_addr uop has to first create an entry in the store buffer before the data can be stored in it. This may be fine for the hypothetical design, but it's an unnecessary dependency since allocation can be done before execution. Resource allocation and reclamation isn't discussed in the book anyway.

A simple load instruction is decoded into a single load uop. There is no reason to split the load into multiple uops.

Well spotted that the book claims `s_data` can't execute first. Indeed, in real Intel CPUs (which do decodes stores into 2 separate uops like this) a store-buffer entry is allocated during issue/rename/alloc as the uops of a store instruction are copied into the out-of-order back-end. AMD handles stores as a single uop, but I think has separate queues for separate execution units so I think the uop just gets copied to both queues, which is pretty much the same as Intel's macro-fused store-addr + store-data uops. — Peter Cordes, Oct 31 '21 at 19:52

how does store operation in memory performance work?

1 Answers1