i am using this textbook Randal E. Bryant, David R. O’Hallaron - Computer Systems. A Programmer’s Perspective [3rd ed.] (2016, Pearson), and there is a section I don't really understand very well.
C code:
void write_read(long *src, long *dst, long n)
{
long cnt = n;
long val = 0;
while (cnt) {
*dst = val;
val = (*src)+1;
cnt--;
}
}
Inner loop of write_read:
#src in %rdi, dst in %rsi, val in %rax
.L3:
movq %rax, (%rsi) # Write val to dst
movq (%rdi), %rax # t = *src
addq $1, %rax # val = t+1
subq $1, %rdx # cnt--
jne .L3 # If != 0, goto loop
Given this code, the textbook gives this diagram to describe the program flow 
This is the explanation given, for those who don't have access to the TB:
Figure 5.35 shows a data-flow representation of this loop code. The instruction
movq %rax,(%rsi)is translated into two operations: The s_addr instruction computes the address for the store operation, creates an entry in the store buffer, and sets the address field for that entry. The s_data operation sets the data field for the entry. As we will see, the fact that these two computations are performed independently can be important to program performance. This motivates the separate functional units for these operations in the reference machine.In addition to the data dependencies between the operations caused by the writing and reading of registers, the arcs on the right of the operators denote a set of implicit dependencies for these operations. In particular, the address computation of the s_addr operation must clearly precede the s_data operation.
In addition, the load operation generated by decoding the instruction
movq (%rdi), %raxmust check the addresses of any pending store operations, creating a data dependency between it and the s_addr operation. The figure shows a dashed arc between the s_data and load operations. This dependency is conditional: if the two addresses match, the load operation must wait until the s_data has deposited its result into the store buffer, but if the two addresses differ, the two operations can proceed independently.
a) What I am not really clear about is why after this line movq %rax,(%rsi) there needs to be a load done after s_data is called? I'm assuming that when s_data is called, the value of %rax is stored in the location that the address of %rsi is pointing to? Does this mean that after every s_data there needs to be a load call?
b) It doesn't really show in the diagram but from what I understand from the explanation given in the book, movq (%rdi), %rax this line requires its own set of s_addr and s_data? So is it accurate to say that all movq calls require an s_addr and s_data call followed by the check to check if the addresses match before calling load ?
Quite confused over these parts, would appreciate if someone can explain how the s_addr and s_data calls work with load and when it is required to have these functions, thank you!!