lock addl $0, (%esp) is a substitute for mfence, not lfence.
(lock add is generally faster on modern CPUs, especially Intel Skylake with updated microcode where mfence acts like lfence as well, blocking out-of-order exec even of instructions on registers. That's why GCC recently switched to using a dummy lock add instead of mfence when it needs a full barrier.)
The use-case is when you need to block StoreLoad reordering (the only kind that x86's strong memory model allows), but you don't need an atomic RMW operation on a shared variable. https://preshing.com/20120515/memory-reordering-caught-in-the-act/
e.g. assuming aligned std::atomic<int> a,b, where the default memory_order is seq_cst
movl $1, a # a = 1; Atomic for aligned a
# barrier needed here between seq_cst store and later loads
movl b, %eax # tmp = b; Atomic for aligned b
Your options are:
Do a sequential-consistency store with xchg, e.g. mov $1, %eax / xchg %eax, a so you don't need a separate barrier; it's part of the store. I think this is the most efficient option on most modern hardware; C++11 compilers other than gcc use xchg for seq_cst stores. (See Why does a std::atomic store with sequential consistency use XCHG? re: performance and correctness.)
Use mfence as a barrier. (gcc used mov + mfence for seq_cst stores, but recently switched to xchg for performance.)
Use lock addl $0, (%esp) as a barrier. Any locked instruction is a full barrier, but this one has no effect on register or memory contents except FLAGS. See Does lock xchg have the same behavior as mfence?
(Or to some other location, but the stack is almost always private and hot in L1d, so it's a good candidate. Later reloads of whatever was using that space couldn't start until after the atomic RMW anyway because it's a full barrier.)
You can only use xchg as a barrier by folding it into a store because it unconditionally writes the memory location with a value that doesn't depend on the old value.
When possible, using xchg for a seq-cst store is probably best, even though it also reads from the shared location. mfence is slower than expected on recent Intel CPUs (Are loads and stores the only instructions that gets reordered?), also blocking out-of-order execution of independent non-memory instructions the same way lfence does.
It might even be worth using lock addl $0, (%esp)/(%rsp) instead of mfence even when mfence is available, but I haven't experimented with the downsides. Using -64(%rsp) or something might make it less likely to lengthen a data dependency on something hot (a local or a return address), but that can make tools like valgrind unhappy.
lfence is never useful for memory ordering unless you're reading from video RAM (or some other WC weakly-ordered region) with MOVNTDQA loads.
Serializing out-of-order execution (but not the store buffer) isn't useful to stop StoreLoad reordering (the only kind that x86's strong memory model allows for normal WB (write-back) memory regions).
The real-world use-cases for lfence are for blocking out-of-order execution of rdtsc for timing very short blocks of code, or for Spectre mitigation by blocking speculation through a conditional or indirect branch.
See also When should I use _mm_sfence _mm_lfence and _mm_mfence (my answer and @BeeOnRope's answer) for more about why lfence is not useful, and when to use each of the barrier instructions. (Or in mine, the C++ intrinsics when programming in C++ instead of asm).