Exercise: Memory Consistency, Cache Coherency and Locks

In the tutorial, all solutions will be presented by students. Please be prepared for all questions as the exercise will focus on discussion, not on understanding the question and gathering the knowledge.

Memory Consistency

Sequential Consistency
In a system with sequential consistency each processor always executes memory operations in the order specified by its program (program order). The order in which the individual memory operations of each processor become visible to the other processors on the shared interconnect (e.g., the bus) is called visibility order.

Three processors (P1, P2 and P3) in a shared-memory system execute the following code (initially A = B = 0).

P1 P2 P3

a1: A := 1 a2: u := A a3: v := B

b2: B := 1 b3: w := A

Here a1 denotes the first operation of processor P1, a2 denotes the first operation of P2 and b2 denotes the second operation of P2, etc.

The outcome of the execution, denoted by the tuple (u,v,w), may vary depending on the order in which the individual operations of each processor become globally visible. Some outcomes may not be possible on a sequentially consistent system. Complete the following table with the possible results for (u,v,w). For each row describe if the result is sequentially consistent and if so, specify a visibility order that produces the result. An example is given in the second row.

u v w sequential consistent visibility order

0 0 0

0 0 1 ja a2,a3,a1,b3,b2

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1
Relaxed Consistency: Peterson-Algorithm
A well-known algorithm for mutual exclusion is Peterson's algorithm (shown in pseudo-code below). Explain why Peterson's algorithm does not break on machines with a store buffer where reads are not permitted to bypass writes to the same memory location and why it does break if reads are permitted to bypass writes to the same memory location on systems with store forwarding (e.g., SPARC TSO).
// global variables and initial values
bool flag0 = false, flag1 = false;
int turn = 0;

// Process on CPU0
flag0 = true;
turn = 1;
while (turn == 1 && flag1);
// critical section
flag0 = false;// Process on CPU1
flag1 = true;
turn = 0;
while (turn == 0 && flag0);
// critical section
flag1 = false;
Fence Instructions
Machines with relaxed memory consistency typically provide programmers with fence instructions to tighten the ordering of memory instructions. Insert MFENCE (memory fence) instructions in Dekker's and Peterson's algorithms to ensure their correct behavior on a multi-processor system that implements a store buffer with store forwarding. Use as few fence instructions as necessary.

P1	P2	P3
a1: A := 1	a2: u := A	a3: v := B
	b2: B := 1	b3: w := A

u	v	w	sequential consistent	visibility order
0	0	0
0	0	1	ja	a2,a3,a1,b3,b2
0	1	0
0	1	1
1	0	0
1	0	1
1	1	0
1	1	1

Cache Coherency

Multiprocessor systems with caches use a coherency protocol, which ensures that writes by one processor eventually become visible to all other processors and that no two processors write to the same memory location simultaneously. One invalidation-based protocol discussed in the lecture is the MESI protocol.

Two processors (P1 and P2) and uniform memory are connected to a shared bus, which implements the MESI cache coherency protocol. In memory there exists a data structure with the following layout: A (8 bytes), B (24 bytes), C (8 bytes). The cacheline size is 32 bytes, so that A and B reside in cacheline X, whereas C resides in cacheline Y.

Processor P1 executes the following code:

// Processor P1
read(A);
read(B);
// do something with A
read(B);
write(A);// Processor P2
read(B);
read(C);
// do something with C
read(B);
write(C);

Cache Coherency in Action

The following table lists the memory operations of the individual processors as they appear on the shared bus. The first column lists the processor and the second column specifies the memory operation being carried out. The next two columns list the MESI state of the cachelines X and Y in each of the processors. The last two columns describe if the operation caused a cacheline transfer to/from memory or to/from another cache. Initially all cachelines are invalid (I).

CPU	Operation	P1	P2
		I I	I I
1	read(A)
1	read(B)
2	read(B)
2	read(C)
1	read(B)
1	write(A)
2	read(B)
2	write(C)

False Sharing
Because B is contained in cacheline X, false sharing occurs with A. Discuss how to remedy this problem and explain the impact on memory and cache transfers.

Cache Coherency and Locks

The cache-coherency protocol is sometimes crucial for the scaling of a particular lock implementation. The reason is, that some lock implementations produce too many bus messages and thereby slow down the execution of the processor.

Test-and-Set Locks

The Test-and-Set locks presented within the lecture do not scale well since they perform exclusive write operations to the lock variable even if the lock is currently taken by a different thread.

A possible implementation of a simple Test-and-Set locks can be done as follows:

class TSLock {
   private:
    unsigned int _lock

   public:
    TSLock() : _lock(0) {}

    void lock() {
        while (test_and_set(_lock) == 1) {}
    }

    void unlock() {
        _lock = 0;
    }
};
Check the scalability problem of the Test-and-Set locks by completing the below-mentioned table.

The first column of the table states which CPU is executing an instruction at the moment. The second column contains the executed operation. The state of the lock variable should be written in column three. The remaining four columns are supposed to contain the state of the cache line containing the lock variable on the corresponding core according to the MESI cache coherency protocol.

CPU	Operation	Lock	P1	P2	P3	P4
		0	I	I	I	I
1	lock (T&S)	1	M	I	I	I
2	wait (T&S)	1	I	M	I	I
4	wait (T&S)	1	I	I	I	M
2	wait (T&S)
3	wait (T&S)
4	wait (T&S)
3	wait (T&S)
1	unlock
3	lock (T&S)
2	wait (T&S)
4	wait (T&S)
3	unlock
2	lock (T&S)
4	wait (T&S)
4	wait (T&S)
2	unlock

Test-Test-and-Set Locks

Within the lecture, an extension of the simple Test-and-Set lock was presented. This new lock implementation contains an additional Test-operation before the actual taking of the lock is performed.

A possible implementation of a Test-Test-and-Set lock could look as follows:

class TTSLock {
   private:
    unsigned int _lock

   public:
    TTSLock() : _lock(0) {}

    void lock() {
        while (true) {
            while (_lock == 1) {}

            if (test_and_set(_lock) == 0)
                break;
         }
    }

    void unlock() {
        _lock = 0;
    }
};
Repeat the previous exercise and use a Test-Test-and-Set lock instead of only a Test-and-Set Lock. Keep in mind that the Test-operation of the lock is only a read access and not a write access.

CPU	Operation	Lock	P1	P2	P3	P4
		0	I	I	I	I
1	test	0	E	I	I	I
2	test	0	S	S	I	I
1	lock (T&S)	1	M	I	I	I
2	lock (T&S)
3	wait
4	wait
2	wait
3	wait
4	wait
1	unlock
3	test
3	lock (T&S)
2	wait
4	wait
3	unlock
2	test
4	test
2	lock (T&S)
4	lock (T&S)
4	wait
4	wait
2	unlock

False Sharing and Locks
Earlier within this exercise session there was one exercise about false sharing. Discuss the consequences if two independent lock variables l1 and l2 share the same cache line.

Extra Tasks

The following tasks will not be discussed during the lecture but are there for more advanced and interested students to broaden their knowledge.

Lock free
Implement double-address compare-and-swap with the help of compare-and-swap.
Scalable Reader-Writer-Lock
Fair reader-writer locks protect a critical section such that multiple readers can enter the critical section concurrently, provided no writer is attempting to enter or holding the lock. Thereby at most one writer may be in the critical section at any point in time, provided no readers are in the critical section.
- Reader-Writer-Lock Implementation
  Implement a reader writer lock using the atomic read-modify-write instructions presented in the lecture. Make sure the lock is fair, that is, both readers and writers get the lock after a bounded time.
- Fair Fast Scalable Reader-Writer Lock
  In their paper "A Fair Fast Scalable Reader-Writer Lock", Krieger et al. present a scalable reader-writer lock implementation based on the MCS lock discussed in the lecture.
  - Search for typos in the readerUnlock function and correct them. Justify why there is a bug. If you do not spot any errors, justify why the function is correct.
  - Describe how the readerUnlock function works.
  - The reader-writer lock implementation is fair, though inherently unfair spin-lock implementations are used in the readerUnlock function. Why?

Materials

Orran Krieger, Michael Stumm, Ron Unrau, Jonathan Hanna: A Fair Fast Scalable Reader-Writer Lock, International Conference on Parallel Processing (ICPP), 1993
Mellor-Crummey Scott: Algorithms for scalable synchronization on shared memory multiprocessors, ACM Transactions on Computer Systems, Volume 9, 1991