Communication
Summary
Barrier Synchronization
All threads wait at a point until all threads reach that point
Centralized Barrier
AKA counting barrier
- Counter is initialized to number of threads
- Atomically decrement counter
- Can’t progress until counter is 0
- If last thread, reset to N
Need to make sure all threads stop spinning before you reset counter
int count = N;
decrement(count); // atomic
if (count == 0) {
count = N;
}
else {
while (count > 0); // Wait for all threads to decrement
while (count != N); // Wait for counter to reset
}
Sense Reversing Barrier
Like counting barrier, but one less spin per critical section
Share count
plus sense
- sense is false for one side of the barrier, true for the other
spin on sense reversal
Tree Barrier
A tree of sense reversal barriers
Once one barrier unlocks, you decrement parent barrier and wait for that to unlock
More complexity, but lowers contention on any one specific lock
MCS Tree Barrier (4-ary arrival)
4-ary tree to mark when all processes have reached a barrier
Two data structures associated with every parent
- HaveChildren (HC)
- Only has meaning when node is a parent
- Array of children nodes
- ChiledNotReady (CN)
- Indicates if each child has arrived at the barrier
Position in the HC vector is statically determined
MCS Tree Barrier (binary wakeup)
binary tree
root node wakes up two children
- Each child wakes up their children
Tournament Barrier
Bracket of 1v1 where winner of each round is fixed
- Always left-most or top-most, depending on orientation
- Winner waits until loser informs them that they lost
- This means winner is spinning on static, local memory
Winner progesses to next round
Everybody waits until tourney is over
Works even if system is NOT a shared memory machines
- Works on distributed system through message passing
Dissemination Barrier
N need not be a power of 2
For each round(k),
- O(N) communication events
Once a processor has sent and received a message, it knows the round is done as fas as its concerned.
rounds
Communication complexity is O(Nlog(N))
For N=5, 3 rounds
Round 0
Round 1
Round 2
After 3 rounds, all are “awake”
No hierarchy
Performance Evaluation
-
Spin
- Spin on T+S
- Spin on Read
- Spin with delay
- Ticket lock
- Array queue lock
- List queue lock
-
Barrier
- Counter
- Tree
- MCS Tree
- Tournament
- Dissemination
-
Parallel Architectures
- Cache Coherent Symmetric Multi Processor (CC SMP)
- Cache Coherent Non-Uniform Memory Access (CC NUMA)
- Non-Cache Coherent Non-Uniform Memory Access (NCC NUMA)
- Message Passing Clusters
Which spin algorithm are fastest on which architectures? What about barrier?