Managing Cache Coherence

in Multiprocessor Computer Systems

 

 

 

 

 

 

 

prepared by:

 

 

 

 

Brian S. Mitchell

School Of Engineering

Department of Electrical Engineering

 

 

Introduction

 

Parallel computer architectures have become very popular in applications requiring cost-effective, high speed computing environments. One such implementation involves using multiple processors that are connected to memory and each other by means of a high-speed interconnection network (Figure 1).

 

 

Figure 1: Parallel Computer Architecture

 

 

 

 

In this multi-processor configuration, each processor must use the interconnection network for all memory accesses, and processor to processor communication messages. Program partitioning methods can be utilized to limit the amount of processor to processor messages, but memory access requirements will almost always bottleneck the interconnection bus. One solution to this problem involves establishing a local cache for each processor (Figure 2). This configuration enables processors to locally cache and work on copies of main memory blocks; thereby, greatly reducing the number of memory accesses that the processor must perform during program execution. When a processor determines that it no longer needs a block in cache (many algorithms exist for making this determination) it will flush it from the local cache and copy it back to the main system memory. One of the key challenges in this architecture involves synchronizing and controlling processors that have the same memory block copied in their local caches. In order to maintain system integrity, if one processor performs an update to a memory block that resides in it’s local cache, that change must either be propagated to all processors that have cached the same main memory block or trigger the invalidation of all remote copies of the block. These additional activities are required in order to maintain a coherent view of the system memory with respect each processor and their local cache.

 

This paper will survey algorithms and protocols that were created to solve the cache coherence problem described above. In addition to the simple bus-based environments which are depicted in figures 1 and 2, approaches that provide cache coherence for more complex parallel architectures will be discussed.

Figure 2: MIMD Architecture with Local Cache Memories

 

 

 

 

Hardware Cache Coherence

 

Many modern day commercial multi-processor computers utilize hardware based consistency algorithms to provide a coherent view of the systems shared memory to software. The following sections discuss bus-based "snoopy" protocols and directory management schemes which are popular approaches for implementing high performance, hardware based cache coherence protocols.

 

Bus Based "Snoopy" Coherence Policies

Bus based cache coherence solutions are the most widely implemented protocols in modern day commercial multi-processor computers. These algorithms work very well for computers where all processors and memory are connected to a single bus. Switched bus parallel architectures that have complex topologies and multi-cast communication protocols can not effectively implement bus-based coherence algorithms because of the complexity and overhead of performing system-wide broadcasts. As we will see in subsequent sections, the ability to efficiently broadcast and respond to messages, or consistency commands, is a fundamental requirement of all bus based coherence policies. Because of this requirement, bus based coherence protocols are only implemented in computers with a relatively small number of processors. Massively parallel machines can not effectively be built with a single bus, so other mechanisms that employ directory or network based protocols should be used.

With bus based cache coherence, individual processors eavesdrop, or "snoop" on all of the other processors memory activities over a shared bus and respond according to a predetermined coherence policy. These policies, which can be implemented as finite state machines, detail a sequence of events that occur when a local processor interacts with shared memory blocks via read and write instructions. These interactions will always result in either a read hit, read miss, write hit, or write miss occurring with respect to a local cache. In addition to handling cache hit and miss events, local processors must be able to generate, broadcast and respond to special signals, or consistency commands, that are used to send special instructions to remote caches for the purposes of implementing the coherence protocol.

Most bus based cache coherence protocols utilize either write-invalidate, or write-update techniques to maintain consistency [GUPT90, LI89]. The write-invalidate technique works by invalidating all remote copies of a block prior to performing a local update. A significant feature of this approach is that once all remote copies of a block are invalidated, subsequent updates can be immediately performed (without any delay penalty) on the local block since valid remote copies of the block no longer exist. Write-update techniques keep all copies of a block consistent by broadcasting local block updates to all remote caches. Because copies of the cached blocks are maintained across the entire system (that is, not invalidated), read hit percentages are greatly improved. The shortcoming of write-update protocols is that for some applications the bus will bottleneck due to the plethora of update messages that must be broadcast on every write instruction.

From an architectural standpoint, memory design of a multiprocessor is very similar to that of a uniprocessor. Shared memory is divided into a number of equally sized blocks. All interactions between cache and shared memory are in units of these blocks. Similarities also exist in replacement algorithms that determine what blocks to flush and/or to move when available cache space becomes constrained. Although, unlike uniprocessor architectures, multiprocessors additionally tag blocks with state information that is required for the administration of the cache coherence protocol. As this paper will show, management of the state information governs the correct operation of hardware based coherence protocols.

The four bus-based cache coherence algorithms discussed in the remainder of this section will be presented with respect to how they handle cache hit and miss situations along with, when they generate, and how they respond to consistency signals. In addition to the textual descriptions, the figures in this section will use the following terminology and conventions:

The "Basic" Write-Update Cache Coherence Protocol. This protocol described by [BRIG88] is the simplest cache coherence protocol discussed in this section and is typically only implemented in hardware utilizing write-through cache memory systems. This algorithm requires two states - valid and invalid. The invalid state signifies that the desired shared memory block is either not in the local cache memory, or that the copy residing in the local cache is old and might be inconsistent with the current copy. The valid state indicates that the block in the local cache is consistent with all other cached copies in the system which might possibly include shared memory.

The behavior of the basic write-update protocol is shown below with respect to the events that occur when a processor accesses a block of data that is expected to be in the local cache.

Figure 3: Basic Write-Update Protocol

Read Hit. The desired cache block is already in the local cache and can be accessed without any delay for reading purposes. The state of the local cache block remains valid.

Read Miss. If a remote cache has a copy of the desired block, it will intercept the read request (from the bus) and provide a copy of the block to the requesting cache. If no remote caches intercepts the request message, then there are no duplicate copies of the block in the system so a copy provided to the requester from shared memory. The state of the local copy is set to valid.

Write Hit. The write to the local cache block can be carried out without any delay and the state of the local block remains valid. The write is also broadcast over the bus enabling all remote copies of the block to update themselves in order to maintain memory coherence. This update might also be propagated to shared memory if write-through cache architectures are used. All copies of the block remain consistent and maintain state valid.

Write Miss. If a write miss occurs, the desired block is first loaded into the local cache by following the read miss policy. Once the block resides in the local cache, it is updated in adherence to the write hit policy. The end result will be that the local cache copy is in state valid and is consistent with all other copies of the block. All copies maintain the valid state.

The Firefly protocol is an alternative write-update protocol and is discussed in [STEN90]. Firefly improves the basic write-update protocol by adding an additional state. The new state, valid-exclusive, is used to indicate that the block residing in the local cache is the only cached copy in the system. This additional information enables the processor to write to the local block without having to broadcast the update to any other remote caches. Other then support for the new state, Firefly is identical to the basic write-update protocol which was discussed above.

The "Basic" Write-Invalidate Cache Coherence Protocol. This strategy [WANG90, BRIG88] implements a basic block invalidation cache coherence protocol. Every cached block main memory block is tagged by one of three possible states - invalid, read-only, and read-write. The invalid state is assigned when the desired memory block is not in the local cache, or the copy in the local cache is inconsistent with the physical memory block. A cache block is the read-only state when it has been read from the main memory, or has provided by another cache, and has not been modified. A block in the read-only state may be distributed among many caches so long as it is not updated and remains consistent with the shared memory copy. A block in the read/write state has been modified by the local processor and is the only valid copy in the system. In addition to the three states described above, this algorithm requires an invalidate cache consistency signal which is sent by a local cache to invalidate all remote copies of a block.

The operation of the basic write-invalidate protocol can be shown by analyzing how the system responds to the following memory reference scenarios:

Read Hit. This condition occurs when a processor attempts to read from a memory block that already resides in the local cache. This operation can be performed without any delay or overhead because the locally cached copy is consistent with the other memory copies.

Read Miss. This condition occurs when a processor attempts to read from a memory block that does not currently reside in the local cache. When this situation arises one of two possible events occurs. The first possibility occurs if a remote cache has the desired memory block in the read/write state. Under these circumstances, the remote cache will first write the block back to shared memory, and then shared memory will supply the block to the requesting cache. Otherwise, the block comes directly from shared memory. Each cache with a copy of the block will set the block state to read-only.

Write Hit. This condition occurs when a processor attempts to update a block that already resides in the local cache. If the local cache state is read-only then an invalidation signal broadcast over the bus. Any remote cache with a copy of the block must change the state of their copy invalid. After broadcast of the invalidation signal, the local state is changed to read/write. Once in state read/write (or if the block was already in read/write state), the update can be performed without delay.

Write Miss. This condition occurs when a processor attempts to write to a shared memory block that is not currently located in the local cache. Under this condition the processor must first load the block into it’s local cache. This operation is performed just like the read-miss condition described above. At this point the local cache has a copy of the memory block in the read-only state (other caches might also have a local read-only copy). The next step involves invalidating all of the other cache copies by broadcasting an invalidate signal over the bus. This results in remote caches setting their local copy block state to invalid. It is now safe for the local cache to change the block state to read/write and perform the write operation.

Figure 4: Basic Write-Invalidate Protocol

The basic write-invalidate cache coherence protocol is among one of the simplest hardware based snoopy algorithms to implement. Cache blocks can be in only one of three states and the invalidate signal is the only consistency command that needs to be implemented.

The "Write Once" (Goodman) Cache Coherence Protocol. This protocol was originally proposed in 1983 by James Goodman and described by [WANG90, HWAN93, STEN90]. The algorithm, which is very similar to the "Basic" protocols described above, incorporates the advantageous features of the write-update and write-invalidate protocols. This is achieved by using a write-update policy on the first write, and a write-invalidate policy on all subsequent writes. For most algorithms, bus traffic will be greatly reduced due to the initial write-update transaction.

The "Write Once" protocol is slightly more complicated then the "Basic" protocols and therefore requires some additional states and consistency commands. The four cache states are: valid, invalid, reserved and dirty. The valid state indicates that the cache block has been read from shared memory and has not been modified. The invalid state is used to express that the block is either not present in the local cache or the copy in the cache is stale and inconsistent with shared memory. The reserved state is assigned to a local cache block that has been updated only once since it has been read from shared memory. Furthermore, the cache copy is consistent with the shared memory copy which is the only other copy in the system (it has been written-through). The dirty state indicates that the local cache has been written to more than once and is the only consistent copy in the system. In addition to the cache block states, the write once protocol requires two consistency commands - read-invalidate and write-invalidate. The read-invalidate cache consistency signal is used to read a consistent copy of a block and invalidate all other copies. The write-invalidate cache consistency signal is used to invalidate all non-local copies of a block.

The state transitions for the "Write Once" protocol are shown below:

Read Hit. The desired cache block is already in the local cache and can be accessed without any delay for reading purposes.

Read Miss. Under this condition if no dirty copies exist in any of the other remote caches, the shared memory copy is consistent and provides a copy of the block to the local cache. If a dirty copy exists, then the remote cache with the dirty block will provide a copy directly to the requesting cache, by intercepting the read request off of the bus (and bypassing shared memory). Both the requesting cache and the remote cache change the state of the their copies to valid, and the shared memory copy is updated. All copies of the block are now consistent.

Write Hit. If the current cache block state is reserved or dirty the write can be carried out without any delay. The new state for the block is dirty. If the current state is valid, a write-invalidate signal is broadcast to the other caches. Any cache with a valid copy will set the block state to invalid. Because this is the first write to the cache block, a copy is written-through to shared memory and the local copy state is set to reserved.

Write Miss. If a write miss occurs the requested block must either be provided by shared memory or by another cache. The activity starts by the requester broadcasting a read-invalidate consistency command to all caches. If another cache has the block in state dirty, the remote cache will update memory, forward a copy of the block to the requesting cache and invalidate itself. If there was no remote cache with a dirty block, then the shared memory is consistent and will provide the requester with the desired block. Because this process started by the requesting cache broadcasting a read-invalidate signal, all remote copies are now invalid. The local cache is the exclusive owner of the block, the write can be safely performed, and the new state of the block is dirty.

Figure 5: Write-Once Protocol

Figure 5 was adapted from Advanced Computer Architecture, Kai Hwang, 1993.

The Berkeley Cache Coherence Protocol. The Berkeley protocol is referenced in much of the cache coherence research because of its high performance characteristics. This protocol, like the write-once, requires four cache states: invalid, unowned, exclusive, and non-exclusively owned. The invalid state is used to express that the block is either not present in the local cache or the copy in the cache is stale and inconsistent with shared memory. The unowned state indicates that a block residing in the local cache is consistent with all other locally cached copies and shared memory. The exclusive state indicates that the local cache copy is unique, and can be updated without delay, making it the only valid copy in the system. A local cache with a block in the exclusive state is additionally responsible for intercepting and responding to requests sent over the bus for a copy of the block. In the non-exclusively owned state, a local cache block is owned and may be updated after informing other the other caches. This protocol also requires that a cache have the ability to broadcast a invalidate consistency signal which invalidates all non-local copies of the block.

The Berkeley protocol is based on managing ownership of memory blocks. Each memory block will always have a unique owner. The owner is either the cache which has the memory block in state exclusive or non-exclusively owned, or if the block is not owned by any cache, then shared memory is the owner. This is important because in the case of a read or a write miss, the requesting cache receives a copy from the cache owner (which is always changing). As mentioned before, the block owner performs this task by seemlessly intercepting and responding to block requests that are sent over the system bus.

Figure 6: Berkeley Protocol

The state transitions for the Berkeley coherence protocol are shown below:

Read Hit. The desired cache block is already in the local cache and can be accessed without any delay for reading purposes.

Read Miss. If a read miss occurs, the desired block is always provided by the blocks’ owner. The local block state is set to unowned. No other state changes are performed on other remote copies of the block.

Write Hit. If the current cache block state is exclusive the write can be carried out without any delay. The local block state will remain exclusive. Otherwise, a invalidate consistency command will be broadcast to all of the other remote block copies which will result in a state change to invalid. The local block sets it’s state to non-exclusively owned and performs the update.

Write Miss. If a write miss occurs the requested cache block is provided by the block owner as in the read miss situation. Each cache with a copy of the block invalidates its copy by setting the state to invalid. The requesting cache sets it’s copy to state exclusive and performs the update.

The following table compares the protocols analyzed in the above section.

Table 1: Comparison of Bus-Based Cache Coherence Protocols

Directory Based Coherence

The bus based cache coherence protocols discussed in the previous section require the ability to monitor other caches activities and broadcast consistency commands. This dependance on brodcasting will effectively restrict the number of processors that can be incorporated in to a multiprocessor due to the limited bandwidth of a single bus. In addition, many modern day multi-processors are being built with complex switched bus topologies, which as mentioned above, can not implement bus based coherence strategies. This remainder of this section will discuss an alternative approach to implementing hardware based cache coherence solutions.

Directory based coherence protocols work by tracking the location of all memory blocks that have copies residing in remote caches. This information is used to send consistency commands to only those caches that have a copy of the block; thereby, eliminating the need for global broadcasts.

The first cache coherence directory protocol was introduced by Tang (1976). This approach used a directory that centrally stored information on the cached locations of all main memory blocks (Figure 7). The directory information was then used to coordinate and control the read and write activities of remote caches. One major shortcoming of this approach is it’s lack of scalability. As the number of processors increases, the central directory size grows to the point of where it becomes a bottleneck.

Figure 7: Central Directory Cache Coherence Architecture

Recognizing that a central directory solution to the cache coherence problem would not be adequate for supporting future architectures, Censier and Feautrier (1978) proposed an alternative directory organization (Figure 8). Their approach split the central directory into a number of smaller, or distributed, directories. Individual directories would only be responsible for tracking the status of individual memory modules (a subset of main memory). The resulting architecture dramatically improves system salability by load balancing the management of directory entries across the entire main memory address space.

Distributed directory mechanisms fall into three primary categories - full map directories, limited map directories and chained directories. The following sections will discuss each of these variations in detail and present example algorithms that illustrate features of each classification.

Full Map Directories

Full map directories store enough information associated with each global memory block so that every processor in the system can simultaneously contain any block of data. Figure 9 illustrates a data structure capable of supporting a full map directory cache coherence protocol. The clean_dirty_flag is used to determine if a remote cache containing a copy of the data block has modify (write)permission. The presence_indicator_array contains 1 bit for each cache in the system. This is a convenient representation for tracking the caches that contain copies of the global data block.

Figure 8: Distributed Directory Cache Coherence Architecture

In addition to the directory, a cache must maintain state information associated with local blocks. A simple cache block data structure is also shown in figure 9. The write_enable_flag is used to lock and unlock a local block for write operations. The valid_invalid_indicator tracks the validity of the local block. It is the responsibility of the cache coherence protocol to keep both the data and state information consistent.

Figure 9: Full Map Data Structure

structure directory_block
{
bit clean_dirty_flag;
bit presence_indicator[NUMBER_OF_PROCESSORS];
byte data[CACHE_BLOCK_SIZE];
}

structure cache_block
{
bit valid_invalid_indicator;
bit write_enable_flag;
byte data[CACHE_BLOCK_SIZE];
}

Figure 10 shows multiprocessor computer (4 processor) which implements a full map directory cache coherence protocol. For simplicity, only one directory entry and one cache block are shown (a real system might have millions of directory blocks, and thousands of cache blocks). After running some programs the state of the machine is as follows. Cache 1 and 3 have read only copies of the data block. This is known because the clean/dirty indicator in the directory is not set and valid pointers exist in the presence_indicator_array. This state enables processors 1 and 3 to continue to reference the local data block, for read only access, without incurring any delay penalties. Cache 2 does not have a copy of the block in its local cache. Cache 4 has a copy present, but it is invalid, indicating that it is stale and might be inconsistent with the current copy. This condition is illustrated by the valid/invalid in the local block data structure for cache 4. Thus, from the main directory blocks perspective, only cache 1 and cache 3 contain valid local copies of the data block. .

Figure 10: Example of a Full Map Directory Architecture (4 Processor)

The remainder of this section will focus on analyzing two full-map cache coherence protocols. These protocols will be presented from the perspective of how they handles read miss and write hit events. Read hit events do not cause any cache consistency problems because local processors can reference locally cached data without any delay. A write miss event is handled by executing the read miss and write hit policies in sequence.

The Censier and Feautier Protocol. This protocol[STEN90], which was first introduced in 1988, was the first published distributed directory algorithm for multiprocessor cache coherence. In fact, many of the other published distributed directory solutions only add performance and/or resource enhancements to the basic Censier and Feautier protocol. A pictorial view of the read miss and write hit operations of this protocol are shown in figure 11.

Figure 11: The Censier and Feautier Protocol

Figure 11 depicts the cache coherence actions taken by the Censier and Feautier protocol. The lighter line shows a read miss from cache 2, given cache 1 has a dirty copy of the block. The darker line shows a write hit from cache 1, given that cache 2 has a copy of the block. This figure was adopted from Per Stenström, IEEE Computer, June 1990.

The Censier and Feautier protocol operates as follows:

Read Miss. If a read miss occurs, the cache references the directory entry for the desired block. If the clean_dirty_flag is set (Figure 9), a remote cache is the block owner, and the data block might have been updated by the owning cache. Under this circumstance, the directory redirects the original read request to the cache that is the block owner by using the presence_indicator_array. Once the owning cache that receives the redirected read request, it first clears the write_enable_flag bit, and sends the updated data block back to the directory. The directory updates main memory, updates the presence_indicator_array and returns a copy to the original requesting cache. At this point all copies throughout the system are consistent and may be accessed for read only operations. This situation is shown in figure 11 by the lighter line.

If the clean_dirty_flag is not set, no remote cache has update permission on the data block. Therefore, shared memory has a consistent copy of the data block and it can safely return a copy to the requester after updating the presence_indicator_array.. The requesting cache ensures that the write_enable_flag bit is cleared upon reception of the data block.

Write Hit. If a write hit occurs, the local cache checks the write_enable_flag. If the bit is set, indicating write permission, then the cache owns the block and can conduct the write without delay. If the write_enable_flag is not set, then the local cache notifies the memory directory of it’s intent to update the cache block. Upon reception of the write request signal, the directory traverses the presence_indicator_array and invalidates all remote copies of the block. The directory then sets the clean_dirty_flag, and confirms the write request. When the local cache receives confirmation of the write request signal, it sets the write_enable_flag, and performs the update. This sequence is shown by the dark line in figure 11.

Table 2: Bit Overhead for Directory Storage

Scheme

Overhead (No of Bits)

Tang

CB

Censier

M(B+N)

Stenström

C(B+N) + M*log2N

 

Table 2: A comparison of the overhead of storage of 3 directory based cache coherence protocols. In the above table M is the number of memory blocks, C is the number of cache lines, N is the number of caches, and B is the number of bits needed to hold state information for each block. Adopted from Stenström, IEEE Computer, June 1990.

The Stenström Protocol. Per Stenström recognized that a significant shortcoming in the Censier approach was that the number of bits necessary to store directory information was proportional to the size of shared memory [STEN90]. This is problematic because most systems can scale their computing capacity by increasing memory size. Stenström’s answer to this problem was to relocate the presense_indicator_array from the directory and move it to the cache. By doing this, the overhead associated with the directory protocol is now proportional to the size of the cache instead of the size of shared memory. This is desirable in most systems where cache size is much smaller then shared memory size. Table 2 shows the overhead associated with directory storage for the Tang (central directory), Censier, and Stenström Schemes.

Like many other cache coherence algorithms the Stenström policy must maintain block ownership. In the Censier approach, the block owner was either shared memory, or the cache that has write permission. Conversely, ownership is very dynamic (and inherently more complex) in the Stenström approach because local caches are dynamically changing ownership and managing the distribution of the presence_indicator_array. In fact, the addition of the M*log2N term in table 2, is directly associated with the directory knowing the current block owner.

Figure 12: Stenström Data Structure

structure stenstrom_directory_block
{
bit cache_owner[Log2(NUM_CACHES)];
byte data[CACHE_BLOCK_SIZE];
}

structure stenstrom_cache_block
{
bit valid_invalid_indicator;
bit write_enable_flag;
bit block_owner;
bit presence_indicator_array[NUMBER_OF_PROCESSORS];
byte data[CACHE_BLOCK_SIZE];
}

Figure 12 shows the basic data structures designed to handle the Stenström protocol. The cache_owner field is used to point to the cache that is the current data block owner. The data field is the physical data block. The block_owner flag indicates if the local cache is the owner of the data block. The valid_invalid_flag, data, write_enable_flag and the presence_indicator_array fields perform the same function as in the Censier protocol.

Figure 13: Stenström Protocol

Figure 13 depicts the cache coherence actions taken by the Stenström protocol. The lighter line shows a read miss from cache 2, given cache 1 has a dirty copy of the block. The darker line shows a write hit from cache 1, given that cache 2 has a copy of the block. Adopted from Stenström, IEEE Computer, June 1990.

The Stenström protocol, depicted in Figure 13 operates as follows:

Read Miss. If a read miss occurs, the cache references the directory entry for the desired block. If the cache_owner field indicates that the data block is unowned, the memory copy is the only valid copy in the system. In this situation the block will be provided to the requesting cache by shared memory. The requesting cache now becomes the block owner so the directory updates the cache_owner field, and the cache initializes the presense_indicator_array and sets the block_owner flag (figure 12).

If the cache_owner field is set, the block is owned by a remote cache and the data in the directory block might be inconsistent with the valid copy. Therefore the directory forwards the read request to the cache that is registered as the block owner. Once the read request is received by the block owner, the owning cache checks the write_enable_flag field. If it is set, the owning cache will first clear the field which prohibits further local write operations. At this time the owning cache will update the presence_indicator_array and send a copy of the block directly to the requesting cache. The requesting cache ensures that the write_enable_flag bit is cleared upon reception of the data block. At the end of processing the read miss condition, all cached data copies are consistent and read only. This operation is illustrated in figure 13 by the thin line.

Write Hit. If a write hit occurs, the local cache checks the write_enable_flag. If the bit is set, indicating write permission, then the cache owns the block and can conduct the write without delay. If the write_enable_flag is not set, then the cache references block_owner flag. If this field is set, then cache is the block owner, so it may traverse the presence_indicator_array and send invalidation signals to all other caches that have copies of the block. Once all remote copies are invalidated, the local cache sets the write_enable_flag, and performs the write operation.

If the block_owner flag is not set, the local cache notifies the memory directory of it’s intent to update a local block. Upon reception of the write request signal, the directory redirects the signal to the cache owner by using the cache_owner field in the directory data structure. Upon reception of the write request signal, the owning cache clears the block_owner flag and transmits the presence_indicator_array directly to the requesting cache. The requesting cache is now the block owner so it sets the block_owner flag. Now that the requesting cache owns the data block, it can follow the policy previously described for performing a write operation by the block owning cache.

Even thought the Stenström protocol optimized the amount of overhead associated with a full directory cache coherence protocol, it still would not be adequate for handling architectures with a large number of processors. The overhead associated with any full map protocol is directly proportional to the number of processors in the system. This is because the presence_indicator_array must be large enough to point to all processors in the system. With massively parallel multiprocessors on the horizon, other alternatives need to be explored. The next two sections discuss limited-map directories and chained directories, which are two alternative directory organizations which exhibit good scalability characteristics.

Limited Map Directories

Limited map directory protocols are extensions of full map protocols and exhibit improved scalability characteristics by solving the directory size problem. Limited directory organizations limit the number of simultaneously cached copies of a particular block to a constant factor [HWAN93]. Therefore the presence_indicator_array will not have an entry for each cache in the system, but will instead be restricted to a predefined size.

One downfall of these algorithms is that they must handle the added complexity of providing a copy of a block to a requesting cache even if the directory presence indicator array is full. This condition is handled by first selecting and then prematurely invalidating a remote cache that has a copy of the block. The result will be a freed up a slot in the presence_indicator_array which then in turn can be reused for the requesting cache. The act of invalidating a remote cache under these circumstances is called eviction, and should be done according to a predefined replacement policy such as LRU or FIFO. Figure 14 shows an example of cache 2 getting evicted for cache 3, after cache 3 made a read request for data block X.

Figure 14: Limited Directory Protocol - Cache 2 block getting evicted

Chained Directories

Every class of cache coherence protocol discussed so far had inherent limitations that prohibited them from being used in massively parallel computer environments. Bus based "snoopy" protocols achieved coherence by eavesdropping on the bus which is inefficient (or impossible) in a large multiprocessor computer. Full map directory protocols work well in complex environments, but they incur overhead proportional to the memory size, cache size and the number of processors. Limited directory algorithms augment the scalability of full map directory approaches, but it is at the expense of fixing the number of concurrently cached blocks to a constant factor. Another approach is needed. One that is both architecture independent and dynamically scalable. This is the goal of chained directory coherence algorithms.

The distributed directory protocols discussed in previous sections used a static presence_indicator_array to point to remote caches that contained copies of the data block. Chained directory algorithms do not use a static array, but instead implement a linked list of directory pointers that start at the shared memory block and link to successive caches that contain cached copies of the block. These sharing lists are effectively unbounded in length and are dynamically created, pruned and destroyed in accordance to the coherence algorithm.

Figure 15: Chained Directory Sample Environment

Figure 15 shows two sharing lists originating from main memory. In list the gray list, cache 1 is the head and cache 5 is the tail. In the black list, cache 3 is the head and cache 4 is the tail.

Figure 15 shows a typical chained directory organization. In the illustration two shared memory blocks have formed sharing lists. The first block, denoted by the gray boxes, links cache 1, cache 2, and cache 5. The second shared memory block, denoted by the black box, links cache 3 and cache 4. As mentioned before, each member is the sharing list is responsible for passing consistency information down the chain and for invalidating itself. Invalidation might either be self imposed (voluntary) or involuntary mandated by the head of the sharing list. Because it is difficult to remove (e.g. invalidate) an element from a linked list unless a back pointer is maintained, most chained protocols implement doubly linked lists. Having access to both the next and previous elements in the sharing list, greatly reduces the complexity of pruning the list.

Unlike the other protocols discussed in this paper, chained directory algorithms are best explained by introducing the sharing list structures, defining the various tag and state values, and describing the operations that happen to the sharing lists during program execution. The remainder of this section will describe the Scalable Coherent Interface (SCI) protocol that is being developed by the IEEE (standard P1596). This chained directory interface is designed to handle advanced computer architectures with up to 65535 processors[JAME90].

SCI (Scalable Coherent Interface), IEEE Standard - P1596

The data structures utilized by the SCI protocol are shown in figure 16. Each memory block contains a 2 bit memory state, mstate, a 16 bit forw_id field, and a 64 byte data block. A cache block entry contains a cache state, a forward pointer to the next block in the list, a reverse pointer to the previous element in the list, a memory id field, and the 64 byte data block.

Figure 16: SCI Data Structures

Sharing List States. The SCI sharing list states are encoded into the mstate field on the memory block and the cstate field on the cache block. In normal operation, the mstate can either be home or cached. The home value indicates that no sharing list exists and that the forw_id pointer is NULL. The cached state indicates that a sharing list exists and the forw_id field points to the head of the list.

The cache state field, cstate, combines information about the location of the entry in the sharing list, and the entry’s caching properties. The allowable location values are head, mid, or tail for multiple entry sharing lists and only for a single entry list. Caching property values can be clean (same as memory), dirty (possibly different from memory), valid (same as head copy), exclusive (only copy in system) or stale (possibly different from head copy). The above cstate combinations indicate stable states, other temporary states might be assigned to cstate while the protocol is executing state transitions.

The exclusive and stale states are not necessary and are only included in SCI for performance enhancement. Sharing list entries with exclusive state can update the data freely because it is the only valid copy in the system. The stale state is included for producer/consumer data sharing and for barrier synchronization operations.

Because not all combinations of states are valid. Table 3 documents the allowable stable SCI states. The values shown in the cache block status field is a concatenation of a location value and a caching property value. For example, "mid|valid" means the cache block is in the middle of the list and the data value is the same as the data contained in the head of the list.

Table 3: Stable SCI States

Sharing List Management. The SCI protocol enforces coherence by managing sharing lists and performing state transitions. This section will discuss creating a sharing list, adding elements to the list, deleting elements from the list, and purging the list.

In order to simplify sharing list updates, and distribute the administrative workload from the memory controller to the caches, the head element will take responsibility for managing the sharing lists. Additionally, new nodes are always inserted at the head position of the list which further distributes the overhead load among all other caches in the system. Besides the added responsibility for maintaining the sharing list, the head node can perform a set of operations that are restricted from the mid and tail nodes. These special capabilities will be discussed in subsequent sections.

Sharing List Creation. Initially memory is at the home state which indicates that no sharing list exists for the data block. The first step in creating the a sharing list is for the cache block to change its cstate from the invalid to the pending state. Next, the cache generates a read_cached transaction in order to obtain a copy of the data block. This transaction updates the memory directory state from home to cached, the forw_id field is set, and the local cache block’s cstate is changed from pending to only|clean.

The only|clean state can then (if necessary) be immediately changed to only|dirty without invocation of any additional transactions. The only|dirty state allows the local cache to write to the block.

Figure 16: SCI Sharing List Creation

Sharing List Additions. If a requester cache attempts to read a block from main memory by issuing a read_cached transaction (as described above) and the directory mstate field indicates that the block is cached, then a sharing list already exists, and the head of the list has the most current copy of the data. Under this condition the requester does not receive the data, but is instead returned an indirect pointer to the head of the sharing list. The requesting cache then uses this pointer to insert itself into the sharing list and becomes the new head of the list. Figure 17 illustrates this procedure.

Figure 17: SCI Sharing List Addition

Figure 17 shows a new element being added to the sharing list. The new element is denoted by the double line box. The indicators "Block A", "Block B" and "Block C" are used in table 4 to identify the nodes in the list.

In addition to the physical act of insertion into the list, the new element must be assigned an initial state, and the old head must have it’s state updated. The determination of these state values is a function of the head element state prior to insertion of the new element. Table 4 shows how to determine the state of the new head (Block B in figure 17), and the old head (which is now either a mid or tail element. - Block C in figure 17) given the state of the head element prior to performing the insertion (Block A in figure 17).

Table 4: SCI State Transitions After Node Insert

Sharing List Deletions. Any entry in the sharing list may delete itself from the list. This operation is typically done when the processor determines that it no longer needs the block in the cache. There are three cases that need to be examined in order to delete an element from a sharing list.

Case one involves blocks that are in a head or mid list location. These blocks have both valid forw_id, and back_id pointers. In order to perform a delete, the previous elements forw_id field must assume the pointer of the forw_id field of the element being deleted, and the next elements back_id pointer must assume the back_id pointer of the element being deleted. These two pointer adjustments link around the element being deleted which results in a logical removal from the sharing list. Figure 18 illustrates the procedure of removing the middle element from the sharing list - FP indicates the forw_id pointer and BP indicates the back_id pointer in the cache blocks.

Figure 18: Deletion of an Element from a Sharing List

Case two involves removing an element that is in the tale state. This element is at the end of the list so it can be pruned by setting the previous elements forw_id pointer to NULL.

Case three involves the removal of an element in the only state. Because this state indicates that it is the only element in the list, it is simply removed from the list by changing the directory mstate field to home. Additionally, if the cstate is only|dirty then the cached data may be different from the directory data. Under these circumstances the data from the cache must be written back to the directory data block.

All of the delete operations described above lock the sharing list prior to making pointer adjustments by switching the cstate fields in the effected elements to from a stable state to a temporary lock state. These locks will serialize operations through the sharing while the locked state is enabled. If locking was not done, then concurrent delete operations could corrupt the sharing list pointers. After the delete procedure has executed, the cstate values in the effected caches are recalculated and set to stable state values.

Sharing List Purges. The head of the sharing list has the authority to purge other caches from the list. This operation is necessary to get a cstate value of only which enables data to be locally modified. Other members in the list having mid or tail positions have no such privilege, they must first delete themselves from the list and then reenter the list as the new sharing list head. This is possible because, as stated before, all elements inserted in to the list are added in the head position.

The procedure for purging is very simple. First the head element changes it’s state to head|busy which prevents other caches from adding to the list during the purge operation. The head element then uses the forw_id pointer to delete the second element in the list by linking around it. The former third list member now becomes the new second list member. This operation is continued until all list nodes are removed from the list. The head element now changes it’s state to only|clean. This state, as mentioned in the sharing list creation discussion, can freely transition to only|dirty which enables the local cache to update the data block.

SCI Protocol Operation. Now that all of the details of list management have been defined, the operation of the SCI protocol with respect to how a local cache handles read hit, read miss, write hit and write miss operations can be presented.

Read Hit. Like all of the coherence protocols discussed in this paper, a read hit operation allows the local cache to read the data in the cache block without any delay or overhead.

Read Miss. A read miss operation utilizing SCI invokes a read_cached transaction which is described in the previous section. If the directory block indicates a home state, then a sharing list is crated otherwise the local cache is added to the head of the list. Once the cache has been added to the sharing list, the data value can be immediately read.

Write Hit. If a write hit occurs, the local cache checks the cstate. If the cstate indicates only|dirty, then the write occurs without any delay. If the cstate is only|clean then the state is changed to only|dirty and the write to the local block is performed. Otherwise a sharing list exists and it must be purged prior to the write being performed. This is handled as follows. If the cstate indicates a mid or tail list position, then the local cache deletes itself from the sharing list and re-adds in the home position. Once in the home position the cache performs a purge_list operation (described above) which results in a cstate of only|clean. The local cache then is allowed to change the state to only|dirty and perform the write.

Write Miss. This operation is performed by executing the read miss policy which results in the local cache getting a copy in the home or only state. The write hit policy is then applied to the local block in order to complete the write operation.

The above section on SCI is no way intended to completely describe all of the functionality of the protocol. SCI has many other interesting capabilities and optimizations which makes it a very powerful chained directory cache coherence protocol. For additional information on SCI performance and fault-tolerance enhancements, see [JAME90].

Other Cache Coherence Approaches

While the hardware cache coherence methods discussed in the previous section have been successfully applied in current day multiprocessor computer systems, they suffer from inherent limitations which might impact their ability to effectively support future generation hardware and software. Other approaches which show promise include cache coherent network architectures, and software directed cache coherence.

Networks have become very popular over the past 10 years because connecting redundant computers over a high speed network provides a lot of processing power, and is much less expensive then a custom multiprocessor. In fact, the IBM now sells machines with supercomputer specifications which are nothing more then a collection of stock RS/6000 processors connected by a dedicated ultra-high speed fiber optic network. In this configuration each processor is given local memory which acts like a cache and has access to massive amounts of shared memory by means of the fiber optic network. Memory coherence services are provided by the network communication protocol.

Cheong and Veldenbaum [CHEO90] make a good case for using software to direct the cache management of multiprocessors. They contend that many hardware coherence approaches perform well for some software algorithms, but poorly for others. Their approach utilizes compiler technology to partition a programs data access into cacheable and non-cacheable sections. This information is then used to independently control processor caching behavior. By pre-establishing the caching activities of a parallel program, the need for interprocessor communication for the purposes of coherence administration is eliminated.

 

Conclusion

As shown in this paper, managing memory coherence, which is only one of the challenges in designing a multiprocessor, can be very complex. This coupled with growth rates that double computing capacity every four years ensures that new memory coherence techniques will have to be pioneered in order to keep pace with the technology.

 

References

[BRIGG88] F. Briggs, "Synchronization, Coherence, and Event Ordering in Multiprocessors", IEEE Computer, February 1988, pp. 9-21.

[CHEO90] H. Cheong and A. Veldenbaum, "Software-directed Cache Management in Multiprocessors", Cache and Interconnect Architectures in Multiprocessors, 1990, pp. 259-276.

[GOOS90] H. Goosen and D. Cheriton, "Predicting the Performance of Shared Memory Caches", Cache and Interconnect Architectures in Multiprocessors, 1990, pp. 153-164.

[GUPT90] A. Gupta and W. Weber, "Analysis of Cache Invalidation Patterns in Shared-Memory Multiprocessors", Cache and Interconnect Architectures in Multiprocessors, 1990, pp. 83-107.

[HWAN93] K. Hwang, Advanced Computer Architecture, McGraw Hill, New York 1993, pp. 348-368.

[JAME90] David V. James, "SCI (Scalable Coherent Interface) Cache Coherence", Cache and Interconnect Architectures in Multiprocessors, 1990, pp. 189-208.

[LI89] K. Li and P. Hudak, "Memory Coherence in Shared Virtual Memory Systems", ACM Transactions on Computer Systems, November 1989, pp. 321-357.

[STEN90] P. Stenström, "A Survey of Cache Coherence Schemes for Multiprocessors", IEEE Computer, June 1990, pp. 12-24.

[WANG90] J. Wang and M. Dubois, "Memory-Access Penalties in Write-Invalidate Cache Coherence Protocols", Cache and Interconnect Architectures in Multiprocessors, 1990, pp. 109-130.


Return to my home page