Understanding Memory Hierarchy Simulation in SST
Jan 16, 2022
Notes:
    In this article, we make a strict distinction between the term 
“initialization” and “construction”. “Initialization” refers to the 
operations performed in function init(), which will be called after object 
construction, and it mainly performs runtime configuration or parameter 
exchange between components. “Construction”, on the other hand, refers to 
operations performed within object constructors, which happens before 
initialization. The term “initialize”, however, are used interchangeably, 
and the exact meaning can be deduced from the context (e.g., if we are talking 
about “initializing a data member” within the section of object 
construction, then it is obvious that “initialize” here means construction 
operations).
Memory Hierarchy
Memory Event Objects
class MemEventBase and class MemEvent are the two classes that are used 
throughout the memory hierarchy to model message exchange between memory 
components, and to carry the commands as well as responses. They are defined in 
file memEventBase.h and memEvent.h, respectively. The related constants, 
macros, such as memory commands, coherence states, and command classes, are 
defined separately in memTypes.h. Both classes are derived from the base class 
class Event, meaning that they can be sent over link objects just like any 
other event objects.
During initialization in which runtime configuration (such as topology 
learning) and parameter exchange, another type of event objects are passed 
around, which is of type class MemEventInit, or its derived classes. These 
objects carry initialize-related information, and are only exchanged during the 
initialization stage.
MemEventBase
class MemEventBase defines low-level link layer properties of the message, such 
as source and destination names (which, surprisingly, are of std::string type) 
of the current hop, src_ and dst_, the original requestor’s name (which is 
also an std::string), rqstr_, the memory command, cmd_, and the globally unique 
event ID, eventID_. If the message is a response to a previous request, then 
the ID of the matching request event is stored in responseToID_, such that the 
requestor as well as all components on the path can identify the matching 
request message when receiving a response.
All memory hierarchy event objects are assigned globally unique IDs by calling 
generateUniqueId(), which is defined in the base class, class Event. The ID is 
assigned in member function setDefaults(), which is called by the constructor, 
meaning that all memory hierarchy event objects carry a valid ID. 
Interestingly, class Event objects themselves do not define the unique ID, but 
leave it to the derived class for generation and storage of the ID.
The class MemEventBase object also carries cache flags in flags_ field, which 
uses the flag constants defined in the same source file that begins with F_. 
These flags indicate special semantics of the message, such as locked accesses, 
non-cacheable accesses, etc., and are only used in specific scenarios.
The lifetime of memory event objects is from the sending of the message to the 
successful processing of the message. If a new message is to be generated, the 
old one is freed, after the new message is created based on the contents of the 
old. This way, the memory hierarchy defines the ownership rule for memory event 
objects: These objects are created by the sender of the message via C++ new 
expression, while the receiving end retains the ownership of the message, once 
delivered, and is responsible for destroying the messages when they are 
successfully processed (destruction can be delayed, if the processing indicates 
a failure and demands a reattempt later). In other words, each memory event 
object only carries information for one hop, from the source to the destination 
(which correspond to the src_ and dst_ fields).
Member function setResponse() of class MemEventBase provides a handy shortcut 
for setting the fields of a response message, given an existing event object as 
the request, the command of which must be of a request type in order for this 
function to work properly (this prerequisite is not checked, though). The 
function simply sets the new message’s next hop of destination as the given 
message’s source, and sets the source as the destination, meaning that the 
response message is intended to be sent backwards to the exact same sender of 
the given message. The responseToID_ field is also set as the unique global ID 
of the request event, such that the receiving end can match the response with 
the request. The makeResponse() function simples does slightly more than 
setResponse() by allocating a new event object as an exact clone of the current 
one (passing *this into the copy constructor), and then passing the current 
object as the first argument to setResponse(). The newly created object is a 
response event to the current event on which the member function is called.
MemEvent
class MemEvent is a direct derivation of class MemEventBase, and an indirect 
child class of class Event. It inherits all data fields carrying link-level 
information, and in addition, it also defines high-level operation details, 
such as the byte-granularity address of the request, addr_, the block-aligned 
address, baseAddr_, whether the message contains dirty data, dirty_, whether 
the message is a block eviction, isEvict_, and payload of either a request or a 
response, payload_ (which is merely a vector of bytes, std::vector<uint8_t>, 
typedef’ed as dataVec). The object also maintains memory transaction-level 
control variables, such as the number of retries, retries_, and whether the 
event is blocked by another pending event on the same address, blocked_, 
whether the event is currently being handled by the controller, inProgress_, 
and so on.
class MemEvent also defines the makeResponse() interface. The argument-less 
version overrides the method of class MemEventBase, and it creates a new class 
MemEvent object that is the response to the current one on which the method is 
called. Other versions of makeResponse() are also available, which creates new 
event objects using an explicit cmd argument as the command of the new object, 
or with an unused state argument.
The rest of the source file just defines getters and setters of data members, 
and method functions for testing various flags.
MemEventInit and Its Derived Classes
class MemEventInit is only used during initialization. The class defines its 
own set of commands of type enum class InitCommand, and stores the command in 
data member initCmd_. The initialization command can only be one of the four 
values: Region, Data, Coherence, and Endpoint. The class inherits from class 
MemEventBase, and it can be sent over the memory hierarchy just like other 
event objects, although this only happens during initialization, and the 
message is received with polling, rather than with function call backs. The 
base class’s command is always set to NULLCMD.
Three classes inherit from class MemEventInit, namely, class 
MemEventInitCoherence, class MemEventInitEndpoint, and class 
MemEventInitRegion. As the name suggests, they carry the command Coherence, 
Endpoint, and Region, respectively, as well as the information related with 
these matters. We, however, postpone the discussion of each specific event type 
until the section that they are used with a context.
Constants, Macros, and Flags
File memTypes.h defines the full set of memory commands, coherence states, and 
coherence transitions used in the memory hierarchy. The macro X_CMDS defines a 
table of memory commands, with the response command, command class, command 
route, write back flag, and the type of the command. Each row of the table is 
defined by a macro X. According to the exact definition of X, different columns 
of the X_CMDS table can be extracted. For example, enum class Command is a C++ 
enum class with the body being a list of the memory commands. In order to 
define the class, macro X is temporarily defined as #define X(a,b,c,d,e,f,g) 
a,, meaning that it extracts the first argument of macro X. X_CMDS is inserted 
after the definition of X, which is replaced by an array of table rows encoded 
with macro X, which will simply be replaced by the first macro argument of X, 
i.e., the command itself.
Other constant arrays are defined in an exact same way by extracting different 
columns from the X_CMDS table. CommandResponse defines the response message of 
each command, and it exacts the second column of the X_CMDS table as the body 
of the array. If a command is not of the request type, and hence has no 
corresponding response command, the response command will be NULLCMD. This 
array, as well as all arrays covered below, are indexed by the command’s enum 
class integer value. BasicCommandClassArr and CommandClassArr define basic 
command classes and command classes, which extract the third and the fourth 
element of the X_CMDS table. Similarly, CommandRouteByAddress, 
CommandWriteback, and MemEventTypeArr define command route, command write back 
flag, and the memory event type, which are extracted from the rest of the table 
columns.
Coherence states are defined as a table by the macro STATE_TYPES, with the 
first column being the state, and the second column being the next state. The 
enum type State stores a list of all coherence states, which is defined by 
extracting the first column of the STATE_TYPES table.
The Hierarchy Interface
The memory hierarchy in SST does not directly interact with the CPU. Instead, 
the hierarchy is exposed to the CPU by an interface class, which selectively 
exposes a limited set of commands, while encapsulating the implementation 
details of the hierarchy within the interface class.
Memory requests are generated by the CPU in the form of class 
SimpleMem::Request objects, and then given to the interface class (which is 
instanciated as a subcomponent of the CPU, as we will see later) via direct 
function calls. The interface class then translates the memory requests into 
the memory event objects that are internally used by the hierarchy, and keeps a 
mapping between requests and memory event requests/responses. When a response 
is received from the hierarchy, the interface class finds the matching request 
object, and generates a response object, before returning that to the CPU via a 
function call back.
SimpleMem
class SimpleMem is the base class of the memory interface, and it resides in 
the core part of SST, in file simpleMem.h. It exposes an interface to the 
memory hierarchy via class Request objects, and two simple interface functions: 
sendRequest() and recvResponse() (the latter is only used for polling-based 
interface, though, while we only cover event-based interface). class SimpleMem 
and all its derived classes are derived from class SubComponent, which cannot 
be instanciated independently, but must be loaded into a subcomponent slot of 
the containing component via the configuration script.
class Request is an enclosed class defined within class SimpleMem, and it is 
the object that CPUs use to interact with the memory hierarchy. The object 
contains high-level memory operation information, such as command, requested 
addresses (multiple addresses are supported for, e.g., scatter-gather access 
patterns), request size (only one uniform size), data payload, and some flags 
(defined as macros beginning with F_).
Each request object also has an id field, which stores the globally unique ID 
generated by atomically incrementing the ID counter, main_id, at object 
construction time. The main_id counter is defined as a static member of class 
SimpleMem, and is hence globally consistent. Just like the case with class 
MemEventBase, CPUs and interface objects use the global ID to keep track of 
outstanding memory operations, and to match responses with requests.
Note that the command used by class Request differ from the internal commands 
of the hierarchy defined in memTypes.h. The interface-level commands are 
designed for the relatively high-level operations from the perspective of CPUs, 
and will be translated to and from the internal command by the implementation 
of the interface class for requests and responses, respectively.
The lifetime of request objects, contrary to the lifetime of memory event 
objects, span the duration of the memory operation. Request objects are 
allocated by the CPU who initiates the memory operation (e.g., in 
commitReadEvent(), commitWriteEvent(), and commitFlushEvent() of class 
ArielCore), and are kept in an internal pending operation map of the interface, 
which are matched against incoming completion notices from the hierarchy. When 
an operation completes, the original request object will be retrieved, updated 
to reflect the completion status of the operation, and sent back to the CPU. 
The CPU is responsible for the final destruction of the request after 
processing it (for example, in ArielCore::handleEvent()).
The Interface Object
The implementation of the interface object in SST’s memory hierarchy is 
defined in files memHierarchyInterface.h/cc, as class MemHierarchyInterface. 
The class is derived from class SimpleMem, and hence can be loaded as a 
subcomponent into the component slot. In the Python configuration file, class 
MemHierarchyInterface is instanciated by calling setSubComponent() on the CPU 
component object, with the first argument (slot name) being memory, and the 
second argument being memHierarchy.memInterface.
Communicating with CPU and the Hierarchy
class MemHierarchyInterface communicates with the containing CPU via direct 
function calls, since it is instanciated as a subcomponent, and thus no link is 
required at the CPU side. At the cache-facing side, the interface object 
defines a port named port, that needs to be connected explicitly to the highest 
level of the hierarchy using connect() method of the Python link object. Memory 
requests generated by the CPUs are given to the interface object by calling its 
member function sendRequest() with the class Request object as the argument. 
Responses and hierarchy-initiated event objects are received by the registered 
call back function to the link object, i.e., handleIncoming(). Once an event is 
received, it is transformed into a class Request object, and then delivered to 
the CPU via a registered call back (note that this call back is registered to 
the hierarchy interface by the CPU, not to the link object), recvHandler_, 
which is provided as a constructor argument, handler.
The CPU may also choose to not register a call back with the interface object, 
in which case the handler argument to the interface constructor is set to NULL. 
The constructor will then register the receiving link as a polling link, 
meaning that messages received at the link will be buffered in a local queue, 
and must be retrieved explicitly by the CPU via the interface’s method 
function recvResponse(). We focus on callback-based mechanism in this article.
The High-Level Workflow
When a request is generated by the CPU, it will be sent to the hierarchy object 
by calling member function sendRequest(). The function first transforms the 
request into a class MemEvent object by calling createMemEvent(), and then keep 
track of it by storing the unique ID of the newly created memory event in the 
internal map, requests_, with the value being the original request object. When 
a response or a hierarchy-initiated memory event is received on the link of the 
interface object, the call back function handleIncoming() is invoked. The 
handler function first calls processIncoming() to transform the event into a 
request object, if the event should be delivered to the CPU (there are events 
that will not be delivered to the CPU), then a new request object is created, 
which carries the information of the response. The response class Request 
object is eventually delivered to the CPU by calling the CPU-registered call 
back function, recvHandler_, with the object being the only argument.
Transforming Requests to MemEvents
What function createMemEvent() does is quite straightforward, and does not 
require much explanation. It first translates the request command into the 
memory hierarchy’s command, e.g., reads will become GetS, and writes will 
become GetX (but reads will the locked flag set will become GetSX). Then it 
allocates a new class MemEvent object from the heap, and initializes the object 
with all necessary information in order to perform the operation. Note that the 
src_ field of the memory event object is set to the name of the interface 
object itself, which is obtained by getName() function call into class 
BaseComponent, and both the dest_ and rqstr_ is set to the rqstr_ field of the 
interface object. This field is initialized in function init(), the value of 
which is the identify of the other end of the receiving link. This suggests 
that the rqstr_ field of the interface object stores the string name of the 
next hop, which is typically the L1 cache or the TLB object.
The memory hierarchy also supports custom commands, which carries a command of 
CustomCmd in the request object. Custom commands are transformed into a memory 
event object by calling createCustomEvent(), and the resulting memory event 
object carries a command of CustomReq. We do not cover custom command in this 
article, as the are pretty straightforward to implement.
Transforming MemEvents to Requests
Function processIncoming() first attempts to match the memory event object with 
the request, by obtaining the ID of the request memory event using 
getResponseToID(), and then querying the map requests_ that tracks pending 
memory operations (the map uses request memory event’s ID as key, and maps to 
the matching request class Request object). If the query fails, meaning that 
the message is initiated by the hierarchy itself, and no corresponding request 
object exists, but the command is Inv (external invalidation), then a new 
request object is created with the command being Inv, and returned to the 
caller. Otherwise, if the query succeeds, meaning that the memory event is a 
response to a previous request, the original request object of type class 
Request is then fetched from the pending request map requests_, and updated to 
reflect the completion of the request by calling updateRequest(). Finally, the 
request object will be returned to the caller, which is then delivered back to 
the CPU. The mapping entry is removed from the internal map, if the request 
object is retrieved.
Function updateRequest() updates the command of the class Request object 
according to the command of the response memory event object. Read operations 
will receive GetSResp, which is translated to ReadResp. Writes will receive 
GetXResp, which is translated to WriteResp. CPU-initiated flush will receive 
Command::FlushLineResp on completion, which is translated to 
SimpleMem::Request::FlushLineResp.
Cache: Array, Replacement, and Hash
The next big topic is the cache, which is the central component of SST’s 
memory hierarchy. SST decouples the organization, operation, and coherence 
aspects of a cache, and makes them into three largely independent parts. In 
this section, we focus on cache organization, which includes the tag array, the 
replacement algorithm, and hash functions.
The Cache Tag Array
Tag Array Operations
The cache tag array is defined as class CacheArray, in file cacheArray.h. The 
cache array class does not specify the exact contents of the tag. Instead, the 
class has one template argument T for specifying the type that holds per-block 
tag contents. The tag array object is initialized with cache layout 
information, such as the number of ways (associativity), the cache size (in the 
number of blocks), and the block size. The number of sets is also derived from 
the above information. Tag objects are stored in data member lines_, which is 
simply a vector of type T objects. Blocks (in fact, tags) are uniquely 
identified using their indexes into the lines_ vector. Hash functions for 
mapping addresses to set indexes and replacement algorithms that selects a 
victim block on insert eviction are also given as constructor arguments. The 
implementation of these two are also decoupled from the tag array, and they can 
be loaded as subcomponents into slot hash and replacement (of class Cache, not 
class CacheArray), respectively, using the Python configuration file. Note that 
both hash and replacement are just regular data members of class CacheArray, 
and they are supposed to be initialized and loaded into the subcomponent slots 
by the containing class, and then passed to class CacheArray as constructor 
arguments.
The tag array object also explicitly maintains replacement information of each 
tag entry in its data member rInfo. rInfo is maintained as a mapping structure, 
with the set number being the key, and type std::vector<ReplacementInfo *> 
objects being the value. Value objects of rInfo represent per-set replacement 
information, and each element of the vector corresponds to a tag entry within 
the set. The vector’s element data type class ReplacementInfo is an opaque 
type exposed by the replacement algorithm module, and stores per-entry 
replacement information, which we discuss later. Tag entries of type T is also 
supposed to implement a method, getReplacementInfo(), which, when invoked, 
returns the class ReplacementInfo object of that tag entry.
The tag array also implements several standard routines for lookup and 
replacement. Method lookup() just looks up a particular set based on the 
provided address, and returns a type T pointer as a result, if the lookup hits, 
or NULL, if it misses. It also takes an argument on whether to update the LRU 
information of the hit entry, which is performed by calling update() on the 
replacement manager with the class ReplacementInfo object of the entry.
Method findReplacementCandidate() returns the tag entry object to replace 
within the set, but does not actually replace it. This function is implemented 
by merely calling replacement manager’s findBestCandidate() with the set’s 
class ReplacementInfo vector, which is maintained by rInfo as we have discussed 
above.
Method replace() just invalidates an existing entry plus its replacement info 
(by calling replaced() into the replacement manager), and sets a new address to 
the entry. Method deallocate() simply invalidates an entry and its replacement 
information.
Note that the cache tag array is not initialized in the cache constructor. 
Instead, it is initialized by the coherence controller, which selects a tag 
type (template argument T) from those defined in file lineTypes.h.
Cache Tag Types
File lineTypes.h defines several tag entry types that can be passed to class 
CacheArray as the template argument. The file defines three base classes, from 
which inherited classes are created to represent different tag contents.
Base class class DirectoryLine defines tag entries that only contain coherence 
states, but not data. It is supposed to be used as directory cache’s tag 
entries. The base class contains the block address for which coherence states 
are maintained, the state of the block at the current level, a list of sharers 
in the form of a vector of std::string objects, and the identity of the 
exclusive owner in the form of an std::string. As mentioned earlier, the base 
class also carries replacement information in data member 
CoherenceReplacementInfo.
class DirectoryLine also carries a data member, lastSendTimestamp_, that is not 
used for functional simulation, but to emulate the timing of tag access. This 
data member remembers the last time (in local object time) the tag is accessed, 
and new accesses to the tag can only be made after this time point. This data 
member implements a miniature queuing simulation: If the lastSendTimestamp_ 
value is T1, and an access to the block is made at time T2, then if T1 > T2, 
the access at time T2 can only be performed at T1, resulting in the tag being 
actually accessed at T1, and the value of T1 is updated to T1 + t where t is 
the access latency. If, on the other hand, T2 > T1, then when the access at 
time T2 happens, the tag entry is not under some prior operation, meaning that 
it can be accessed immediately at time T2. The value is updated to T2 + t, 
which is the time the entry will be free for access again in the future.
Note that this queuing simulation only handles in-order requests, i.e., the 
accesses time T2 must be a non-decreasing sequence. As we will see later in the 
coherence controller section, this property is trivially guaranteed by the 
cache controller processing memory events at a tick-by-tick manner. This is 
also the part where SST does not strictly follow the principles of DES: 
Theoretically speaking, a component being simulated by DES shall only remember 
the state at the current cycle, and state changes at a future cycle shall be 
implemented by enqueuing an event that updates the state at the future cycle. 
The cache hierarchy designers abandons this idea, due to the overhead it takes 
to enqueue a trivial state change object (i.e., changing the state of the block 
from busy to free), and just uses a variable to remember a future state change.
class DirectoryLine also implements several getter and setter functions to 
query and update the coherence information. Specifically, caller can add or 
remove sharers, query whether a particular cache is a sharer, or whether there 
is any sharer at all. In addition, the owner can be set or removed, and the 
coherence state can be changed to one of the four states. The timestamp, 
although not being part of the functional coherence states, can also be set or 
obtained by calling the member functions.
class DataLine implements a tag entry that contains only data, which is stored 
as a vector of bytes in data member data_, and a pointer to class DirectoryLine 
object, in data member tag_. Note that the data line object does not contain a 
coherence state field, and the call to method getState() will either return I, 
or call into its corresponding tag object’s getState().
class CacheLine implements a tag entry that contains both data, as a vector of 
bytes, and a coherence state variable, as data member state_. There is neither 
sharer nor owner information stored, suggesting that this class can only be 
used in private caches. In addition, this base class does not have replacement 
information, and method getReplacementInfo() is a pure function, making the 
class a pure abstract class that cannot be instanciated as well.
class L1CacheLine is derived from class CacheLine, and it adds some additional 
information to support more complicated operations, such as LL/SC, 
locked/atomic instructions, and so on. It also adds an class ReplacementInfo 
object and getReplacementInfo() method. Since LL/SC and locked instructions are 
only seen by the L1 cache, it is suggested that this type of tags is most 
likely used for L1 caches.
class SharedCacheLine is also derived from class CacheLine, and it adds sharers 
and owner information just as those in class DirectoryLine. It also contains a 
class CoherenceReplacementInfo object for replacement. Since there is no LL/SC 
and locked/atomic instruction support, we can conclude that this type of tag is 
used for lower-level shared caches.
class PrivateCacheLine, like the previous two, also derive from class 
CacheLine. As the name suggests, it is used on private, non-L1 caches. There is 
no LL/SC and locked/atomic instruction support. The only additional information 
it has are two boolean flags that reflect the status of the block in the 
current cache: owned, and shared. It also contains a class 
CoherenceReplacementInfo object. The owned and shared flags are kept consistent 
with the two flags within the class CoherenceReplacementInfo object, the 
meaning of which will be explained below.
Replacement Manager
Replacement Information
The memory hierarchy provides several flavors of the replacement manager, which 
are contained in file replacementManager.h. This file also defines two extra 
classes: class ReplacementInfo and class CoherenceReplacementInfo, which must 
be included in tag array entries, and obtainable by calling 
getReplacementInfo() on these entries (i.e., the type T argument of class 
cacheArray). Both classes store the information that is necessary to make a 
replacement decision within a set.
class ReplacementInfo stores the index of the corresponding tag entry in data 
field index (this is a duplication of the index field of tag entries, but is 
also necessary, since the replacement manager has no idea of the tag entry), 
and the coherence state of the tag in data field state. The state need not 
precisely reflect the actual coherence protocol, and could just be a rough 
approximation, such as I/Non-I. class CoherenceReplacementInfo inherits from 
class ReplacementInfo, and it stores more detailed information about the 
coherence state, i.e., owned boolean flag to indicate whether the tag is owned 
by the current cache, and shared flag to indicate whether upper level caches 
may also have a copy of the same block. Replacement decisions can hence 
prioritize certain lines over the others based on their detailed coherence 
status.
Replacement Algorithm
All replacement algorithms inherit from the base class, class 
ReplacementPolicy, which itself is derived from class SubComponent, and 
therefore must be instanciated into the slot of its containing component in the 
Python configuration file. The class is pure abstract, and only serves the 
purposes of specifying the exposed interface of the replacement algorithm: 
update(), replaced(), getBestCandidate(), and findBestCandidate(). As we have 
seen above, the cache tag array calls these methods to implement block 
replacement and invalidation.
class LRU is an implementation of ideal LRU replacement algorithm. The class 
constructor takes the total number of blocks in the cache, and the number of 
ways (which is not used, though), and initializes an internal vector, array, 
which stores per-tag LRU counter. Note that class LRU does not rely on the 
per-tag class ReplacementPolicy objects to store the LRU counter, as the 
counter is not defined in that class. The class also maintains a global LRU 
counter, which is incremented every time the counter of a tag is updated, and 
the before value of the global counter is assigned to the tag’s counter, as 
can be seen in update().
The findBestCandidate() method of class LRU takes a vector of class 
ReplacementInfo objects, which correspond to tag entries of the set on which 
the replacement is being performed, and searches for the entry with either I 
state, or the minimal value of the LRU counters tracked by array (if none of 
the entry is in I state), as the LRU entry. The index of the LRU entry is 
stored in data member bestCandidate, and is also returned by the method. Method 
getBestCandidate simply returns the bestCandidate value.
Note that the comment block on top of findBestCandidate() in the source code 
describes the function incorrectly (up to version 11.1.0, which is the 
reference version this article assumes), and it seems that the developers 
copy-pasted it from that of class LRUOpt. A mistake on their part!
class LRUOpt implements an optimized version of LRU, which not only takes LRU 
counter values into consideration, but also gives priority to entries with a 
particular state. To elaborate: The algorithm aims at minimizing cache blocks 
that will likely be used and/or will incur larger overhead if evicted. The 
heuristics is as follows. First, invalid tag entries always have the highest 
priority, i.e., just the same as LRU. Second, if a block is shared by upper 
level caches, the block will also be given higher priority than those that do 
not, as evicting the block will incur extra invalidations to the sharers to 
maintain inclusiveness. Next, if a block is owned by the cache, then ownership 
must not be given up, since ownership typically indicates an intention to write 
and/or dirty blocks. Tag entries that satisfy none of the above criteria has 
the lowest priority.
To establish the relation between tag entries according to the prioritization 
rule described above, class LRUOpt defines a helper class, struct Rank, that 
enables comparisons between tag entries using the mentioned criteria. In method 
findBestCandidate(), each element of rInfo is used to construct a struct Rank 
object by copying the state, owned, and shared members from the class 
CoherenceReplacementInfo object to struct Rank’s fields with the same name, 
and copying the timestamp value of these entries from array to struct Rank’s 
timestamp member. Then a loop that is similar to the one in class LRU is 
executed to find the smallest struct Rank object, and its corresponding tag 
entry. struct Rank objects are compared based on the prioritization rule: For 
any two entries in a comparison, the I state entry is always smaller than non-I 
state entries (in the current source code, this check is performed outside of 
struct Rank, but it could also be made otherwise); shared entries are always 
smaller than non-shared entries; Owned entries are always smaller than 
non-owned entries. At last, if the ordering still cannot be determined between 
the two entries, the LRU timestamp is used for final arbitration, which will 
never result in a tie.
Hash Functions
Hash functions are another class of subcomponent that class CacheArray uses, 
which are defined in file hash.h. The pure, abstract base class, class 
HashFunction, exposes a single method hash(), which, given an ID and an 
address, returns a hash value. In practice, class CacheArray always calls 
hash() with the ID being zero, and the address being the address of the block.
Three hash function implementations are provided, all of which inherit from 
class HashFunction. class NoHashFunction does not perform any hashing, and 
simply returns the original value unchanged. class LinearHashFunction performs 
a linear transformation of the input value x using the formula 1103515245 * x + 
12345, and it ignores the ID argument. class XorHashFunction XORs every byte of 
the input value with the higher byte, except the last byte, which is unchanged. 
It ignores the ID argument as well.
Cache: Construction and Controller Operation
This section focuses on cache object’s construction and high-level event 
processing operations that do not involve coherence (which we simply just treat 
as a black box). We mainly cover the trunk of the cache object’s operation 
logic, and largely ignore minor details such that debugging, statistics, 
prefetching, and so on (and they are pretty straightforward once the cache 
operations are made clear).
Memory Links
Cache objects are not always connected by point-to-point regular links of type 
class Link. Instead, they are connected by class MemLink objects or its 
derivations, which may provide some routing capabilities based on address 
ranges and interleaving. In this section, we mainly focus on class MemLink 
itself, which still provides the abstraction of a point-to-point communication 
link, and is simply just a wrapping layer over class Link.
Memory Link Operations
class MemLink is defined in file memLink.h/cc, which itself is derived from 
class MemLinkBase, defined in file memLinkBase.h. class MemLinkBase is derived 
from class SubComponent, meaning that it must be loaded into a containing 
component’s slot. class MemLinkBase contains a call back function as data 
member, recvHandler, which is called in method recvNotify(). The call back 
function must be set after the construction of the object (or its derivations) 
by calling setRecvHandler(), with a caller-provided standard functor object for 
class Event. The object also contains a struct EndpointInfo data object, but 
class MemLink does not leverage the endpoint, and hence we exclude this from 
our discussion.
class MemLink wraps a linked object as its data member, link. The link is 
initialized and registered to SST’s global link map in class MemLink’s 
constructor, with the port name given in argument params, and the receiving 
call back function being recvNotify(), which is a member function in class 
MemLinkBase, as we have seen above. class MemLink also exposes a send() method, 
which does exactly what a send() on link object does, and in fact, the wrapper 
function just calls send() on the link member.
The net effect of using class MemLink is that, after construction and setting 
the external receiving handler, a send() method call on the memory link object 
will send an event object through the link. A received message will first 
invoke the internal method recvNotify(), which in turn calls recvHandler with 
the received event object, passing the object to the external handler. We 
conclude, therefore, that the class MemLink object alone is merely a wrapped 
link object with identical semantics.
End Point Information
Each memory link object derived from class MemLinkBase has a type struct 
EndpointInfo data member, info, which stores the end point information of the 
memory link. The struct EndpointInfo object contains a string object, name, 
which is set to the identity of the component that the memory link object 
belongs to by calling getParentComponentName() in the base class constructor 
(derived classes may modify it, but class MemLink does not). The addr and id 
fields are the routing address for on-chip routing, and the context-specific ID 
of the link. For class MemLink, both fields are set to zero, and is not used. 
The region field is of type class MemRegion, defined in file memTypes.h, and it 
describes a memory region in the from of start address, end address, interleave 
size, and interleave step. Region information of a memory link can be specified 
via the component, but for point-to-point class MemLink, the region is just set 
to [0, -1UL] without any interleaving, covering the entire address space.
class MemLink also maintains information about the other end of the memory 
link, i.e., its immediate neighbor, as well as all reachable end points from 
its immediate neighbor. The former is stored in data members remotes, while the 
latter in data member endpoints. Both members are of type 
std::set<EndpointInfo>, and can be updated by calling addRemote() and 
addEndpoint(), respectively, which usually happens at initialization time. 
Besides, data member remoteNames stores the set of string names for each struct 
EndpointInfo object in remotes. This structure is used to quickly decide 
whether an immediate neighbor is reachable via the current link in method 
isReachable(). Note that for class MemLink, since it is a direct point-to-point 
link, there can be only a single remote end point, and hence the set remotes 
and remoteNames is always of size one.
class MemLink exposes a few interface functions for querying the remote and 
endpoint information. Method findTargetDestination() returns the name (which is 
used as a global identifier) of the component whose address region contains the 
given address. It can be regarded as a simple routing function which, given the 
address, selects the next hop destination from all registered destinations 
tracked by remotes. Method getTargetDestination() is identical to 
findTargetDestination(), except that, if the next hop destination could not be 
found because no region contains the given address, it prints an error message 
and a list of known destinations to assist debugging.
Memory Link Initialization
During initialization, class MemLink objects that are connected together 
exchange their identities, and register each other in their remotes structure. 
This happens in init() function, which is called during first stage 
initialization. Recall that the first stage initialization happens globally by 
the simulator main function iteratively calling init() on all components (in 
the case of link objects, their init() is called by the containing 
component’s init()), with the argument (phase in class MemLink’s init() 
function) being the current iteration number. In the first iteration (where 
phase is zero), the class MemLink object creates an initialization message of 
type class MemEventInitRegion, and sends it over the link by calling 
sendInitData() on the link object. Note that the send and receive function 
during the initialization stage are polling-based, meaning that the receiving 
end must explicitly call recvInitData() in order to read messages. As we have 
seen in earlier sections, the class MemEventInitRegion type is defined in 
memEventBase.h, and is derived from class MemEventInit, which itself is derived 
from class MemEventBase. The initialization message carries the name and region 
values in the current info object, with the event command being 
InitCommand::Region (set implicitly in the event object’s constructor). This 
happens on the first iteration at both ends of the class MemLink connection.
Then, at later iterations, the init() function simply listens on the link for 
initialization messages from the other end. This is implemented as a while() 
loop that calls recvInitData() repeatedly until the receive queue is empty. For 
each event, the command is inspected. If the command is Region, then the link 
object itself processes the event by creating a struct EndpointInfo object with 
information contained in the message, and insert the object into its own data 
member remotes. The event object is not used anywhere else as it is destroyed 
right in the loop.
If the command is Endpoint, then the message is cast into type class 
MemEventInitEndpoint (also defined in memEventBase.h), and the end point 
information carried by the message is added into data member endpoints using 
addEndpoint(), meaning that this message contains a list of objects that are 
reachable from the other end of the memory link. One thing that differs from 
the handling of Region command is that, the Endpoint command event is not 
destroyed in the loop. Instead, it is added to the data member, initReceiveQ, 
such that the message can also be read by its containing component during 
initialization. The memory link object provides such an interface, namely, 
recvInitData(), for the containing component to read event objects that are 
added into initReceiveQ. Correspondingly, the containing component may also 
send initializing messages via the memory link by calling sendInitData() on the 
link object, the implementation of which is simply wrapping sendInitData() of 
class Link.
Event objects with unknown commands are just inserted into initReceiveQ, such 
that the containing component may read them in their init() functions.
Cache Object Construction
High-Level Construction Workflow
Contrary to most source files in SST, the cache, due to its code size and 
complexity, is divided into several source files that do not follow the naming 
convention, making the source tree hard to navigate for beginners. File 
cacheFactory.cc contains the cache object constructor. File 
cacheController.h/cc contains the class definition and implementation of event 
processing. Certain method names are rather misleading, such as 
createCacheArray(), which only computes the cache array parameters, and does 
not have anything to do with the actual creation of cache arrays. In fact, 
class CacheArray is not even a data member of the cache. Instead, it is 
contained in the coherence controller, and is initialized along with the 
coherence protocol (i.e., different coherence protocol options may initialize 
cache arrays differently).
The cache object’s constructor is defined in file cacheFactory.cc. It first 
reads cache block size and the number of banks, and stores them in data member 
lineSize_ and banked_, respectively. Data member bankStatus_, which is a vector 
of booleans tracking whether a particular bank has been active in an event 
processing cycle, is also initialized by reserving elements, the number of 
which equals the value of banked_.
The constructor then calls createCacheArray() to compute the parameters of the 
cache tag array, specifically, the number of blocks in the cache. As a result, 
the function inserts the key lines with the string representing the number of 
blocks into the param object for future use.
The cache is registered to the global clock in the method createClock(), which 
is called after createCacheArray(). The method registers the cache with a tick 
frequency read from parameter key cache_frequency, and the handler being the 
method clockTick().
Next, the constructor reads the number of requests that can be processed per 
cycle into data member maxRequestsPerCycle_. At last, the constructor 
initializes the cache links with other components by calling configureLinks(), 
initializes the coherence controller (and the tag array) by calling 
createCoherenceManager(), and registers statistics using registerStatistics(). 
One thing to note is that the coherence controller also keeps a copy of the 
memory link objects of the cache, meaning that the coherence controller can 
implement its own methods for sending messages. Receiving messages, however, 
must always go through the cache object, as the receiving handler is bound to 
the cache’s method.
Creating Links
The links between the current cache object being initialized and other 
components of the hierarchy are configured during construction by calling 
configureLinks(). This function is also the biggest function in file 
cacheFactory.cc, as the cache object supports several different flavors of 
configuration for being connected to other components in the hierarchy. In this 
section, we only cover the simplest configuration, namely, connecting the cache 
objects to an upper (closer to CPU) and a lower (further away from CPU) 
component via point-to-point link, without any packet routing. In this 
configuration, the cache object only has two ports: The high network port, or 
high_network_0, which connects to a higher level component, and low_network_0, 
which connects to a lower level component. More complicated topologies, such as 
those with an on-chip network, can be realized by connecting the cache with 
other components via more advanced memory link objects.
At a high level, function configureLinks() guesses the cache’s link 
configuration by detecting the ports that are connected in the configuration, 
and then sets up the memory link objects based on the guess. According to the 
comment and function body, six port combinations are considered valid (the 
comment block above the function header specifies five valid combinations, 
while in the function body, it also seems that loading user-defined memory link 
objects as subcomponents into slot cpulink and memlink explicitly in the 
configuration file is also supported), and we discuss the simplest among them: 
high_network_0 and low_network_0, which is the configuration where the cache 
object connects to a higher- and lower-level component, respectively, via 
point-to-point class MemLink objects.
First, configureLinks() checks whether either the cache component’s slot 
cpulink or memlink (or both) is loaded with a subcomponent derived from class 
MemLinkBase. If true, then they will be used as the link. Otherwise, the 
function checks the connectivity of the following four ports: high_network_0, 
cache, directory, and low_network_0. The combination of connection ports is 
also checked for validity.
If both high_network_0 and low_network_0 are connected, the function just 
initializes two class MemLink objects into subcomponent slot cpulink and 
memlink, by calling loadAnonymousSubComponent(). Note that in this case, 
neither subcomponent needs to be explicitly specified in the configuration 
file. The parameter given to class MemLink’s constructor is memlink and 
cpulink (unwise naming, of course) defined right above the if code block. The 
parameter specifies key port as high_network_0 and low_network_0, respectively, 
such that the class Link object wrapped within the memory link can be 
configured correctly.
Recall that, in SST, the topology of components is inferred from the Python 
configuration file first, and then class Link objects are initialized based on 
the topology, and inserted into per-component link maps, after which components 
are initialized. So at component construction time, the link objects have 
already been created, which can be retrieved using the port name specified in 
the Python configuration file.
Region information (which is likely not important) for the two memory links are 
also given by the parameter objects, which we do not discuss. The last step of 
link construction is to bind the memory links (and the wrapped link objects) 
with a handler function for receiving memory events. The handler function is 
class Cache’s member function, handleEvent(), which will be called if any of 
the two links receive a message.
Other combinations can also be initialized in the same function as the one we 
have discussed above. For simplicity, we skip the rest of the function, because 
they more or less follow the same logic.
Cache Initialization
Cache initialization is performed in init(). This function handles the case 
where linkUp_ and linkDown_ are identical, and the case where they are distinct 
objects, differently. In this section, we only discuss the latter, as it is the 
more general case. The former, however, is almost identical to the latter, 
except that only one link is used for sending and receiving instead of both.
On each iteration, the link objects’ init() is called, with the iteration 
number passed as the argument. As we have discussed earlier, the link objects 
will learn their immediate neighbors, as well as the overall topology of the 
memory hierarchy. If the iteration number of zero, the cache also sends, on 
both directions, the initialization message on behalf of the coherence 
controller, by calling getInitCoherenceEvent() on the controller to obtain the 
message, and then calling sendInitData() on the memory link objects.
On all iterations, the initialization routine will keep receiving messages on 
the link objects by calling recvInitData() repeatedly (i.e., polling) until 
there is no more message in the current iteration. We use event processing on 
linkUp_ as an example. For each initialization event object received, the 
function first inspects its command, and if the command is NULLCMD, meaning 
that it is an initialization event object that carries a separate 
initialization command, then the latter will be obtained by calling 
getInitCmd(). For Coherence type messages, they will be processed at the 
current cache, by calling processInitCoherenceEvent() on the coherence 
controller, and not forwarded, since the current cache also sends coherence 
message to its neighbors. In other words, coherence type messages will only 
travel by one hop. For Endpoint type messages, however, it will be duplicated, 
and sent to the other direction (linkDown_), by calling sendInitData().
Messages with non-NULLCMD command type will be forwarded to the other direction 
after being duplicated, with the destination being set properly according to 
the address carried in the message. In all cases, the originally received 
message is destroyed, meaning that initialization message’s lifetime, similar 
to those of the regular messages, is also only one hop.
The event processing on linkDown_ is similar, but messages of non-NULLCMD 
commands will not be forwarded up, because otherwise, it would cause an 
infinite loop as the up links also forward them back. Endpoint type messages 
will be forwarded still, but the destination is not set to the next hop 
neighbor, but remains the same as in the original one.
Cache Operations
The cache class definition and method implementations are in file 
cacheController.h/cc. The call back function for receiving incoming events from 
both links is handleEvent(), the logic of which simply adds the event object 
into the new event buffer, eventBuffer_, after calling recordIncomingRequest() 
on the coherence controller. The latter is solely for statistical purposes, and 
does not involve any operational details. Cache operations are implemented in 
method clockTick(), which is registered as the clock tick function during 
construction.
At the beginning of clockTick(), the cache drains the coherence controller on 
the outgoing queues for both directions, by calling sendOutgoingEvents() on the 
coherence controller. Recall that the coherence controller also keeps a copy of 
the memory link objects of the cache, and hence they could send their own 
messages without going through the cache. Also, this method will not always 
fully drain the queues, if the outgoing bandwidth per cycle exceeds the maximum 
bandwidth of the coherence controller (we will discuss later in sections 
talking about coherence controllers).
Then the function clears the state from the previous cycle for access 
arbitration. Access arbitration can be modeled at per-address or per-bank 
level, which uses data member addrsThisCycle_ and bankStatus_, respectively.
Next, the cache handles retried events. An event is retried if it is rejected 
by the coherence controller or fails the access arbitration, in which case the 
event is added into the retry buffer, retryBuffer_. The cache processes entries 
in the buffer one by one, until the buffer is empty, or maxRequestsPerCycle_ 
events have been processed. The latter condition essentially implements a 
processing bandwidth limit. On the other hand, maxRequestsPerCycle_ can be 
configured to be -1, in which case there is not bandwidth limit.
Each entry in the retry buffer is processed by calling processEvent(). If the 
entry is accepted, the function returns true, and it is removed from the retry 
buffer. Otherwise, it just stays there, and will be revisited in the next tick.
After processing the retry buffer, the cache then proceeds to handle the event 
buffer, eventBuffer_, which stores newly received events from both links that 
have not been rejected. The processing logic is identical to the one of the 
retry buffer, except that the second argument to processEvent() is false. The 
prefetch buffer, prefetchBuffer_, is also processed after the event buffer. We 
do not cover prefetching here, and we will also skip it in the rest of the 
section.
Note that the second argument to processEvent() indicates whether the event is 
from the MSHR, or from the event buffer. The different between these two is 
that requests that are either internally generated (e.g., evictions and write 
backs), or cause cache misses will be allocated one or more slots in the Miss 
Status Handling Register (MSHR) file. These requests will be handled with the 
flag being true, indicating that the MSHR slot needs to be recycled, if the 
request is successfully processed. On the other hand, if an event can be 
fulfilled immediately, or the MSHR allocation itself is rejected (due to the 
MSHR being full), it will not be inserted into the MSHR, and the flag will be 
set to false. Detailed discussions on MSHRs are postponed to later section.
An event in the eventBuffer_ can be in one of the three states after being 
handled by processEvent(). First, if the event is successfully processed, and 
no more action is needed in the future, processEvent() will return true, and 
the event is removed from the buffer before being deallocated. This is most 
likely a cache hit, which neither generates internal events, nor requires miss 
handling. Second, the event can be successfully processed, but it causes 
internal events being generated, or incurs a cache miss. The event needs to be 
allocated an entry in the MSHR, possibly together with all the internal events 
it generates. In this case, processEvent() still returns true, meaning that the 
event can be removed from the buffer (because the handling of the event itself 
is successful), but the coherence controller will later on put the generated 
events into the retry buffer, such that these internal events are also handled. 
In addition, the event object will not be deallocated until the response 
message for the cache miss it has incurred is received. Lastly, the event can 
also be rejected by the bank arbitrator (see below) or the MSHR. In this case, 
processEvent() returns false, and the event object will remain in the 
eventBuffer_. These events will be repeatedly attempted in the following 
cycles, until the handlings are eventually successful.
After processing buffers, the cache appends the retry buffer of the coherence 
controller (obtained by calling getRetryBuffer() on the controller object) to 
the cache’s retry buffer. These events are those that are accepted by the 
coherence controller, but cannot be immediately handled, for which MSHR entries 
are allocated. consequently, all events in the retry buffer are handled by 
processEvent() with the second argument set to true.
Event Processing
Method processEvent() implements event processing function for each individual 
event. At the beginning, it checks whether the event represents a non-cachable 
operation. If true, it is handled separately, and the operation always 
succeeds. Next, for cachable operations, which is the normal case, the cache 
arbitrates the access by calling arbitrateAccess(). The function decides 
whether the request can be processed in the current cycle or not using the 
history of previous accesses in the same cycle. There are two ways that 
accesses can be arbitrated. If the boolean flag banked_ is false, meaning that 
banked cache access is not modeled, then access is arbitrated with data member 
addrsThisCycle_, which tracks the set of addresses that have been accessed in 
the current cycle. If the address of the current event has already been 
accessed in the same cycle, then the request is rejected.
On the other hand, if banking is modeled, then bankStatus_ is used for tracking 
the access status of each bank. If a bank is accessed previously in the same 
cycle, the corresponding bit flag is set to true, which will cause a later 
event on the same bank to be rejected. The bank index of a given address is 
computed by calling getBank() on the coherence controller, which forwards the 
call to the function with the same name in class CacheArray. The index 
computing function simply returns the address modular the number of banks, 
meaning that addresses are mapped to different banks in an interleaved manner. 
(Note: I think this is incorrect, because set-associative caches allow a block 
to be stored in any of the ways of the set the address maps to. The bank index 
should at least be a function of the set number, e.g., be the module of the set 
number and the number of banks).
After access arbitration, the function just uses a big switch statement to call 
into the coherence controller methods based on the event command. Note that 
both request type and response type commands are processed in this switch 
block, since the event handler is registered as the call back for memory links 
on both directions. The return value of these methods reflects whether the 
event is successfully processed, or requires retry. In the former case, the 
arbitration information is updated by calling updateAccessStatus(). In the 
latter case, the event is kept by the coherence controller in its own retry 
buffer, which, as we have seen above, will be moved to the cache’s retry 
buffer at the end of the event handling cycle.
Cache: Coherence Controllers
The next major part of the cache is the coherence controller. SST’s memory 
hierarchy implements a number of different protocols derived from MESI, either 
to give users more flexibility on choosing the base protocol (e.g., MSI or 
MESI), or to adapt to different scenarios (e.g., inclusive or non-inclusive, 
private or shared, L1 or non-L1). Each flavor of the coherence protocol is 
implemented as an individual class, and which class is instanciated as the 
coherence controller of the cache is dependent on both user configuration, and 
the role of the cache in the memory hierarchy. All coherence controllers are 
derived from the same base class, class CoherenceController, defined in file 
coherenceController.h/cc under the coherencemgr directory. Each type of the 
coherence controller is defined in its own header and cpp files, with the file 
names being self-descriptive.
Coherence Controller Construction
class CoherenceController is initialized during cache construction, in function 
createCoherenceManager() (file cacheFactory.cc). The function first reads the 
access latency of the cache data, specified with parameter key 
access_latency_cycles, and the latency of cache tag, specified with key 
tag_access_latency_cycles. The tag access latency is optional, though, and if 
not specified, it is by default set to the data access latency. Then, the 
protocol is read using the key coherence_protocol, the value of which can be 
mesi, msi, or none. The boolean flag L1 is also read with the key L1, to 
indicate whether the cache is an L1 cache or not (L1 cache requires specific 
treatments during construction). The cache type is read with key cache_type, 
which can be of value inclusive, noninclusive, or noninclusive_with_directory. 
The function also ensures that L1 caches are always configured to be inclusive, 
and that non-coherent caches must be non-inclusive (although the latter case is 
rare, since most caches need some sort of coherence). The MSHR object is then 
created by calling createMSHR(). We postpone the discussion on the MSHR to a 
later section, and only focus on the coherence controller.
The function then proceeds to preparing the parameters, stored in 
coherenceParams, for initializing the coherence controller. We do not cover the 
meaning of all parameters or switches here, and only introduce them as they are 
encountered during later discussions. The coherence controller is then loaded 
into the cache object’s subcomponent slot coherence as an anonymous 
subcomponent. The exact type of the coherence controller to load, however, is 
dependent on the combination of the protocol, the inclusiveness, and the L1 
flag. For coherent, non-L1 caches, if it is inclusive, the controller class 
that will be loaded is class MESIInclusive, defined in file 
MESI_Inclusive.h/cc. If it is not inclusive, and has no directory (i.e., a 
non-shared cache, likely a private L2), then the class is class 
MESIPrivNoninclusive, defined in file MESI_Private_Noninclusive.h/cc. If it is 
not inclusive, and has a directory (i.e., a shared cache, likely an LLC or 
shared L2), then the class is class MESISharNoninclusive, defined in file 
MESI_Shared_Noninclusive.h/cc. Note that inclusive, non-L1 caches do not get to 
choose whether a directory is needed or not, and always use class 
MESIInclusive, which has inline directory entries. For coherent L1 caches, 
inclusiveness does not matter (must be inclusive), and the coherence manager is 
of type class MESIL1, in file MESI_L1.h/cc.
Non-coherent caches use class Incoherent objects (file Incoherent.h/cc) and 
class IncoherentL1 (file Incoherent_L1.h/cc), respectively, for non-L1 and L1 
caches. Inclusion does not matter for non-coherent caches, since they are 
intrinsically non-inclusive. Their implementations are just trivial, and we do 
not cover them in this article.
After loading the coherence controller, the function then reads 
prefetching-related parameters, before it sets the up link and down link of the 
coherence controller with linkUp_ and linkDown_ objects of the cache by calling 
setLinks(). In other words, the coherence controller also keeps a copy of the 
memory link objects of the cache, such that it is also capable of sending 
memory event objects to other components in the hierarchy. In addition, the 
MSHR object of the cache is also passed to the coherence controller by calling 
setMSHR().
Construction Performed In Derived Classes
The coherence controller base class constructor leaves the class CacheArray 
object uninitialized, which should be completed by the derived class 
constructors. The main reason is that class CacheArray requires 
protocol-specific template argument T as the contents of the tag. Such 
information is not known until derived class construction time.
Replacement Manager and Hash Function
The base class, however, provides two method functions that can be called by 
child class to initialize the replacement manager and the hash function, which 
are both needed by class CacheArray. Function createReplacementPolicy() 
initializes a replacement manager object as indicated by the parameter key 
replacement_policy. It supports a few possible values: lru for Least Recently 
Used (LRU), lfu for Least Frequently Used (LFU), and mru for Most Recently Used 
(MRU), random for random replacement, and nmru for Not Most Recently Used. For 
non-L1 caches, the -opt version of the corresponding replacement policy is 
used, while for L1 caches, the non-opt version is used. The replacement manager 
is eventually loaded, as a subcomponent, into the replacement slot of the 
coherence controller.
Function createHashFunction() initializes the hash function for the tag array, 
and loads it as a subcomponent into the hash slot of the controller. Both this 
function and the previous one are called by derived classes in their 
constructors, when they initialize the tag array (recall that class CacheArray 
takes a replacement manager and a hash function object as construction 
arguments).
The Cache Tag Array
Each derived class of class CoherenceController selects its own tag array 
types. class MESIInclusive uses CacheArray<SharedCacheLine>, which contains 
sharer and owner information within the tag entry, i.e., the cache directory is 
inline. This is consistent with the fact that class MESIInclusive controller is 
used for inclusive caches (both shared and private), in which case, all 
addresses that are cached in the upper levels must also be present at the 
current level, and the directory can just be conveniently implemented in the 
tag entry.
class MESIPrivNoninclusive uses CacheArray<PrivateCacheLine>, which only keeps 
track of very limited information about a block, i.e., owned and shared, 
indicating whether the block is owned at the current level, or potentially 
present at a higher level. For private caches, since there could only be one 
upper level cache, using one bit for tracking the block status would be 
sufficient.
class MESISharNoninclusive needs a more complicated mechanism to track block 
state both at the current level, and at higher levels. This is because the 
cache is non-inclusive, and therefore, blocks that are present in the higher 
levels may not always be present at the current level. To track blocks in both 
conditions, the controller initializes two tag arrays: One directory tag array 
for tracking blocks in the upper levels and the current cache, which is data 
member dirArray_, of type CacheArray<DirectoryLine>, and the other is a data 
tag array for tracking blocks only at the current cache (i.e., just the regular 
tag array), which is in data member dataArray_, of type CacheArray<DataLine>. 
Recall that class DataLine also keeps a reference to a directory entry, which 
stores coherence information of the block.
Lastly, class MESIL1 uses CacheArray<L1CacheLine>, which carries only the most 
essential information without coherence states, plus several extra data members 
for supporting LL/SC and locked/atomic instructions. These bits are only 
present in the L1 tag array, because these high-level operations are only 
performed in the L1.
Non-coherence caches use CacheArray<PrivateCacheLine> and 
CacheArray<L1CacheLine>, respectively, for non-L1 and L1 caches. The reason is 
straightforward: They do not need to keep coherence information, but L1 cache 
still need the extra flags to support certain instructions.
Coherence Controller Initialization
The coherence controller class does not have an init() function, and its 
initialization is performed by the init() function of the containing cache 
controller. As discussed earlier, at the first iteration of init(), the cache 
controller will generate a type class MemEventInitCoherence object, by calling 
getInitCoherenceEvent() on the coherence controller, and then this object will 
be sent to the neighboring caches. On receiving a coherence type message, the 
processInitCoherenceEvent() will then be called to process it.
class MemEventInitCoherence carries the name of the originating cache, stored 
in base class data member src, the end point type of type enum class Endpoint 
(type_, defined in file util.h), the data block size (lineSize_), and a few 
booleans flags describing the operational properties of the controller, such as 
whether the protocol is inclusive (inclusive_), whether write backs will always 
send write back ACKs (sendWBAck_), whether write back ACKs are expected 
(recvWBAck_), and whether the controller tracks the presence of addresses in 
other caches (tracksPresence_). According to the source code comments, the last 
boolean flag affects whether clean eviction is needed.
Method getInitCoherenceEvent() is implemented in derived classes of the cache 
controller, and what it does is to just create a class MemEventInitCoherence 
object, and initializes it by passing construction arguments (we do not discuss 
how each controller class selects their own arguments). 
processInitCoherenceEvent(), on the contrary, is implemented in the base class, 
and it takes two arguments. The first argument, event, is the event object 
received from the link, and the second argument, source, indicates whether the 
message is sent by the controller above or below (true for above, and false for 
below). Note that the naming is a little bit confusing. My best guess is that 
the method was originally designed for the scenario where linkUp_ and linkDown_ 
are identical, in which case the argument just indicates whether the message is 
sent by the controller itself from one of the two links and received by the 
other.
Method processInitCoherenceEvent() sets the current controller’s data members 
and booleans flags, according to the received message. Message sent from the 
above will cause an update on the local sendWritebackAck_ flag, and if the 
message is from a CPU, then the name of the CPU will also be added into data 
member cpus. Messages from the below will cause updates on lastLevel_ (if the 
below component is not a memory, then lastLevel_ will be false), 
silentEvictClean_ (if the below component uses a bitmap to track block sharers, 
then even clean evictions should generate an event to the lower level), 
writebackCleanBlocks_ (if the lower level is not inclusive, then clean blocks 
need to be written back explicitly), and recvWritebackAck_ (if lower level will 
send ACK for write backs, then this one will also expect to receive it).
The Coherence Controller Base Class
The base class of the coherence controller, class CoherenceController, defines 
a few handy functions and types that are useful for all derived classes 
regardless of their types. These functions mainly implement low-level event 
sorting and dispatching. Derived classes do not override these methods, and 
just call them as helper functions.
Response Objects
class CoherenceController has an enclosed data type, class Response, which is a 
wrapper class for class MemEvent objects generated by the coherence protocol. 
class Response contains a pointer to the event object (using base class type 
class MemEventBase, though, since only link-level information is needed), a 
timestamp value, deliveryTime, in the component’s local cycle that represents 
the time the response message is generated. Recall that a response message can 
be put into the sending queue before the simulated cycle in which it is 
actually generated, due to the fact that tag access delay is not modeled using 
DES, but just using a miniature queueing model, with a per-tag timestamp 
variable remembering the local time when the block becomes free (see Section 
“Cache Tag Types” for more details). As we have discussed earlier, this 
violates the principle of DES, since the event object “time travels” to an 
earlier cycle, at which time it would not have been in the sending queue. To 
overcome this, class Response objects use deliveryTime to remember the 
simulated cycle in which the event is generated. The sending functions should 
hence ignore response objects with deliveryTime less than the current simulated 
cycle.
The last member of class Response objects is size, which stores the number of 
bytes that the memory event object is supposed to be. This value is not the 
physical storage it takes on the simulator. Rather, it is the simulated size of 
the message, which is used for bandwidth auditing while sending the message.
Also note that class Response has nothing to do with class Request in the 
interface object. The object is only used as a wrapper layer within the 
controller class. External objects will not find this class useful by any means.
class CoherenceController has two internal queues for response objects, namely, 
outgoingEventQueueDown_ and outgoingEventQueueUp_, one for each direction as 
suggested by the names. These two queues are of type std::list<Response> (note 
that the source file just imports the entire name space std, which is not a 
good practice), and they maintain the response objects in a sorted order based 
on deliveryTime. Objects with the same deliveryTime will be ordered based on 
the order of insertion, meaning that event objects generated by the coherence 
controller will be sent in the same order as they are generated.
Queue Maintenance Helper Functions
Method addToOutgoingQueue() and addToOutgoingQueueUp() insert a response object 
into outgoingEventQueueDown_ and outgoingEventQueueUp_, respectively. The 
methods iterates over the list object from the end to the begin using a reverse 
iterator, until the deliveryTime of the response to be inserted is greater than 
or equal to the current response object in the list, or until a memory event 
object with the same block address (obtained via getRoutingAddress()) is found. 
In the former case, either there is no response object at deliveryTime, and the 
one being inserted is the first, or response objects already exist at 
deliveryTime, and the insertion happens at the end of all response objects at 
deliveryTime, i.e., the order of objects with the same delivery time is 
consistent with the order they are inserted.
In the latter case, the response object is essentially delayed by a later 
response object on the same block address. This is to guarantee that response 
messages on the same address are not reordered. (Notes: I doubt whether this 
check is necessary, since the coherence controller handles events in a 
tick-by-tick basis. If two response objects on the same address are to be 
inserted into the queue, then the first one must have a smaller delivery time, 
since they must use the same tag entry to derive the delivery time, and these 
two response objects will be serialized on that tag entry. This, however, does 
not hold, if some responses are generated without being serialized on the tag 
entry.)
Message Helper Functions, Part I
Method forwardByAddress() and forwardByDestination() implement event message 
helper functions that assist the processing of creating new event objects, and 
sending the event objects to another component via the memory link. Method 
forwardByAddress() obtains the base address from the event object to be sent 
via getRoutingAddress(), and tests on linkDown_ and linkUp_ respectively to 
find the outlet by calling findTargetDestination() on the link object. The 
function gives priority to linkDown_ over linkUp_, meaning that if the address 
falls into the range of both links, then it will be sent down. This is exactly 
what will happen if point-to-point memory links are used, i.e., all event 
objects sent with forwardByAddress() will be sent downwards (for more 
complicated topologies, this is not guaranteed). A heavily used paradigm of 
this function is to create a new event and copy the address from an earlier 
request, or duplicate an earlier request, and then call forwardByAddress() on 
the new event. This is likely the case where an existing request is forwarded 
to a lower level, because it could not be satisfied at the current level, or 
the current cache generates a request on its own, and sends it down. The source 
and destination fields of event objects sent this way still need to be valid, 
though, as these two fields might be needed when creating the response event 
object at the lower level.
Method forwardByDestination(), on the other hand, tests on both links by 
calling isReachable() to see if the destination (a std::string object) of the 
event object is an immediate neighbor. A heavily used paradigm of this function 
is to call makeResponse() on the event, which revers the input event’s source 
and destination, and then send the new event using forwardByDestination(). This 
way, the event will be sent upwards, most likely as the response message to an 
earlier request.
Message Helper Functions, Part II
There are also message forwarding helper functions that are built based on the 
two discussed in Part I. Method forwardMessage(), given an existing event, 
duplicates that event by copy construction. Optional data payload is set, if it 
is given. The delivery time for the event object, which is also the return 
value, is computed based on the given base time argument baseTime (although in 
L1 MESI controller, this value is always zero, meaning that starting from the 
current simulated tick) and the current simulated tick timestamp_. If the event 
object is a cachable, then the access latency being modeled is the tag latency, 
tagLatency_, and otherwise, it is the MSHR latency mshrLatency_. The newly 
created object is eventually sent to the lower level by calling 
forwardByAddress(), before the delivery time is returned.
Method sendNACK() just creates a NACK event object based on a given event 
object by calling makeNACKResponse(), which is essentially just a response 
event object with the command being NACK. The delivery time is the simulated 
tick plus tag access latency (i.e., modeling a tag access). The new event 
object, with its destination and source being the source and destination of the 
input object, is returned to the sender of the input object by calling 
forwardByDestination().
Method resendEvent() does not create any new event object, but just implements 
exponential back off for an existing object, and schedules its sending in a 
future cycle. This also violates the principle of DES, just like how access 
latency is modeled. The back off value is computed based on how many reattempts 
have been made. The delivery time of the event is then the current tick, plus 
tag latency, plus the back off cycles. The event is sent with 
forwardByDestination().
Method sendResponseUp() is very similar to forwardMessage(), but it creates a 
response event given an existing one, and sends it up by calling 
forwardByDestination() (since the source and destination is just the inverse of 
those in the existing object, which is assumed to be a request received from 
the above), rather than down. The delivery time models MSHR latency, if the 
input event object is from the MSHR (argument replay set to true), or models 
tag latency, if it is a freshly received event object that is just handled for 
the first time (argument replay set to false). The delivery time is also 
returned. This function also has several overloaded versions, which mainly 
differ from each other on what the command in the response object is, or 
whether dirty bit is passed by argument.
Message Helper Functions, Part III
Method sendOutgoingEvents() drains the outgoing event queues, until the queues 
become empty, or until a byte limit per cycle is reached, whichever comes 
first. At the very beginning of the function, the current simulated cycle is 
incremented by one. Recall that this function is called at the very beginning 
of cache object’s clock call back, meaning that it is the de facto tick 
function for coherence controllers. The function then drains both queues, 
outgoingEventQueueDown_ and outgoingEventQueueUp_, with local variable 
bytesLeft tracking the number of remaining bytes that can be sent at the 
current tick. Note that this function allows partial messages to be sent, i.e., 
an event object can be sent over a few cycles. The actual sending is performed 
by calling send() on the corresponding memory link object, and then popping the 
front object from the queues. The function returns a boolean to indicate 
whether both queues are empty.
Function Stubs
The base class also defines stub functions for each type of coherence message 
it may receive, with the name being handleX where X is the event type. These 
stub functions, once called, will print an error message and terminate 
simulation, in order to indicate that the functionality regarding event type X 
is not implemented. Derived classes should override at least a subset of these 
functions and implement the coherence handling logic.
Miss Status Handling Registers (MSHR)
The Miss Status Handling Register (MSHR) is a critical component of the 
coherence controller that has serves three distinct roles in request handling. 
First, requests that cause cache misses cannot be handled locally, and the 
request must be forwarded to a remote cache (e.g., a lower level cache for CPU 
data requests). During this period, the original request will be inserted into 
the MSHR, which blocks all future requests on the same address, i.e., requests 
on the same address are serialized by the MSHR. The request in the MSHR will be 
removed when the response is received, and later requests on the same address 
can proceed.
Second, requests that originate from the current cache, such as block evictions 
and write backs, will use the MSHR as an event buffer. These requests are 
inserted into the MSHR, possibly serialized with existing requests on the same 
address, and expect to be serviced just like a normal request.
Lastly, the MSHR is also essential for implementing atomic instructions. It 
allows a request to temporarily fail on a locked cache block just because the 
block is under the protection of an atomic instruction, even if the tag array 
lookup indicates a cache hit. In this case, the failing request will also be 
inserted into the MSHR, and retried at a later cycle.
Since the MSHR buffers requests that cannot be completed at an earlier cycle in 
which they were processed, when a request from an MSHR completes, the next 
request on the same address will be removed from the MSHR, and added into the 
retry buffer, retryBuffer_, of the coherence controller. As have discussed in 
earlier sections, these requests will be copied to the cache controller’s 
retry buffer at the cycle end, and will then be reattempted in the next 
processing cycle.
The MSHR is implemented by class MSHR, in file mshr.h/cc. Note that the class 
is neither a component nor a subcomponent, since it is always instanciated.
MSHR Construction
The MSHR is constructed along with the cache, in the cache object’s member 
function, createMSHR, defined in file cacheFactory.cc. The function reads two 
MSHR parameters: mshr_num_entries, which is the number of entries for storing 
external requests (internally generated requests will be counted as an entry, 
as we will see later), and mshrLatency, which is the latency of accessing the 
MSHR. The latter will be used, instead of the cache tag latency, to compute 
cache block access time, if the request is from the MSHR rather than from the 
cache’s event buffer. In addition, mshrLatency is an optional argument. If 
not given, then the default value would be used, which is one cycle for the L1 
cache, and computed from a lookup table with regard to the cache access 
latency. Later in the cache constructor, both the MSHR object and its latency 
is passed to the coherence controller, by calling setMSHR() and by setting the 
key mshr_latency_cycles in the parameter object, respectively.
MSHR Data Structure
The main class, class MSHR, is implemented as a mapping structure that 
translates from addresses to a list of waiting events. The class contains a map 
object, mshr_, of type MSHRBlock which is defined as std::map<Addr, 
MSHRRegister>, a data member size_ for tracking the current number of external 
event objects in the MSHR, and maxSize_ which is the size of the MSHR, and is 
initialized during construction. We ignore other data members as they are 
either prefetching- or debugging-related.
The data type of mshr_ is struct MSHRRegister, which is a per-address structure 
that contains a list of request objects being buffered, entries, a data member 
pendingRetries, which tracks the number of requests for the address that are in 
the coherence controller’s retry buffer and might be processed later in the 
current or future cycles, and other data members, such as acksNeeded, 
dataBuffer, and dataDirty. We will cover the usage of these data members in 
details when we first encounter them in the following discussion.
All MSHR entries are of type class MSHREntry, which carries a type field of 
type enum class MSHREntryType. The type field can be of one of the following 
three values: Event, Evict, and Writeback. Among them, the Event type entry is 
for external events that will occupy MSHR space, while both Evict and Writeback 
type entries are for the internally generated eviction and write back requests, 
respectively, and they do not occupy MSHR space to avoid protocol deadlock 
(i.e., external requests can be blocked by internal requests and fail to be 
allocated an MSHR entry, but not vice versa).
Other data members of class MSHREntry also play their respective roles in 
different parts of the coherence protocol. One of the most commonly seen data 
member is inProgress, which indicates whether the request is already being 
handled and is just waiting for response, and it is only defined for Event type 
entries. The flag is set when a data or flush request has been issued to the 
lower level cache, but the response is not received yet. The coherence 
controller will check this flag when trying to schedule a waiting event in the 
MSHR, and if the flag is set, indicating that the current head MSHR entry for a 
given address has already been scheduled, the coherence controller will not 
schedule it twice.
Each entry type has a unique constructor, and hence the constructor is 
overloaded. Event type entries is constructed with the event object and a 
boolean flag stallEvict that initializes the needEvict field. Evict type 
entries are initialized with an address, which is the new address that will 
replace the old address (which is the address key used to query mshr_) after 
the eviction. The new address value is needed for cascaded evictions, i.e., one 
eviction blocks the other and both are in MSHR, in which case the second 
eviction needs to evict the new address stored in the first eviction’s entry, 
rather than the old address stored in the second eviction’s entry. Also note 
that the data member for storing these new pointers, evictPtrs, is a 
std::list<Addr>, meaning that multiple addresses can potentially be the new 
address for an eviction. Write back entries are constructed with a boolean flag 
argument, downgr, which initializes data member downgrade, to indicate whether 
the write back is a downgrade from exclusive to shared state, or it also 
invalidates the block.
MSHR Operations
Method insertEvent() inserts an event type entry into the MSHR for a given 
address. Besides the address and the event object, is also has three arguments: 
pos, fwdRequest, and stallEvict. Argument pos indicates the position in the 
list of entries where the insertion should happen. The most common values are 
-1 and 0, where -1 means appending at the end, and 0 means inserting at the 
beginning. Most events are appended to the MSHR, but high priority events, such 
as external invalidations (from the lower level) will override all existing 
event objects, and be inserted at the very beginning of the list. Argument 
fwdRequest indicates whether the event is an explicit invalidation from the 
lower level. If this flag is set, then two slots, instead of one, will be 
reserved, since the invalidation in this case will be waiting for another 
request that can unblock the invalidation, which may further allocate an MSHR 
entry. If no extra slot is reserved, the invalidation and the second request 
will deadlock. Argument stallEvict is just forwarded to the MSHR entry object 
constructor, and is not used by the method.
The method first checks whether the MSHR is full or not (recall that if 
fwdRequest is set, two slots are needed). If full, no allocation will happen, 
and the function returns -1. Otherwise, size_ is incremented by one. Then the 
function creates a new entry, and possibly the register object for the address, 
if it does not exist, and adds the new entry into the requested position of the 
list in the register object.
Method insertEviction() and insertWriteback() work similarly by creating the 
entry of the respective type and inserting them into the register’s list. The 
difference is that, first, both functions do not check the size, the reason of 
which has been discussed earlier. Second, insertWriteback() will insert at the 
beginning of the list, meaning write back requests have higher priority than 
every other request waiting in the MSHR.
The above functions are the most frequently used ones in the coherence 
controller. We will cover other less used functions of class MSHR when we first 
encountered them during later discussion.
Coherence Protocol: MESI L1
We next delve into the realm of the coherence protocol of MESI L1 type cache. 
As have discussed earlier, this type of caches are always inclusive, uses class 
L1CacheLine type cache tags, and do not have a directory of any kind. This type 
of cache also features a simpler coherence protocol compared with the more 
general type of inclusive and non-inclusive caches, since there is no 
maintenance of coherence for upper levels. The MESI L1 coherence protocol is 
implemented by class MESIL1, in file MESI_L1.h/cc.
The General Coherence Handling Flow
Before discussing different classes of coherence operations in details, we 
first summarize the general rule of coherence handling as follows:
    CPU-initiated requests can be served from either the receive buffer of the 
cache controller, or from the retry buffer of the coherence controller. 
Requests that cannot be handled immediately will be inserted into the MSHR, 
which, when certain conditions are met (e.g., when the message they expect are 
received), can be sent to the retry buffer, and be attempted in the next cycle. 
The per-address MSHR register field pendingRetries tracks the number of retries 
on the address for the current moment.
Note that there is always a race condition between requests served from the 
retry buffer (i.e., from the MSHR), and those that are served from the 
receiving buffer. In the latter case, the requests are not allocated any MSHR 
entry, and they can be considered as logically ordered before those that are 
currently in the MSHR, if they perform operations that can be completed 
immediately, most likely being cache hits. These race conditions are harmless, 
though, since cache hits do not change the tag array regarding addresses.
    Method argument inMSHR, which is common to most of the coherence handling 
methods and many helper functions, indicates whether the request is served from 
MSHR, or from the receiving buffer. Many handlers or branches, especially those 
that implement operations that cannot be completed immediately, assume an 
invariant that the inMSHR flag is true and that the event to be handled is at 
the front of the MSHR register, before any real coherence action is taken. The 
invariant is most likely enforced by checking whether inMSHR is already true. 
If not, then an MSHR entry will be allocated by calling allocateMSHR(). The 
processing will only continue, if the return value of allocateMSHR() indicates 
that the allocation is successful, and that the entry is at the front. 
Otherwise, the handler will just return true, such that it will be removed from 
the memory controller’s buffer, and only be retried when it reaches the front 
entry of the MSHR register.
    A data request can be completed in one of the three manners: (1) The data 
request can be immediately completed in the current cycle. The request may or 
may not have been waiting in the MSHR in this case. The request will call 
cleanUpAfterRequest() to attempt to schedule the next request in the MSHR for 
retry. This case usually happens if there is a cache hit.
(2) The data request cannot be immediately completed in the current cycle, due 
to conditions such as cache misses or downgrades, invalidations, etc., and the 
request is inserted into the MSHR, if not already. In the meantime, the request 
will initiate internal requests to eliminate the condition that blocks it from 
being completed. When the responses to the internal requests are received, the 
request completes automatically, and the response handler calls 
cleanUpAfterResponse() to attempt to schedule the next request in the MSHR for 
retry.
(3) Same as (2), except that after the responses have been handled, the same 
request is retried by calling retry(). Then it will either be case (1), or case 
(2). This case is common if the response handler is unsure of the type of the 
request it is dealing with, and would just like to delegate state transition 
and post-response processing to the request handler. Note that this is merely a 
programming artifact to simplify coding, and it does not reflect the way that 
real hardware handles coherence requests.
    In particular, data requests that miss the current level will be inserted 
into the MSHR, and then handled in a non-atomic, split-transaction manner when 
its reaches the front. In the first half (request path), if eviction is needed, 
the controller will insert an eviction entry in the old address’s MSHR 
register. When that eviction request reaches the front of the MSHR register, 
and is successfully handled, the original data request is retried by calling 
retry(). If the miss persists on the retry, or, if eviction is not needed, the 
data request handler will forward the data request to the lower level. The 
coherence state of the block will also transit to a transient state to indicate 
a pending transaction. In the second half (response path), the response is 
received, which is processed by the response handler. The response handler 
transits the state back to a stable state, and the request is completed in the 
response handler by calling cleanUpAfterResponse().
    Certain requests may already be in progress, which is checked by calling 
getInProgress(), when their preceding requests complete or when a response is 
received and handled. This indicates that the request has completed the first 
half of the split transaction, and is waiting for the responses, and hence need 
not be retried on completion of the preceding request. In this case, these 
requests will not be retried by cleanUpAfterRequest() and 
cleanUpAfterResponse(). Correspondingly, when a request finishes its first half 
execution, then the handler needs to call setInProgress() to set the in 
progress flag.
Note that the reason some requests are already handled while not sitting at the 
front entry of the MSHR register is that some requests can “cut the line”, 
and be inserted at the front of the MSHR. When such requests complete, it is 
necessary to check whether the next request is already in progress.
    If a request can be handled immediately, then it will just complete in the 
same cycle as it is handled, without involving the MSHR. Otherwise, the 
coherence controller will insert it into the MSHR, which will be handled in the 
future after all requests that precede it are drained (and as discussed 
earlier, the waiting request is scheduled for retry by one of the preceding 
requests or the corresponding responses).
Insertion into MSHR takes place by calling allocateMSHR(), which is defined in 
the base class controller. There are three outcomes to MSHR allocation: (1) The 
allocation succeeds, and the entry allocated is the front entry of the 
register. The function returns OK; (2) The allocation succeeds, and the entry 
is not the front entry. The function returns Stall; (3) The allocation fails, 
due to the MSHR being full. The function returns Reject.
    The caller of allocateMSHR() must check the return value, and act 
accordingly: In case (1), the request must be handled immediately, which 
executes the first half of the split transaction. The event will be removed 
from the cache controller’s buffer when the handler returns true. When the 
second half of the split completes on receiving the response event, the 
response event handler will eventually remove the request from the MSHR.
In case (2), if the request does not work on transient states, then the request 
must not be handled immediately, since it is not at the front entry of the 
MSHR. The caller should just return true to the cache controller, which 
indicates that the request can be removed from the cache controller’s buffer. 
The request will be scheduled for retry (in fact, its first time attempt) when 
the preceding request completes.
On the contrary, if the request works on transient states, because it can 
resolve race conditions (e.g., race between downgrade and upper level flush), 
then the request may still be handled to resolve the race condition first, and 
then at a later cycle, be retried from the MSHR to perform its own task.
In case (3), for L1 caches, the request must not be handled immediately, since 
it fails to acquire an MSHR entry. The caller should return false to to the 
cache controller, which indicates that the request must remain in the cache 
controller’s buffer, and be retried by the cache controller in the next 
cycle. For non-L1 caches, if case (3) happens, the caller still returns true, 
and will send NACK to the upper level cache, indicating that the upper level 
cache is responsible for a retry in the future.
    External downgrades and invalidations can be received, and they will race 
with CPU-initiated requests. The coherence protocol always orders external 
events before concurrent CPU-initiated requests, only with a few exceptions, 
such as when the cache block is locked, or (for non-L1 caches) when the 
external event will cause the second half of the split transaction to fail.
    Each MSHR register has a pending retry counter, which tracks the number of 
queued requests that will be retried in the current cycle. This counter is 
incremented when a new request is inserted into the coherence controller’s 
retry buffer, and decremented when a request handler executes with inMSHR flag 
set. Certain operations (most notably, evictions) require this counter to be 
zero in order to be able to proceed.
    Despite the serialization provided by MSHR registers, evictions will always 
race with all requests on the old address. This is because evictions are 
initiated by requests on the new address (i.e., the address whose access incurs 
a cache miss), but the eviction itself operates on the old address (i.e., the 
address to be replaced). In addition, although evictions can only be initiated 
after the original request has been inserted into the MSHR, the eviction can be 
attempted on the old address without first being inserted into the MSHR 
register of the old address, essentially making it of higher priority than all 
existing entries on the old address. As a result, the eviction path handling is 
extremely cumbersome, with lots of condition checks, the purpose of which is 
just to ensure that evictions will not race with existing entries in the old 
address’s MSHR register in a harmful way.
In the following text, we discuss the three major classes of operations, 
namely, CPU-initiated data requests, CPU-initiated flush requests, and external 
requests (i.e., downgrades and invalidations), in separate sections. Helper 
functions are covered when they are encountered for the first time.
CPU-Initiated Data Requests
High-Level Overview
CPU may initiate three types of data requests: GetS, which obtains a block in 
shared state for read, GetX, which obtains a block in exclusive state for 
write, and GetSX, which is equivalent to a GetX, except that the block is 
locked in the L1 cache, until a future GetX with an atomic flag hits it. GetSX 
is essential for implementing atomic read-modify-write instructions.
From a high-level perspective, these requests are handled as either a single 
transaction, in the case of cache hits, or as a few separate atomic 
transactions, if the access misses. The coherence controller first performs a 
lookup on the tag array. If the tag entry of the requested address exists, and 
it is in a state that does not need further coherence actions (e.g., E/M state 
on GetX), then the access is a hit, and the response is sent back in the same 
cycle the request is processed.
Otherwise, the access incurs a cache miss, and the coherence controller inserts 
the request into the MSHR (if this fails, then the request is rejected by the 
coherence controller, and the cache controller just keeps it, as discussed 
earlier). Meanwhile, a new request is created based on the current request, and 
the new request is sent to the lower level to acquire the block. The rest of 
the actions depends on whether the address tag already exists in the current 
cache or not. If it does, meaning that the coherence action is taken to upgrade 
the permission of the block (which only happens for GetX and GetSX), then no 
further action is taken.
If, however, the address tag does not exist, then two more transactions are 
started: eviction and write back. The coherence controller first attempts to 
evict the block on the same cycle. If this is not achievable due to the block 
being locked, or pending retries on the address to be evicted, the controller 
creates an eviction request using the existing block’s address, and inserts 
it into the MSHR. In addition, the controller also creates a request to write 
back the contents of the evicted block, and also inserts it into the MSHR. The 
write back request might be optional for clean blocks, if the lower level cache 
does not explicitly request clean write backs (this is negotiated during 
initialization, as discussed in cache controller’s init() function).
The coherence state of the cache block also transits to an transient (unstable, 
intermediate) state to indicate the ongoing transaction of block fetching or 
upgrading. I state blocks will transit to IS or IM, depending on the type of 
the request. S state blocks will transit to SM, if an upgrade is required. E 
and M state blocks will never miss, and the request is handled trivially in one 
cycle (although E state blocks will become M on GetX requests).
The second half of the transaction begins when the response event for an 
earlier cache miss is received. The coherence controller matches the response 
event with the outstanding MSHR entry, removes the entry, and then transits the 
block state to a stable state. Extra actions may also be taken, such as locking 
the cache block, if the request is GetSX, or marking the LL/SC’s atomic flag. 
The transaction concludes by creating and sending the a response message up 
that indicates the completion of the access.
handleGetS(), Hit Path
We use function handleGetS() to walk over the above sequence as an example. 
This function is called in the cache controller’s tick handler to process an 
memory event object of type GetS. The inMSHR argument is set to false, if the 
event object is from the cache controller’s event buffer, or true, if the 
object is from its retry buffer (which itself is just copied from the coherence 
controller’s retry buffer at the end of every tick).
At the beginning of handleGetS(), a tag lookup is performed on the tag array by 
calling lookup() on cacheArray_, which either returns a valid pointer to the 
line, if there is an address match, or returns NULL. Note that this lookup 
operation will update replacement, as the second argument is true. The line 
state is stored in local variable state, and then a switch statement decides 
the next action. If the line is in state S, E, or M, indicating a cache miss, 
then the request is fulfilled at the current cycle, and the response message is 
sent by calling sendResponseUp(), followed by cleanUpAfterRequest(). Note that 
the line’s timestamp is also updated by calling setTimestamp() with the 
return value of sendResponseUp().
The function also calls removePendingRetry() to decrement the MSHR register’s 
pendingRetries field, if it is from the cache controller’s retry buffer, as 
an indication that one less requests need to be processed for the address in 
the current cycle.
Method sendResponseUp() just takes the request event, creates a new response 
event by calling makeResponse() (which also sets the source and destination of 
the response event by inverting the two in the request event), and simulates 
the access latency using the input argument time, which is the last cycle the 
cache block is busy, and the current timestamp timestamp_. We have already 
covered access latency simulation in earlier sections, and do not repeat it 
here. The access latency value being used depends on whether the request is 
from the MSHR (i.e., cache controller’s retry buffer), in which case is data 
member mshrLatency_, or it is from the cache controller’s event buffer, in 
which case is data member tagLatency_. The response event is eventually 
inserted into the send queue by calling forwardByDestination(), which is 
defined in the base class.
Method cleanUpAfterRequest() is called after a request event has been 
completed. This function is called under the invariance that the request must 
be the first in the MSHR’s entry list of the requested address. The function 
first removes the front (i.e., current) entry from the MSHR register of the 
given address by calling removeFront(), if the request is from the MSHR 
(indicated by argument inMSHR). The function then moves the next request on the 
same address that was blocked by the completed request into the retry buffer, 
which will then be copied to the cache controller’s own retry buffer and 
processed in the next cycle. To achieve this, the function first inspects the 
type of the entry in the MSHR. If the entry is of event type, and the event 
object is not already in progress (by checking getInProgress()), then the event 
object is scheduled to be processed in the next cycle by inserting it into 
retryBuffer_. The addPendingRetry() is also called to increment the 
pendingRetries counter in the MSHR register, indicating that some CPU-generated 
event will be processed for the address in the next cycle.
Note that in the current version of SST (version 11.1.0), there is likely a bug 
in cleanUpAfterRequest(), where curly braces are missing for the if statement 
if (!mshr_->getInProgress(addr)). This way, the statement, 
mshr_->addPendingRetry(addr);, which is supposed to be only executed when the 
if holds, will be always executed regardless of the if condition.
If the next MSHR entry is of evict type, the function will then generate one 
eviction request for every “new address” stored in the entry (the list of 
new address is obtained by calling getEvictPointers()). Note that internally 
generated eviction requests carry a command of NULLCMD, and that they also 
carry the old address (i.e., the address to be evicted), addr.
Also Note that write back entries in the MSHR register are not retried in 
cleanUpAfterRequest(), even if they are at the front of the queue. The reason 
is that the coherence controller always sends the write back request to the 
lower level in the same cycle as the entry is added to the MSHR, and hence 
these entries are just waiting for write back responses (i.e., they are always 
in progress), while blocking all future requests. The write back entries will 
be removed when write back responses are received, in which scenario 
cleanUpAfterResponse() will be called to retry the next request in the MSHR 
register.
handleGetS(), Miss Path (Despite Eviction)
If the state of the tag is I, meaning that the tag address is not found, then a 
cache miss has occurred, and the request must be completed with a few more 
transactions. The controller will first attempt to evict a block from the 
cache, by calling processCacheMiss(). This function may perform the following 
three operations (some are optional): (1) Adding the request into MSHR, if not 
already; (2) Evict the tag if necessary, or schedule an eviction request in the 
MSHR; (3) Send a write back request to the lower level for the eviction. The 
function may return MemEventStatus::OK, MemEventStatus::Stall, or 
MemEventStatus::Reject, which are enum class types defined in file memTypes.h. 
In the first case, everything succeeds, and the CPU-generated request can be 
processed. In the middle case, the request cannot be handled immediately, but 
we know that it has already been added to the MSHR, so the request can be 
removed from the cache controller’s buffer, and it will be retried later. In 
the last case, the request does not even get into the MSHR, and it must be 
retained in the cache controller’s buffer. We postpone discussion of the 
eviction path to later section, and only focuses on miss handling here.
If the return value of processCacheMiss() is OK, then the processing proceeds 
by performing a second lookup on the tag array. The second lookup is necessary, 
since a replacement may have already been made. Also note that the lookup does 
not change replacement information, as its purpose is merely just to find the 
tag entry rather than simulating an access. The controller creates a new 
request object and sends it to the lower level cache by calling 
forwardMessage() (which is defined in the base class), transits the state of 
the block to IS, which is equivalent to S but data is missing (so any data 
related request on IS block should be stalled unless the response arrives). 
Finally, the timestamp of the block is updated to the send time, and the MSHR 
entry of the request is set as in progress, such that it will not be scheduled 
twice. If the return value of processCacheMiss() is not OK, the request will 
not be further processed. The return value of handleGetS() also depends on the 
return value of processCacheMiss(), i.e., whether the request has been 
successfully added to the MSHR. If true, then handleGetS() returns true, and 
the request will be removed from the cache controller’s buffer. Otherwise, 
the method returns false, and the request remains in the buffer, which will be 
processed in the next cycle.
Note that it is possible that the request is successfully inserted into the 
MSHR, but the eviction fails, which will cause processCacheMiss() to return 
Stall, and handleGetS() will return true regardless, causing the request to be 
removed from the cache controller’s buffer. At the first sight, this will 
block overall progress, since the request will never be re-executed. With 
further inspection, however, it is revealed that when this happens, there must 
already be earlier requests in the MSHR. When these requests complete, they 
will insert the current request back into the retry buffer. This is also the 
reason why function handleGetS() always calls removePendingRetry().
If the request hits an transient state, it will be inserted into the MSHR by 
calling allocateMSHR(), if not already, since CPU-generated requests always 
operate on stable states. The older request that is responsible for causing the 
transient state will put this request in the retry buffer when it completes.
handleGetS(), Response Path, Part I
When a cache miss occurs, the request object is inserted into the MSHR, and we 
know that it also must be the first entry of the MSHR register (because 
otherwise, the block will be in a transient state, and the request will not be 
processed). All processing on the address is blocked (excluding invalidations, 
which is considered as being ordered before the access operation) until the 
response event from the lower level cache arrives, in which case, 
handleGetSResp() will be invoked.
Method handleGetSResp() first performs a lookup on the tag array, acquires the 
cache tag, and sets the block state to the stable state, S. We ignore the 
mundane part of the function on data and memory flags as they are pretty 
straightforward. The method then calls sendResponseUp() to send a response 
message to the higher level component (for L1 cache, the CPU). Note that the 
replay argument is set to true to indicate that the request being responded to 
is from the MSHR. Finally, the method calls cleanUpAfterResponse() to remove 
the front entry from the MSHR, and schedule the following entry on the same 
address.
Method cleanUpAfterResponse() behaves similarly to its buddy, 
cleanUpAfterRequest(). It removes both the front entry of the MSHR register on 
the response event’s address, and the request event object that is in the 
entry as well, if the entry is of Event type. Note that the entry can also be 
of other type as well, since write backs and flush response handler will also 
call this function, in which case there is no associated request object in the 
front entry. The method then adds the next waiting entry in the MSHR register 
to the retry buffer in a way that is similar to the one in 
cleanUpAfterRequest(). The only difference is that cleanUpAfterResponse() 
assumes that the next entry in the MSHR register will never be a write back 
entry, since write back entries have higher priority than CPU-generated 
requests as well as evictions, which are the type of requests that need to call 
cleanUpAfterResponse() on completion. Additionally, write backs on the same 
address will not stack on each other, since the block state will transit to I 
after the write back entry is inserted into MSHR. This implies that write back 
entries will never be queued in the MSHR after these request entries.
In the case of cleanUpAfterRequest(), the reason that a write back request may 
be after the front request is that external invalidation requests have even 
higher priority than write backs, i.e., when an invalidation is received, if it 
cannot be handled immediately, the event entry will always be inserted as the 
front entry of the MSHR register. After this request is handled, 
cleanUpAfterRequest() will be called to remove its entry from the MSHR, in 
which case the next entry being a write back is truly possible.
handleGetS(), Response Path, Part II
It is also possible that GetS request receives a GetXResp as the response 
message, which invokes handleGetXResp() for event handling. This may happen if 
one or more lower level caches are non-inclusive, and on the GetS request, they 
just decide to give up the ownership of a dirty block. The dirty block will be 
delivered to the requesting cache via GetXResp, and the handler will see 
transient state IS.
The relevant logic of handleGetXResp() is no different from those in 
handleGetSResp(). In the switch statement, transient state IS is handled by 
transiting to either M, if the response data is dirty as indicated by the lower 
level, or to E or S, depending on whether E state is enabled for the protocol.
handleGetS(), Eviction Path, Part I
We have left out the eviction path of handleGetS() in the previous sections, 
which will be covered in this section. The eviction path starts in function 
processCacheMiss(). The function first allocates an MSHR entry for the request, 
if it is not already in the MSHR. If the allocation fails, due to the MSHR 
being full, the function returns Rejected without any further action, and the 
request will be reattempted in the next cycle. If the request is already in the 
MSHR, the function also checks whether the request is, in fact, also the front 
event of the MSHR register. The eviction path will only be executed if the 
request is the in the front. This check is necessary, since it is possible that 
the event was the front entry of the MSHR register, and hence was added to the 
retry buffer in the previous cycle, but an invalidation is received in the 
current cycle, which has a higher priority, and will be inserted as the front 
request. In this case, the CPU-generated request should give up the cycle, and 
let the invalidation be handled first by returning Stall to the caller.
If all checks are passed, or the MSHR is allocated successfully, then the value 
of the local variable status will be OK. The function then checks whether an 
eviction is needed by checking whether the argument line is NULL. If true, this 
implies that the tag lookup does not find the address, and hence an existing 
block should be evicted. Otherwise, the address is already in the cache, and 
the miss is caused by an upgrade. In the former case, allocateLine() is called 
to perform eviction and write back, while in the latter case, no eviction is 
needed, as the block is already present.
Method allocateLine() first calls handleEviction() to attempt performing an 
eviction in the current cycle. If the eviction is feasible, handleEviction() 
returns true, which means that the block has already been evicted in the 
current cycle. In this case, the tag is updated with the requested address by 
calling replace(), and the line object with the new address is returned. If 
eviction is infeasible, due to some ongoing actions on the old address, 
handleEviction() returns false. In this case, an eviction entry is inserted 
into the MSHR by calling insertEviction(), and the function returns NULL. Note 
that the eviction entry is inserted on the old address, with the entry object 
carrying both the old and the new address, while the event entry representing 
the request is inserted on the new address.
Also note that, in the case where eviction fails, allocateLine() returns NULL, 
which causes its caller, processCacheMiss(), to return Stall. This will further 
cause the request to be removed from the cache controller’s buffer. The 
request will eventually be retried when the eviction on the old address 
succeeds.
Method handleEviction() attempts to evict the block given by the replacement 
manager if feasible. Note that the second argument is both an input and an 
output argument. The function first finds the replacement block, if one is 
needed (line being NULL) by calling findReplacementCandidate() on the tag array 
with the old address. It then checks whether the line is locked, and if true, 
then an atomic instruction is currently half-way, and eviction must not proceed 
until the block is unlocked, in which case false is returned. Otherwise, the 
function checks whether the address also has pending retries by calling 
getPendingRetries() on the block address at the beginning of every case block. 
If true, then these retries are logically ordered before the eviction, and 
eviction could not proceed either, which causes the control flow to fall though 
to the default case, and false is returned.
If the eviction is performed successfully, then the state of the line will 
transit to I. There is no transient state for evictions, meaning that evictions 
happen atomically.
Note that there is also likely a coding bug that fails to return false directly 
when the getPendingRetries() check fails. The bug, although not affecting 
correctness, causes the check with getPendingRetries() to be called multiple 
times, until the control flow falls through to default. A better scheme would 
be just to add an else to the if statement that checks pending retries, and 
return false there.
If there is no pending retry on the old address, the block is not locked, and 
the state of the block is a stable state, eviction could then proceed in the 
current cycle, by sending write back requests to the lower level with PutS, 
PutE, or PutM commands, for state S, E, and M, respectively. Transient states 
indicate ongoing transactions on the address, in which case eviction should 
also be delayed (the control flow will hit the default case). The actual 
request object is sent with sendWriteback(), which is just a simple function 
that creates a new event object, computes the delivery time, initializes 
related fields, sends the request downwards by calling forwardByAddress(), and 
finally updates the block’s timestamp. After sending the write back event 
object, if the cache is expecting acknowledgements for write backs (which is 
configured during init()), a write back entry is also inserted into the MSHR 
for the old address by calling insertWriteback(). Note that the write back 
entry is always inserted as the front request of the old address. The state of 
the block is reset to I to reflect the fact that the block has been evicted.
handleGetS(), Write Back Response Path
If write back ACKs are to be received, the response event of type AckPut will 
be processed by the cache controller by calling handleAckPut(). This function 
is extremely simple: It does nothing except calling cleanUpAfterResponse(), 
which, as we have discussed earlier, removes the front entry of the MSHR 
register, which is the write back entry, and inserts the event in the following 
entry into the retry buffer.
handleGetS(), Eviction Path, Part II
When handleEviction() fails, due to ongoing operations on the old address, an 
eviction entry is inserted into the MSHR register of the old address. The 
eviction entry will be translated into eviction request objects of type class 
MemEvent, with command NULLCMD, in method cleanUpAfterRequest() and 
cleanUpAfterResponse(), before being inserted into the retry buffer.
On seeing a NULLCMD type event object, the cache controller will call 
handleNULLCMD() to perform the eviction. The function first obtains the line 
object with the old address by performing a tag lookup. The function then 
reuses handleEviction() to perform eviction in the current cycle. If eviction 
succeeds, then it calls deallocate() on the line object to clear the line’s 
state and mark it as unused. Since the eviction has been performed, the 
originating CPU-generated request on the new address, which is obtained by 
calling getBaseAddr() on the event object, is then added to the retry buffer, 
and the new address’s pending retries counter is also incremented by one.
Recall that if the eviction MSHR entry has multiple new addresses, then 
multiple eviction request object will be generated. Each successful 
handleEviction() in these requests will remove the new address value from the 
eviction MSHR entry (which remains the front entry as these evictions are 
processed) by calling removeEvictPointer(). The last eviction request that is 
successfully completed will call retry() on the old address to schedule the 
next entry in the old address’s MSHR register, if any.
If handleEviction() fails with the eviction entry being the front entry of the 
MSHR register, then either the block is still being locked, or the address to 
be evicted has no longer been in the cache. The latter is a possible scenario 
if multiple eviction requests on the same old address are being processed, in 
which case the earlier lookup() will return NULL, causing the handleEviction() 
to pick a new replacement block with a different address. The function only 
handles the latter, though, by comparing the line’s address with the old 
address. If they mismatch, then the address of the current line is used as the 
old address to insert another eviction entry into the MSHR. The 
removeEvictPointer() and retry() calls at the end of the branch just serve 
identical purposes as those in the other branch.
As for the former case, in which the old address to be evicted is still locked, 
the handler just let the evictions requests be destroyed quietly at the end of 
the function without causing any effect. The eviction entry, though, still 
remains as the front entry of the MSHR with all new addresses in the entry 
object. When the second request of the locked instruction is handled by 
handleGetX(), since the address is guaranteed to be locked in the cache, the 
instruction would never retry, never miss the cache, and hence will always 
succeed on the first attempt, never requiring an MSHR. After the GetX completes 
(which will hit an M state line), the method cleanUpAfterRequest() is called, 
which, in this case, will see the eviction entry, and the eviction requests 
will be retried again.
handleGetX()
The logic of handleGetX() is very similar to those of handleGetS(), with a few 
exceptions:
    handleGetX() will not forward the request to the lower level, if the 
current level is known to be the last level cache, as the last level cache is 
assumed to be shared by all caches, and hence exclusive permission can be 
granted locally. The last level cache is tracked by the flag lastLevel_, which 
is set during init(), if the down link connects to a memory end point.
    The access will incur cache misses on both I and S state blocks, causing 
the block to transit to transient state, IM and SM, respectively. E and M 
states will be cache hits, and in the case of E state blocks, it will transit 
to M state.
    In the case of cache misses, the request sent down the hierarchy is of type 
GetX. Correspondingly, the response message received is GetXResp, which is 
handled by method handleGetXResp(). On receiving the response message, IM and 
SM state blocks will transit back to the stable state, M. In addition, if the 
GetX request is the second half of an atomic read-modify-write instruction, 
receiving the response will cause the lock counter to be decremented (by 
calling decLock()), indicating the completion of the locked instruction. Note: 
I am quite sure whether the last part about locked instructions is necessary or 
not, because locked cache blocks will be acquired in M state from the 
beginning, and will never be evicted or downgraded until the second GetX 
releases the lock. This way, it is impossible for a locked GetX instruction to 
cause a cache miss, and hence handleGetXResp() should never see a locked GetX 
as the request type.
    GetX requests with the F_LOCKED flag set is regarded as the second half of 
read-modify-write atomic instructions. When such a request is processed (it 
should always hit on the first attempt), the flag is checked, and if the flag 
is set, the lock counter of the block will be incremented by calling incLock().
handleGetSX()