The massive growth in the production and consumption of data, particularly unstructured data, like images, digitized speech, and video, results in an enormous increase in accelerators' usage. The growing trend towards heterogeneous computing in the data center means that, increasingly, different processors and co-processors must work together efficiently, while sharing memory and utilizing caches for data sharing. Hence sharing memory with a cache brings a formidable technical challenge known as coherency; which is addressed by the Compute Express Link (CXL).
WHAT IS CXL?
CXL is a technology that enables high-bandwidth, low-latency connectivity between the host processor and devices such as accelerators, memory buffers, and smart I/O devices. CXL, based on the PCI Express® (PCIe®) 5.0 physical layer infrastructure, i.e. uses PCIe electricals and standard PCIe form factors for the add-in card. Leveraging the PCIe 5.0 infrastructure makes it easy for devices and platforms to adopt CXL without designing and validating the PHY, channel, any channel extension devices such as retimers, or the upper layers of PCIe, including the software stack. It is designed to address the growing high-performance computational workloads by supporting heterogeneous processing and memory systems with applications in Artificial Intelligence, Machine Learning, communication systems, and High-Performance Computing by enabling coherency and memory semantics.
CXL supports dynamic multiplexing between a rich set of protocols that includes I/O (CXL.io, based on PCIe), caching (CXL.cache), and memory (CXL.memory) semantics. CXL.io protocol is used for functions such as device discovery, configuration, initialization, I/O virtualization, and direct memory access (DMA) using non-coherent load-store, producer-consumer semantics. CXL.cache enables a device to cache data from the host memory, employing a simple request and response protocol. The host processor manages the coherency of data cached at the device utilizing snoop messages. CXL.memory allows a host processor to access the memory attached to a CXL device. CXL.memory transactions are simple memory load and store transactions that run downstream from the host processor. CXL maintains a unified, coherent memory space between the CPU (host processor) and any memory on the attached CXL device, allowing both the CPU and device to share resources for higher performance and reduced software stack complexity.
Figure 1 - CXL Protocol Stack
WHY CXL?: CXL VS CCIX
There are many host-to-device, and device-to-device high-speed cache-coherent interconnect standards, such as GenZ and OpenCAPI (Open Coherent Accelerator Processor Interface), CCIX (Cache Coherent Interconnect for Accelerators). Different companies have developed all these interfaces to target heterogeneous computing and coherency challenges. It is visible as different groups have been working to solve similar problems.
Cache Coherent Interconnect for Accelerators (CCIX), is an industry-standard specification to enable coherent interconnect technologies between general-purpose processors and acceleration devices for efficient heterogeneous computing. CCIX was created in 2016 by a consortium that included AMD, Arm, Huawei, IBM, Mellanox, Qualcomm, and Xilinx.
Compute Express Link (CXL) is an open standard interconnection for high-speed central processing unit (CPU)-to-device and CPU-to-memory, designed to accelerate next-generation data center performance. The CXL specification's founding promoter members included: Alibaba Group, Cisco Systems, Dell EMC, Facebook, Google, Hewlett Packard Enterprise (HPE), Huawei, Intel, and Microsoft.
Both CXL and CCIX target the same problem. The major difference between them is that CXL is a master-slave architecture where the CPU is in charge, and the other devices are all subservient, while CCIX allows peer-to-peer connections with no CPU.
Possible shakeouts/convergence is needed to move things forward. Compute Express Link and Gen-Z Consortiums have already announced their execution of a memorandum of understanding (MoU), describing a mutual collaboration plan between the two organizations.
WHY IS CACHE COHERENCY REQUIRED?
For higher performance in a multiprocessor system, each processor usually has its cache. Cache coherence refers to keeping the data in these caches consistent.
Since each core has its cache, the copy of the data in that cache may not always be the most up-to-date version. For example, imagine a dual-core processor where each core brought a block of memory into its private cache, and then one core writes a value to a specific location. When the second core attempts to read that value from its cache, it won't have the most recent version unless its cache entry is invalidated. So there is a need for a coherence policy to update the cache entry in the second core's cache; otherwise, it becomes the cause of incorrect data and invalid results.
There are various Cache Coherence Protocols in the multiprocessor system. One of the most common cache coherency protocol is MESI. This protocol is an invalidation-based protocol that is named after the four states that a cache block can have:
- Modified: Cache block is dirty for the shared levels of the memory hierarchy. The core that owns the cache with the Modified data can make further changes at will.
- Exclusive: The cache block is clean for the shared levels of the memory hierarchy. If the owning core wants to write to the data, it can change the data state to Modified without consulting any other cores.
- Shared: Cache block is clean for the shared levels of the memory hierarchy. The block is read-only. If a core wants to read a block in this Shared state, it may do so; however, if it wishes to write, then the block must be transitioned to the Exclusive state.
- Invalid: This state represents cache data that is not present in the cache.
The states' transition is controlled by memory accesses and bus snooping activity. When several caches share specific data, and a processor modifies the shared data's value, the change must be propagated to all the other caches that have a copy of the data. The notification of data change can be done by bus snooping. If a transaction modifying a shared cache block appears on a bus, all the snoopers check whether their caches have the same copy of the shared block. If they have, then that cache block needs to be invalidated or flushed to ensure cache coherency.
Figure 2 is the state-transition diagram for this protocol and shows how the cache states transition on receiving commands from the local and remote processor.
Figure 2 - MESI Transitions
VERIFICATION GOALS TO ADDRESS CACHE COHERENCY CHALLENGES
Coherency management, being a high-risk event, is required because multiple copies of the same data are in different caches throughout the system. Since data in each cache can be modified locally, the risk of using invalid data is high. Therefore, it is essential to provide a mechanism that manages when and how changes can are made. Cache coherent systems are high-risk design elements—they are challenging to design and even more challenging to verify. In the end, you need a way to sign off that your system is cache coherent confidently—this a key verification challenge.
Another challenge in verifying CXL cache-based design is that the CXL specification provides a vast range of Request types, Response Types, and a vast possibility of cache state combinations. Every combination and permutation must be verified thoroughly. Although the specification defines the logical behavior of activity on the bus, sequencing and timing of cache shared lines must also be verified accurately.
To verify a multi-core cache coherent system requires the capabilities mentioned below.
- Verification Plan with Stimulus Generation - A subtle verification plan is a requirement for a complex environment on which designs can rely on their verification requirements. Thus, it off-loads the user from knowing the protocol details to create legal (or illegal) transactions.
A requirement is to turn that plan into stimulus generation to achieve the plan's intentions. VIP whose purpose is to mimic the core/device behavior must create a stimulus that accounts for the protocol rules, cache line states, and any design-specific constraints when generating transactions.
- Cache Checking - Another verification goal is to catch any illegal activity happening on the BUS and ensure that each device complies with the specification. Also, checking must be done at the cache level to check that communication with cache is compliant with the CXL specification.
- Debug Mechanism - Once something fishy was caught, one needs to get to the root level of that activity which needs to be less time-consuming and efficient to get directly to the root cause of the issue. Less debug time ultimately leads to a lesser turnaround time for any system.
- Coverage completeness - Coverage helps us in ensuring the completeness of our verification plan and space. Thus, relieving the verification team of the burden of creating thousands of scenarios necessary. It dramatically reduces and focuses the test writing effort down to only filling the coverage holes.
HOW QUESTA VERIFICATION IP HELPS ADDRESS THE ABOVE VERIFICATION CHALLENGES
Intelligent modeling allows the QVIP to mimic both host and device behavior if DUT is at the other end of the CXL interconnect. Also, QVIP can be hook as a passive component to actively monitor the bus and provides various verification capabilities like a checker, coverage and logger.
As can be seen from Figure 3, CXL QVIP can act as a host or a device, or can be hooked up as a passive device (on the bus or attached to the CXL component) for analysis purposes.
Figure 3 - CXL QVIP Environment
1) Verification Plan with Stimulus Generation
The complexity of verifying CXL-based designs requires using QVIP to model the variety of CXL hosts and devices in the system. Therefore, offloading the user from knowing the protocol details to create legal (or illegal) transactions.
QVIP provides a comprehensive verification plan covering all the complex and simple scenarios required to verify a cache coherent system. QVIP role is to mimic all CXL compliant components, which helps create a stimulus that takes into account the protocol rules, cache line states, and any design-specific constraints when generating transactions.
Figure 4 - Detailed Verification Plan with QVIP
Questa VIP comes with in-build sequences that allow the user to use these sequences to create their scenarios if required or directly use them to achieve their verification plan completeness. The sequences required that the user executes transactions like D2H/H2D request and H2D/D2H response as per their scenario requirement.
Figure 5 - D2H Sequence Flow example
Mostly, users need to define or generate a scenario by thinking only at the higher level, i.e., instead of thinking about various D2H requests (Read, Write, Eviction), users only need to do Load and Store operation for a cache line.
So QVIP comes with this abstraction of providing high-level sequences that ultimately break down into lower-level sequences of D2H/H2D requests depending upon the cache line states. It does so by taking into account the cache line biasing state.
Figure 6 - Load/Store APIs Usage
Suppose the user wants to perform the cacheable write using the store APIs, QVIP first checks the cache line state let's say the cache line state is invalid for the provided address. The Device QVIP automatically executes lower-level transactions like gaining exclusive ownership of line using RdOwn D2H Request, modifying the cache line, and evict the cache line into the memory. On the other hand, if QVIP is configured as HOST upon receiving the RdOwn D2H request, it automatically invalidates the cache line in all caches using H2S Snoop commands.
Figure 7 - Store Operation if QVIP Is a Device
As shown in Figure 7, for the device QVIP, a user only needs to execute store operation without taking care of lower-level transactions. The API itself, based on the cache line state, executes lower-level transactions and updates the cache internally. Similarly, if QVIP is the host, it automatically executes snoop requests based on D2H requests and cache line state.
AUTOMATIC RESPONDER AND DCOH ENGINE
QVIP also provides the automatic responder which responds to the D2H/H2D request automatically and provides the appropriate D2H/H2D response based on cache line state and the request received. Therefore, the user doesn't need to take care of the environment's response.
DCOH engine automatically completes the device's request if a particular cache line address lies in the device attached memory and the line is device bias. Otherwise, it forwards the appropriate D2H request onto the host's bus.
2) Cache Checking using QVIP
In a multi-core multi-device environment, it is essential to verify each host and each device individually to ensure that they comply with the specification. QVIP host and device agents with enabled checker components help achieve this aspect.
Assertions: Each agent has its mirror cache model which mimics the other end cache and updates its local cache by monitoring the bus. The cache model changes its cache line states as per MESI protocol and the transaction observed after observing the transactions.
The QVIP checker would throw an assertion if any illegal activity or transactions happened on the bus, for instance.
Figure 8 - Assertion Message
As shown in Figure 8, an assertion message provides full information about violation with proper message format, error tagging, and required information to debug that violation.
Cache predictor: QVIP provides a separate cache component that helps maintain full system coherency by providing data integrity checks and providing the correctness of the cache by predicting what could be the transaction should be executed on the bus depending upon the current state of the line. This predictor mirrors the cache model and holds all the data about the cache line states and their related metadata.
Before executing any cache transaction on the bus, the user can communicate with this predictor regarding the type of command that can be executed on any cache line. The predictor returns the list of commands that can be executed on a given cache line address based on its state.
Figure 9 - QVIP Cache Predictor
As shown in Figure 9, cache predictor can communicate with devices and suggest which D2H request must be executed, based on cache line state and cache line biasing. This offloads the DCOH engine and reduces the latency as the device already knows which request to execute before DCOH processes it.
Cache predictor takes input from BUS and host and also throws an error whether the correct set of D2H and H2D request has been executed on BUS or not.
3) Debug Mechanism
Whenever a cache system gets caught in some unpredictable conditions or some illegal activity happens, the turnaround time to debug the behavior is crucial in the verification cycle.
QVIP comes with the following debug ways that help the user identify whether the behavior is correct or not, which ultimately helps in less turnaround time, leading to less verification cycle time.
Loggers: QVIP provides a cache enabled logger that provides all the bus information on CXL interconnect. This cache logger can be used to debug the traffic at a particular time from both the directions, i.e. from both host and device, leading the user to quickly reach the desired timestamp and observe the traffic at that specific timestamp.
Figure 10 - Logger Snapshot
As shown in Figure 10, logger instances from both device and host provide the necessary information required to debug and verify the behavior at any timestamp.
QVIP also provides the debug messages whenever enabled; they can print the traffic and transactions information in a transcript or on a shell as a first step of debugging without looking into the loggers.
Figure 11 - Debug Message
As seen in figure 11, all debug information about the cache transaction can be obtained directly on the shell by using debug messages.
As previously mentioned, the vast verification space associated with CXL cache designs presents a key verification challenge. Defining all the complex scenarios requires significant investment. Yet, this is insufficient. The CXL verification solution must also enable you to measure and ensure the verification space's completeness.
CXL QVIP defines all the coverage points required to attain verification productivity. Thus, relieves the burden on the verification team of creating thousands of scenarios necessary. It dramatically reduces and focuses the test writing effort down to only filling the coverage holes. What's needed is an executable verification plan that hierarchically correlates to the CXL specification sections. It also must provide a way to differentiate between high and low importance coverage items easily.
Figure 12 - Coverage Map Example
For verifying cache coherent systems verification plan is essential but not sufficient. The complexity of creating and managing thousands of individual test cases is not feasible within realistic schedule constraints. There is a clear need for a wide range of pre-defined stimuli to ensure that you can achieve high coverage against your compliance goals.
The major challenges needed for CXL cache verifications are:
- Verification plan with stimulus: A variety of stimulus required for achieving verification completeness goals
- Checks: A complete protocol checking with cache coherency checking required at every stage of verification
- Debug Mechanism: An efficient debug mechanism required to reduce the verification cycle time
- Coverage: A subtle coverage map required for verification completeness
Questa VIP helps achieve all the above verification challenges by providing means to tackle the above points like QVIP provides a verification plan with a vast pre-defined stimulus library. Loggers, cache predictors, debug messages, all QVIP components mentioned above must be incorporated into the environment to achieve 100% verification productivity and design quality.
Contact your local Siemens EDA representative to find out more about our Questa Verification IP solutions for CXL 1.1, CXL 2.0, PCIe5, our upcoming CXL 3.0 and PCIe6 support, and other protocols in the extensive QVIP portfolio.
ABOUT THE AUTHORS
Nikhil Jain is a lead member of the Consulting Staff on the Questa Verification IP team at Siemens EDA, specializing in the development of CXL, as well as Memory and Ethernet verification IP. He received his B. Tech. degree in Electronics and Communication from GGSIPU University, Delhi in 2007.
Gaurav Manocha is a member of the Consulting Staff on the Questa Verification IP team at Siemens EDA, specializing in the development of PCIe, NVMe, and CXL verification IP. He received his B. Tech. degree in Electronics and Communication from The NorthCap University (NCU), formerly ITM University, Gurgaon in 2013.
Back to Top