FPGA PROTOTYPE RUNNING—NOW WHAT?
Well done team; we've managed to get 100's of millions of gates of FPGA-hostile RTL running at 10MHz split across a dozen FPGAs. Now what? The first SoC silicon arrives in a few months so let's get going with integrating our software with the hardware, and testing the heck out of it. For that, we'll need to really understand what's going on inside all those FPGAs.
Ah, there's the rub.
Our conversations with very many prototypers, confirmed by numerous user surveys, tell us that debug has emerged as just about the biggest challenge for prototypers today. In fact, debug is really a series of challenges.
DEBUGGING A PROTOTYPE IS A MULTI-LAYERED PROBLEM
Assuming you trust your FPGA hardware, the first challenge is to make sure you haven't introduced new bugs during the process of converting and partitioning the SoC into the FPGA hardware. Any unintended functional inconsistency between the SoC design and its FPGA prototype means that we don't even get to first base.
The recommended approach is to bring the design up piecemeal, and debug each new piece in its turn. Some prototypes can be driven via a signal-level interface from the RTL simulation testbench to the top-level ports on the design. Then a relevant simulation testbench can be applied as each function is added to the prototype in turn. When the whole design passes this initial check, we can be confident that the FPGA prototype is a valid cycle-accurate representation of our RTL.
We can now get on with using the prototype in earnest, and for this we will need to gain visibility inside the FPGAs using embedded instrumentation.
EMBEDDED INSTRUMENTATION FOR ALL REASONS
We can identify these five tasks during the prototype bring-up and usage for which internal visibility is essential...
- The configuration and integrity of the bare FPGA hardware
- The piecemeal test of blocks of RTL during bring-up
- The long-duration testing of the RTL
- The in-circuit validation of software, including hardware-software co-debug
- Final system integration
Ideally we want a common debug approach for all these steps. The Visualizer™ Debug Environment from Mentor Graphics®, in combination with Certus™ Silicon debug provides exactly that. Visualizer is a key component of the Mentor Graphics Enterprise Verification Platform™ (EVP), so it may already be familiar to the prototype team, who may have used it in Questa® for simulation and Veloce® for emulation debug earlier in the SoC project. You can read more about Visualizer and EVP in back-issues of Verification Horizons, but for now, let's focus on how Certus provides the trace data that the Visualizer needs in order to trace bugs to their root cause.
FROM TRIGGER TO ROOT CAUSE CAN BE A LONG JOURNEY
The hardest to find bugs may only become apparent in the prototype many thousands or even millions of clock cycles after their root causes have occurred. These are exactly the kinds of bugs that tend to emerge only at prototype stage. The more trace data our embedded instrumentation can capture, the sooner we can find the root cause.
The root cause may also be in a location far removed from the observed bug, often in a different clock domain, and maybe even a different FPGA on the prototype. A number of different trigger conditions might also be required in order to expose the bug's erroneous behavior. These challenges give us a shopping list of features we need in our prototype's debugger...
- Wide trace capture across multiple FPGAs
- Extremely long trace depth
- Multiple concurrent and flexible triggers
The latest version of the Certus debugger delivers these features, providing real benefit to prototypers and other FPGA users.
The first benefit of Certus is its ability to choose among 10's of thousands of internal FPGA signals from a single instrument, called a capture station. This is made possible by the unique observation network shown along with a capture station in Figure 1. An observation network is an optimally efficient non-blocking switching network that funnels the chosen RTL signals to the capture station. The capture station itself contains logic for tracing, compressing and storing activity on the selected RTL signals. The trace width per station can be from 16 to 1024 on binary boundaries, but a typical usage is 256 signals, allowing for multiple medium-sized capture stations in a typically utilized FPGA.
Figure 1: The Certus observation network allows runtime tracing from a wide selection of RTL signals
Each station is synchronous with one of the FPGA's clock domains and has the ability to trace signals from any of those in its observation network inputs, selectable at runtime. This freedom to switch between signals at runtime is a huge advantage for prototypers in the lab.
Typically, prototypers connect the observation network to as many signals as may be even remotely interesting to trace, because all of these will be selectable later in the lab. Traditional debug tools force the user to guess which signals we will need to be traced in advance, way back at the start of the tool flow. Murphy's Law says that once in the lab, the one critical signal that the user needed to trace has not been selected in advance, so they must return to the beginning and re-instrument. This tiresome loop is much, much less likely to happen with Certus because we can select from 10's of thousands of available signals at runtime.
As we can see in Figure 2, multiple capture stations can be instantiated in any given FPGA, or any given clock domain in order to extend the capture width even further. Users of traditional embedded debug tools may have been reluctant to add multiple instruments to the same clock domain, since the captured data could not be correlated between the stations.
Figure 2: Multiple capture stations provide very wide cross-domain tracing
However, Certus uses its on-FPGA router block to provide inter-station communication, in order to calibrate and synchronize the trace data from each station, and to transfer everything to an external host for assembly into a common, time-aligned database for analysis. We can see in Figure 2 that transfer is usually done via high-speed JTAG, but in the special case of ProFPGA boards, we will achieve even greater speed by using ProDesign's MMI-64 communications bus.
As FPGA prototypes almost always have to implement multi-clock SoC designs, it is important that this Certus ability overcomes limitations of traditional single-clock, single-FPGA debuggers.
EXTENDING TRACE DEPTH
Typically, embedded FPGA debuggers employ unused internal block RAM to store trace data. However, block RAM is a limited resource in any FPGA, so the maximum possible trace depth may be too short. If so, then users are forced to "walk" successively refined triggers back towards the root cause event, capturing fresh data at each step. Traditionally, this walking can take several days since each trigger might well require a re-instrumentation back at the RTL. It's preferable to extend the trace depth either finding more RAM or making better use of the RAM that we have.
Certus makes better use of block RAM employing data compression on the trace data; typically increasing trace depth by up to 1000x over uncompressed approaches. The actual compression achieved is data-dependent but as an example, storing all read and write traffic on an AXI bus to all memory locations in an ARM® A7 design may yield compression of over 1000x. At other times, on the same bus, the compression might be as low as 3x nevertheless that is still hugely significant.
An even better and easier way to extend trace depth is to store the compressed trace data in dedicated external memory, as provided on any good prototyping hardware. Using external memory allows trace data to be captured for thousands of times longer than the upper limit imposed by the available block RAM.
Mentor Graphics has been developing debug technology that uses external memory, enabling the capture of seconds of real time data, at nanosecond resolutions. The capture length depends on the compression, trace width and design clock speed but even a worst case example with no compression, thousands of signals can be captured for 100s of milliseconds. What does that mean in real life? Consider capturing the entire length of an OS boot sequence in order to trace back why a certain driver did not initialize correctly. Alternatively, how about conditionally capturing every occurrence of a particularly rare packet header over many days of test of a network SoC prototype? This is an extraordinary leap forward in FPGA prototype observability.
We call this breakthrough in trace depth streaming, and it will be made available initially to users of DDR3 and Virtex®-7 or DDR4 and UltraScale®, both on proFPGA hardware, with other memory configurations to follow. Figure 3 shows how in streaming mode, the instrumentation logic is extended to include two important Xilinx® IP blocks; a MIG block created from Xilinx’s Memory Interface Generator and an AXI Network-Interconnect (NIC).
Figure 3: Mentor Graphics Streaming technology uses Xilinx AXI Network-on-Chip and Memory Interface Generator
Using this configuration, we have already proven that streaming on Virtex-7 and DDR3 hardware will achieve a throughput of over 10 Gbps.
This raw speed allows the streaming technology to keep up with the design running in the FPGAs by extracting trace data at a rate which is typically 15x faster than the prototype’s highest system clock rate. Each capture station’s internal memory acts as a FIFO buffer to provide further data elasticity, and data compression brings further benefit.
It is even possible that streaming data can be sent directly to the host via a fast connection, such as ProDesign’s MMI-64 bus. In a recent application, streaming was able to capture all reads and write activity to system RAM (a trace width of approximately 500 signals) on an ARM®-A7 prototype during its entire three minute boot-up of Linux.
As another boon to prototypers, each capture station can be configured independently to store its trace data in local block RAM or in external RAM, or stream directly to host. Then at runtime, each streaming-capable capture station can be configured to either stream data or to store data in its local block RAM.
CALLING ALL STATIONS; THIS IS AN ALL FPGA ALERT!
The last of our three identified major requirements for prototype debug is trigger flexibility and concurrency across all FPGAs. We have noted that trace data in the various capture stations are kept in sync by inter-station communication and the router, but let's see how communications are extended across multiple FPGAs to enable complex whole-system triggering.
The latest release of Certus uses enhanced communi-cations links between the routers, called event channels. These event channels can be seen operating between FPGAs in Figure 4.
Figure 4: Inter-FPGA communication using Certus event channels
First, let's recap on Certus triggering. A capture station can be set up at run-time in order to continuously capture data until a certain condition on its inputs is met. This is the trigger condition, which can be any logical combination of the observed signals. Think about that; any input to the observation network can be used as either trigger or trace (or both) so we avoid the long iterations which plague other debug tools that force the user to guess the inputs to the cone of trigger logic too early in the flow.
We can set up different trigger conditions in each capture station so that if one of them occurs a trigger event is propagated over the event channels so that the ongoing trace capture is stopped effectively simultaneously in all stations, and the data in the Visualizer display will be correctly aligned in time.
The event channel can be used not only for triggers but also to alert all other capture stations that an event has occurred. This allows more complex triggers and conditional tracing to be employed. For example, one capture station might look for an initial condition, which becomes an event, then other stations can be set up to look for that event before enabling their own trigger conditions. Don't forget, these stations might be in different FPGAs, running on different clock domains, observing different signals. This is useful for FPGA prototyping because contiguous blocks in the SoC design are often partitioned across multiple FPGAs, but still need to be debugged as one function.
REAL DEBUG FOR FPGA PROTOTYPES
Traditional FPGA debuggers might be acceptable for single FPGA designs, or where long instrumentation iterations might somehow be tolerable, but they run out of steam when faced with a full-scale, FPGA prototype of today's complex SoCs. Certus Silicon Debug is ready to take up the baton and provide prototypers with the visibility and productivity they demand.
To take a closer look at Certus, please visit www.mentor.com/certus, contact your local Mentor representative or if you want to discuss your FPGA debug challenges, feel free to contact me at email@example.com.
Back to Top