by Avidan Efody, Verification Architect, Mentor Graphics.
In our previous article on ISO 26262 requirements with regards to random hardware faults we explained what random hardware faults were, where they come from, how ISO requires that you classify them, and why classifying them wrong can cost a lot of money and time. We also promised that we will show you how to classify them right, and in this post we will try to keep this promise. We will do so by examining a series of techniques, starting from the simplest and cheapest and working our way to the most elaborate ones. As we go, we will see how an increasing number of worst case assumptions gives way to real life results, hence making the good fault bins grow at the expense of the bad fault ones.
Imagine you’re a verification engineer being asked to get a small 10K gate design ISO 26262 certified. Assuming you don’t take the smart decision to quit your job, what would be your first step? If you would have been asked to do plain functional verification of the design, it is obvious you would start by reading the DUT spec. With random hardware faults analysis you’ll do just the same, only your spec will be called a “hardware safety requirements specification” (ISO 26262 5.6) instead of just “spec”, and what it will contain will be a description of how your design should behave in order not to violate any safety goals. The behavior specified might concern some outputs – for example, “output A should never send out a wrong value”(1); relationships between outputs – for example “output A should never assume the value of X when output B is at a value Y”(2); or internal state – for example, “state machine must not jump from state X to state Y”(3). If we follow ISO to the letter, the same document would also describe the safety mechanisms that should be implemented in the design in order to make sure that the specified behavior is met. As a hardware design engineer, your role will be to implement the safety mechanisms. As a hardware verification engineer, your job would be to verify that the above requirements are met. A good chunk of the job would be just plain functional verification of requirements (1), (2) and (3). The additional special part is you also have to show that under a certain dose of random hardware faults the requirements will still hold at a probability above a given threshold. Applying that to requirement (2) we get something like “in the presence of hardware random faults, the probability of output A assuming a value of X when output B is at a value Y should be less than once per 109 hours”(4) (Also known as 1 FIT).
Taking requirement (4) above and ignoring the others, and assuming no safety mechanisms in place (we will factor those in later), we could initially take a very prudent assumption that any fault in any gate in our design could lead to its violation. This would put the probability of a safety goal violation at 10000*(one gate fault probability), which will most probably be higher than 1 FIT. It is easy to see that this very prudent worst case assumption is also quite unrealistic, since it is highly unlikely that all 10K gates of the design are even physically connected to A or B. In this article we will look at three different ways for refining this worst case analysis and replacing this exaggerated estimation with a more realistic one.
Just about any design includes some debug or status logic that usually can't drive outputs unless it has some telepathic powers. There's no reason in the world to include this logic in the risk probability calculation, especially since taking it out is so easy. All this requires is a good formal tool that can calculate the fan-in cone of any safety critical outputs or internal signals, and exclude all other parts of the design from the risk probability calculation. This usually gives 10-20% reduction in risk probability, practically without having to get of the sofa. Actually, formal tools can usually go further than that and also take out of the calculation any logic that can't influence A or B under certain assumptions about the inputs.
Adding a gray scale to our black and white analysis is a big step up in pain and in gain. Instead of asking if a gate can or can't influence the safety goal, we ask what are the gate's chances of influencing the safety goal. If we take an OR gate for example, as long as one input is '1', the other inputs could go as crazy as they want. If one input is '1' for 99.9% of the time, faults in anything connected to other inputs could be left out. That's where the big savings are hiding, but to get that information we must perform some measurements on a "running" design – i.e. not only connections but also actual operation. One way of doing that would be to log selected gates' inputs while the design is running, than run a script to find out the percentage of time in which each input is really influencing the output. The diagram on the right shows some example results.
Sounds simple? Well that's because I intentionally left a big gotcha out. To be really efficient, and provide a reduction that is worth the investment, the gates for which we perform this analysis would have to meet a somewhat difficult condition: gates that are in the fan-in cone of their inputs, should not have another physical connection to the safety critical areas. Why? The graphic description below tries to make this clear. Basically if faults in the fan in cone don't have another way out, then we could multiply their cumulative probability to fail, by the input's chances to influence the output. If the faults have another way, then we need to consider them in the safety goal violation probability anyhow.
At a high level, gates that meet the criteria above can be found at “strategic” points in the design. For example, at the interface between two blocks, at arbitration points, or when multiplexing/de-multiplexing is taking place. A pretty common example of gates that could be excluded from calculation using this method are those that belong to some configuration interface logic (say APB or AHB). While configuration registers are usually read quite often by DUT and safety critical logic, they’re written only rarely. This means that faults in the bus logic are buffered from safety critical logic, and have a very low probability of impacting it.
Another way to determine the probability of safe/dangerous faults is to inject a large enough sample of faults into the design and check their impacts. The brute force method of doing that, which has been used around for some time, is to run a test, force an internal signal to a wrong value, and then diff the outputs or some internal signals against the same test without a fault. If a difference is found then the fault is usually counted as dangerous.
Since the fault sample required might be quite big, the brute force method will obviously have a significant cost in regression time. An easy way to avoid additional runs, is to look at the results of tests without fault injection, and try to recalculate their result if a fault was injected at some point. Since most random faults are masked by logic just after they happen, this could help eliminate a good chunk of the sample. For example in the diagram to the right the fault injected at point A at time 200ns will be masked at point B. Note that in some cases, the fault will propagate to a large funnel of logic, or to registers/memory. In this case this sort of "what if" analysis will become growingly more complex, until it takes just as long as a new run. Hence, it makes sense to stop this analysis if the fault isn't masked after an arbitrary number of cycles.
The above fault analysis techniques should all be used together to determine safe/dangerous fault probability. Formal analysis bounds the “spatial” dimension to remove logic that is not physically connected or that has no way of impacting safety critical areas under given assumptions. Dynamic analysis bounds the time dimension removing logic that has influence on safety critical areas only for negligible time windows. Finally, fault simulation is used to determine the probability of safe/dangerous faults in the remaining logic. In our next article we will bring the results derived from these various analysis techniques together and show how they can be turned into meaningful ISO 26262 metrics.
Back to Top