1. Introduction

    Shrinking nodes and reduced supply voltages make transient faults due to electromagnetic interferences a growing concern for mission critical ASICs (Application Specific Integrated Circuit) and FPGAs (Field Programmable Gate Array). Unexpected changes to data, configuration, state or control elements within the design can lead to catastrophic failures if not taken into account and addressed upfront, either by additional safety mechanisms or through analysis that shows their risk to be sufficiently low within the component’s life-cycle.

    The ISO 26262 automotive safety standard requires that the probabilities of random hardware failures are rigorously analyzed and quantified via a set of objective metrics to allow for external auditing. These metrics include the so called architectural metrics which assess the adequacy of the safety mechanisms chosen[2], and their ability to prevent faults from reaching safety critical areas by detecting or correcting them. Although transient faults are only one reason that could lead to random hardware failures, quantifying their impact as part of the architectural metrics evaluation is one of the most challenging aspects of the certification process, due to the immense number of state and fault combinations.

    In cases where any of the architectural metrics fail to meet the criteria defined for the product’s Automotive Safety Integrity Level (ASIL), re-evaluation of the component safety concept possibly followed by improvement of existing safety mechanisms or introduction of new safety mechanisms is mandated by the standard. Since transient fault related metrics are usually obtained from gate level netlists, sometimes even after place and route, they are usually not available until late in the design cycle. At this stage, conceptual re-evaluation and substantial hardware modifications together with the verification required might have severe impact on schedule.

    An analogy to the implementation of a power concept can highlight important similarities and one important difference. Error rates per block can be allocated, just like power budgets, during the architectural exploration phase. Blocks will then be translated into RTL (Register Transfer Level) and GL (Gate Level), at which point their actual error rate or power consumption, can be accurately measured, and compared against the high level target values. If a problem is discovered at this stage, the solution space is much more limited, and the saving per solution is usually much smaller. The main difference between the two flows is that while it may be ok for a block’s power consumption to be very close to the initial target, with error rate this will mean a significant increase in analysis efforts. This stems from the fact that to comply with ISO error rates must be “proven”. A proof for external purposes, is a lot harder to achieve.

    The closer the metrics are to the criteria defined, the higher their accuracy is required to be. The higher the accuracy required, the more effort must be invested in fault analysis. Hence there is a trade-off between design cost and analysis cost. If a design’s safety concept is planned upfront with a significant margin from the criteria defined, then analysis could be relaxed. This would mean additional design cost, at the expense of analysis cost. The opposite applies in the same way – if a design is planned to meet targets exactly, analysis will need to become much more sensitive, which will result in an extra analysis cost. As we will see in the sections below, this extra cost in analysis will manifest itself in a growth in the number of faults that need to be analyzed.

    In this paper we tackle the transient fault coverage space challenge from two sides. First we explain how this practically infinite space can be managed, and a limited number of faults can be used to obtain the relevant metrics at the required accuracy. We then describe a technique allowing the limited sample chosen to be run at reduced cost in time and resources.

  2. Download Paper