A Study in Printed Circuit Board Failure Analysis, Part 1

Over the course of a failure analyst’s career, they will be exposed to an extensive and varied array of devices. No matter the technology – whether they be nanoscopic silicon sensors with moving parts so small as to defy belief or massive circuit assemblies comprised of thousands of discrete components and integrated circuits – no device is completely immune to failure. Variations in process control, insufficiently robust designs and extended abuse by an end user can all spell early doom for a device. In our introductory article, we took a high-level overview of the failure analysis process, discussing the steps an analyst takes to turn a failing, rejected product into actionable knowledge for process improvement; in this column, we will see how these steps are applied to a specific failure. Naturally, examining a relatively trivial case would not provide the necessary depth of learning, so instead, we choose to give an example of a failure many analysts dread: an intermittent failure on a printed circuit assembly.

In this study, a single printed circuit assembly was received as an RMA from an end user. The end user was able to identify the failing assembly only by swapping parts; lacking any sort of test equipment, the customer was unable to provide any detail that could help to narrow the scope of the analysis beyond the most basic of failure descriptions (“this part doesn’t work anymore”). The first step in the failure analysis process is to verify the failure; after initial photo documentation, the assembly was put into functional testing using an application test bench. Initial results were disheartening, to say the least; the assembly functioned as designed, with supply current and output levels within specifications. In the absence of any reproducible failure mode, an analyst must rack their brain, grasping at any explanation for why the product has miraculously returned to normal function. Could the product have been improperly used by the customer – for example, were all connectors fully seated? Were power supply voltages stable and held at the correct levels? Had this board been processed with a top secret, self-healing material pulled straight from the annals of science fiction that had repaired whatever defect was responsible for the initial failure (hopefully not, lest our intrepid analyst find himself out of a job)? Fortunately, in this case, our analyst was rescued from the throes of despair and his search for a new career writing schlocky novellas about autonomous, regenerating electronic assemblies by a sudden change in the functional test results: an output that was previously within specifications suddenly dropped out, with only a fraction of the expected current being supplied to its load. Though our analyst rejoiced at being returned firmly to the realm of reality, these results indicated that the most likely root cause of failure would be hard to pin down – an intermittent connection.

The initial functional test led to several key observations that helped to characterize the failure. Initially, the assembly worked as intended, but after some period of time under power, the device would fail. Furthermore, the failure was not a “hard fail” (i.e. a short circuit or open circuit); power was still being supplied to the output pin, but insufficient drive current was available. After repeating the functional test and seeing the same failure characteristics, it was hypothesized that some thermal effect (thermal expansion, for example) was causing the device to fail. When first powered up, the board was at room temperature; however, after being under bias for a length of time, the power dissipated by the board caused enough self-heating to create a failure. Environmental testing was performed, and the temperature of the board was modulated; a strong correlation was noted between higher board temperature and reduced load current provided by the failing output. With the failure verified and characterized, the next step was to isolate the problem; in this case, isolation was done completely non-destructively, by tracing the circuit from the failing output back until an unexpected high resistance (48,000 ohms) between two points on the same node was noted.

With the failure verified and isolated to a relatively small area, non-destructive testing procedures were performed. For PCB failures, x-ray analysis and optical inspection are chief among the non-destructive approaches available; other techniques, like acoustic microscopy, are more appropriate for component-level failures. At this point in the process, an analyst would inspect for cracked solder joints or broken PCB traces, misaligned via drills, or any other anomalous features that might help to explain the failure mechanism; in this particular case, no issues were noted during non-destructive testing. While a negative result like this may seem like no value added to the analysis, in this case, the data can be used to rule out certain types of defects (for example, a crack in the copper trace between the two points as a result of the warping of the PCB is unlikely).

Failure is the First Step on the Road to Success, Part 2

Continued from Failure Is The First Step on the Road To Success, Part 1

Non-destructive testing overlaps to a certain degree with the next step in the process, wherein an analyst attempts to isolate the failure to as small of an area as possible. This phase of the project may include both destructive and non-destructive aspects as necessary to locate a defect site. Some problems may be fairly simple to isolate, given the correct tools; a low resistance short between nodes of a board may be revealed in a matter of seconds using a thermal imaging camera, and the aforementioned cracked solder joint found during visual inspection can usually be probed for continuity with very little trouble. Other defects may require patience, a steady hand, and a methodical plan of attack; finding a leakage site on a PCB, for example, may require an analyst to cut traces (both on the surface of the PCB and buried within) in order to limit the number of possible locations for a defect.

Once a potential defect site has been isolated, an analyst must be able to reveal the defect in all its glory. While the data gathered from isolation and non-destructive testing may be fairly strong, failure analysis follows the old clichés that “seeing is believing” and “a picture is worth a thousand words”; a failure analysis project is not truly finished until the analyst can produce images clearly showing a defect, removing any shadow of doubt that the anomaly found is at the heart of the reported problem.

This step is almost always destructive; the analyst must figuratively speaking, tear away the veil of FR4 and copper shielding the defect from view in order to definitively show the defect. At the assembly level, this often includes cross-sectioning (to show cracked vias and solder joints, or defects between PCB layers) or PCB delayering (to reveal damaged traces and voided or burnt dielectrics). Once the defect has been uncovered, an appropriate imaging solution can be chosen depending on the nature of the defect: high resolution optical or electron microscopes are sufficient for physical damage and defects, while tools like energy dispersive spectroscopy may be used to provide an “image” of contamination on a device that led to its early failure. With images in hand, an analyst’s work is almost finished.

In the final phase of a failure analysis project, an analyst must report their findings. The tools and techniques used by a failure analyst may not be familiar to their audience, who may be specialists in PCB assembly, metallurgy, or other disciplines. In some cases, the final audience of the report may be predisposed to disbelieve the results of an analysis (for example, in the case where the evidence shows that a subcontractor’s PCB’s do not meet required specifications, obligating them to re-run one or more lots of product). The failure analysis report must, therefore, be a clear, objective distillation of all data obtained during the course of the analysis, with a strong conclusion grounded in the facts revealed during the process. Whether the results point to a pervasive problem that must be remedied in order to meet reliability targets or are simply indicative of improper use by an end user, it is important to remember that the purpose of failure analysis is continuous improvement, not finger-pointing. Assigning blame does not offer a solution to a given problem; by understanding the nature of device failures, it is possible to implement corrective action (if necessary) to prevent recurrence of the same defect in future devices.

By following through the various steps of the failure analysis process – verification, NDT, isolation, revelation, and reporting – it is possible to take a device that would have been relegated to the trash can and transform it into a vital learning tool. It has been said that failure is the first step on the road to success; understanding the reason why a device has failed is a key starting point to creating a better device. Whether a defect was introduced during PCB manufacturing, solder reflow, or by an end user, all parties involved may learn from the anomaly and work to improve their own processes. While this article only provided a generic overview of the failure analysis flow, future articles will dive into further detail, exploring case studies that show the impact that failure analysis can have as well as exploring the techniques that go into a successful investigation. Until then, remember the motto of one of the most beloved groups of television scientists around – “Failure is always an option” – and keep an open mind to what that malfunctioning PCA might really be telling you!