Meta warns bit flips, other hardware faults cause AI errors

It's no hallucination: '4 in 1,000 inferences inaccurate' due to this alone, depending on the setup

Meta has identified another reason AI might produce rubbish output: Hardware faults that corrupt data.

As noted in a paper emitted last week and a June 19 write-up, hardware faults can corrupt data. No prizes for Meta there – phenomena such as "bit flips" that see data values changed from zero to one are well known and have even been attributed to cosmic rays hitting memory or hard disks.

Meta labels such "undetected" hardware faults – we'll assume they mean errors that aren't caught and dealt with on the fly – as "silent data corruptions" (SDCs). Its researchers suggest that when these kinds of faults occur in AI systems, they create "parameter corruption, where AI model parameters are corrupted and their original values are altered."

That could result in inaccurate, weird, or just generally bad output.

"When this occurs during AI inference/servicing it can potentially lead to incorrect or degraded model output for users, ultimately affecting the quality and reliability of AI services," as Meta's boffins put it.

As we said, bit flips are not a new thing – Meta has documented their prevalence in its own infrastructure, and it's hard to deal with these undetected faults at the best of times. In their latest paper, Meta's eggheads suggest the AI stack complicates matters further.

"The escalating complexity and heterogeneity of AI hardware systems make them increasingly susceptible to hardware faults," the paper states.

What to do? Meta suggests measuring hardware faults so that builders of AI systems at least understand the risks.

Its boffins therefore proposed the "parameter vulnerability factor" (PVF) – "a novel metric we've introduced with the aim to standardize the quantification of AI model vulnerability against parameter corruptions."

PVF is apparently "adaptable to different hardware fault models" and can be tweaked for different models and tasks.

"Furthermore, PVF can be extended to the training phase to evaluate the effects of parameter corruptions on the model's convergence capability," Meta's team asserted.

The paper explains that Meta simulated silent corruption incidents using "DLRM" – a tool the social media giant uses to generate personalized content recommendations. Under some circumstances, Meta's techies found four in every thousand inferences would be incorrect due to bit flips alone.

Presumably that's on top of the usual accuracy, or lack thereof, by LLMs.

The paper concludes by suggesting that operators of AI hardware designers consider PVF, to help them balance fault protection with performance and efficiency.

If this all sounds a bit familiar, your déjà vu is spot on. PVF builds on the architectural vulnerability factor (AVF) – an idea described last year by researchers from Intel and the University of Michigan in the US. ®

More about

TIP US OFF

Send us news


Other stories you might like