AI + ML

Meta warns bit flips, other hardware faults cause AI errors

It's no hallucination: '4 in 1,000 inferences inaccurate' due to this alone, depending on the setup

Fri 21 Jun 2024 // 02:58 UTC

Meta has identified another reason AI might produce rubbish output: Hardware faults that corrupt data.

As noted in a paper emitted last week and a June 19 write-up, hardware faults can corrupt data. No prizes for Meta there – phenomena such as "bit flips" that see data values changed from zero to one are well known and have even been attributed to cosmic rays hitting memory or hard disks.

Meta labels such "undetected" hardware faults – we'll assume they mean errors that aren't caught and dealt with on the fly – as "silent data corruptions" (SDCs). Its researchers suggest that when these kinds of faults occur in AI systems, they create "parameter corruption, where AI model parameters are corrupted and their original values are altered."

That could result in inaccurate, weird, or just generally bad output.

"When this occurs during AI inference/servicing it can potentially lead to incorrect or degraded model output for users, ultimately affecting the quality and reliability of AI services," as Meta's boffins put it.

As we said, bit flips are not a new thing – Meta has documented their prevalence in its own infrastructure, and it's hard to deal with these undetected faults at the best of times. In their latest paper, Meta's eggheads suggest the AI stack complicates matters further.

"The escalating complexity and heterogeneity of AI hardware systems make them increasingly susceptible to hardware faults," the paper states.

What to do? Meta suggests measuring hardware faults so that builders of AI systems at least understand the risks.

Its boffins therefore proposed the "parameter vulnerability factor" (PVF) – "a novel metric we've introduced with the aim to standardize the quantification of AI model vulnerability against parameter corruptions."

PVF is apparently "adaptable to different hardware fault models" and can be tweaked for different models and tasks.

"Furthermore, PVF can be extended to the training phase to evaluate the effects of parameter corruptions on the model's convergence capability," Meta's team asserted.

The paper explains that Meta simulated silent corruption incidents using "DLRM" – a tool the social media giant uses to generate personalized content recommendations. Under some circumstances, Meta's techies found four in every thousand inferences would be incorrect due to bit flips alone.

Presumably that's on top of the usual accuracy, or lack thereof, by LLMs.

The paper concludes by suggesting that operators of AI hardware designers consider PVF, to help them balance fault protection with performance and efficiency.

If this all sounds a bit familiar, your déjà vu is spot on. PVF builds on the architectural vulnerability factor (AVF) – an idea described last year by researchers from Intel and the University of Michigan in the US. ®

Topics

Special Features

Vendor Voice

Resources

AI + ML

Meta warns bit flips, other hardware faults cause AI errors

It's no hallucination: '4 in 1,000 inferences inaccurate' due to this alone, depending on the setup

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

Meta training AI models on citizen data gets a hard não from Brazil

With users mostly happy to keep older kit, Macs just ain't selling like they used to

So much for green Google ... Emissions up 48% since 2019

Accelerate migration and go beyond virtualisation to cloud native

AMD buys developer Silo AI in bid to match Nvidia's product range

OpenAI develops AI model to critique its AI models

A friendly guide to local AI image gen with Stable Diffusion and Automatic1111

Cloudflare debuts one-click nuke of web-scraping AI

64% of people not happy about idea of AI-generated customer service

A friendly guide to containerization for AI work

Meta won't train AI on Euro posts after all, as watchdogs put their paws down

SoftBank buys struggling UK AI chipmaker Graphcore

About Us

Our Websites

Your Privacy