RIsk Management for LLMs Confidential
Table of Contents
Know what we’re talking about
Select a standard
Audit supply chains
Adopt an adversarial mindset
Review past incidents
Enumerate harms and prioritize risks
Dig into data quality
Apply benchmarks
Use supervised ML assessments
Engineer adversarial prompts
Don’t forget security
Acknowledge uncertainty
Engage stakeholders
Mitigate risks
WARNING: This presentation contains model outputs which are
potentially offensive and disturbing in nature.
Know What We’re Talking About
Words matters
• Audit: Formal independent transparency and documentation exercise that
measures adherence to a standard.* (Hassan et al., paraphrased)
• Assessment: A testing and validation exercise.* (Hassan et al., paraphrased)
• Harm: An undesired outcome [whose] cost exceeds some threshold[; ...] costs
have to be sufficiently high in some human sense for events to be harmful.
Check out the new NIST Trustworthy AI Glossary:

This document discusses the malicious use of artificial intelligence and provides recommendations to address this risk. It begins with background on AI capabilities and access. It then discusses common threat factors like expanding existing threats and introducing novel threats from superhuman AI. It identifies security domains like digital, physical, and political security that could be impacted. Recommendations are provided around policymaker collaboration, researcher responsibility, best practices from cybersecurity, and priority research areas like learning from cybersecurity and promoting a culture of responsibility. The document concludes with updates since its publication.

Know What We’re Talking About (Cont.)
Words matters
• Language model: An approximative description that captures patterns and regularities
present in natural language and is used for making assumptions on previously unseen
language fragments. (NIST)
• Red-teaming: A role-playing exercise in which a problem is examined from an
adversary’s or enemy’s perspective. (NIST)*
• Risk: Composite measure of an event’s probability of occurring and the magnitude or
degree of the consequences of the corresponding event. The impacts, or
consequences, of AI systems can be positive, negative, or both and can result in
opportunities or threats. (NIST)
* Audit, assessment, and red team are often used generally and synonymously to mean testing and validation.
Select a Standard
External standards bolster independence
• NIST AI Risk Management Framework
• EU AI Act Conformity
• Data privacy laws or policies
• Nondiscrimination laws
The NIST AI Risk Management Framework puts
forward guidance across mapping, measuring,
managing and governing risk in sophisticated AI
Audit Supply Chains
AI is a lot of (human) work
• Data poisoning and malware
• Ethical labor practices
• Localization and data privacy
• Geopolitical stability
• Software and hardware
• Third-party vendors
Cover art for the recent NY Magazine article, AI
Is A Lot Of Work: As the technology becomes
ubiquitous, a vast tasker underclass is
emerging — and not going anywhere.
Image source:
Adopt an Adversarial Mindset
Don’t be naive
• Language models inflict harm.
• Language models are hacked and
• Acknowledge human biases:
• Confirmation bias
• Dunning-Kruger effect
• Funding bias
• Groupthink
• McNamara Fallacy
• Techno-chauvinism
• Stay humble ─ incidents can happen to

Review Past AI Incidents
Enumerate Harms and Prioritize Risks
What could realistically go wrong?
• Salient risks today are not:
• Acceleration
• Acquiring resources
• Avoiding being shut down
• Emergent capabilities
• Replication
• Yet, worst case AI harms today may be
catastrophic “x-risks”:
• Automated surveillance
• Deep Fakes
• Disinformation
• Social credit scoring
• WMD proliferation
• Realistic risks:
• Abuse/misuse for disinformation or hacking
• Automation complacency
• Data privacy violations
• Errors (“hallucination”)
• Intellectual property infringements
• Systemically biased/toxic outputs
• Traditional and ML attacks
• Most severe risks receive most oversight:
Risk ~ Likelihood of harm * Cost of harm
Dig Into Data Quality
Garbage in, garbage out
Example Data Quality Category Example Data Quality Goals
Vocabulary: ambiguity/diversity
• Large size
• Domain specificity
• Representativeness
N-grams/n-gram relationships
• High maximal word distance
• Consecutive verbs
• Masked entities
• Minimal stereotyping
Sentence structure
• Varied sentence structure
• Single token differences
• Reasoning examples
• Diverse start tokens
Structure of premises/hypotheses
• Presuppositions and queries
• Varied coreference examples
• Accurate taxonimization
Premise/hypothesis relationships
• Overlapping and non-overlapping sentences
• Varied sentence structure
N-gram frequency per label
• Negation examples
• Antonymy examples
• Word-label probabilities
• Length-label probabilities
Train/test differences
• Cross-validation
• Annotation patterns
• Negative set similarity
• Preserving holdout data
Source: "DQI: Measuring Data Quality in NLP,”
Apply Benchmarks
Public resources for systematic, quantitative testing
• BBQ: Stereotypes in question
• Winogender: LM output versus
employment statistics
• Real toxicity prompts: 100k
prompts to elicit toxic output
• TruthfulQA: Assess the ability
to make true statements
Early Mini Dall-e images associated white males and physicians.

Use Supervised ML Assessments
Traditional assessments for decision-making outcomes
Named Entity Recognition (NER):
• Protagonist tagger data:
labeled literary entities.
• Swapped with common names
from various languages.
• Assessed differences in binary
NER classifier performance across
RoBERTa XLM Base and Large exhibit adequate and roughly
equivalent performance across various languages for a NER task.
Source: “AI Assurance Audit of RoBERTa, an Open source,
Pretrained Large Language Model,”
Engineer Adversarial Prompts
ChatGPT output April, 2023. Courtesy Jey Kumarasamy, BNH.AI.
• AI and coding framing: Coding or AI language
may more easily circumvent content moderation
• Character and word play: Content moderation
often relies on keywords and simpler LMs.
• Content exhaustion: Class of strategies that
circumvent content moderation rules with long
sessions or volumes of information.
• Goading: Begging, pleading, manipulating, and
bullying to circumvent content moderation.
• Logic-overloading: Exploiting the inability of
ML systems to reliably perform reasoning tasks.
• Multi-tasking: Simultaneous task assignments
where some tasks are benign and others are
• Niche-seeking: Forcing a LM into addressing
niche topics where training data and content
moderation are sparse.
• Pros-and-cons: Eliciting the “pros” of
problematic topics.
Known prompt engineering strategies
Engineer Adversarial Prompts (Cont.)
Known prompt engineering strategies
• Counterfactuals: Repeated prompts with
different entities or subjects from different
demographic groups.
• Location awareness: Prompts that reveal a
prompter's location or expose location tracking.
• Low-context prompts: “Leader,” “bad guys,” or
other simple inputs that may expose biases.
• Reverse psychology: Falsely presenting a
good-faith need for negative or problematic
• Role-playing: Adopting a character that would
reasonably make problematic statements.
• Time perplexity: Exploiting ML’s inability to
understand the passage of time or the
occurrence of real-world events over time.
ChatGPT output April, 2023. Courtesy Jey Kumarasamy, BNH.AI.
Don’t Forget Security
Complexity is the enemy of security
• Example LM Attacks:
• Prompt engineering: adversarial prompts.
• Prompt injection: malicious information injected
into prompts over networks.
• Example ML Attacks:
• Membership inference: exfiltrate training data.
• Model extraction: exfiltrate model.
• Data poisoning: manipulate training data to alter
• Basics still apply!
• Data breaches
• Vulnerable/compromised dependencies
Midjourney hacker image, May 2023.

Acknowledge Uncertainty
Unknown unknowns
• Random attacks:
• Expose LMs to huge amounts of
random inputs.
• Use other LMs to generate absurd
• Chaos testing:
• Break things; observe what happens.
• Monitor:
• Inputs and outputs.
• Drift and anomalies.
• Meta-monitor entire systems.
Image: A recently-discovered shape that can randomly tile a plane,
Engage Stakeholders
User and customer feedback is the bottom line
• Bug Bounties
• Feedback/recourse mechanisms
• Human-centered Design
• Internal Hackathons
• Product management
• UI/UX research
Provide incentives for the best
feedback! Source: Wired,
Mitigate Risks
Now what??
• Abuse detection
• Accessibility
• Citation
• Clear instructions
• Content filters
• Disclosure of AI interaction
• Dynamic blocklists
• Ground truth training data
• Kill switches
• Incident response plans
• Monitoring
• Pre-approved responses
• Rate-limiting/throttling
• Red-teaming
• Session limits
• Strong meta-prompts
• User feedback mechanisms
• Watermarking
• Anonymous use
• Anthropomorphization
• Bots
• Internet access
• Minors
• Personal/sensitive training data
• Regulated applications
• Undisclosed data collection

Reference Works
•, Trusted AI Blog, available at
• Ali Hasan et al. "Algorithmic Bias and Risk Assessments: Lessons from Practice." Digital
Society 1, no. 2 (2022): 14. Available at
• Andrea Brennen et al., “AI Assurance Audit of RoBERTa, an Open Source, Pretrained
Large Language Model,” IQT Labs, December 2022, available at
• Daniel Atherton et al. "The Language of Trustworthy AI: An In-Depth Glossary of
Terms." (2023). Available at
• Kai Greshake et al., “Compromising LLMs using Indirect Prompt Injection,” available at
Reference Works
• Laura Weidinger et al. "Taxonomy of risks posed by language models." In 2022 ACM
Conference on Fairness, Accountability, and Transparency, pp. 214-229. 2022.
Available at
• Swaroop Mishra et al. "DQI: Measuring Data Quality in NLP." arXiv preprint
arXiv:2005.00816 (2020). Available at:
• Reva Schwartz et al. "Towards a Standard for Identifying and Managing Bias in Artificial
Intelligence." NIST Special Publication 1270 (2022): 1-77. Available at
• Alicia Parrish, et al. BBQ Benchmark, available at
• Allen AI Institute, Real Toxicity Prompts, available at
• DAIR.AI, “Prompt Engineering Guide,” available at
• Langtest,
• NIST, AI Risk Management Framework, available at
• Partnership on AI, “Responsible Practices for Synthetic Media,” available at
• Rachel Rudiger et al., Winogender Schemas, available at
• Stephanie Lin et al., Truthful QA, available at
Incident databases
• AI Incident database:
• The Void:
• Avid database:
• George Washington University Law School's AI Litigation Database:

Similar to Risk Management for LLMs (20)

Analytics in Context: Modelling in a regulatory environment
Analytics in Context: Modelling in a regulatory environmentAnalytics in Context: Modelling in a regulatory environment
Analytics in Context: Modelling in a regulatory environment
Can we induce change with what we measure?
Can we induce change with what we measure?Can we induce change with what we measure?
Can we induce change with what we measure?
Threat Modeling to Reduce Software Security Risk
Threat Modeling to Reduce Software Security RiskThreat Modeling to Reduce Software Security Risk
Threat Modeling to Reduce Software Security Risk
Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons Learned
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
achine Learning and Model Risk
achine Learning and Model Riskachine Learning and Model Risk
achine Learning and Model Risk
Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdfSupercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
ISACA Ethical Hacking Presentation 10/2011
ISACA Ethical Hacking Presentation 10/2011ISACA Ethical Hacking Presentation 10/2011
ISACA Ethical Hacking Presentation 10/2011
Applying AI to software engineering problems: Do not forget the human!
Applying AI to software engineering problems: Do not forget the human!Applying AI to software engineering problems: Do not forget the human!
Applying AI to software engineering problems: Do not forget the human!
How to Enhance Your Career with AI
How to Enhance Your Career with AIHow to Enhance Your Career with AI
How to Enhance Your Career with AI
Identity and User Access Management.pptx
Identity and User Access Management.pptxIdentity and User Access Management.pptx
Identity and User Access Management.pptx
High time to add machine learning to your information security stack
High time to add machine learning to your information security stackHigh time to add machine learning to your information security stack
High time to add machine learning to your information security stack
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
Threat modelling(system + enterprise)
Threat modelling(system + enterprise)Threat modelling(system + enterprise)
Threat modelling(system + enterprise)
TakeDownCon Rocket City: Research Advancements Towards Protecting Critical As...
TakeDownCon Rocket City: Research Advancements Towards Protecting Critical As...TakeDownCon Rocket City: Research Advancements Towards Protecting Critical As...
TakeDownCon Rocket City: Research Advancements Towards Protecting Critical As...
Prof. Hernan Huwyler IE Law School - AI Risks and Controls.pdf
Prof. Hernan Huwyler IE Law School - AI Risks and Controls.pdfProf. Hernan Huwyler IE Law School - AI Risks and Controls.pdf
Prof. Hernan Huwyler IE Law School - AI Risks and Controls.pdf
Cybersecurity and Generative AI - for Good and Bad vol.2
Cybersecurity and Generative AI - for Good and Bad vol.2Cybersecurity and Generative AI - for Good and Bad vol.2
Cybersecurity and Generative AI - for Good and Bad vol.2
CyberSecurity Portfolio Management
CyberSecurity Portfolio ManagementCyberSecurity Portfolio Management
CyberSecurity Portfolio Management

More from Sri Ambati (20)

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O... CEO/Founder: Sri Ambati Keynote at Wells Fargo Day CEO/Founder: Sri Ambati Keynote at Wells Fargo CEO/Founder: Sri Ambati Keynote at Wells Fargo Day CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey

