Dusty Chadwick’s Post

VP of Engineering / Head of AI @ Voze | Leadership in AI (AI Utah 100 Honoree), Cloud Architecture and Data Engineering

1mo Edited

Hallucinations in LLMs are inevitable, no seriously! Josh Fritzsche and I have worked hard at Voze to find a way to tune, prompt and even plead 🙏 with LLMs to prevent them from the occasional 🍄 hallucination. It seemed like the better we have done the harder the edge cases have been to solve and eventually we realized that they tend to frequently happen when LLMs predict or make anticipations on "nothing". I remember at one point nearly going insane with frustration because when LLMs had a hallucination, they were EPIC! 😱 Then we pushed through it and benefited for the effort! How did we solve this common problem? We didn't (Not completely), we got better at identifying the issues as they happened. Once we identified in the data we were able to start looking for patterns. The same catalysts that cause it to happen once if left unchanged caused it over and over again. Our solution is astoundingly simple. Once you know that it's likely to happen..... let it! Accept it will 🍄, but also be preemptive in not focusing on it and use alternative data sources, generators or providers. Change the conditions that cause it but continue to monitor it as it happens. As Dan Caffee is fond of saying to me on a regular bases. "Dusty we need the system to be adaptive and provide mechanisms to learn from it's past." Well he is right! All we needed was data and the people that contribute to it's quality. Huge props 👏 to strong support from Kathy and Janelles audit teams. We were armed with a knowledge and a desire to learn from past mistakes. Armed with confidence that we could correct these problems we could focus on the data to prevent their impact going forward. When doing millions of LLM (Generative AI) calls regularly Josh and I have learned that hallucinations will happen. We also know exactly how we will handle and prevent it going forward! Failure is not doing anything and accepting defeat.

Hallucination is Inevitable: An Innate Limitation of Large Language Models

arxiv.org

2 Comments

Ber Zoidberg

Building web apps and data pipelines at-scale

1mo

The reality is that we approach LLMs wrong, and I think we do this because we don’t understand the complex iterative process of human cognition. When you think about something, even something as simple as recalling information you have stored in your own head, most people go through a series of steps to self-validate that information before just spewing it out. LLMs don’t have this luxury — the functions we use for training don’t enforce this, the models don’t seem to naturally develop this within their networks, and we are so obsessed with simple single-iteration IO against a single universal model that we completely miss that human intelligence comes from internal iteration and cross-communication across multiple specialized neural networks. The future of AI, and the solution to hallucination, is engineering a system of networks working together to solve problems, answer questions, etc. Beyond what we currently attempt with MoE models.

2 Reactions

To view or add a comment, sign in

More Relevant Posts

John Thompson

Serial innovator, Keynote speaker, and Author of 4 books on AI & Data. Successful in leading GenAI & FoundationalAI Product & Go To Market teams in creating & delivering products & solutions that drive change at scale.
1mo
Report this post
Work has begun and is progressing on understanding the internal states of LLMs. I found this paper by the team at Anthropic to be insightful and useful. It is early days in understanding how models work. This is an exciting area of research that will help us understand how LLMs work and how we can manage them in a safe, ethical, and transparent manner. Mapping the Mind of a Large Language Model https://lnkd.in/gwiG8qdb

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

transformer-circuits.pub

1 Comment
Like Comment
To view or add a comment, sign in
Bernard Hishamunda, Ph.D

Engineer, Consultant, Data Scientist, Entrepreneur, Research scientist, Physicist
2mo
Report this post
LLM vibe check. The author explores LLM benchmarking datasets and recent advances in genai making it hard to properly evaluate LLMs. A number of flagship models were found to overfit which leads to better performance test, but perform poorly on data post release. This brings us to the old adage, trust but verify. In this case, verify with your own data and be sure to refresh your eval data to test against performance degradation. #llm_eval, #llm_performance https://lnkd.in/e_sE74vW

The Evolving Landscape of LLM Evaluation

newsletter.ruder.io
Like Comment
To view or add a comment, sign in
Daniel Hoadley

Head of R&D and vLex Labs at vLex
9mo
Report this post
Interesting paper from a largely UK-based team on the "surprising failure" of #LLMs to generalise bidirectionally (aka The Reversal Curse). The gist is that, according to this paper, autoregressive language models (i.e. those pretrained on left-to-right next token prediction) cannot do logical deduction. For example, when prompted with the question "Who is Tom Cruise's mother?" #GPT4 correctly returns Mary Lee Pfeiffer. But, when prompted in the "reverse" direction with "Who is Mary Lee Pfeiffer's son?" the model fails to return Tom Cruise. I haven't spent much time dwelling on this paper and am not yet sure how I feel about it or how it may affect the quality of execution in certain #legal use-cases (e.g. how does this impact discovery-type applications)? Raises the question as to whether a model pretrained on a masked-token objective (like BERT) would perform better in this setting (this paper doesn't go there). 📗 "The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A” https://lnkd.in/e6qqKKcN #legaltech #datascience #generativeAI
3 Comments
Like Comment
To view or add a comment, sign in
Jonathan Eisenzopf

Chief Strategy and Research Officer at Talkmap
10mo
Report this post
Earlier in the year, we held an internal hackathon to evaluate how well GPT 3.5 turbo did at generating factual summaries. Each participant submitted their LLM input and 3 judges evaluated the outputs for 10 human to human conversations using 6 different evaluation criteria. The outputs were quite good. While we’ve advanced beyond GPT 3.5 (and 4) quality since then (we are building our own LLM), there was one thing no one knew until right now. The third judge was GPT 4. What we found is that the LLM wasn’t any better or worse at judging the entries then a human judge. In fact, the human judges had a higher % of disagreement between each other than GPT did with one of the judges. We have seen similar outcomes in the past. What we tend to observe, if I had to generalize, is that human beings are no better at agreeing with each other on ground truth labels in a dataset than machines are when using an LLM with more than 40B parameters or one that has been fine tuned for the task. I think this agrees with some recent papers evaluating LLM quality in labeling data compared to humans. We ran a similar study in 2020 and found similar results using much smaller models than we have today. This leads me an interesting question that I have struggled with for many years. Whether in business, history, politics, religion, or life, who is the final arbiter of truth? Who decides what that truth should be? These LLMs reflect the wonderful and evil traits of humanity equally. We can certainly tune them to one’s own values and truth, but there can’t ever be a single model that defines truth because there can never be human alignment on it. For now. Perhaps we will see LLMs tuned for differently cultural and belief systems just as we are seeing LLMs fine tuned for tasks.
2 Comments
Like Comment
To view or add a comment, sign in
Gireesh Sreedhar

Senior DSA, Field Engineering - Artificial Intelligence and GenAI | AI Technology Leader| Deep expertise on Data, AI, Machine Learning, Cloud, Generative AI, MLOps| 'Data + AI' problem solver | 5 x Databricks Certified
8mo
Report this post
LLM-as-a-judge is a promising tool with a tradeoff between cost and accuracy of the resulting metrics, and a worthy companion to the gold standard of human-judged evaluation. MLflow 2.8 supports our LLM-as-a-judge metrics which can help save time and costs while providing an approximation of human-judged metrics. MLflow 2.8 introduces a powerful and customizable framework for LLM evaluation. You get out-of-the-box metrics like toxicity, latency, tokens and more, alongside some GenAI metrics that use GPT-4 as the default judge, like faithfulness, answer_correctness, and answer_similarity. Custom metrics can always be added in MLflow, even for GenAI metrics. Discover how to use LLM-as-a-judge best practices and methodology to evaluate your RAG applications quickly and efficiently.

Announcing MLflow 2.8 LLM-as-a-judge metrics and Best Practices for LLM Evaluation of RAG Applications, Part 2

databricks.com
Like Comment
To view or add a comment, sign in
Hussein Bahmad

Digital Security Expert
10mo Edited
Report this post
Very nice article from NCSC, explaining the danger of an early adoption of emerging technology such as LLM models (used in Generative AI). See in comments some URLs illustrating the known risks, challenges and attack vectors related to LLM based models. "Researchers and vendors have already found some concerning issues. Research is suggesting that an LLM inherently cannot distinguish between an instruction and data provided to help complete the instruction. As a technical community, we generally understand classical attacks on services and how to solve them. SQL injection is a well-known, and far less-commonly seen issue these days. For testing applications based on LLMs, we may need to apply different techniques (such as social engineering-like approaches) to convince models to disregard their instructions, or find gaps in instructions. Whilst research is ongoing into prompt injection, it may simply be an inherent issue with LLM technology. Research is also ongoing into possible mitigations, and there are some strategies that can make prompt injection more difficult, but as yet there are no surefire mitigations." ... "Consider a bank that deploys an ‘LLM assistant' for account holders to ask questions, or give instructions about their finances. An attacker might be able send a user a transaction request, with the transaction reference hiding a prompt injection attack on the LLM. When the user asks the chatbot “am I spending more this month?” the LLM analyses transactions, encounters the malicious transaction and has the attack reprogram it into sending user’s money to the attacker’s account." ... "The emergence of LLMs is undoubtedly a very exciting time in technology. This new idea has landed - almost completely unexpectedly - and a lot of people and organisations want to explore and benefit from it. However, organisations building services that use LLMs need to be careful, in the same way they would be if they were using a product or code library that was in beta. They might not let that product be involved in making transactions on the customer's behalf, and hopefully wouldn't fully trust it yet. Similar caution should apply to LLMs." #LLM #generativeAI #AI #caution #risks #vulnerabilities #prompt #injection #NCSC

Exercise caution when building off LLMs

ncsc.gov.uk

1 Comment
Like Comment
To view or add a comment, sign in
Jesus Rodriguez

CEO of IntoTheBlock, Co-Founder, President at Faktory, Co-Founder, President NeuralFabric, Founder of The Sequence AI Newsletter, Guest Lecturer at Columbia, Guest Lecturer at Wharton Business School, Investor, Author.
4mo
Report this post
#TheSequence published A Summary Of Our Series About LLM Reasoning https://lnkd.in/eHbErwRU #artificialintelligence #generativeai #LLMs

Edge 379: A Summary Of Our Series About LLM Reasoning

thesequence.substack.com
Like Comment
To view or add a comment, sign in
Bhagwati Malav

Building Search & AI Systems 🚀 | Senior Tech Lead @ Paytm Search, GenAI, Personalisation & Recommendation | Ex- Blibli.com
4mo
Report this post
Large language models excel in language understanding, often surpassing custom in-house models. However, it's crucial to weigh the pros and cons when considering their use. While LLMs offer unparalleled performance, they come with higher costs and response time challenges. In certain scenarios, such as domain-specific tasks like classification and entity detection etc., lightweight in-house models may prove more cost-effective and provide quicker responses.
Like Comment
To view or add a comment, sign in
Priyanka Nath

Product + AI Leader, IIT-Delhi
6mo
Report this post
Everything you need to know about identifying hallucinations by LLMs https://lnkd.in/dnmyve64 Subscribe to https://lnkd.in/dJ83jdXw to learn more about cutting edge ML topics

Everything you need to know about identifying hallucinations by LLMs

vevesta.substack.com
Like Comment
To view or add a comment, sign in
Hanzhe Chen

Student @ University of Southern California | USC Viterbi Dean’s List | USC Viterbi Grader for CSCI 104 and ITP 104
8mo
Report this post
10/26 Angela's AI Newsletter: Follow me and in just 5 minutes, stay updated on today's trending topics on Twitter! Detecting Pretraining Data from Large Language Models 🔗: https://lnkd.in/gUa9fw7Y 📌: The project by Weijia Shi's team introduces Min-K% Prob method and WikiMIA benchmark to detect if specific texts were part of the undisclosed pretraining data in large language models like GPT-3, addressing concerns like copyright or privacy issues1. Introducing EdgeLLama: A Decentralized AI Network for Democratizing AI Access 🔗: https://lnkd.in/gJC_brfY 📌: Varun Mathur announces EdgeLLama, an open standard for a peer-to-peer network aimed at democratizing AI. The initiative seeks to reduce dependency on centralized AI companies by allowing users to host and share AI models on their own devices. The project emphasizes security and plans to release its node software soon. Researchers from the University of Washington and NVIDIA Propose Humanoid Agents: An Artificial Intelligence Platform for Human-like Simulations of Generative Agents 🔗: https://lnkd.in/gsFx_-9z 📌: Researchers have developed an AI platform for humanoid agents to mimic human interactions, using a psychology-inspired mechanism and a Unity WebGL game environment for visualization1. Eleuther AI: How the Foundation Model Transparency Index Distorts Transparency 🔗: https://lnkd.in/gmuQExsQ 📌: Eleuther AI criticizes the Foundation Model Transparency Index (FMTI) for misleadingly measuring commercial product documentation instead of actual model transparency, highlighting factual errors and bias against open models, and urging a re-evaluation of the FMTI1. AI Algorithm Predicts Earthquakes a Week in Advance 🔗: https://lnkd.in/g8P2QpZ9 📌: Mustafa Suleyman shares that researchers at UT Austin have developed an AI algorithm capable of predicting 70% of earthquakes a week ahead of time. The algorithm was successfully tested over a seven-month period in China. A Look Back at the Evolution of RLHF: From Simulated Robots to In-Context Learning with GPT-4 🔗: https://lnkd.in/g8CZBXx9 📌: Dr. Jim Fan discusses how Reinforcement Learning from Human Feedback (RLHF) has evolved since its 2017 debut by OpenAI and DeepMind. Initially used for simulated robots, RLHF has now advanced to easier in-context learning via GPT-4, significantly reducing manual workload. By Angela Chen & ChatGPT Contact Email: hanzhespring@gmail.com

Weijia Shi on X

twitter.com
Like Comment
To view or add a comment, sign in

1,921 followers

643 Posts

View Profile Follow

Dusty Chadwick’s Post

More Relevant Posts

Explore topics