Cate Lochead’s Post

CMO at Snorkel AI

1mo

📣 It's been an incredible 4 months since starting at Snorkel AI. Surfacing from getting up to speed / firehose of AI infrastructure, I'm in absolute AWE of the practical foresight the team consistently demonstrates 😍 Alexander Ratner recent post is another example. If you are interested in learning how enterprises will drive long term value from AI, here's a succinct summary for how we make foundational LLMs work in accordance with enterprise policies, regulatory conditions and brand ... absolutely nails 🔥 '...what matters most now, as the algorithms and infrastructure for doing fine-tuning has already standardized and commoditized? The LABELED DATA!" #genai #LLMs #enterpriseAI

Alexander Ratner

Co-founder and CEO at Snorkel AI

1mo

Some thoughts on Alignment: - It's just fine tuning (basically)! - It's all about the labeled data - Some recent work Snorkel AI on what we call "enterprise alignment" + a mini rant on what we need to focus on to actually make AI work :) There have been a plethora of terms to describe some form of tuning an LLM with respect to a labeled dataset: - "Fine tuning": Tuning on labeled training data - "Instruction tuning": Tuning on prompt, response pairs. - "Alignment": Tuning on some form of preference data (e.g. LLM A vs. B, yes/no, written feedback, etc.) Conceptually though: it's just tuning LLMs for a labeled dataset either way! For the above reasons- Snorkel AI we just prefer to call the above all "fine-tuning" for simplicity; and, believe that this (+ prompting) will all eventually converge and become lower-level details of LLM adaptation. What will matter then? And what matters most now, as the algorithms and infrastructure for doing fine-tuning has already standardized and commoditized? The labeled data! This is evinced by BigTech and LLM providers spending $ Bs on labeled data for fine-tuning and alignment, and organizations like OAI beginning to research more scalable approaches for future superalignment objectives, like weak-to-strong generalization (which we started working on 8+ years ago, https://lnkd.in/gnBqS422). The problem for enterprises: the use cases, objectives, policies, regulations, tone of voice, etc. that they need their AI models to align with are *not* the same as the generic ones these LLM providers are tuning/aligning to! We call the challenge of aligning to enterprises' specific objectives/policies enterprise alignment. It's difficult especially because private data & internal subject matter experts are usually required to do the data labeling (i.e. can't just be outsourced, even if there were $ Bs to spend!) Snorkel AI we've show that our programmatic approach to data labeling & development can solve this challenge, leading to custom alignment in a few hours of development- for ex., aligning a finance chatbot to be compliant with financial advisory policies (+20.7 pts. above baseline) in several hours. Check out our work here: https://lnkd.in/gW5hbqJF As for the "mini rant" - mostly bait to read a long post on a niche technical topic :). However, many of us in AI have a strong view that: - (A) Superhuman AGIs are not right around the corner. - (B) Enterprises are struggling to deploy AI to impactful production settings- and the party's over for all of us (and all the positive impact that we *know* AI can have) if we don't deliver value ASAP. - (C) Safe systems at scale are usually built through years/decades of incremental and collective engineering (e.g. think air travel), vs. secretive leaps forward. That is: let's all learn to walk safely first- then we'll get to flying saucer seatbelts!

1 Comment

Haziqa Sajid

1mo

Absolutely right Cate Lochead Labeled Data is the answer!

To view or add a comment, sign in

More Relevant Posts

Armin Parchami

CV/ML Research @ Snorkel
1mo
Report this post
What will and has always mattered in AI is "data". Especially to draw a meaningful conclusion on specialized use cases.

Alexander Ratner

Co-founder and CEO at Snorkel AI
1mo

Some thoughts on Alignment: - It's just fine tuning (basically)! - It's all about the labeled data - Some recent work Snorkel AI on what we call "enterprise alignment" + a mini rant on what we need to focus on to actually make AI work :) There have been a plethora of terms to describe some form of tuning an LLM with respect to a labeled dataset: - "Fine tuning": Tuning on labeled training data - "Instruction tuning": Tuning on prompt, response pairs. - "Alignment": Tuning on some form of preference data (e.g. LLM A vs. B, yes/no, written feedback, etc.) Conceptually though: it's just tuning LLMs for a labeled dataset either way! For the above reasons- Snorkel AI we just prefer to call the above all "fine-tuning" for simplicity; and, believe that this (+ prompting) will all eventually converge and become lower-level details of LLM adaptation. What will matter then? And what matters most now, as the algorithms and infrastructure for doing fine-tuning has already standardized and commoditized? The labeled data! This is evinced by BigTech and LLM providers spending $ Bs on labeled data for fine-tuning and alignment, and organizations like OAI beginning to research more scalable approaches for future superalignment objectives, like weak-to-strong generalization (which we started working on 8+ years ago, https://lnkd.in/gnBqS422). The problem for enterprises: the use cases, objectives, policies, regulations, tone of voice, etc. that they need their AI models to align with are *not* the same as the generic ones these LLM providers are tuning/aligning to! We call the challenge of aligning to enterprises' specific objectives/policies enterprise alignment. It's difficult especially because private data & internal subject matter experts are usually required to do the data labeling (i.e. can't just be outsourced, even if there were $ Bs to spend!) Snorkel AI we've show that our programmatic approach to data labeling & development can solve this challenge, leading to custom alignment in a few hours of development- for ex., aligning a finance chatbot to be compliant with financial advisory policies (+20.7 pts. above baseline) in several hours. Check out our work here: https://lnkd.in/gW5hbqJF As for the "mini rant" - mostly bait to read a long post on a niche technical topic :). However, many of us in AI have a strong view that: - (A) Superhuman AGIs are not right around the corner. - (B) Enterprises are struggling to deploy AI to impactful production settings- and the party's over for all of us (and all the positive impact that we *know* AI can have) if we don't deliver value ASAP. - (C) Safe systems at scale are usually built through years/decades of incremental and collective engineering (e.g. think air travel), vs. secretive leaps forward. That is: let's all learn to walk safely first- then we'll get to flying saucer seatbelts!
Like Comment
To view or add a comment, sign in
Alexander Ratner

Co-founder and CEO at Snorkel AI
1mo
Report this post
Some thoughts on Alignment: - It's just fine tuning (basically)! - It's all about the labeled data - Some recent work Snorkel AI on what we call "enterprise alignment" + a mini rant on what we need to focus on to actually make AI work :) There have been a plethora of terms to describe some form of tuning an LLM with respect to a labeled dataset: - "Fine tuning": Tuning on labeled training data - "Instruction tuning": Tuning on prompt, response pairs. - "Alignment": Tuning on some form of preference data (e.g. LLM A vs. B, yes/no, written feedback, etc.) Conceptually though: it's just tuning LLMs for a labeled dataset either way! For the above reasons- Snorkel AI we just prefer to call the above all "fine-tuning" for simplicity; and, believe that this (+ prompting) will all eventually converge and become lower-level details of LLM adaptation. What will matter then? And what matters most now, as the algorithms and infrastructure for doing fine-tuning has already standardized and commoditized? The labeled data! This is evinced by BigTech and LLM providers spending $ Bs on labeled data for fine-tuning and alignment, and organizations like OAI beginning to research more scalable approaches for future superalignment objectives, like weak-to-strong generalization (which we started working on 8+ years ago, https://lnkd.in/gnBqS422). The problem for enterprises: the use cases, objectives, policies, regulations, tone of voice, etc. that they need their AI models to align with are *not* the same as the generic ones these LLM providers are tuning/aligning to! We call the challenge of aligning to enterprises' specific objectives/policies enterprise alignment. It's difficult especially because private data & internal subject matter experts are usually required to do the data labeling (i.e. can't just be outsourced, even if there were $ Bs to spend!) Snorkel AI we've show that our programmatic approach to data labeling & development can solve this challenge, leading to custom alignment in a few hours of development- for ex., aligning a finance chatbot to be compliant with financial advisory policies (+20.7 pts. above baseline) in several hours. Check out our work here: https://lnkd.in/gW5hbqJF As for the "mini rant" - mostly bait to read a long post on a niche technical topic :). However, many of us in AI have a strong view that: - (A) Superhuman AGIs are not right around the corner. - (B) Enterprises are struggling to deploy AI to impactful production settings- and the party's over for all of us (and all the positive impact that we *know* AI can have) if we don't deliver value ASAP. - (C) Safe systems at scale are usually built through years/decades of incremental and collective engineering (e.g. think air travel), vs. secretive leaps forward. That is: let's all learn to walk safely first- then we'll get to flying saucer seatbelts!

4 Comments
Like Comment
To view or add a comment, sign in
Dael Williamson

EMEA CTO @ Databricks
1mo
Report this post
AI is nothing without data! However, companies also need to be able to effectively leverage and analyze data to drive impact across the organization. Even non-technical employees need to be able to turn raw information into outputs that drive better productivity. It’s why data intelligence platforms are becoming so vital. With the right DI Platform, companies and users can: (1) Operationalize their data, whether that’s building a custom LLM or enabling anyone in the organization to run a SQL query. (2) Tap into any commercial or open source AI model they want, then customize or fine-tune it with their own proprietary data, (3) Query their own private information like they are using a search engine, with a natural language prompt, and; (4) Easily bring in data from partners for analysis, and then quickly visualize the resulting insights. To learn more about choosing the right DI Platform, check out a recent blog I coauthored with Robin Sutara. https://lnkd.in/eTZFCGs7

Data + AI Strategy: Platform Focus

databricks.com

1 Comment
Like Comment
To view or add a comment, sign in
Joseph Slade

Freelance technology writer | B2B | Copywriter | Digital marketing writer | SaaS | Cybersecurity | Telecom | CannaTech
3mo
Report this post
The Underbelly of AI: Insights from Tech Leaders at American Honda, Intuit, Cushman & Wakefield and Accenture While we frequently discuss the brilliance of AI algorithms, it's crucial to acknowledge the submerged labyrinth of data pipelines that provide the fuel. Tech leaders from American Honda, Intuit, Cushman & Wakefield and Accenture recently shared their perspectives on the data operations powering today's generative AI innovations. Some key takeaways: - Cushman & Wakefield is standardizing data across finance, legal and other core functions to enable generative AI. They are leveraging unstructured data sources and merging proprietary and external data to create value. - Intuit has dedicated teams ensuring data is "fresh, clean, well understood, correct" to empower developers to use AI. Infusing generative AI has made previously lower priority data, like blog posts, mission critical for training language models. - Banks are finding potential in unlocking unstructured data for generative AI use cases. Accenture is assisting banks in analyzing decades of legacy code using AI. - At American Honda, data is the common language. They have increased investment in data quality initiatives to make data "AI-ready" and are prioritizing the most critical areas lacking high-quality data. The article underscores that while generative AI applications attract the limelight, it's the tireless efforts in data engineering, governance and strategy that enable the magic. As eloquently put by one executive, "accessing data — fresh, clean, well understood, correct [data] — all of those things literally expanded by a factor of three to five, almost overnight" with the rise of generative AI. How is your organization laying the groundwork for an AI-powered future? Read the full article: https://lnkd.in/g3zZ4vaS #DataEngineering #GenerativeAI #DataGovernance #AIStrategy #DataOps

4 executive views from AI’s data underbelly

ciodive.com
Like Comment
To view or add a comment, sign in
TopicLake™ Insights

4,457 followers
7mo
Report this post
80% of enterprise #proprietarydata is unstructured. How can you make use of it for game-changing AI? Try leveraging data transformation engines like Conveyer to merge your #structureddata and #unstructureddata into AI-ready data repositories. Learn more: https://lnkd.in/et6swze3

Unlock the Full Potential of Your Data with Conveyer AI - Conveyer

conveyer.com
Like Comment
To view or add a comment, sign in
Matthew Lynley Matthew Lynley is an Influencer

Data analyst and Journalist covering Big Data and AI
7mo
Report this post
As we see over and over again in AI, the more things change, the more they stay the same. So let's talk about the newest emerging trend (or headache, depending on who you talk to) in AI development: data lineage. As we edge closer to the widespread adoption of generative AI tools, a new (and, also, old) challenge is emerging—managing the unpredictable nature of AI models. Data lineage has started to emerge as a key part of this puzzle as companies look to move into a more production-ready phase for AI tooling built on top of proprietary data. Once again we're seeing a classic data engineering and data management paradigm extend to AI. And it's for a good reason: AI is, at its heart, a story about data. Data quality, data management, data etc. etc. etc. Understanding the journey of data—where it's moving and why, like we did in the MDS era—is becoming crucial in AI development. This isn't just about building guardrails to prevent AI gaffes but also about gaining clarity on how these errors occur. The focus on data lineage marks another shift in a phase where AI has to "grow up" to make it into production. Managing the data that flows into and out of AI models isn't just about data governance; it's about tracing each step in the process. Whether it's data being pulled from Salesforce into Snowflake or the journey of training data sets, understanding this flow is vital. Data lineage isn't just a retrospective tool to fix broken apps. With stringent regulations like GDPR and CCPA, it's become a proactive necessity. It's moving from being a part of the data governance umbrella to taking a preferred slot in the modern data stack. #ai #dataengineering #snowflake https://lnkd.in/gax5Q33d

A classic data problem is taking hold as AI gets ready for production

supervised.news
Like Comment
To view or add a comment, sign in
Henry Bestritsky
7mo
Report this post
Ugh! Matt hit a sore spot. I could not agree with him more. Everybody is fighting about whose LLM is "bigger" (remember Spaceballs?) and everybody assumes that the data is just there. Just look at Grok and how supposedly it sucked up a bunch of OpenAI data and I believe it. IMHO the web is slowly dying as everybody is putting out AI content which then feeds other LLMs. Even the picture on Matt's post is beginning to look like every other AI picture out there. One of the key pieces of our legacy migration strategy like Domino is to follow the data, let me say it differently, follow the knowledge and the workflow. We understand that just putting it into some EML or PDF file and locking it in some portal is a disastrous event for your company. Just look at various industries where small companies are beginning to leapfrog the big boys by creating domain-specific AI initiatives. Matt's post is timely and spot on. #binaryadvisers #lotus #notes #lotusnotes #domino #hclnotes #hcl #hcldomino #migration #transformation #cloud #digitaltransformation #dominomigration #LOTUS #NOTES #domino #hclnotes #hcldomino #migrate #migration #transform #transforming #archive #archiving #lotusnotes #assessment #analysis #project #technology #success #transformation

Matthew Lynley Matthew Lynley is an Influencer

Data analyst and Journalist covering Big Data and AI
7mo

As we see over and over again in AI, the more things change, the more they stay the same. So let's talk about the newest emerging trend (or headache, depending on who you talk to) in AI development: data lineage. As we edge closer to the widespread adoption of generative AI tools, a new (and, also, old) challenge is emerging—managing the unpredictable nature of AI models. Data lineage has started to emerge as a key part of this puzzle as companies look to move into a more production-ready phase for AI tooling built on top of proprietary data. Once again we're seeing a classic data engineering and data management paradigm extend to AI. And it's for a good reason: AI is, at its heart, a story about data. Data quality, data management, data etc. etc. etc. Understanding the journey of data—where it's moving and why, like we did in the MDS era—is becoming crucial in AI development. This isn't just about building guardrails to prevent AI gaffes but also about gaining clarity on how these errors occur. The focus on data lineage marks another shift in a phase where AI has to "grow up" to make it into production. Managing the data that flows into and out of AI models isn't just about data governance; it's about tracing each step in the process. Whether it's data being pulled from Salesforce into Snowflake or the journey of training data sets, understanding this flow is vital. Data lineage isn't just a retrospective tool to fix broken apps. With stringent regulations like GDPR and CCPA, it's become a proactive necessity. It's moving from being a part of the data governance umbrella to taking a preferred slot in the modern data stack. #ai #dataengineering #snowflake https://lnkd.in/gax5Q33d

A classic data problem is taking hold as AI gets ready for production

supervised.news

2 Comments
Like Comment
To view or add a comment, sign in
Peggy tsAI

Chief Data Officer at BigID | Global Top 100 Innovator in Data & Analytics | Adjunct Faculty at Carnegie Mellon | Podcast Host | Co-author of The AI Book
2mo
Report this post
One of the most common questions I get asked by data teams today is, how can I start using GenAI technology to help me with my data management work? In other words, other than the general business use cases for documentation summary and email rewrites - what can it do for my data governance tasks? And I want it today!! Okay maybe the ask is more polite. But with all the talk (or rather hype) around GenAI, here are 4 tangible things that can accelerate today with BigID AI Copilot. No slideware. No big talk. This is certifiably done today to help our customers know, govern and protect their data. Here’s a quick summary that I wrote - again, with my own words. No AI assistance here to explain the benefits of AI. Want to see and learn more? Contact us for a demo and join my weekly AI Fundamentals course for more practical learnings. #copilot #AI #datamanagement #AIDataManagement #AIGovernance #datagovernance https://lnkd.in/ehWF2e49

4 Ways to Leverage BigID Copilot for AI Data Management

https://bigid.com

6 Comments
Like Comment
To view or add a comment, sign in
Dimitri Sirota

BigID - Know Your Data | Control Your Data
2mo
Report this post
First of its kind Intelligence for your Data Intelligence

Peggy tsAI

Chief Data Officer at BigID | Global Top 100 Innovator in Data & Analytics | Adjunct Faculty at Carnegie Mellon | Podcast Host | Co-author of The AI Book
2mo

One of the most common questions I get asked by data teams today is, how can I start using GenAI technology to help me with my data management work? In other words, other than the general business use cases for documentation summary and email rewrites - what can it do for my data governance tasks? And I want it today!! Okay maybe the ask is more polite. But with all the talk (or rather hype) around GenAI, here are 4 tangible things that can accelerate today with BigID AI Copilot. No slideware. No big talk. This is certifiably done today to help our customers know, govern and protect their data. Here’s a quick summary that I wrote - again, with my own words. No AI assistance here to explain the benefits of AI. Want to see and learn more? Contact us for a demo and join my weekly AI Fundamentals course for more practical learnings. #copilot #AI #datamanagement #AIDataManagement #AIGovernance #datagovernance https://lnkd.in/ehWF2e49

4 Ways to Leverage BigID Copilot for AI Data Management

https://bigid.com
Like Comment
To view or add a comment, sign in
Lev Gorelov

Treasure trove of generative AI resources
2mo
Report this post
How to prep the data in your company for integrating AI Data is the life blood of any AI application. How can you prepare your data to be a basis for a future RAG tool or other AI tool for your company? There are 6 dimensions to data which you must pay attention to: ⏱ 1.Timeliness - Degree to which the data accurately reflects reality at a particular point in time and maintains a coherent temporal organization - It is best to have distinct file systems and documents for different time periods. Or alternative have every document earmarked with metadata as from a particular time period. When a RAG tool pulls data from a file, it can very well omit the context which situates it in a particular time period. As result, it will offer misleading insights because it will not understand that this may have been true at one point in time but not anymore. 💯 2. Completeness - The proportion of stored data against the potential of “100% complete” - Lack of completeness will hinder the capacity of the generative tool to generalize, because incomplete data may not cover all the necessary patterns, edge cases, or variations. The reliability of insights would end up being inconsistent, depending whether the tool is extracting insights from a complete patch of data or a shoddy one. 🌟 3. Uniqueness - Lack of duplication of the data in other sources - The uniqueness of your data is what will differentiate your AI tool from all others. The more exceptional your storehouse is, the most profound and valuable the model's insights will be. 🔗 4. Validity - How much the data adheres to proper format and range. For example, if the answer can be one of four options, then none of the answers have a free-form response. - Lack of validity usually derives from what is called "Breaking Schemas" - when businesses add columns to a previous dataset, or remove options, or in other ways violate the previously established order. Implement data validation checks at various stages of the data pipeline to catch and correct any syntax errors. Otherwise it will throw the model off and make it spit out more nonsensical reponses because the logic framework you expect it to follow is put into doubt. 💠 5. Consistency - Absence of differences between two referents to the same truth or conception - The RAG tool will pull chunks from original documents as references. If two excerpts on a user query state two different answers to the question, then you are increasing highly the likelihood that you will get back a hallucination. The model will try to make a smorgaborg of both takes and in the end make for an inaccurate advisor. ✅ 6. Accuracy - The degree to which data correctly describes the "real world" object or event being described - This point perhaps deserved to be #1. Inaccurate data will make for a misleading, misinforming and misguiding AI tool. And if you can't trust your outputs 100% of the time, then you can't trust them at all and the entire endeavour is pointless.

1 Comment
Like Comment
To view or add a comment, sign in

3,393 followers

View Profile Follow

Cate Lochead’s Post

More from this author

Friday Note: A New Adventure!

Explore topics