Mozilla.ai’s Post

View organization page for Mozilla.ai, graphic

2,138 followers

Mozilla and EleutherAI brought together experts to discuss a critical question: How do we create openly licensed and open-access LLM training datasets and how do we tackle the challenges faced by their builders? Mozilla.ai's Senior Director of Product Innovation, Julie V. Belião, participated in the workshop along the many outstanding experts. Thank you to Kasia Odrozek, Stella Biderman, and Ayah Bdeir.

Julie V. Belião

Sr. Director of Product Innovation @ Mozilla.ai | Executive Consultant, Product, Strategy, Operations

I recently had the privilege of attending the Open Dataset Convening, organized by EleutherAI and Mozilla in Amsterdam. This event was super insightful as we discussed the creation and sustainability of openly licensed and open-access LLM training datasets 👐 Together, we explored how these datasets could be built, how to make them sustainable, and ways to support their producers. Topics included the potential impact on the industry, ensuring inclusivity and equity, addressing copyright and ethical considerations, and managing data privacy rights. Our discussions also delved into developing legal and ethical frameworks, establishing collaborative structures, and ensuring access via cultural and linguistic diversity. This aligns with many of our goals at Mozilla.ai ✨ where we are committed to transparency, access, agency and inclusivity. For more insights into these discussions and their implications, check out the great article published by Kasia Odrozek and Stella Biderman 🙏 (special thanks 💙 to Ayah Bdeir for organizing and inviting us, and to Santiago Martorana for his support and coordination 🤗 ) https://lnkd.in/d_9jjajG

The Dataset Convening: A community workshop on open AI datasets

https://blog.mozilla.org/en/

To view or add a comment, sign in

More Relevant Posts

Alix Dunn

I work with serious troublemakers to facilitate change. Technology ⇆ society.
2w
Report this post
Ever wonder what it would take to build a training data set from only legally available materials? There are people working on this! I had the pleasure of facilitating a community of cutting-edge teams working to build training data sets that aren't...horrible. All to support a foundation of work contributing to more open development of emerging technologies. Interested in learning more, check out the blog post below & as always lovely to work with the teams at Mozilla and EleutherAI!

Kasia Odrozek

Director, Insights at Mozilla
2w Edited

Some weeks ago I had a pleasure to co-host an in-person workshop for Mozilla and in partnership with EleutherAI bringing together a group of people passionate about datasets for AI being done in a way that seems better and fairer for us, the people. Knowing that openness alone does not guarantee legal compliance or ethical outcomes, we asked which decision points can contribute to datasets being more just and sustainable in terms of public good and data rights. It is a true Rubik's cube to figure out how all the elements fit but it is an immense pleasure to do it as a part of a value-driven community. So together with EleutherAI we asked: How do we create openly licensed and open-access LLM training datasets and how do we tackle the challenges faced by their builders? Thank you to our fantastic participants who spent 8,5 hours discussing thorny issues: Jennifer Ding, Mitchell Baker, Julie V. Belião, Kasia Chmielinski, Marzieh Fadaee, Paul Keller, Pierre-Carl Langlais, Greg Leppert, Hynek Kydlíček, Shayne Longpre, Cullen Miller, Victor Miller, Angela Oduor Lungati, Guilherme Penedo, Michael Running Wolf, Max Ryabinin, Andrew Strait, Mark Surman, Anna Tumadottir, @Marteen Van Segbroeck, EM Lewis-Jong, Lee White, Leandro von Werra, Maurice Weber, Thomas Wolf, Maximilian Gahntz, Jillian B. and Kathleen Siminyu and the great co-hosts Aviya Skowron, Sebastian Majstorovic, Alix Dunn, Ayah Bdeir, and Stefan Baack. Stay tuned for a community publication soon!

Dataset Convening: A community workshop on openly licensed LLM datasets

https://blog.mozilla.org/en/

1 Comment
Like Comment
To view or add a comment, sign in
Brendan Quinn

Data Expert in Law and Technology: Helping Businesses Grow while using all their data lawfully and ethically – Data, Data Protection and IT law, Information Security, and AI Expertise
5mo Edited
Report this post
Expert predicts that highly paid coders will be replaced within 10 years by AI after generously supplying code for free to Github used in training machine learning algorithms. I wonder if the terms and conditions of Github have stated this? No doubt you and me can assume everything we posted on social media platforms over the years is also being used to train AI with its mix of knowledge and problematic free speech. I question the lawfulness of this in Europe under Data Protection laws such as the GDPR because posts are mainly opinions and thus personal data but also contain copyright material. Code is protected by laws that depending on the jurisdiction can range from legal issues of copyright in the code, patent rights and contractual licensing terms. Something that might not be considered is that comments in the code by the author of the code and others is likely also to be personal data where it relates to that individual. Given the impact on people's livelihoods this use of code in training AI likely will be tested in court. I see tonnes of infringements of Data Protection and other laws that could return material damages and I am very interested in talking to as many litigation firms as possible whether legal or funders. In my book the Data Protection Implementation Guide, A Legal, Risk and Technology Framework for the GDPR I even detailed some of the more common infringements. Many of these infringements are now being identified in DPAs decisions but this is only a fraction of the infringements. New obligations will also be created under new laws such as the AI Act in Europe. I would be very interested in coders views too. #privacy #gdpr #business #dataprotection #aiact #ai #copyright

Suffolk expert says AI will replace human coders in ten years

bbc.com

5 Comments
Like Comment
To view or add a comment, sign in
Javed Alam

Professor emeritus at Youngstown State University
3mo
Report this post
every major change in IT is followed by a newcomer taking the top spot, while the old winners are relegated to playing second fiddle, and few notables disappear like digital research. this time around I don't see this scenario will change, and a newcomer will finally emerge from relatively obscure position now. among the top 3, Google is in most vulnerable position right now. just to keep this post short I will look into their crown jewel search. in fact every one of their product space they are facing stiff challenge and failing to meet it. a new Comer perplexity in the search space showed how weak their position is. if I have to bet I will bet on whoever provides the fastest tokens at the cheapest, open source llm, not big but better trained, opensource tools to work with these tools, and interesting user applications, will win over any other combination. perplexity was a wrapper software, and now has its homegrown llm to meet its need. if someone wants try out different open source llm models, and doesn't want to navigate the complex website of huggingface, one try them at labs.perplexity.ai through a chat interface. I started with huggingface.co, a big website for open source llm,m repository. it has steep learning. it's courses are either too basic resemble Hello c, or one needs to know a lot to use this website. they lack learning tools at intermediate levels. one can figure their website out, even use it, but it requires some time commitment. https://lnkd.in/gVz-Nsss https://lnkd.in/gVz-Nsss

More than an OpenAI Wrapper: Perplexity Pivots to Open Source

https://thenewstack.io
Like Comment
To view or add a comment, sign in
AI Makerspace

7,854 followers
9mo
Report this post
🕴️ What is an LLM agent in LlamaIndex? Query engines - built on multiple indexes (indices) and retrievers - takes natural language input and returns “rich” responses. Example pattern (quantitative + qualitative indices): - Natural language query (provided to SQL database) - SQL query: written automatically from natural language - SQL response: given in natural language - Transformed query given SQL response - Query engine response: Yay or nay about sufficient context Final response: Addresses known and unknown parts of the initial query In code: https://lnkd.in/gCheuRea #agents #llamaindex #llmops

Google Colaboratory

colab.research.google.com

2 Comments
Like Comment
To view or add a comment, sign in
ORCRO

99 followers
5mo
Report this post
Andrew Katz, our CEO, will speak at FOSDEM 2024 on Sunday 3 February, in the vibrant city of Brussels, Belgium. In a world where technology often outpaces legislation, Andrew’s talk entitled: Using code generated by AI: issues, misconceptions, and solutions will tackle pressing questions, including: ◾What are the copyright issues with ingestion of training data, configuration and transfer of models, and the generation and use of code? ◾Can GPL code inadvertently be incorporated into your non-GPL codebase? ◾Who owns the copyright in generated code? As a legal expert specialising in open source and AI, Andrew will provide insights and consider potential solutions. Read the full blog post here 👇 https://lnkd.in/eM9GvV3W #opensource #technology #AI

Andrew Katz to speak at FOSDEM | 3 & 4 February 2024 - Moorcrofts

moorcrofts.com
Like Comment
To view or add a comment, sign in
John Ellis

CTO at Linnworks
8mo Edited
Report this post
My 2¢... generative "AI" is a marketing term that attempts to simplify the commoditization of toolchains that have been built & enhanced over decades. Before DALL·E came around we could muck about with VQCAN+CLIP, where OpenAI's CLIP model guides VQGAN's generative network to produce images based on text prompts. A low-code way to try out these lower-level libraries and mess with different training models is by messing around with Katherine Crowson's Colaboratory where you can easily swap out models, parameters, and even tweak the Python script. If you are interesting in seeing how databases like Flickr and WikiArt contribute to the "Generative AI" output you see today, this is a great way to get into the weeds: https://lnkd.in/d96RYgak Victor Basu's explainer on VQGAN and CLIP is great as well: https://lnkd.in/dYrPcuvN #generativeai #vqgan #clip #ai

Google Colaboratory

colab.research.google.com

1 Comment
Like Comment
To view or add a comment, sign in
Madhusudhan Reddy

AI Researcher | LLM's | Problem-Solving Skills | Entrepreneur
8mo Edited
Report this post
The Lazy Predict library! 🚀 ---------------------------------------- This amazing Python tool simplifies the model selection process for data scientists. With just a few lines of code, Lazy Predict quickly builds and evaluates a wide range of machine-learning models, helping you identify the most promising ones for your dataset. It's a game-changer for anyone looking to streamline their model selection and save valuable time. Give it a try and supercharge your data science projects. Docs: https://lnkd.in/gQdHCDeM Colab: https://lnkd.in/g5hY9FSj Shankar P. #DataScience #MachineLearning #LazyPredict #Python #AI #nlp #naturallanguageprocessing #genai #llm

Google Colaboratory

colab.research.google.com
Like Comment
To view or add a comment, sign in
Timothy Dolan

Data Scientist | MLOps
4mo Edited
Report this post
Create synthetic instruction datasets using open source LLM's and bonito🐟! With Bonito, you can generate synthetic datasets with a wide variety of supported tasks. The Bonito model introduces a novel approach for conditional task generation, transforming unannotated text into task-specific training datasets to facilitate zero-shot adaptation of large language models on specialized data. This methodology not only improves the adaptability of LLMs to new domains but also showcases the effectiveness of synthetic instruction tuning datasets in achieving substantial performance gains. AutoBonito🐟: https://lnkd.in/efNm3--J Original Repo: https://lnkd.in/ejVWbSdK Paper: https://lnkd.in/e34JNzrz

Google Colaboratory

colab.research.google.com
Like Comment
To view or add a comment, sign in
Open Source Initiative (OSI)

22,673 followers
2mo
Report this post
In 2022 the OSI started an in-depth global initiative to engage key players, including corporations, academia, the legal community and organizations and nonprofits representing wider civil society, in a collaborative effort to draft a definition of Open Source AI that ensures that society at large can retain agency and control over the technology.

The Open Source AI Definition gets closer to reality with a global workshop series

https://opensource.org
Like Comment
To view or add a comment, sign in
Anton Parkhomenko

Data & AI Solutions | Langchain consultancy | Custom AI Agents | AI-enabled Operational Excellence
9mo Edited
Report this post
User Interview Summarizer v0.2 A while ago I shared a mini-app for user interview summarization (https://lnkd.in/e_wi-uf4), and now it's time to do an update. Google Colab link: https://lnkd.in/eBiix5td Features added: - Whisper model selection (medium, large-v2, ..) - Language selection (works quite well with Ukrainian) - Structure templates (meeting notes, VPC, generic summary, add your own) - Sources: file, Google Drive, YouTube - Formats: .mp3, .mp4, .txt, youtube url Now it supports records of any length, depending on the size is choosing the summarization method: gpt-4, gpt-3.5-turbo-16k, langchain map_reduce. Bottomline is the same: Basically, this is what you (most likely) will find under the hood of every AI-powered user research assistant app, but customizable as hell 😈 It appeared to be a quite viable and almost free – as long as you are in your trial Colab and OpenAI budget, or run it on your own CUDA hardware. Youtube walkthrough: https://lnkd.in/e3xfneip #python #ai #promtengineering #userresearch #openai #whisper #summarization #langchain

Google Colaboratory

colab.research.google.com

2 Comments
Like Comment
To view or add a comment, sign in

2,138 followers

View Profile Follow

Mozilla.ai’s Post

More Relevant Posts

Explore topics