Clem Delangue 🤗’s Post

Co-founder & CEO at Hugging Face

IMO, open datasets are more impactful than open models these days! You can now filter almost 200,000 of them on HF by modality, size and format: https://lnkd.in/dp_RqpY

Julien Chaumond

CTO at Hugging Face

🎉 Just shipped! Big update of the Hugging Face Datasets page. You can now filter datasets: 1. By Modalities (🖼️ Image, 🔊 Audio, 📝 Text, ...) 2. By Dataset Size (from 1k to ∞ samples) 3. By Format (JSON, CSV, Parquet, ...) Should be easier to find the perfect dataset(s) for your next project(s)

12 Comments

Linda Yang

Could we have a filter by the refreshed date as well? Want to use the latest one…

1 Reaction

Umar Butler

Assistant Director of Data Science @ Attorney-General of Australia

Clem Delangue 🤗 This is great! One thing I would really love though, as an avid user of Hugging Face 🤗 to extent that it’s one of my top most DNS-requested domains, is the ability to keyword search or even better vector search for models and datasets based on their tags and READMEs (I know full-text exists but it searches for files not repositories AFAIK). Right now, if you search for “law”, neither my EmuBert Australian legal model nor my Open Australian Legal Corpus would show up because they don’t contain the word “law” in their title. Instead I have to use Google to find domain specific models and dataset since there’s no guaranteeing they may have relevant keywords in their titles.

The Artificially Intelligent Enterprise

Thanks for sharing!

1 Reaction

Venugopal Adep

Lead Data Scientist | AI Leader | General Manager at Reliance Jio | LLM & GenAI Pioneer | AI Evangelist

18h

Useful tips

Venkat Gella

https://www.linkedin.com/posts/venkat-gella-b77822270_the-nerves-of-the-umbilical-cord-in-man-and-activity-7215156610338275329--chr?utm_source=combined_share_message&utm_medium=member_android

Shyam Sundar

Technology Evangelist | Code Writer | Customer-Centric Innovator | Senior Specialist Lead at Deloitte

This is going to foster lot of innovation and try outs ..

Pablo M.

I ❤️ Hugging Face

Pierre-Henri Delville, PhD

Data Scientist | PhD in Physics | Passionate about AI & Machine Learning | #Python #SQL #MATLAB

You are right. Thank you !

Sukriti Sarkar

Love this! Towards truly open-source AI.

Jyothirmai Kottu

Actively looking for SDE/MLE opportunities | MS in Computer Science | Specialised in AI/ML and SDE

Very useful feature! Thank you for sharing. 🙌🏻

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Ayush Gupta

Building Private LLMs |  Apple, 🌲Stanford University
1w
Report this post
If AI is the new electricity, data is the power grid ⚡ Hugging Face has been the largest repository of public datasets for training LLMs used by enterprises. Discovering relevant, aligned data faster and better is a significant value add. However, in my experience working with some of the biggest enterprises, performance on public data often doesn't align well with their custom use cases. Creating a 𝐆𝐨𝐥𝐝𝐞𝐧 𝐁𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤 𝐃𝐚𝐭𝐚𝐬𝐞𝐭 tailored to one's use case is crucial for accurate evaluation. Here's a strategy I recommend for training new models. This gets simpler if your enterprise uses tools like Snowflake, SuperAnnotate, or Lilac (acquired by Databricks). 1. 𝐂𝐨𝐥𝐥𝐞𝐜𝐭 𝐚 𝐆𝐨𝐥𝐝𝐞𝐧 𝐁𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤 𝐃𝐚𝐭𝐚𝐬𝐞: Ensure it's human-reviewed and thoroughly covers all result possibilities in a balanced manner. 2. 𝐋𝐞𝐯𝐞𝐫𝐚𝐠𝐞 𝐔𝐬𝐞 𝐂𝐚𝐬𝐞-𝐒𝐩𝐞𝐜𝐢𝐟𝐢𝐜 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐃𝐚𝐭𝐚: If you don't have your own, use Hugging Face with filtration techniques to find relevant datasets. Filter based on your use case, used by public models, popularity, example size, etc. 3. 𝐑𝐮𝐧 𝐃𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬: Compare the Golden Benchmark dataset with candidate public datasets to identify the closest match to your use case. While ML-driven clustering and analysis techniques are helpful, manual review is essential. 4. 𝐓𝐫𝐚𝐢𝐧 𝐘𝐨𝐮𝐫 𝐌𝐨𝐝𝐞𝐥: If you find a good match, train your LLM/model on that data. If not, invest time and resources in collecting your own data. Period. 5. 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐞 𝐂𝐨𝐧𝐬𝐢𝐬𝐭𝐞𝐧𝐭𝐥𝐲: Always evaluate on the Golden Benchmark dataset, prioritizing it over public/academic benchmarks. Data is crucial. I've personally seen performance jumps of over 10% with data refinement. Let me know your thoughts. #ArtificialIntelligence #MachineLearning #DataScience #HuggingFace #ModelTraining #DataQuality #AI #ML #Benchmarking
Julien Chaumond

CTO at Hugging Face
2w

🎉 Just shipped! Big update of the Hugging Face Datasets page. You can now filter datasets: 1. By Modalities (🖼️ Image, 🔊 Audio, 📝 Text, ...) 2. By Dataset Size (from 1k to ∞ samples) 3. By Format (JSON, CSV, Parquet, ...) Should be easier to find the perfect dataset(s) for your next project(s)
1 Comment
Like Comment
To view or add a comment, sign in
Julien Chaumond

CTO at Hugging Face
2w
Report this post
🎉 Just shipped! Big update of the Hugging Face Datasets page. You can now filter datasets: 1. By Modalities (🖼️ Image, 🔊 Audio, 📝 Text, ...) 2. By Dataset Size (from 1k to ∞ samples) 3. By Format (JSON, CSV, Parquet, ...) Should be easier to find the perfect dataset(s) for your next project(s)
29 Comments
Like Comment
To view or add a comment, sign in
SufeelAhmad Katevadi

Data Analyst | AI-Enabled Data Professional | Power BI, Python, SQL, Excel |
10mo
Report this post
Day 7 : COMPLETED : ✓ Binary search Algorithm in 2D Array ✓ Matrix is sorted in a row wise and column wise manner ✓ Just cracked a binary search algorithm problem! 🕵️♂️ This time, it was all about transforming a 2D array – turning rows into columns to create a matrix. 💡 The problem pushed me to think creatively and optimize my solution. #100daysofcodechallenge #dsawithkunal #dsachallenge #100daysofcode
Like Comment
To view or add a comment, sign in
CONTECO e.U.

67 followers
6mo
Report this post
Day 3: Multidimensional Arrays 📏 Explore multidimensional arrays. Create a 2D array with: import numpy as np a = np.array([[1, 2], [3, 4]]) Use .shape to see its dimensions: print(a.shape) Great for representing matrices or grids! 🌐 #DataScienceJourney #30DaysOfNumPyDay03 ✨ Now, it's your turn! Create your own 2D arrays, try different shapes, and experiment with accessing elements. Share your discoveries in the comments or on social media using #NumPyMasteryChallenge and #30DaysOfNumPy. Let's dive into the world of multidimensional arrays together! 💡💬
Like Comment
To view or add a comment, sign in
Muhammad Abdul Hanan

BS Artificial intelligence | FAST NUCES 26| Leetcode | Generative AI | LLMs | Web developer | Python developer | Future Ai and machine Learning Engineer
1mo
Report this post
Greetings LinkedIn family, Code 1: Edge Detection and Contour Extraction: It loads an image and converts it to grayscale. Using the Canny edge detection algorithm, it identifies edges in the grayscale image. Contours are then extracted from the detected edges. Vectorization and Tokenization: The contour points are extracted and stored as vectors. Each contour point serves as a token. These tokens (contour points) are vectorized, meaning they are represented as numerical arrays. Finally, the original image, its grayscale version, and multi-color text overlaid on the original image are displayed using Matplotlib. Code 2: Import Libraries: Import necessary libraries - Librosa for audio processing and NumPy for numerical operations. Load Audio: Load an audio file and its sample rate. Extract MFCCs: Use Librosa to compute MFCC features from the audio signal. Tokenize: Treat each MFCC frame as a token, organizing them into a matrix for further processing. Vectorize: Convert each MFCC token into a numerical vector using NumPy arrays. Print: Display the vectorized form of the audio.
Like Comment
To view or add a comment, sign in

183,622 followers

View Profile Follow

Clem Delangue 🤗’s Post

More from this author

Instagram doubled its iOS ratings in a week thanks to this in-app review popup!

What Happens When Two Artificial Intelligences Try To Prank Each Other?

PRvertising - getting PR coverage through ads.

Explore topics