🤔 Do you wait for the next embedding model? Don't! Just tell us what you want your embeddings to excel at, e.g., car insurance claims, financial news, or Spanish dialogs. Specify your wish in a prompt; and remember this is your only input to our API. In about 30 mins, we then deliver a ready-to-use, fine-tuned embedding model that can be loaded via SentenceTransformers. Behind the scenes, we take care of everything else: from generating useful synthetic data to managing the train-eval-test ML workflow, and finally, uploading the fine-tuned model to the Hugging Face Hub. Yep, under this very minimal UI abstraction, so much happens!
This is a new feature that we are alpha-testing with invited users. A minimalistic fine-tuning UX that eliminates the need for uploading reference data and manual triplet/hard-negative mining. As a user, you simply need to specify your expectations. For instance, "I want my embeddings to excel at biomedical literature" or for a more detailed instruction, "Please make it more effective on various subfields of artificial intelligence, particularly focusing on distinctions between machine learning, deep learning, and neural networks."
But how can we ensure the quality of the fine-tuned models? By feeding them high-quality data! To be frank, it’s not an easy job especially when we talk about synthetic data from LLMs. It’s easy to get started but hard to get it right. Simple prompting can give some ("boring") results, but finding diversified, effective and hard-negative triplets requires significant prompt engineering.
We proposed a Stochastic Augmented Generation framework, which has proven to be highly effective in generating effective training data for embedding models, under a configurable budget. Have a look at the graphic to find out more.
Enabling businesses to drive value with AI | Chief AI Officer on demand | Generative AI strategy | NLP expert | Women in AI newsletter & empowerment | freelance | ex-Amazon
3wLooks fun!