Elon Musk’s xAI is working on making Grok multimodal

Users may soon be able to input images into Grok for text-based answers.

By Kylie Robison, a senior AI reporter working with The Verge's policy and tech teams. She previously worked at Fortune Magazine and Business Insider.

May 21, 2024, 6:04 PM UTC

Elon Musk grins in a photo illustration, lifting his arms over his head triumphantly

Elon Musk’s AI company, xAI, is making progress on adding multimodal inputs to its Grok chatbot, according to public developer documents. What this means is that, soon, users may be able to upload photos to Grok and receive text-based answers.

This was first teased in a blog post last month from xAI which said Grok-1.5V will offer “multimodal models in a number of domains.” The latest update to the developer documents appear to show progress on shipping a new model.

In the developer documents, a sample Python script demonstrates how developers can use the xAI software development kit library to generate a response based on both text and images. This script reads an image file, sets up a text prompt, and uses the xAI SDK to generate a response.

A sample Python script that demonstrates how developers can to use the xAI software development kit library to perform multimodal completion.

Image: xAI

This is a big update for Grok, which xAI first released in November 2023 and is available to users who pay for the X Premium Plus subscription. The last update was Grok 1.5 in March, which came with improved reasoning capabilities.

The model is trained “on a variety of text data from publicly available sources from the Internet up to Q3 2023 and data sets reviewed and curated by ... human reviewers,” according to a blog post from X. Grok-1 was not trained on X data (including public X posts), the blog added. However, Grok does have “real-time knowledge of the world,” including posts on X.

xAI, founded by Elon Musk in March 2023, is relatively new in the AI field and trails behind competitors such as OpenAI’s ChatGPT. However, according to a blog post from xAI, their Grok 1.5 model is closing the gap with GPT-4 on various benchmarks that span a wide range of grade school to high school competition problems. It’s important to note that benchmarks for large language models are often criticized because the models can perform well on benchmarks if those benchmarks are included in their training data. It’s sort of like memorizing test answers, rather than actually learning the material.

Multimodal conversational chatbots seem to be the next frontier for AI, with multiple advancements announced at Google I/O and OpenAI releasing GPT-4o, so Grok lacking multimodal capabilities has put it behind the curve — until now.

Elon Musk’s xAI is working on making Grok multimodal

Elon Musk’s xAI is working on making Grok multimodal

Users may soon be able to input images into Grok for text-based answers.

Nothing’s CMF Phone 1 is proof that gadgets can still be fun

This is the summer’s coolest new smart kitchen gadget — literally

Dear Roku, you ruined my TV

Tesla’s Model Y ‘Juniper’ redesign might come soon

The best way to get rid of all the clutter you don’t need

More from Artificial Intelligence

Microsoft’s new Windows Copilot Runtime aims to win over AI developers

Google’s AI search results are already getting ads

All the news from Microsoft Build 2024

Microsoft Edge will translate and dub YouTube videos as you’re watching them

Elon Musk’s xAI is working on making Grok multimodal

Elon Musk’s xAI is working on making Grok multimodal

Users may soon be able to input images into Grok for text-based answers.

Share this story

Nothing’s CMF Phone 1 is proof that gadgets can still be fun

This is the summer’s coolest new smart kitchen gadget — literally

Dear Roku, you ruined my TV

Tesla’s Model Y ‘Juniper’ redesign might come soon

The best way to get rid of all the clutter you don’t need

More from Artificial Intelligence

Microsoft’s new Windows Copilot Runtime aims to win over AI developers

Google’s AI search results are already getting ads

All the news from Microsoft Build 2024

Microsoft Edge will translate and dub YouTube videos as you’re watching them