Skip to main content

Google’s AI just got ears

The Google Gemini AI logo.
Google

AI chatbots are already capable of “seeing” the world through images and video. But now, Google has announced audio-to-speech functionalities as part of its latest update to Gemini Pro. In Gemini 1.5 Pro, the chatbot can now “hear” audio files uploaded into its system and then extract the text information.

The company has made this LLM version available as a public preview on its Vertex AI development platform. This will allow more enterprise-focused users to experiment with the feature and expand its base after a more private rollout in February when the model was first announced. This was originally offered only to a limited group of developers and enterprise customers.

1. Breaking down + understanding a long video

I uploaded the entire NBA dunk contest from last night and asked which dunk had the highest score.

Gemini 1.5 was incredibly able to find the specific perfect 50 dunk and details from just its long context video understanding! pic.twitter.com/01iUfqfiAO

— Rowan Cheung (@rowancheung) February 18, 2024

Google shared the details about the update at its Cloud Next conference, which is currently taking place in Las Vegas. After calling the Gemini Ultra LLM that powers its Gemini Advanced chatbot the most powerful model of its Gemini family, Google is now calling Gemini 1.5 Pro its most capable generative model. The company added that this version is better at learning without additional tweaking of the model.

Gemini 1.5 Pro is multimodal in that it can interpret different types of audio into text, including TV shows, movies, radio broadcasts, and conference call recordings. It’s even multilingual in that it can process audio in several different languages. The LLM may also be able to create transcripts from videos; however, its quality may be unreliable, as mentioned by TechCrunch.

When first announced, Google explained that Gemini 1.5 Pro used a token system to process raw data. A million tokens equate to approximately 700,000 words or 30,000 lines of code. In media form, it equals an hour of video or around 11 hours of audio.

There have been some private preview demos of Gemini 1.5 Pro that demonstrate how the LLM is able to find specific moments in a video transcript. For example, AI enthusiast Rowan Cheung got early access and detailed how his demo found an exact action shot in a sports contest and summarized the event, as seen in the tweet embedded above.

However, Google noted that other early adopters, including United Wholesale Mortgage, TBS, and Replit, are opting for more enterprise-focused use cases, such as mortgage underwriting, automating metadata tagging, and generating, explaining, and updating code.

Fionna Agomuoh
Fionna Agomuoh is a technology journalist with over a decade of experience writing about various consumer electronics topics…
3 important ways gaming on Arm PCs just got better
Gaming on a laptop with the Snapdragon X Elite chip

While the current selection of Copilot+ PCs aren't focused on gaming, Microsoft has expressed strong confidence in the potential of gaming on Arm-based PCs.

With the launch of Qualcomm’s Snapdragon X Elite platform, the tech giant highlighted several improvements and initiatives aimed at enhancing the gaming experience on the platform, particularly with the Copilot+ PCs coming soon. These advancements include optimizations through Microsoft's "Prism" technology, automatic super resolution, and enhanced anti-cheat software compatibility, all of which address some of the long-standing challenges faced by Arm-based systems in the gaming sector.

Read more
I compared ChatGPT against Google Gemini to see which is the better free AI chatbot
A person typing on a laptop that is showing the ChatGPT generative AI website.

Two of the leading AI chatbots available today come from Google, with its Gemini system, and OpenAI, the company that kicked off the AI revolution with ChatGPT.
But you might be wondering which is the better free chatbot. I've spent a significant time with both to see how they compare, break down the costs and benefits of each service, explain what features you'll have to pay for and which you get for free, and show you which AI is best for what you need.

Pricing and tiers
Both ChatGPT and Gemini are available to the public for free at their respective websites and through their mobile apps. However, free tier users will only receive limited access to the most current and capable models.

Read more
Google’s new AI generates audio soundtracks from pixels
An AI generated wolf howling

Deep Mind showed off the latest results from its generative AI video-to-audio research on Tuesday. It's a novel system that combines what it sees on-screen with the user's written prompt to create synced audio soundscapes for a given video clip.

The V2A AI can be paired with vide -generation models like Veo, Deep Mind's generative audio team wrote in a blog post, and can create soundtracks, sound effects, and even dialogue for the on-screen action. What's more, Deep Mind claims that its new system can generate "an unlimited number of soundtracks for any video input" by tuning the model with positive and negative prompts that encourage or discourage the use of a particular sound, respectively.

Read more