Skip to main content

Meta AI releases OpenEQA to spur ’embodied intelligence’ in artificial agents

Credit: VentureBeat made with Midjourney
Credit: VentureBeat made with Midjourney

We want to hear from you! Take our quick AI survey and share your insights on the current state of AI, how you’re implementing it, and what you expect to see in the future. Learn More


Meta AI researchers today released OpenEQA, a new open-source benchmark dataset that aims to measure an artificial intelligence system’s capacity for “embodied question answering” — developing an understanding of the real world that allows it to answer natural language questions about an environment.

The dataset, which Meta is positioning as a key benchmark for the nascent field of “embodied AI,” contains over 1,600 questions about more than 180 different real-world environments like homes and offices. These span seven question categories that thoroughly test an AI’s abilities in skills like object and attribute recognition, spatial and functional reasoning, and commonsense knowledge.

“Against this backdrop, we propose that Embodied Question Answering (EQA) is both a useful end-application as well as a means to evaluate an agent’s understanding of the world,” the researchers wrote in a paper released today. “Simply put, EQA is the task of understanding an environment well enough to answer questions about it in natural language.”

Even the most advanced AI models, like GPT-4V, struggled to match human performance on OpenEQA, a new benchmark that measures an artificial intelligence system’s ability to understand and answer questions about the real world, according to a study by Meta AI researchers. (Credit: open-eqa.github.io)

Bringing together robotics, computer vision and language AI

The OpenEQA project sits at the intersection of some of the hottest areas in AI: computer vision, natural language processing, knowledge representation and robotics. The ultimate vision is to develop artificial agents that can perceive and interact with the world, communicate naturally with humans, and draw upon knowledge to assist us in our daily lives.


Countdown to VB Transform 2024

Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now


The researchers see two main applications for this “embodied intelligence” in the near-term. One is AI assistants embedded in augmented reality glasses or headsets that could draw upon video and other sensor data to essentially give a user a photographic memory, able to answer questions like, “Where did I leave my keys?” The other is mobile robots that could autonomously explore an environment to find information, for example searching a home to answer the question “Do I have any coffee left?”

Creating a challenging benchmark

To create the OpenEQA dataset, the Meta researchers first collected video data and 3D scans of real-world environments. They then showed the videos to humans and asked them to generate questions they might want to ask an AI assistant that had access to that visual data.

The resulting 1,636 questions thoroughly test a wide range of perception and reasoning capabilities. For example, to answer the question ,”How many chairs are around the dining table?” an AI would need to recognize the objects in the scene, understand the spatial concept of “around,” and count the number of relevant objects. Other questions require the AI to have basic knowledge about the uses and attributes of objects.

Each question also includes answers generated by multiple humans, to account for the fact that questions can be answered in many different ways. To measure the performance of AI agents, the researchers used large language models to automatically score how similar the AI-generated answer is to the human answers.