Grasping With Common Sense using VLMs and LLMs

How to leverage large language models for robotic grasping and code generation

Published in

Towards Data Science

6 min readMar 29, 2024

Grasping and manipulation remain a hard, unsolved problem in robotics. Grasping is not just about identifying points where to put your fingers on an object to create sufficient constraints. Grasping is also about applying just enough force to pick up the object with breaking it, while making sure it can be put to its intended use. At the same time, grasping provides critical sensor input to detect what an object is and what its properties are. With mobility essentially solved, grasping and manipulation remains the final frontier in unlocking truely autonomous labor replacements.

Imagine you are sending your humanoid robot companion to the supermarket and tell it “Check the avocados for ripeness and grab an avocado for guacomole today”. There is a lot of stuff going on:

The quality “ripeness” is not obvious from the avocado’s color as would be the case for a strawberry or a tomato, but requires tactile information
“Grab an avocado”, in particular a ripe one, implies a certain gentleness when handling it. When picking up an Avocado, you might be less careful than when picking up a raspberry, but also be cognizant of not damaging the legume.
“Guacomole today” implies a certain ripeness level that is different than “for Guacomole in three days from now”.

Grasping With Common Sense using VLMs and LLMs

How to leverage large language models for robotic grasping and code generation

Written by Nikolaus Correll