LangSplat is a 3D language Gaussian that allows you to search 3D worlds by language – right down to the soup.
Researchers at Tsinghua University and Harvard University have developed LangSplat, a new AI system that enables efficient and accurate searches for open vocabulary in 3D spaces. According to the article, the system significantly outperforms the previous state-of-the-art method, LERF, in terms of speed and accuracy.
Language Embedded Radiance Fields (LERF) was presented by researchers at UC Berkeley in March 2023. The system integrates large language models into NeRFs, enabling accurate object recognition in 3D environments without special training. For example, a user in the NeRF environment of a bookshop could search for a specific book title in natural language, according to the vision. The technology could also be used in robotics, for visual training of robots in simulations, and human interaction with 3D worlds.
LangSplat is almost 200 times faster and more accurate
However, LERFs are not suitable for real-time search and are relatively inaccurate. LangSplat constructs the 3D language field using 3D Gaussians. According to the researchers, this method avoids the complex rendering process required for NeRFs. At a resolution of 1440 x 1080 pixels, LangSplat is 199 times faster than LERF.
To form the 3D language field, LangSplat uses Meta’s Segment Anything Model to learn hierarchical semantics from multiple images of a scene. Specifically, an image is decomposed into different object masks with clear boundaries, where the object is further decomposed into its whole, its parts, and its sub-parts. The learned masks are then processed by CLIP, whose embeddings train an autoencoder, which is then used to train LangSplat’s 3D language Gaussians.
LangSplat can tell soup ingredients apart
In practice, this makes LangSplat much more accurate: in one example, the team asks for “tea in a glass”. LERF marks two cups, while LangSplat marks the liquid in the glass. In another example, it can mark individual ingredients in a bowl of ramen soup.
The researchers tested LangSplat on two datasets, the LERF dataset and the 3D OVS dataset. In both cases, LangSplat significantly outperformed LERF in terms of speed and accuracy. Specifically, LangSplat achieved an overall accuracy of 84.3% on the LERF dataset and 93.4% on the 3D OVS dataset, compared to 73.6% and 86.8% respectively for LERF.