To mark the 10th anniversary of Meta’s Fundamental AI Research (FAIR) team, the company presents three new research projects: Ego-Exo4D, Seamless Communication, and Audiobox.
Ego-Exo4D is a dataset and benchmark set to support AI research in video learning and multimodal perception. Collected over two years by Metas FAIR, Project Aria, and 15 university partners from around the world, Ego-Exo4D captures both “egocentric” views from the camera of a participant wearing the Project Aria headset and “exocentric” views from surrounding cameras.
The dataset focuses on complex human activities such as sports, music, cooking, dancing, and bicycle repair.
Meta sees applications in augmented reality (AR) systems, where a person wearing smart headsets could quickly learn new skills with the help of a virtual AI trainer guiding them through an instructional video; in robotic learning, where a robot observing people around it could learn new handling skills with less physical experience; or in social networks, where new communities could emerge based on people sharing their knowledge and complementary skills in videos.
The dataset of over 1,400 hours of video will be available as open source in December, and a public benchmark competition for Ego-Exo4D is planned for next year.
Seamless Communication aims to enable expressive and fast AI translations
After the Seamless Communication project presented the SeamlessM4T multimodal translation model in August, FAIR is now presenting a family of AI research models that build on the old model to enable more natural and authentic communication across language boundaries.
The project consists of four models
– SeamlessExpressive: preserves the expression and nuance of speech across language boundaries.
– SeamlessStreaming: Delivers speech and text translations with a latency of approximately two seconds.
– SeamlessM4T v2: A multilingual and multitasking model for effortless voice and text communication.
– Seamless: Combines the capabilities of SeamlessExpressive, SeamlessStreaming, and SeamlessM4T v2 in a single model.
Meta also published a demo of SeamlessExpressive, where you can have your voice translated.
Audiobox is a generative AI model for audio
Audiobox is Meta’s new audio generation model. It is capable of generating voices and sound effects through a combination of voice input and natural language text prompts, making it easier to create custom audio files for different use cases.
Compared to its direct predecessor, Voicebox, Audiobox offers improved controllability by allowing users to use natural language prompts to create a desired sound or type of speech.
The model will initially be made available to a select group of researchers and academic institutions to advance the state of the art in audio generation research and ensure the responsible development of artificial intelligence, Meta said.