Magma: A foundation model for multimodal AI agents

Published 3 weeks ago• 6 minute read

, Principal Researcher, Microsoft Research Redmond

This talk introduces Magma, a new multimodal agentic foundation model designed for UI navigation in digital environments and robotics manipulation in physical settings. It covers two new techniques, Set-of-Mark and Trace-of-Mark, for action grounding and planning, and details the unified pretraining pipeline that learns agentic capabilities.

Microsoft Research Forum, February 25, 2025

The following talk introduces Magma, an agentic foundation model, meaning a generalist model that has agentic abilities like perceiving its environment, reasoning, and taking actions to achieve goals.Magma can understand multimodal inputs and predict actions for real-world goals in both the digital and physical world.

JIANWEI YANG: Welcome everyone. My name is Jianwei Yang. I’m a researcher from MSR [Microsoft Research] Deep Learning Group and very excited to talk about Magma, the most recent work to build the foundation for multimodal AI agents.

When talking about multimodal agents, I would like to walk us through the multimodal models people have built in the past five years. Five years ago, vision-language models or multimodal models were mostly built on top of the BERT architecture. Typically, these models contain less than 1 billion parameters, and the training data is usually a small amount of images. Later on, the CLIP model came out from OpenAI. It scaled up their multimodal training to billions of images. Back then, we built our own multimodal foundation called Florence. Although the modal size is still relatively small, it shows strong open vocabulary and zero-shot recognition capability across a range of visual domains.

Most recently, we entered the era of large multimodal models. Connecting multimodal vision models, such as CLIP, with large language models, such as GPT, incurs many advanced and multimodal capabilities. Now we can have a multimodal chatbot such as GPT-4o or small 53cv, which can see, talk, and reason.

Nowadays, most of the existing multimodal models are built to make a good sense of the world. They still lack the ability to interact with the world, either virtually or physically. They cannot directly interact with the world as their inputs are captured by different sensors and then detached between the environment and the large foundation models. We believe that a multimodal AI model should not only understand the inputs but also interact with the environment as an agent in a human-like manner.

However, nowadays, we are still facing a big gap between AI and human in performing tasks as simple as web navigation and manipulation. With this in mind, we developed Magma, a foundation model for multimodal agents. We are striving for a single foundation model, which is a large multimodal model that can understand the visual and textual inputs, and also predict actions for a real-world goal.

The whole model is pretty simple and straightforward. As you can see, it follows common design and takes image, video and a task prompt as inputs and then generates textual, spatial, and action as outputs for different tasks. The goal is to create a generalizable system capable of performing a varied range of agentic tasks in both digital and physical environments.

As we well know, pretraining larger foundation models requires large-scale data. In this project, we explore a new way of leveraging a valid range of human instructional videos for our model pretraining. Consider that temporal motions in this video data are used for supervision for action grounding and pretraining. Below are four sample videos and the corresponding object motions. As you can see, the motion represented by the object trajectory can clearly indicate the action taken by humans and the robot.

However, the raw vision in the video cannot be directly used, as they are usually very noisy and do not necessarily capture the meaningful object in the scenarios. We need a way to convert the motions to meaningful action for agentic models to learn. To achieve this goal, we introduce two techniques: Set-of-Mark for images and Trace-of-Mark for videos and robot data. Set-of-Mark is our earlier proposed method, which has been widely used by the community for UI and robotics tasks as it helps to ground the agent action spatially into the images.

Trace-of-Mark, on the other hand, is our newly developed method to capture the motions of foreground objects. The resulting traces, which with the actions, are shown at the bottom. In the end, we compared roughly 20 million training samples, which contains images, video data, and also robotics data. Each of them serve slightly different goals. Given the pretraining data, we use a unified pretraining objective, which is similar to pretraining a large language model.

More specifically, our model takes text data as input, and then predicts verbal, spatial, and action outputs. During the pretraining, we prompted a model for action planning. At the top, we compare different numbers of pre-training data. As we can see, the more data we use for the pre-training, the better our model is for action grounding and planning. At the bottom, we prompt the model with different task prompts. It shows good generalization ability across tasks given the same image input.

After the whole pretraining, we evaluated our model in zero-shot manner on different tasks. From left to right, we evaluated on spatial grounding, digital UI navigation, and physical robot manipulation. Our Magma model shows advantages over the counterpart methods, including GPT-4v. Note that our model is the first and only model that can perform all three agentic tasks simultaneously.

Given the pretrained Magma model, we can configure it for robotics manipulation. Using the same amount of robot data as OpenVLA, the Magma model, almost doubles the performance in different simulated environments. This indicates the effectiveness of our pretraining techniques and the potential of leveraging unlabeled image and video data for agentic pretraining. and lab of the video data for pre-training. Afterwards, we further fine-tune our model for real-world robot manipulation and UI navigation.

At the top, we tested both seen and unseen tasks, and Magma showed much better performance compared with OpenVLA, though both methods are fine-tuned in exactly the same way. In the bottom table, we compare Magma with other methods in a more realistic UI navigation benchmark called Manage-to-Work. Using only image data as input, our Magma model achieved state-of-the-art performance in terms of the success rate.

To summarize, in this project we developed the first agentic foundation model, Magma, that can understand multimodal input and also take action in both digital and physical environments. Considering the limited amount of pretraining data, we proposed two techniques, Set-of-Mark and Trace-of-Mark, to leverage large amounts of images and videos without human labels for model pretraining.

In the end, we get a very compatible foundation model for a wide range of multimodal tasks, including both understanding and action prediction. We have released our code and model. Feel free to try it out by yourself. At last, I want to highlight that this is a joint work by many teammates in the Deep Learning group and also MSR [Microsoft Research] as well as many external collaborators.

Thank you all for your attention.

Origin: