MMSearch-R1: Incentivizing LMMs to Search

Jinming Wu$^$, Zihao Deng$^$, Wei Li, Yiding Liu, Bo You, Zejun Ma $^*$equal contribution

Table of Contents

1. Introduction

In recent years, scaling visual-text paired data has significantly advanced the capabilities of Large Multimodal Models (LMMs) across a wide range of multimodal understanding tasks. However, this paradigm faces fundamental limitations when applied to complex and dynamic real-world knowledge. In particular, long-tail information, such as newly emerging facts beyond the model’s training cutoff, or proprietary domain-specific knowledge constrained by privacy, copyright, or security, remains difficult to capture during pretraining. Moreover, operating beyond their internal knowledge boundaries often leads models to hallucinate, undermining their reliability in applications where accuracy and authority are critical.

Retrieval-Augmented Generation (RAG) has emerged as a widely adopted solution to these challenges. Yet it introduces two key limitations: (1) its modular architecture decouples retrieval and generation, hindering end-to-end optimization; and (2) it follows a static “retrieve-then-generate” paradigm, often triggering unnecessary retrievals even when the model already possesses sufficient knowledge, thus leading to increased latency and computational overhead.

Recent advancements, such as OpenAI’s o-series$^{[1]}$, DeepSeek-R1$^{[2]}$, and Kimi K-1.5$^{[3]}$, have highlighted the potential of end-to-end reinforcement learning (RL) in enhancing the reasoning abilities of large-scale models. Additionally, Deep Research Models$^{[4-6]}$ developed by OpenAI, Google Gemini, and Perplexity have shown that training models to interact with internet content can significantly improve their ability to tackle complex real-world tasks. Building on these insights, we attempt to explore an end-to-end RL framework to extend the capability boundaries of LMMs. We try to answer these questions:

(1). Can LMMs be trained to perceive their knowledge boundaries and learn to invoke search tools when necessary? (2). What are the effectiveness and efficiency of the RL approach? (3). Could the RL framework lead to the emergence of novel multimodal intelligent behaviors?

In this preliminary work, we introduce MMSearch-R1, an initial effort to equip LMMs with active image search capabilities through an end-to-end RL framework. Specifically, we start investigating this problem by studying how LMMs can better perform visual question answering (VQA) tasks with the assistance of image search tools. Our goal is to train models not only to determine when to invoke the image search tool but also to effectively extract, synthesize, and utilize relevant information to support downstream reasoning. This work represents a foundational step toward enabling LMMs to dynamically interact with external tools in a goal-directed manner, thereby enhancing their performance on long-tailed and knowledge-intensive VQA tasks.

Figure 1: Case Study. Our model is capable of determining whether it has sufficient knowledge about an image when performing visual question answering and can conduct on-demand image search if necessary.

Figure 1: Case Study. Our model is capable of determining whether it has sufficient knowledge about an image when performing visual question answering and can conduct on-demand image search if necessary.

Our primitive experimental results show that:

Image Search effectively expands the knowledge boundaries of LMMs. Through RL, the model learns to intelligently decide when to initiate a image search and how to avoid over-reliance on search tools. Across both in-domain (FactualVQA) and out-of-domain benchmarks(InfoSeek$^{[7]}$, MMSearch$^{[8]}$, Gimmick$^{[9]}$), models trained with supervised fine-tuning (SFT) and reinforcement learning (RL) demonstrated substantial performance improvements, while dynamically adjusting their search rate based on familiarity with the visual content of the dataset, thereby maintaining a reasonable and efficient use of search.
RL outperforms SFT in efficiency. With minimal data, GRPO applied directly to Qwen2.5-VL-Instruct-3B/7B$^{[10]}$ achieved superior results compared to SFT, even though the latter utilized twice the amount of homogeneous training data. This highlights the effectiveness of RL in optimizing model performance with limited resources.

Figure 2: Main Results. Model's performance on various knowledge-intensive VQA benchmarks (left) and compares the effects of STF and RL (right). EM: We use exact string matching as one of our criteria. 4o Judge: We prompt GPT-4o to take the image, question, ground truth answer, and model response as inputs to judge the correctness of the response. SR: Search Ratio across all questions. All numbers in the table represent percentages.

https://github.com/EvolvingLMMs-Lab/multimodal-search-r1

2. Method

2.1. FactualVQA Dataset

Figure 3: Data Engineering. Construction pipeline of FactualVQA dataset.

Figure 3: Data Engineering. Construction pipeline of FactualVQA dataset.