Jinming Wu$^$, Zihao Deng$^$, Wei Li, Yiding Liu, Bo You, Zejun Ma $^*$equal contribution


Table of Contents


1. Introduction

In recent years, scaling visual-text paired data has significantly advanced the capabilities of Large Multimodal Models (LMMs) across a wide range of multimodal understanding tasks. However, this paradigm faces fundamental limitations when applied to complex and dynamic real-world knowledge. In particular, long-tail information, such as newly emerging facts beyond the model’s training cutoff, or proprietary domain-specific knowledge constrained by privacy, copyright, or security, remains difficult to capture during pretraining. Moreover, operating beyond their internal knowledge boundaries often leads models to hallucinate, undermining their reliability in applications where accuracy and authority are critical.

Retrieval-Augmented Generation (RAG) has emerged as a widely adopted solution to these challenges. Yet it introduces two key limitations: (1) its modular architecture decouples retrieval and generation, hindering end-to-end optimization; and (2) it follows a static “retrieve-then-generate” paradigm, often triggering unnecessary retrievals even when the model already possesses sufficient knowledge, thus leading to increased latency and computational overhead.

Recent advancements, such as OpenAI’s o-series$^{[1]}$, DeepSeek-R1$^{[2]}$, and Kimi K-1.5$^{[3]}$, have highlighted the potential of end-to-end reinforcement learning (RL) in enhancing the reasoning abilities of large-scale models. Additionally, Deep Research Models$^{[4-6]}$ developed by OpenAI, Google Gemini, and Perplexity have shown that training models to interact with internet content can significantly improve their ability to tackle complex real-world tasks. Building on these insights, we attempt to explore an end-to-end RL framework to extend the capability boundaries of LMMs. We try to answer these questions:

(1). Can LMMs be trained to perceive their knowledge boundaries and learn to invoke search tools when necessary? (2). What are the effectiveness and efficiency of the RL approach? (3). Could the RL framework lead to the emergence of novel multimodal intelligent behaviors?

In this preliminary work, we introduce MMSearch-R1, an initial effort to equip LMMs with active image search capabilities through an end-to-end RL framework. Specifically, we start investigating this problem by studying how LMMs can better perform visual question answering (VQA) tasks with the assistance of image search tools. Our goal is to train models not only to determine when to invoke the image search tool but also to effectively extract, synthesize, and utilize relevant information to support downstream reasoning. This work represents a foundational step toward enabling LMMs to dynamically interact with external tools in a goal-directed manner, thereby enhancing their performance on long-tailed and knowledge-intensive VQA tasks.

Figure 1: Case Study. Our model is capable of determining whether it has sufficient knowledge about an image when performing visual question answering and can conduct on-demand image search if necessary.

Figure 1: Case Study. Our model is capable of determining whether it has sufficient knowledge about an image when performing visual question answering and can conduct on-demand image search if necessary.

Our primitive experimental results show that:

Figure 2: Main Results. Model's performance on various knowledge-intensive VQA benchmarks (left) and compares the effects of STF and RL (right). EM: We use exact string matching as one of our criteria. 4o Judge: We prompt GPT-4o to take the image, question, ground truth answer, and model response as inputs to judge the correctness of the response. SR: Search Ratio across all questions. All numbers in the table represent percentages.

Figure 2: Main Results. Model's performance on various knowledge-intensive VQA benchmarks (left) and compares the effects of STF and RL (right). EM: We use exact string matching as one of our criteria. 4o Judge: We prompt GPT-4o to take the image, question, ground truth answer, and model response as inputs to judge the correctness of the response. SR: Search Ratio across all questions. All numbers in the table represent percentages.

2. Method

2.1. FactualVQA Dataset

Figure 3: Data Engineering.  Construction pipeline of FactualVQA dataset.

Figure 3: Data Engineering. Construction pipeline of FactualVQA dataset.

As a starting point, we aim to explore whether outcome-based RL can be used to train LMMs end-to-end to develop effective search decision-making behaviors. However, existing VQA datasets typically contain open-ended answers with variable lengths and formats, which poses significant challenges for designing precise and consistent outcome-based reward signals. To support our investigation, we require a dataset that meets two key criteria: (1) each question should have a unique, unambiguous answer, and (2) the answers should be easily and reliably evaluated using simple, automated methods. To address these limitations, we construct a factual visual question answering dataset, which we refer to as FactualVQA. Below, we detail the data construction process