University of Cambridge > Talks.cam > Language Technology Lab Seminars > Beyond Surface Matching: Reasoning, Grounding, and Retrieval in Vision-Language Models

Beyond Surface Matching: Reasoning, Grounding, and Retrieval in Vision-Language Models

Download to your calendar using vCal

If you have a question about this talk, please contact Lucas Resck.

Abstract: Vision-language models have made remarkable progress on multimodal benchmarks, yet much of this performance relies on shallow pattern matching — single-vector compression in retrieval, brute-force training scaling in reasoning, and surface-level lexical cues in grounding. In this talk, I present recent work that addresses these limitations. I begin with MetaEmbed, a flexible multi-vector retrieval framework that introduces learnable meta tokens processed by a vision-language backbone, whose contextualized representations enable late interaction at variable granularity. Through a Matryoshka multi-vector training objective, MetaEmbed learns coarse-to-fine embeddings that allow users to scale retrieval quality against efficiency at test time, achieving state-of-the-art results on the MMEB and ViDoRe benchmarks across model scales up to 32B parameters. I then present ProxyThinker, an inference-time method that transfers visual reasoning capabilities from small reinforcement-fine-tuned models to larger base models without any additional training. By steering the large model’s token distributions using the logit difference between a small reasoning expert and its base counterpart, ProxyThinker elicits slow-thinking behaviors such as self-verification and backtracking, achieving performance competitive with full-scale reinforcement fine-tuning at a fraction of the cost. I conclude with a brief overview of two ongoing directions: Referring Scenario Comprehension, a benchmark that challenges grounding models with non-literal, scenario-based queries requiring reasoning over user intent and relational context; and Retrieval-Augmented Reinforcement Fine-Tuning, which trains language models to reason by analogy through retrieved demonstrations selected for reasoning utility rather than surface similarity.

Bio: Vicente Ordóñez-Román is an Associate Professor in the Department of Computer Science at Rice University. His research interests lie at the intersection of computer vision, natural language processing and machine learning. His focus is on building efficient visual recognition models that can perform tasks that leverage both images and text. He received a Best Paper Award at the conference on Empirical Methods in Natural Language Processing (EMNLP) 2017 and the Best Paper Award—Marr Prize—at the International Conference on Computer Vision (ICCV) 2013. He has also been the recipient of an NSF CAREER Award, an IBM Faculty Award, a Google Faculty Research Award, and a Facebook Research Award. From 2016-2021, he was Assistant Professor in the Department of Computer Science at the University of Virginia. Vicente obtained his PhD in Computer Science at the University of North Carolina at Chapel Hill, an MS at Stony Brook University, and an engineering degree at the Escuela Superior Politécnica del Litoral in Ecuador. In the past, he has also been a visiting researcher at the Allen Institute for Artificial Intelligence, Adobe Research, Amazon Alexa AI and the Amazon AGI Foundations team.

This talk is part of the Language Technology Lab Seminars series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2025 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity