Alberto Baldrati

I am Alberto Baldrati, a Research Scientist at the Samsung AI Center in Cambridge, UK, focusing on computer vision and machine learning research.

Previously, I successfully defended my PhD thesis in February 2025 as part of the AI Italian National Doctorate program at the University of Pisa. During my PhD, I was hosted by the University of Florence and worked at the Media Integration and Communication Center (MICC) under the supervision of Prof. Marco Bertini and Andrew David Bagdanov.

During my PhD, my research interests revolved around vision and language, with a particular focus on prompt learning and composed image retrieval, and fashion image generation, focusing on multimodal fashion image editing and virtual try-on. As part of my PhD journey, I also had the opportunity to intern as a Computer Vision Research Scientist at Huawei Finland Research Center from March to September 2024, where I worked on video generation.

Currently, my research focuses on efficient vision-and-language applications, particularly efficient VLLMs (see this paper).

News

Apr 6, 2026	One paper about multi-image VLLMs accepted at ACL26 (Findings).
Feb 20, 2026	One paper about efficient VLLMs accepted at CVPR 2026.
Jan 10, 2026	The extended version of our ICCV2023 paper multimodal fashion image editing has been accepted at ACM TOMM.
Jul 26, 2025	The extended version of our ICCV2023 paper on composed image retrieval has been accepted at TPAMI.
Mar 3, 2025	Joined Samsung AI Center in Cambridge as a Research Scientist.

Selected Publications

2026

CVPR
VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

A. Bulat*, A. Baldrati*, I. Metaxas*, Y. Ouali, and G. Tzimiropoulos

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

Abs arXiv Bib PDF

Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.
@inproceedings{bulat2026vision, author = {Bulat*, A. and Baldrati*, A. and Metaxas*, I. and Ouali, Y. and Tzimiropoulos, G.}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages = {31920--31930}, year = {2026} }

2025

ICLR
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion

M. Mistretta*, A. Baldrati*, L. Agnolucci*, M. Bertini, and A. Bagdanov

In The Thirteenth International Conference on Learning Representations, 2025

Abs arXiv Bib PDF Code Website

Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications. In this paper, we show that the common practice of individually exploiting the text or image encoders of these powerful multi-modal models is highly suboptimal for intra-modal tasks like image-to-image retrieval. We argue that this is inherently due to the CLIP-style inter-modal contrastive loss that does not enforce any intra-modal constraints, leading to what we call intra-modal misalignment. To demonstrate this, we leverage two optimization-based modality inversion techniques that map representations from their input modality to the complementary one without any need for auxiliary data or additional trained adapters. We empirically show that, in the intra-modal tasks of image-to-image and text-to-text retrieval, approaching these tasks inter-modally significantly improves performance with respect to intra-modal baselines on more than fifteen datasets. Additionally, we demonstrate that approaching a native inter-modal task (e.g. zero-shot image classification) intra-modally decreases performance, further validating our findings. Finally, we show that incorporating an intra-modal term in the pre-training objective or narrowing the modality gap between the text and image feature embedding spaces helps reduce the intra-modal misalignment. The code is publicly available at: https://github.com/miccunifi/Cross-the-Gap.
@inproceedings{mistretta2025cross, title = {Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion}, author = {Mistretta*, M. and Baldrati*, A. and Agnolucci*, L. and Bertini, M. and Bagdanov, A.}, booktitle = {The Thirteenth International Conference on Learning Representations}, year = {2025}, }

2023

ICCV
Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing

A. Baldrati*, D. Morelli*, G. Cartella, M. Cornia, M. Bertini, and R. Cucchiara

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2023

Abs arXiv Bib PDF Code

Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images by following multimodal prompts, such as text, human body poses, and garment sketches. We tackle this problem by proposing a new architecture based on latent diffusion models, an approach that has not been used before in the fashion domain. Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets, namely Dress Code and VITON-HD, with multimodal annotations collected in a semi-automatic manner. Experimental results on these new datasets demonstrate the effectiveness of our proposal, both in terms of realism and coherence with the given multimodal inputs.
@inproceedings{baldrati2023multimodal, author = {Baldrati*, A. and Morelli*, D. and Cartella, G. and Cornia, M. and Bertini, M. and Cucchiara, R.}, title = {Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = oct, year = {2023}, pages = {23393-23402}, }
ICCV
Zero-Shot Composed Image Retrieval with Textual Inversion

A. Baldrati*, L. Agnolucci*, M. Bertini, and A. Del Bimbo

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2023

Abs arXiv Bib PDF Code Website

Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for labeling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset. Our approach, named zero-Shot composEd imAge Retrieval with textuaL invErsion (SEARLE), maps the visual features of the reference image into a pseudo-word token in CLIP token embedding space and integrates it with the relative caption. To support research on ZS-CIR, we introduce an open-domain benchmarking dataset named Composed Image Retrieval on Common Objects in context (CIRCO), which is the first dataset for CIR containing multiple ground truths for each query. The experiments show that SEARLE exhibits better performance than the baselines on the two main datasets for CIR tasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code and the model are publicly available at https://github.com/miccunifi/SEARLE.
@inproceedings{baldrati2023zero, title = {Zero-Shot Composed Image Retrieval with Textual Inversion}, author = {Baldrati*, A. and Agnolucci*, L. and Bertini, M. and Del Bimbo, A.}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, pages = {15338--15347}, year = {2023}, }