Alberto Baldrati

I am Alberto Baldrati, a Research Scientist at the Samsung AI Center in Cambridge, UK, focusing on computer vision and machine learning research.

Previously, I successfully defended my PhD thesis in February 2025 as part of the AI Italian National Doctorate program at the University of Pisa. During my PhD, I was hosted by the University of Florence and worked at the Media Integration and Communication Center (MICC) under the supervision of Prof. Marco Bertini and Andrew David Bagdanov.

During my PhD, my research interests revolved around vision and language, with a particular focus on prompt learning and composed image retrieval, and fashion image generation, focusing on multimodal fashion image editing and virtual try-on. As part of my PhD journey, I also had the opportunity to intern as a Computer Vision Research Scientist at Huawei Finland Research Center from March to September 2024, where I worked on video generation.

If you wish to learn more about my research or explore potential collaborations, please feel free to reach out via email!

News

Jul 26, 2025	The extended version of our ICCV2023 paper on composed image retrieval has been accepted at TPAMI.
Mar 3, 2025	Joined Samsung AI Center in Cambridge as a Research Scientist.
Feb 19, 2025	Successfully defended my PhD thesis at University of Pisa.
Jan 21, 2025	One paper about CLIP representations accepted at ICLR 2025.
Jul 9, 2024	One paper about prompt learning accepted at ECCV 2024.

Selected Publications

2025

ICLR
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion

M. Mistretta*, A. Baldrati*, L. Agnolucci*, M. Bertini, and A. Bagdanov

In The Thirteenth International Conference on Learning Representations, 2025

Abs arXiv Bib PDF Code Website

Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications. In this paper, we show that the common practice of individually exploiting the text or image encoders of these powerful multi-modal models is highly suboptimal for intra-modal tasks like image-to-image retrieval. We argue that this is inherently due to the CLIP-style inter-modal contrastive loss that does not enforce any intra-modal constraints, leading to what we call intra-modal misalignment. To demonstrate this, we leverage two optimization-based modality inversion techniques that map representations from their input modality to the complementary one without any need for auxiliary data or additional trained adapters. We empirically show that, in the intra-modal tasks of image-to-image and text-to-text retrieval, approaching these tasks inter-modally significantly improves performance with respect to intra-modal baselines on more than fifteen datasets. Additionally, we demonstrate that approaching a native inter-modal task (e.g. zero-shot image classification) intra-modally decreases performance, further validating our findings. Finally, we show that incorporating an intra-modal term in the pre-training objective or narrowing the modality gap between the text and image feature embedding spaces helps reduce the intra-modal misalignment. The code is publicly available at: https://github.com/miccunifi/Cross-the-Gap.
@inproceedings{mistretta2025cross, title = {Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion}, author = {Mistretta*, M. and Baldrati*, A. and Agnolucci*, L. and Bertini, M. and Bagdanov, A.}, booktitle = {The Thirteenth International Conference on Learning Representations}, year = {2025}, }

2024

ECCV
Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation

M. Mistretta*, A. Baldrati*, M. Bertini, and A. Bagdanov

In European Conference on Computer Vision, 2024

Abs arXiv Bib PDF Code

Vision-Language Models (VLMs) demonstrate remarkable zero-shot generalization to unseen tasks, but fall short of the performance of supervised methods in generalizing to downstream tasks with limited data. Prompt learning is emerging as a parameter-efficient method for adapting VLMs, but state-of-the-art approaches require annotated samples. In this paper we propose a novel approach to prompt learning based on unsupervised knowledge distillation from more powerful models. Our approach, which we call Knowledge Distillation Prompt Learning (KDPL), can be integrated into existing prompt learning techniques and eliminates the need for labeled examples during adaptation. Our experiments on more than ten standard benchmark datasets demonstrate that KDPL is very effective at improving generalization of learned prompts for zero-shot domain generalization, zero-shot cross-dataset generalization, and zero-shot base-to-novel class generalization problems. KDPL requires no ground-truth labels for adaptation, and moreover we show that even in the absence of any knowledge of training class names it can be used to effectively transfer knowledge.
@inproceedings{mistretta2025improving, title = {Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation}, author = {Mistretta*, M. and Baldrati*, A. and Bertini, M. and Bagdanov, A.}, booktitle = {European Conference on Computer Vision}, pages = {459--477}, year = {2024}, organization = {Springer}, }

2023

ICCV
Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing

A. Baldrati*, D. Morelli*, G. Cartella, M. Cornia, M. Bertini, and R. Cucchiara

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2023

Abs arXiv Bib PDF Code

Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images by following multimodal prompts, such as text, human body poses, and garment sketches. We tackle this problem by proposing a new architecture based on latent diffusion models, an approach that has not been used before in the fashion domain. Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets, namely Dress Code and VITON-HD, with multimodal annotations collected in a semi-automatic manner. Experimental results on these new datasets demonstrate the effectiveness of our proposal, both in terms of realism and coherence with the given multimodal inputs.
@inproceedings{baldrati2023multimodal, author = {Baldrati*, A. and Morelli*, D. and Cartella, G. and Cornia, M. and Bertini, M. and Cucchiara, R.}, title = {Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = oct, year = {2023}, pages = {23393-23402}, }
ICCV
Zero-Shot Composed Image Retrieval with Textual Inversion

A. Baldrati*, L. Agnolucci*, M. Bertini, and A. Del Bimbo

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2023

Abs arXiv Bib PDF Code Website

Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for labeling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset. Our approach, named zero-Shot composEd imAge Retrieval with textuaL invErsion (SEARLE), maps the visual features of the reference image into a pseudo-word token in CLIP token embedding space and integrates it with the relative caption. To support research on ZS-CIR, we introduce an open-domain benchmarking dataset named Composed Image Retrieval on Common Objects in context (CIRCO), which is the first dataset for CIR containing multiple ground truths for each query. The experiments show that SEARLE exhibits better performance than the baselines on the two main datasets for CIR tasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code and the model are publicly available at https://github.com/miccunifi/SEARLE.
@inproceedings{baldrati2023zero, title = {Zero-Shot Composed Image Retrieval with Textual Inversion}, author = {Baldrati*, A. and Agnolucci*, L. and Bertini, M. and Del Bimbo, A.}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, pages = {15338--15347}, year = {2023}, }