Publications
* denotes equal contribution
An up-to-date list is available on Google Scholar.
2025
- ICLRCross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality InversionM. Mistretta*, A. Baldrati*, L. Agnolucci*, M. Bertini, and A. BagdanovIn The Thirteenth International Conference on Learning Representations, 2025
Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications. In this paper, we show that the common practice of individually exploiting the text or image encoders of these powerful multi-modal models is highly suboptimal for intra-modal tasks like image-to-image retrieval. We argue that this is inherently due to the CLIP-style inter-modal contrastive loss that does not enforce any intra-modal constraints, leading to what we call intra-modal misalignment. To demonstrate this, we leverage two optimization-based modality inversion techniques that map representations from their input modality to the complementary one without any need for auxiliary data or additional trained adapters. We empirically show that, in the intra-modal tasks of image-to-image and text-to-text retrieval, approaching these tasks inter-modally significantly improves performance with respect to intra-modal baselines on more than fifteen datasets. Additionally, we demonstrate that approaching a native inter-modal task (e.g. zero-shot image classification) intra-modally decreases performance, further validating our findings. Finally, we show that incorporating an intra-modal term in the pre-training objective or narrowing the modality gap between the text and image feature embedding spaces helps reduce the intra-modal misalignment. The code is publicly available at: https://github.com/miccunifi/Cross-the-Gap.
@inproceedings{mistretta2025cross, title = {Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion}, author = {Mistretta*, M. and Baldrati*, A. and Agnolucci*, L. and Bertini, M. and Bagdanov, A.}, booktitle = {The Thirteenth International Conference on Learning Representations}, year = {2025}, }
2024
- ECCVImproving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge DistillationM. Mistretta*, A. Baldrati*, M. Bertini, and A. BagdanovIn European Conference on Computer Vision, 2024
Vision-Language Models (VLMs) demonstrate remarkable zero-shot generalization to unseen tasks, but fall short of the performance of supervised methods in generalizing to downstream tasks with limited data. Prompt learning is emerging as a parameter-efficient method for adapting VLMs, but state-of-the-art approaches require annotated samples. In this paper we propose a novel approach to prompt learning based on unsupervised knowledge distillation from more powerful models. Our approach, which we call Knowledge Distillation Prompt Learning (KDPL), can be integrated into existing prompt learning techniques and eliminates the need for labeled examples during adaptation. Our experiments on more than ten standard benchmark datasets demonstrate that KDPL is very effective at improving generalization of learned prompts for zero-shot domain generalization, zero-shot cross-dataset generalization, and zero-shot base-to-novel class generalization problems. KDPL requires no ground-truth labels for adaptation, and moreover we show that even in the absence of any knowledge of training class names it can be used to effectively transfer knowledge.
@inproceedings{mistretta2025improving, title = {Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation}, author = {Mistretta*, M. and Baldrati*, A. and Bertini, M. and Bagdanov, A.}, booktitle = {European Conference on Computer Vision}, pages = {459--477}, year = {2024}, organization = {Springer}, }
- arXiviSEARLE: Improving Textual Inversion for Zero-Shot Composed Image RetrievalL. Agnolucci*, A. Baldrati*, M. Bertini, and A. Del Bimbo2024
Given a query consisting of a reference image and a relative caption, Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption. The reliance of supervised methods on labor-intensive manually labeled datasets hinders their broad applicability. In this work, we introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset. We propose an approach named iSEARLE (improved zero-Shot composEd imAge Retrieval with textuaL invErsion) that involves mapping the visual information of the reference image into a pseudo-word token in CLIP token embedding space and combining it with the relative caption. To foster research on ZS-CIR, we present an open-domain benchmarking dataset named CIRCO (Composed Image Retrieval on Common Objects in context), the first CIR dataset where each query is labeled with multiple ground truths and a semantic categorization. The experimental results illustrate that iSEARLE obtains state-of-the-art performance on three different CIR datasets – FashionIQ, CIRR, and the proposed CIRCO – and two additional evaluation settings, namely domain conversion and object composition. The dataset, the code, and the model are publicly available at https://github.com/miccunifi/SEARLE.
@article{agnolucci2024isearle, title = {iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval}, author = {Agnolucci*, L. and Baldrati*, A. and Bertini, M. and Del Bimbo, A.}, year = {2024}, }
- arXivMultimodal-Conditioned Latent Diffusion Models for Fashion Image EditingA. Baldrati*, D. Morelli*, M. Cornia, M. Bertini, and R. Cucchiara2024
Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try-on, this paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures. To address this problem, we propose extending latent diffusion models to incorporate these multiple modalities and modifying the structure of the denoising network, taking multimodal prompts as input. To condition the proposed architecture on fabric textures, we employ textual inversion techniques and let diverse cross-attention layers of the denoising network attend to textual and texture information, thus incorporating different granularity conditioning details. Given the lack of datasets for the task, we extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal annotations. Experimental evaluations demonstrate the effectiveness of our proposed approach in terms of realism and coherence concerning the provided multimodal inputs.
@article{baldrati2024multimodalconditioned, title = {Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing}, author = {Baldrati*, A. and Morelli*, D. and Cornia, M. and Bertini, M. and Cucchiara, R.}, year = {2024}, }
2023
- ACM TOMMComposed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based FeaturesA. Baldrati, M. Bertini, T. Uricchio, and A. Del BimboACM Transactions on Multimedia Computing, Communications and Applications, 2023
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, we rely on features from the OpenAI CLIP model to tackle the considered task. We initially perform a task-oriented fine-tuning of both CLIP encoders using the element-wise sum of visual and textual features. Then, in the second stage, we train a Combiner network that learns to combine the image-text features integrating the bimodal information and providing combined features used to perform the retrieval. We use contrastive learning in both stages of training. Starting from the bare CLIP features as a baseline, experimental results show that the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval.
@article{baldrati2023composed, title = {Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features}, author = {Baldrati, A. and Bertini, M. and Uricchio, T. and Del Bimbo, A.}, journal = {ACM Transactions on Multimedia Computing, Communications and Applications}, publisher = {ACM New York, NY}, year = {2023}, }
- ACM MMLaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-OnD. Morelli*, A. Baldrati*, G. Cartella, M. Cornia, M. Bertini, and R. CucchiaraIn Proceedings of the ACM International Conference on Multimedia, 2023
The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a given in-shop garment, has yet to capitalize on the potential of these powerful generative solutions. This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module that exploits learnable skip connections to enhance the generation process preserving the model’s characteristics. To effectively maintain the texture and details of the in-shop garment, we propose a textual inversion component that can map the visual features of the garment to the CLIP token embedding space and thus generate a set of pseudo-word token embeddings capable of conditioning the generation process. Experimental results on Dress Code and VITON-HD datasets demonstrate that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task.
@inproceedings{morelli2023ladi, title = {{LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On}}, author = {Morelli*, D. and Baldrati*, A. and Cartella, G. and Cornia, M. and Bertini, M. and Cucchiara, R.}, booktitle = {Proceedings of the ACM International Conference on Multimedia}, year = {2023}, }
- ICCVMultimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image EditingA. Baldrati*, D. Morelli*, G. Cartella, M. Cornia, M. Bertini, and R. CucchiaraIn Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2023
Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images by following multimodal prompts, such as text, human body poses, and garment sketches. We tackle this problem by proposing a new architecture based on latent diffusion models, an approach that has not been used before in the fashion domain. Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets, namely Dress Code and VITON-HD, with multimodal annotations collected in a semi-automatic manner. Experimental results on these new datasets demonstrate the effectiveness of our proposal, both in terms of realism and coherence with the given multimodal inputs.
@inproceedings{baldrati2023multimodal, author = {Baldrati*, A. and Morelli*, D. and Cartella, G. and Cornia, M. and Bertini, M. and Cucchiara, R.}, title = {Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = oct, year = {2023}, pages = {23393-23402}, }
- ICCVZero-Shot Composed Image Retrieval with Textual InversionA. Baldrati*, L. Agnolucci*, M. Bertini, and A. Del BimboIn Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2023
Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for labeling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset. Our approach, named zero-Shot composEd imAge Retrieval with textuaL invErsion (SEARLE), maps the visual features of the reference image into a pseudo-word token in CLIP token embedding space and integrates it with the relative caption. To support research on ZS-CIR, we introduce an open-domain benchmarking dataset named Composed Image Retrieval on Common Objects in context (CIRCO), which is the first dataset for CIR containing multiple ground truths for each query. The experiments show that SEARLE exhibits better performance than the baselines on the two main datasets for CIR tasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code and the model are publicly available at https://github.com/miccunifi/SEARLE.
@inproceedings{baldrati2023zero, title = {Zero-Shot Composed Image Retrieval with Textual Inversion}, author = {Baldrati*, A. and Agnolucci*, L. and Bertini, M. and Del Bimbo, A.}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, pages = {15338--15347}, year = {2023}, }
- ICCV
Workshop Mapping Memes to Words for Multimodal Hateful Meme ClassificationG. Burbi*, A. Baldrati*, L. Agnolucci, M. Bertini, and A. Del BimboIn Proceedings of the IEEE/CVF International Conference on Computer Vision, Oct 2023Multimodal image-text memes are prevalent on the internet, serving as a unique form of communication that combines visual and textual elements to convey humor, ideas, or emotions. However, some memes take a malicious turn, promoting hateful content and perpetuating discrimination. Detecting hateful memes within this multimodal context is a challenging task that requires understanding the intertwined meaning of text and images. In this work, we address this issue by proposing a novel approach named ISSUES for multimodal hateful meme classification. ISSUES leverages a pre-trained CLIP vision-language model and the textual inversion technique to effectively capture the multimodal semantic content of the memes. The experiments show that our method achieves state-of-the-art results on the Hateful Memes Challenge and HarMeme datasets. The code and the pre-trained models are publicly available at https://github.com/miccunifi/ISSUES.
@inproceedings{burbi2023mapping, title = {Mapping Memes to Words for Multimodal Hateful Meme Classification}, author = {Burbi*, G. and Baldrati*, A. and Agnolucci, L. and Bertini, M. and Del Bimbo, A.}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages = {2832--2836}, year = {2023}, }
- ICCV
Workshop ECO: Ensembling Context Optimization for Vision-Language ModelsL. Agnolucci*, A. Baldrati*, F. Todino, F. Becattini, M. Bertini, and A. Del BimboIn Proceedings of the IEEE/CVF International Conference on Computer Vision, Oct 2023Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts. Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space. This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP’s classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time. We demonstrate the capabilities of our approach on 11 different benchmarks.
@inproceedings{agnolucci2023eco, title = {ECO: Ensembling Context Optimization for Vision-Language Models}, author = {Agnolucci*, L. and Baldrati*, A. and Todino, F. and Becattini, F. and Bertini, M. and Del Bimbo, A.}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages = {2811--2815}, year = {2023}, }
- ICIAPOpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion DataG. Cartella, A. Baldrati, D. Morelli, M. Cornia, M. Bertini, and R. CucchiaraIn International Conference on Image Analysis and Processing, Oct 2023
The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In this work, we propose OpenFashionCLIP, a vision-and-language contrastive learning method that only adopts open-source fashion data stemming from diverse domains, and characterized by varying degrees of specificity. Our approach is extensively validated across several tasks and benchmarks, and experimental results highlight a significant out-of-domain generalization capability and consistent improvements over state-of-the-art methods both in terms of accuracy and recall.
@inproceedings{cartella2023openfashionclip, title = {OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data}, author = {Cartella, G. and Baldrati, A. and Morelli, D. and Cornia, M. and Bertini, M. and Cucchiara, R.}, booktitle = {International Conference on Image Analysis and Processing}, pages = {245--256}, year = {2023}, organization = {Springer}, }
2022
- CVPR
Workshop Conditioned and composed image retrieval combining and partially fine-tuning clip-based featuresA. Baldrati, M. Bertini, T. Uricchio, and A. Del BimboIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Oct 2022In this paper, we present an approach for conditioned and composed image retrieval based on CLIP features. In this extension of content-based image retrieval (CBIR), an image is combined with a text that provides information regarding user intentions and is relevant for application domains like e-commerce. The proposed method is based on an initial training stage where a simple combination of visual and textual features is used, to fine-tune the CLIP text encoder. Then in a second training stage, we learn a more complex combiner network that merges visual and textual features. Contrastive learning is used in both stages. The proposed approach obtains state-of-the-art performance for conditioned CBIR on the FashionIQ dataset and for composed CBIR on the more recent CIRR dataset.
@inproceedings{baldrati2022conditioned, title = {Conditioned and composed image retrieval combining and partially fine-tuning clip-based features}, author = {Baldrati, A. and Bertini, M. and Uricchio, T. and Del Bimbo, A.}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages = {4959--4968}, year = {2022}, }
- CVPR
Demo Effective conditioned and composed image retrieval combining clip-based featuresA. Baldrati, M. Bertini, T. Uricchio, and A. Del BimboIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Oct 2022Conditioned and composed image retrieval extend CBIR systems by combining a query image with an additional text that expresses the intent of the user, describing additional requests w.r.t. the visual content of the query image. This type of search is interesting for e-commerce applications, e.g. to develop interactive multimodal searches and chatbots. In this demo, we present an interactive system based on a combiner network, trained using contrastive learning, that combines visual and textual features obtained from the OpenAI CLIP network to address conditioned CBIR. The system can be used to improve e-shop search engines. For example, considering the fashion domain it lets users search for dresses, shirts and toptees using a candidate start image and expressing some visual differences w.r.t. its visual content, e.g. asking to change color, pattern or shape. The proposed network obtains state-of-the-art performance on the FashionIQ dataset and on the more recent CIRR dataset, showing its applicability to the fashion domain for conditioned retrieval, and to more generic content considering the more general task of composed image retrieval.
@inproceedings{baldrati2022effective, title = {Effective conditioned and composed image retrieval combining clip-based features}, author = {Baldrati, A. and Bertini, M. and Uricchio, T. and Del Bimbo, A.}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages = {21466--21474}, year = {2022}, }
2021
- ACM MM AsiaConditioned image retrieval for fashion using contrastive learning and CLIP-based featuresA. Baldrati, M. Bertini, T. Uricchio, and A. Del BimboIn ACM Multimedia Asia, Oct 2021
Building on the recent advances in multimodal zero-shot representation learning, in this paper we explore the use of features obtained from the recent CLIP model to perform conditioned image retrieval. Starting from a reference image and an additive textual description of what the user wants with respect to the reference image, we learn a Combiner network that is able to understand the image content, integrate the textual description and provide combined feature used to perform the conditioned image retrieval. Starting from the bare CLIP features and a simple baseline, we show that a carefully crafted Combiner network, based on such multimodal features, is extremely effective and outperforms more complex state of the art approaches on the popular FashionIQ dataset.
@incollection{baldrati2021conditioned, title = {Conditioned image retrieval for fashion using contrastive learning and CLIP-based features}, author = {Baldrati, A. and Bertini, M. and Uricchio, T. and Del Bimbo, A.}, booktitle = {ACM Multimedia Asia}, pages = {1--5}, year = {2021}, }