Decoupled Global-Local Alignment for Improving Compositional Understanding

Xiaoxing Hu1* Kaicheng Yang2* Jun Wang2 Haoran Xu3 Ziyong Feng2 Yupei Wang1†

1Beijing Institute of Technology 2DeepGlint 3Zhejiang University
*Equal contribution Corrsponding author

Code Paper

📖 Introduction

Teaser

Motivation

Prior studies have demonstrated that CLIP lacks compositional understanding—particularly in comprehending relational and attributive concepts. While existing approaches attempt to address this limitation through fine-tuning with hard negative samples, they exhibit a critical drawback: the improvement in compositional understanding often comes at the expense of significant degradation in general performance.

This presents a crucial question: how can we enhance CLIP's compositional understanding while preserving its general capabilities to maintain a good trade-off between compositional understanding and general performance?

🛠️ Our Solution

Teaser

LLM-driven Hard Negative Generation

Unlike previous rule-based or unmasking-based hard negative generation methods, we propose a hard negative generation method leveraging large language models (LLMs). Specifically, we first use ChatGPT to generate high-quality rewritten examples for each type of negative sample. From these, we manually select 50 examples to serve as templates for large-scale rewrites. Subsequently, we harness the context learning capabilities of the LLama3.1-8B-instruct model to conduct large-scale rewrites, generating high-quality hard negative samples for subsequent fine-tuning.

DeGLA training framework

To address the challenge of the degradation in general performance and improve the compositional understanding, we propose a DeGLA training framework, which decouples the global and local alignment during the training process.
Global Alignment: Based on the NegCLIP loss, we integrate the self-distillation of the fine-tuned model and the frozen EMA model to constrain subtle adjustments in the pre-trained embedding space.。
Local Alignment: We propose two local alignment losses, Image-Grounded Contrast (IGC) and Text-Grounded Contrast (TGC), to further enhance the model's understanding of compositional concepts..

📊 Experiment Results

Compositional Resoning-VALSE

Model #Params Existence Plurality Counting Sp.rel. Actions Coreference Foil-it! Avg.
quantifiers number relations repl. actant swap standard clean
BLIP 583M 86.3 73.2 68.1 71.5 77.2 61.1 53.8 48.2 93.8 70.0
BEIT3 1.9B 77.4 74.6 68.8 74.0 86.7 65.2 50.0 44.2 96.0 70.4
BLIP2 3.4B 55.5 71.5 66.0 62.4 83.6 51.6 48.6 51.9 95.9 65.4
MiniGPT-4 >9B 65.5 72.5 67.4 68.4 83.2 58.8 52.6 51.0 95.8 68.4
Hard Negative based method
XVLM-coco 216M 83.0 75.6 67.5 70.2 73.8 68.6 46.4 49.6 94.8 69.5
CE-XVLM 216M 83.5 72.8 72.1 68.7 71.8 69.1 51.0 46.8 93.8 70.8
CLIP 151M 68.7 57.1 61.0 65.4 77.8 71.8 54.1 51.0 89.8 65.3
CyCLIP 151M 69.3 58.3 61.0 66.4 78.1 72.0 53.2 51.6 88.8 65.5
NegCLIP 151M 76.8 72.0 65.2 72.7 81.6 84.8 58.9 54.8 91.8 71.7
Structure-CLIP 151M 75.6 67.1 62.0 68.2 80.4 88.3 44.5 58.7 91.2 69.1
DeGLA (ours) 151M 82.4 73.8 68.3 75.3 82.6 88.8 58.5 54.8 93.8 74.1 (+1.9)

Compositional Resoning-SugarCrepe

Model REPLACE SWAP ADD
Object Attribute Relation Avg. Object Attribute Avg. Object Attribute Avg.
Human 100.0 99.0 97.0 98.7 99.0 100.0 99.5 99.0 99.0 99.0
Vera 49.4 49.6 49.1 49.4 49.4 49.2 49.3 49.4 49.6 49.5
Grammar 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0 50.0
BLIP2 - - - 86.7 - - 69.8 - - 86.5
Hard Negative based method
CLIP 90.9 80.0 69.2 80.2 62.7 61.4 64.0 77.2 68.2 72.7
NegCLIP 92.7 85.9 76.5 85.0 75.3 75.2 75.4 88.8 82.8 85.8
Structure-CLIP 91.4 85.0 74.4 83.6 72.7 80.5 76.6 85.5 81.1 83.3
CE-CLIP 93.1 88.8 79.0 87.0 72.8 77.0 74.9 92.4 93.4 92.9
DeGLA 94.5 92.6 84.2 90.5 (+3.5) 81.6 82.1 81.9 (+6.9) 93.8 95.7 94.8 (+1.9)

Compositional Resoning-ARO

Model Relation Attribute COCO-order Flickr-order Avg.
BLIP 59.0 88.0 - - -
BEIT3 (Wang et al., 2022) 60.6 74.6 - - -
BLIP2 41.2 71.3 - - -
MiniGPT-4 46.9 55.7 - - -
Hard Negative based method
CLIP 59.2 62.9 48.4 59.1 57.4
CyCLIP 59.1 65.4 - - -
NegCLIP 80.4 70.5 86.9 90.5 82.1
Structure-CLIP 81.8 80.5 81.7 83.9 82.0
CE-CLIP 83.9 76.4 80.9 83.7 81.2
DeGLA 81.6 74.3 93.8 94.7 86.1 (+4.9)

Zero-shot Classification on 11 datasets

Model CIFAR10 CIFAR100 Food101 Pets Flowers SUN397 Cars DTD Caltech101 Aircraft ImageNet Avg.
Pretrained model
CLIP 86.5 61.0 78.5 79.6 58.4 59.9 48.8 38.7 86.3 15.3 57.9 61.0
Hard negative based method
NegCLIP 86.1 59.9 72.1 78.7 53.9 56.8 43.5 37.7 84.3 11.6 54.0 58.1
Structure-CLIP 76.8 47.4 55.1 61.4 31.3 48.3 16.4 29.4 71.0 7.6 37.3 43.8
CE-CLIP 80.5 54.1 57.6 59.0 30.1 49.2 22.8 27.6 74.4 9.1 38.1 45.7
DeGLA 86.5 59.5 75.6 76.0 52.8 59.5 45.7 38.1 84.0 14.1 54.5 58.7 (+13.0)

Liner probe on 11 datasets

Model CIFAR10 CIFAR100 Food101 Pets Flowers SUN397 Cars DTD Caltech101 Aircraft ImageNet Avg.
Pretrained model
CLIP 95.0 80.1 88.5 89.3 94.6 74.1 80.8 73.6 90.5 44.8 74.3 80.5
Hard negative based method
NegCLIP 94.6 80.0 86.1 89.6 93.9 72.9 78.8 72.9 90.0 43.2 72.9 79.5
Structure-CLIP 91.9 75.5 81.2 86.2 89.6 69.0 67.4 67.7 65.2 37.7 67.7 72.7
CE-CLIP 94.3 78.5 84.3 88.1 92.6 71.0 74.1 71.8 88.3 39.6 70.7 77.6
DeGLA 95.1 80.5 86.7 89.5 94.6 74.0 78.8 73.0 89.6 43.5 73.4 79.9 (+13.0)

Zero-shot image-text retrieval

Teaser

👀 Quantitative Results

Teaser

BiBTex

If you find this work useful, please consider citing our paper:

@misc{hu2025decoupledgloballocalalignmentimproving,
  title={Decoupled Global-Local Alignment for Improving Compositional Understanding}, 
  author={Xiaoxing Hu and Kaicheng Yang and Jun Wang and Haoran Xu and Ziyong Feng and Yupei Wang},
  year={2025},
  eprint={2504.16801},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2504.16801}, 
}