Decoupled Global-Local Alignment for Improving Compositional Understanding

Xiaoxing Hu^1* Kaicheng Yang^2* Jun Wang² Haoran Xu³ Ziyong Feng² Yupei Wang^1†

¹Beijing Institute of Technology ²DeepGlint ³Zhejiang University

^*Equal contribution ^†Corrsponding author

Code

Paper

📖 Introduction

Motivation

Prior studies have demonstrated that CLIP lacks compositional understanding—particularly in comprehending relational and attributive concepts. While existing approaches attempt to address this limitation through fine-tuning with hard negative samples, they exhibit a critical drawback: the improvement in compositional understanding often comes at the expense of significant degradation in general performance.

This presents a crucial question: how can we enhance CLIP's compositional understanding while preserving its general capabilities to maintain a good trade-off between compositional understanding and general performance?

🛠️ Our Solution

LLM-driven Hard Negative Generation

Unlike previous rule-based or unmasking-based hard negative generation methods, we propose a hard negative generation method leveraging large language models (LLMs). Specifically, we first use ChatGPT to generate high-quality rewritten examples for each type of negative sample. From these, we manually select 50 examples to serve as templates for large-scale rewrites. Subsequently, we harness the context learning capabilities of the LLama3.1-8B-instruct model to conduct large-scale rewrites, generating high-quality hard negative samples for subsequent fine-tuning.

DeGLA training framework

To address the challenge of the degradation in general performance and improve the compositional understanding, we propose a DeGLA training framework, which decouples the global and local alignment during the training process.
• Global Alignment: Based on the NegCLIP loss, we integrate the self-distillation of the fine-tuned model and the frozen EMA model to constrain subtle adjustments in the pre-trained embedding space.。
• Local Alignment: We propose two local alignment losses, Image-Grounded Contrast (IGC) and Text-Grounded Contrast (TGC), to further enhance the model's understanding of compositional concepts..

📊 Experiment Results

Compositional Resoning-VALSE

Model	#Params	Existence	Plurality	Counting	Sp.rel.	Actions		Coreference		Foil-it!	Avg.
Model	#Params	quantifiers	number		relations	repl.	actant swap	standard	clean	Foil-it!	Avg.
BLIP	583M	86.3	73.2	68.1	71.5	77.2	61.1	53.8	48.2	93.8	70.0
BEIT3	1.9B	77.4	74.6	68.8	74.0	86.7	65.2	50.0	44.2	96.0	70.4
BLIP2	3.4B	55.5	71.5	66.0	62.4	83.6	51.6	48.6	51.9	95.9	65.4
MiniGPT-4	>9B	65.5	72.5	67.4	68.4	83.2	58.8	52.6	51.0	95.8	68.4
Hard Negative based method
XVLM-coco	216M	83.0	75.6	67.5	70.2	73.8	68.6	46.4	49.6	94.8	69.5
CE-XVLM	216M	83.5	72.8	72.1	68.7	71.8	69.1	51.0	46.8	93.8	70.8
CLIP	151M	68.7	57.1	61.0	65.4	77.8	71.8	54.1	51.0	89.8	65.3
CyCLIP	151M	69.3	58.3	61.0	66.4	78.1	72.0	53.2	51.6	88.8	65.5
NegCLIP	151M	76.8	72.0	65.2	72.7	81.6	84.8	58.9	54.8	91.8	71.7
Structure-CLIP	151M	75.6	67.1	62.0	68.2	80.4	88.3	44.5	58.7	91.2	69.1
DeGLA (ours)	151M	82.4	73.8	68.3	75.3	82.6	88.8	58.5	54.8	93.8	74.1 (+1.9)

Compositional Resoning-SugarCrepe

Model	REPLACE				SWAP			ADD
Model	Object	Attribute	Relation	Avg.	Object	Attribute	Avg.	Object	Attribute	Avg.
Human	100.0	99.0	97.0	98.7	99.0	100.0	99.5	99.0	99.0	99.0
Vera	49.4	49.6	49.1	49.4	49.4	49.2	49.3	49.4	49.6	49.5
Grammar	50.0	50.0	50.0	50.0	50.0	50.0	50.0	50.0	50.0	50.0
BLIP2	-	-	-	86.7	-	-	69.8	-	-	86.5
Hard Negative based method
CLIP	90.9	80.0	69.2	80.2	62.7	61.4	64.0	77.2	68.2	72.7
NegCLIP	92.7	85.9	76.5	85.0	75.3	75.2	75.4	88.8	82.8	85.8
Structure-CLIP	91.4	85.0	74.4	83.6	72.7	80.5	76.6	85.5	81.1	83.3
CE-CLIP	93.1	88.8	79.0	87.0	72.8	77.0	74.9	92.4	93.4	92.9
DeGLA	94.5	92.6	84.2	90.5 (+3.5)	81.6	82.1	81.9 (+6.9)	93.8	95.7	94.8 (+1.9)

Compositional Resoning-ARO

Model	Relation	Attribute	COCO-order	Flickr-order	Avg.
BLIP	59.0	88.0	-	-	-
BEIT3 (Wang et al., 2022)	60.6	74.6	-	-	-
BLIP2	41.2	71.3	-	-	-
MiniGPT-4	46.9	55.7	-	-	-
Hard Negative based method
CLIP	59.2	62.9	48.4	59.1	57.4
CyCLIP	59.1	65.4	-	-	-
NegCLIP	80.4	70.5	86.9	90.5	82.1
Structure-CLIP	81.8	80.5	81.7	83.9	82.0
CE-CLIP	83.9	76.4	80.9	83.7	81.2
DeGLA	81.6	74.3	93.8	94.7	86.1 (+4.9)

Zero-shot Classification on 11 datasets

Model	CIFAR10	CIFAR100	Food101	Pets	Flowers	SUN397	Cars	DTD	Caltech101	Aircraft	ImageNet	Avg.
Pretrained model
CLIP	86.5	61.0	78.5	79.6	58.4	59.9	48.8	38.7	86.3	15.3	57.9	61.0
Hard negative based method
NegCLIP	86.1	59.9	72.1	78.7	53.9	56.8	43.5	37.7	84.3	11.6	54.0	58.1
Structure-CLIP	76.8	47.4	55.1	61.4	31.3	48.3	16.4	29.4	71.0	7.6	37.3	43.8
CE-CLIP	80.5	54.1	57.6	59.0	30.1	49.2	22.8	27.6	74.4	9.1	38.1	45.7
DeGLA	86.5	59.5	75.6	76.0	52.8	59.5	45.7	38.1	84.0	14.1	54.5	58.7 (+13.0)

Liner probe on 11 datasets

Model	CIFAR10	CIFAR100	Food101	Pets	Flowers	SUN397	Cars	DTD	Caltech101	Aircraft	ImageNet	Avg.
Pretrained model
CLIP	95.0	80.1	88.5	89.3	94.6	74.1	80.8	73.6	90.5	44.8	74.3	80.5
Hard negative based method
NegCLIP	94.6	80.0	86.1	89.6	93.9	72.9	78.8	72.9	90.0	43.2	72.9	79.5
Structure-CLIP	91.9	75.5	81.2	86.2	89.6	69.0	67.4	67.7	65.2	37.7	67.7	72.7
CE-CLIP	94.3	78.5	84.3	88.1	92.6	71.0	74.1	71.8	88.3	39.6	70.7	77.6
DeGLA	95.1	80.5	86.7	89.5	94.6	74.0	78.8	73.0	89.6	43.5	73.4	79.9 (+13.0)

Zero-shot image-text retrieval

👀 Quantitative Results

BiBTex

If you find this work useful, please consider citing our paper:

@misc{hu2025decoupledgloballocalalignmentimproving,
  title={Decoupled Global-Local Alignment for Improving Compositional Understanding}, 
  author={Xiaoxing Hu and Kaicheng Yang and Jun Wang and Haoran Xu and Ziyong Feng and Yupei Wang},
  year={2025},
  eprint={2504.16801},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2504.16801}, 
}