Prior studies have demonstrated that CLIP lacks compositional understanding—particularly in comprehending relational and attributive concepts. While existing approaches attempt to address this limitation through fine-tuning with hard negative samples, they exhibit a critical drawback: the improvement in compositional understanding often comes at the expense of significant degradation in general performance.
This presents a crucial question: how can we enhance CLIP's compositional understanding while preserving its general capabilities to maintain a good trade-off between compositional understanding and general performance?
Unlike previous rule-based or unmasking-based hard negative generation methods, we propose a hard negative generation method leveraging large language models (LLMs). Specifically, we first use ChatGPT to generate high-quality rewritten examples for each type of negative sample. From these, we manually select 50 examples to serve as templates for large-scale rewrites. Subsequently, we harness the context learning capabilities of the LLama3.1-8B-instruct model to conduct large-scale rewrites, generating high-quality hard negative samples for subsequent fine-tuning.
To address the challenge of the degradation in general performance and improve the compositional understanding, we propose a DeGLA training framework,
which decouples the global and local alignment during the training process.
• Global Alignment: Based on the NegCLIP loss, we integrate the self-distillation of the fine-tuned model and the frozen EMA model to constrain subtle adjustments in the pre-trained embedding space.。
• Local Alignment: We propose two local alignment losses, Image-Grounded Contrast (IGC) and Text-Grounded Contrast (TGC), to further enhance the model's understanding of compositional concepts..
Model | #Params | Existence | Plurality | Counting | Sp.rel. | Actions | Coreference | Foil-it! | Avg. | ||
---|---|---|---|---|---|---|---|---|---|---|---|
quantifiers | number | relations | repl. | actant swap | standard | clean | |||||
BLIP | 583M | 86.3 | 73.2 | 68.1 | 71.5 | 77.2 | 61.1 | 53.8 | 48.2 | 93.8 | 70.0 |
BEIT3 | 1.9B | 77.4 | 74.6 | 68.8 | 74.0 | 86.7 | 65.2 | 50.0 | 44.2 | 96.0 | 70.4 |
BLIP2 | 3.4B | 55.5 | 71.5 | 66.0 | 62.4 | 83.6 | 51.6 | 48.6 | 51.9 | 95.9 | 65.4 |
MiniGPT-4 | >9B | 65.5 | 72.5 | 67.4 | 68.4 | 83.2 | 58.8 | 52.6 | 51.0 | 95.8 | 68.4 |
Hard Negative based method | |||||||||||
XVLM-coco | 216M | 83.0 | 75.6 | 67.5 | 70.2 | 73.8 | 68.6 | 46.4 | 49.6 | 94.8 | 69.5 |
CE-XVLM | 216M | 83.5 | 72.8 | 72.1 | 68.7 | 71.8 | 69.1 | 51.0 | 46.8 | 93.8 | 70.8 |
CLIP | 151M | 68.7 | 57.1 | 61.0 | 65.4 | 77.8 | 71.8 | 54.1 | 51.0 | 89.8 | 65.3 |
CyCLIP | 151M | 69.3 | 58.3 | 61.0 | 66.4 | 78.1 | 72.0 | 53.2 | 51.6 | 88.8 | 65.5 |
NegCLIP | 151M | 76.8 | 72.0 | 65.2 | 72.7 | 81.6 | 84.8 | 58.9 | 54.8 | 91.8 | 71.7 |
Structure-CLIP | 151M | 75.6 | 67.1 | 62.0 | 68.2 | 80.4 | 88.3 | 44.5 | 58.7 | 91.2 | 69.1 |
DeGLA (ours) | 151M | 82.4 | 73.8 | 68.3 | 75.3 | 82.6 | 88.8 | 58.5 | 54.8 | 93.8 | 74.1 (+1.9) |
Model | REPLACE | SWAP | ADD | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Object | Attribute | Relation | Avg. | Object | Attribute | Avg. | Object | Attribute | Avg. | |
Human | 100.0 | 99.0 | 97.0 | 98.7 | 99.0 | 100.0 | 99.5 | 99.0 | 99.0 | 99.0 |
Vera | 49.4 | 49.6 | 49.1 | 49.4 | 49.4 | 49.2 | 49.3 | 49.4 | 49.6 | 49.5 |
Grammar | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 |
BLIP2 | - | - | - | 86.7 | - | - | 69.8 | - | - | 86.5 |
Hard Negative based method | ||||||||||
CLIP | 90.9 | 80.0 | 69.2 | 80.2 | 62.7 | 61.4 | 64.0 | 77.2 | 68.2 | 72.7 |
NegCLIP | 92.7 | 85.9 | 76.5 | 85.0 | 75.3 | 75.2 | 75.4 | 88.8 | 82.8 | 85.8 |
Structure-CLIP | 91.4 | 85.0 | 74.4 | 83.6 | 72.7 | 80.5 | 76.6 | 85.5 | 81.1 | 83.3 |
CE-CLIP | 93.1 | 88.8 | 79.0 | 87.0 | 72.8 | 77.0 | 74.9 | 92.4 | 93.4 | 92.9 |
DeGLA | 94.5 | 92.6 | 84.2 | 90.5 (+3.5) | 81.6 | 82.1 | 81.9 (+6.9) | 93.8 | 95.7 | 94.8 (+1.9) |
Model | Relation | Attribute | COCO-order | Flickr-order | Avg. |
---|---|---|---|---|---|
BLIP | 59.0 | 88.0 | - | - | - |
BEIT3 (Wang et al., 2022) | 60.6 | 74.6 | - | - | - |
BLIP2 | 41.2 | 71.3 | - | - | - |
MiniGPT-4 | 46.9 | 55.7 | - | - | - |
Hard Negative based method | |||||
CLIP | 59.2 | 62.9 | 48.4 | 59.1 | 57.4 |
CyCLIP | 59.1 | 65.4 | - | - | - |
NegCLIP | 80.4 | 70.5 | 86.9 | 90.5 | 82.1 |
Structure-CLIP | 81.8 | 80.5 | 81.7 | 83.9 | 82.0 |
CE-CLIP | 83.9 | 76.4 | 80.9 | 83.7 | 81.2 |
DeGLA | 81.6 | 74.3 | 93.8 | 94.7 | 86.1 (+4.9) |
Model | CIFAR10 | CIFAR100 | Food101 | Pets | Flowers | SUN397 | Cars | DTD | Caltech101 | Aircraft | ImageNet | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Pretrained model | ||||||||||||
CLIP | 86.5 | 61.0 | 78.5 | 79.6 | 58.4 | 59.9 | 48.8 | 38.7 | 86.3 | 15.3 | 57.9 | 61.0 |
Hard negative based method | ||||||||||||
NegCLIP | 86.1 | 59.9 | 72.1 | 78.7 | 53.9 | 56.8 | 43.5 | 37.7 | 84.3 | 11.6 | 54.0 | 58.1 |
Structure-CLIP | 76.8 | 47.4 | 55.1 | 61.4 | 31.3 | 48.3 | 16.4 | 29.4 | 71.0 | 7.6 | 37.3 | 43.8 |
CE-CLIP | 80.5 | 54.1 | 57.6 | 59.0 | 30.1 | 49.2 | 22.8 | 27.6 | 74.4 | 9.1 | 38.1 | 45.7 |
DeGLA | 86.5 | 59.5 | 75.6 | 76.0 | 52.8 | 59.5 | 45.7 | 38.1 | 84.0 | 14.1 | 54.5 | 58.7 (+13.0) |
Model | CIFAR10 | CIFAR100 | Food101 | Pets | Flowers | SUN397 | Cars | DTD | Caltech101 | Aircraft | ImageNet | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Pretrained model | ||||||||||||
CLIP | 95.0 | 80.1 | 88.5 | 89.3 | 94.6 | 74.1 | 80.8 | 73.6 | 90.5 | 44.8 | 74.3 | 80.5 |
Hard negative based method | ||||||||||||
NegCLIP | 94.6 | 80.0 | 86.1 | 89.6 | 93.9 | 72.9 | 78.8 | 72.9 | 90.0 | 43.2 | 72.9 | 79.5 |
Structure-CLIP | 91.9 | 75.5 | 81.2 | 86.2 | 89.6 | 69.0 | 67.4 | 67.7 | 65.2 | 37.7 | 67.7 | 72.7 |
CE-CLIP | 94.3 | 78.5 | 84.3 | 88.1 | 92.6 | 71.0 | 74.1 | 71.8 | 88.3 | 39.6 | 70.7 | 77.6 |
DeGLA | 95.1 | 80.5 | 86.7 | 89.5 | 94.6 | 74.0 | 78.8 | 73.0 | 89.6 | 43.5 | 73.4 | 79.9 (+13.0) |
If you find this work useful, please consider citing our paper:
@misc{hu2025decoupledgloballocalalignmentimproving, title={Decoupled Global-Local Alignment for Improving Compositional Understanding}, author={Xiaoxing Hu and Kaicheng Yang and Jun Wang and Haoran Xu and Ziyong Feng and Yupei Wang}, year={2025}, eprint={2504.16801}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2504.16801}, }