gScoreCAM: What objects is CLIP looking at?
Peijie Chen, Qi Li, Saad Biaz, Trung Bui, Anh Nguyen
Links: pdf | code | project page
Large-scale, multimodal models trained on web data such as OpenAI’s CLIP are becoming the foundation of many applications. Yet, they are also more complex to understand, test, and align with human values. In this paper, we propose gScoreCAM—a state-of-the-art method for visualizing the main objects that CLIP looks at in an image. On zero-shot object detection, gScoreCAM performs similarly to ScoreCAM, the best prior art on CLIP, yet 8 to 10 times faster. Our method outperforms other existing, well-known methods (HilaCAM, RISE, and the entire CAM family) by a large margin, especially in multiobject scenes. gScoreCAM sub-samples k = 300 channels (from 3,072 channels—i.e. reducing complexity by almost 10 times) of the highest gradients and linearly combines them into a final “attention” visualization. We demonstrate the utility and superiority of our method on three datasets: ImageNet, COCO, and PartImageNet. Our work opens up interesting future directions in understanding and de-biasing CLIP.
Acknowledgment: This work is supported by the National Science Foundation under Grant No. 1850117 & 2145767 and donations from Adobe Research and NaphCare foundation.
Conference: Asian Conference on Computer Vision (ACCV 2022). Oral presentation 6-min video | slide deck (acceptance rate: 41/836 = 4.9%)
- ⭐️ An interactive demo created by Replicate.com just for our work ⭐️
- A Google Colab demo by Peijie Chen