PEEB: Part-based Image Classifiers with an Explainable and Editable Language Bottleneck

* Equal contribution.

CLIP-based classifiers rely on the prompt containing a {class name} that is known to the text encoder. Therefore, they perform poorly on new classes or the classes whose names rarely appear on the Internet (e.g., scientific names of birds). For fine-grained classification, we propose PEEB – an explainable and editable classifier to (1) express the class name into a set of text descriptors that describe the visual parts of that class; and (2) match the embeddings of the detected parts to their textual descriptors in each class to compute a logit score for classification. In a zero-shot setting where the class names are unknown, PEEB outperforms CLIP by a huge margin (~10x in top-1 accuracy). Compared to part-based classifiers, PEEB is not only the state-of-the-art (SOTA) on the supervised-learning setting (88.80% and 92.20% accuracy on CUB-200 🦜 and Dogs-120 🦮, respectively) but also the first to enable users to edit the text descriptors to form a new classifier without any re-training. Compared to concept bottleneck models, PEEB is also the SOTA in both zero-shot and supervised-learning settings.

🌟 Interactive Demo 🌟 on HuggingFace that shows how to edit a class’ textual descriptors at the inference time.

Acknowledgment: This work is supported by the National Science Foundation CAREER Award (No. 2145767), Adobe Research, and donations from the NaphCare Foundation.

Conference: NAACL 2024 Findings – Long Paper (acceptance rate: 869/2,434 = 35.7%). Poster download.

Video: 4-min PEEB demo video | 11-min presentation at NAACL 2024.

4-min demo video

Figure 1: Existing explanations are either (a) textual but at the image level; or (b) part-level but not textual. Combining the best of both worlds, PEEB (c) first matches each detected object part to a text descriptor, then uses the part-level matching scores to classify the image.

Figure 3: During inference, 12 visual part embeddings with the highest cosine similarity with encoded part names are selected (a). These visual part embeddings are then mapped (→) to bounding boxes via Box MLP. Simultaneously, the same embeddings are forwarded to the Part MLP and its outputs are then matched (b) with textual part descriptors to make classification predictions (→). Fig. A1 shows a more detailed view of the same process.

PEEB achieves SOTA CUB-200 accuracy among the text descriptor-based classifiers in GZSL.

<b>Table 4:</b> PEEB consistently outperforms other visionlanguage methods under Harmonic mean and especially in the hard split (SCE) by (+5 to +15) points, highlighting its generalization capability on ZSL.

Table 4: PEEB consistently outperforms other visionlanguage methods under Harmonic mean and especially in the hard split (SCE) by (+5 to +15) points, highlighting its generalization capability on ZSL.

PEEB: Part-based Image Classifiers with an Explainable and Editable Language Bottleneck

It’s time to blog now..

ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification

Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search