Visual Classification via Description from Large Language Models

ICLR 2023, Notable Top 5% (Oral)

1Columbia University
An example image of a hen.

By comparing images to text descriptors of visual categories ("spots") rather than just their names ("Dalmatian"), we achieve interpretable, editable image classification with higher accuracy.

Additional examples of classification decisions.

Abstract

Vision-language models (VLMs) such as CLIP have shown promising performance on a variety of recognition tasks using the standard zero-shot classification procedure -- computing similarity between the query image and the embedded words for each category. By only using the category name, they neglect to make use of the rich context of additional information that language affords. The procedure gives no intermediate understanding of why a category is chosen, and furthermore provides no mechanism for adjusting the criteria used towards this decision.

We present an alternative framework for classification with VLMs, which we call classification by description. We ask VLMs to check for descriptive features rather than broad categories: to find a tiger, look for its stripes; its claws; and more. By basing decisions on these descriptors, we can provide additional cues that encourage using the features we want to be used. In the process, we can get a clear idea of what features the model uses to construct its decision; it gains some level of inherent explainability. We query large language models (e.g., GPT-3) for these descriptors to obtain them in a scalable way.

Extensive experiments show our framework has numerous advantages past interpretability. We show improvements in accuracy on ImageNet across distribution shifts; demonstrate the ability to adapt VLMs to recognize concepts unseen during training; and illustrate how descriptors can be edited to effectively mitigate bias compared to the baseline.

Additional examples of classification decisions.

Descriptor Explorer

You can explore the descriptors for all 1,000 ImageNet classes and their top retrievals here.


For example:
  • killer whale
    • long, curved dorsal fin
    • large flippers
  • buckle
    • a metal or plastic fastener
    • used to secure a belt, strap, or other piece of clothing
    • can be decorated or plain
    • may have a logo or other design on it
  • volcano
    • a large, cone-shaped mountain
    • lava or ash flowing from the crater
  • hummingbird
    • long, thin beak
    • wings that move very quickly
  • Newfoundland dog
    • thick, waterproof coat
    • soulful eyes
  • church
    • a tall, pointed roof
    • stained glass windows
  • ...continued

Demo

Examples

  • breakfast burrito  31.4
    • eggs 32.2
    • a flour tortilla 31.8
    • vegetables 31.5
    • salsa 31.4
    • hot sauce 31.1
    • meat 31.0
    • cheese 30.7
  • ceviche  28.17
    • served with onions, peppers, and cilantro 28.7
    • a dish of seafood 28.5
    • may be garnished with avocado, lime, and/or chili peppers 28.2
    • typically includes fish, shrimp, and/or squid 28.0
    • marinated in citrus juice 27.6
  • cannoli  25.54
    • chocolate chips 27.6
    • an Italian pastry 25.4
    • a tube-shaped shell 25.0
    • filled with sweetened ricotta 25.0
    • made of fried dough 24.7
  • bibimbap  23.22
    • can be served with kimchi on the side 25.2
    • topped with vegetables, meat, and/or an egg 24.5
    • often served with gochujang (red chili pepper paste) 21.6
    • a bowl of rice 21.6

BibTeX

@article{menon2022visual,
  author    = {Menon, Sachit and Vondrick, Carl},
  title     = {Visual Classification via Description from Large Language Models},
  journal   = {ICLR},
  year      = {2023},
}