Spurious Features Robustness

1. A Sober Look at the Robustness of CLIPs to Spurious Features

The elephant in the room is that adopted testsets (i.e., ImageNet variants) to evaluate the robustness of CLIPs are primarily designed for ImageNet-based models. These datasets may not correctly reflect the exact robustness of CLIP, given that CLIP models are trained on a large amount of data that may include, and possibly extend beyond those ImageNet variants during pre-training.

Is there a benchmark that reflects the exact reliance on spurious features of CLIP?

CounterAnimal:

  • The easy group: animals in commonly appeared backgrounds that the CLIP models make correct predictions
  • The hard group: animals in less commonly yet still plausible backgrounds, where the CLIP models are likely to misclassify them.
  • CounterAnimal captures general spurious correlations within CLIP.
  • ImageNet models are more robust to spurious correlations captured by CounterAnimal.
  • Larger CLIP models are more robust.
  • CLIP models trained on high-quality data are more robust.

References

  1. A Sober Look at the Robustness of CLIPs to Spurious Features