The shape and simplicity biases of adversarially robust ImageNet-trained CNNs
Peijie Chen*, Chirag Agarwal*, Anh Nguyen*
Links: pdf | code | project page
* All authors contributed equally.
Adversarial training has been the topic of dozens of studies and a leading method for defending against adversarial attacks. Yet, it remains largely unknown (a) how adversarially-robust ImageNet classifiers (R classifiers) generalize to out-of-distribution examples; and (b) how their generalization capability relates to their hidden representations. In this paper, we perform a thorough, systematic study to answer these two questions across AlexNet, GoogLeNet, and ResNet-50 architectures. We found that while standard ImageNet classifiers have a strong texture bias, their R counterparts rely heavily on shapes. Remarkably, adversarial training induces three simplicity biases into hidden neurons in the process of “robustifying” the network. That is, each convolutional neuron in R networks often changes to detecting (1) pixel-wise smoother patterns i.e. a mechanism that blocks high-frequency noise from passing through the network; (2) more lower-level features i.e. textures and colors (instead of objects); and (3) fewer types of inputs. Our findings reveal the interesting mechanisms that made networks more adversarially robust and also explain some recent findings e.g. why R networks benefit from much larger capacity and can act as a strong image prior in image synthesis.
Acknowledgment: This work is supported by the National Science Foundation under Grant No. 1850117.
4-min Spotlight presentation at the 5th Annual Workshop on Human Interpretability in Machine Learning (WHI 2020)