Distract Your Attention: Multi-Head Cross Attention Network for Facial Expression Recognition


This paper presents a novel facial expression recognition network, called Distract your Attention Network (DAN). Our method is based on two key observations in biological visual perception. Firstly, multiple facial expression classes share inherently similar underlying facial appearance, and their differences could be subtle. Secondly, facial expressions simultaneously exhibit themselves through multiple facial regions, and for recognition, a holistic approach by encoding high-order interactions among local features is required. To address these issues, this work proposes DAN with three key components: Feature Clustering Network (FCN), Multi-head Attention Network (MAN), and Attention Fusion Network (AFN). Specifically, FCN extracts robust features by adopting a large-margin learning objective to maximize class separability. In addition, MAN instantiates a number of attention heads to simultaneously attend to multiple facial areas and build attention maps on these regions. Further, AFN distracts these attentions to multiple locations before fusing the feature maps to a comprehensive one. Extensive experiments on three public datasets (including AffectNet, RAF-DB, and SFEW 2.0) verified that the proposed method consistently achieves state-of-the-art facial expression recognition performance. The DAN code is publicly available.