DeepFace-EMD: Re-ranking Using Patch-wise Earth Mover’s Distance Improves Out-Of-Distribution Face Identification

Hai Phan, Anh Nguyen

Links: pdf | code | project page

Face identification (FI) is ubiquitous and drives many high-stake decisions made by law enforcement. State-of-the-art FI approaches compare two images by taking the cosine similarity between their image embeddings. Yet, such an approach suffers from poor out-of-distribution (OOD) generalization to new types of images (e.g., when a query face is masked, cropped, or rotated) not included in the training set or the gallery. Here, we propose a re-ranking approach that compares two faces using the Earth Mover’s Distance on the deep, spatial features of image patches. Our extra comparison stage explicitly examines image similarity at a fine-grained level (e.g., eyes to eyes) and is more robust to OOD perturbations and occlusions than traditional FI. Interestingly, without finetuning feature extractors, our method consistently improves the accuracy on all tested OOD queries: masked, cropped, rotated, and adversarial while obtaining similar results on in-distribution images.

Acknowledgment: This work is supported by the National Science Foundation under Grant No. 1850117, and donations from the NaphCare foundation.

Conference: CVPR 2022 (acceptance rate: 2,067/8,161 = 25.33%)

🌟 Interactive Demo 🌟  https://aub.ie/face (where users can use StyleGAN-v2 to edit the query and observe how changes (e.g. adding masks or changing glasses or lighting affects the retrieved results).


Figure 1: Traditional face identification ranks gallery images based on their cosine distance with the query (top row) at the image-level embedding, which yields large errors upon out-of-distribution changes in the input (e.g. masks or sunglasses; b–d). We find that re-ranking the top-$k$ shortlisted faces from Stage 1 (leftmost column) using their patch-wise EMD similarity w.r.t. the query substantially improves the precision (Stage 2) on challenging cases (b–d). The “flow” visualization intuitively shows the patch-wise reconstruction of the query face using the most similar patches (i.e. highest flow) from the retrieved face.

 

Figure 2: Our 2-stage face identification pipeline. Stage 1 ranks gallery images based on their cosine distance with the query face at the
image-embedding level. Stage 2 then re-ranks the top-$k$ shortlisted candidates from Stage 1 using EMD at the patch-embedding level.

Figure 3: The results of assigning weights to 4×4 patches for ArcFace under three different techniques. Based on the per-patch density of detected landmarks (– – –), LMK (c) often assigns a higher weight to the center of a face (regardless of occlusions). In contrast, SC and APC assign a higher weight to patches of higher patchwise and patch-vs-image similarity, respectively. APC tends to down-weight a facial feature (e.g. blue regions around sunglasses or mouth) if its corresponding feature is occluded in the other image (b). In contrast, SC is insensitive to occlusions (a).

Figure 4: Our re-ranking based on patch-wise similarity using ArcFace (4×4 grid; APC) pushes more relevant gallery images higher up (here, we show top-5 results), improving face identification precision under various types of occlusions. The “flow” visualization intuitively shows the patch-wise reconstruction of the query (top-left) given the highest-correspondence patches (i.e. largest flow) from a gallery face. The darker a patch, the lower the flow. For example, despite being masked out ∼50% of the face (a), Nelson Mandela can be correctly retrieved as Stage 2 finds gallery faces with similar forehead patches.