Figure 4: Properties of the six networks evaluated in this work.
We categorize into 2 types of models: 1-image and 2-image.
1-image models include CNN (C) and ViT (V) while the 2-image group contains DeepFace-EMD (D).
Hybrid-ViT can be 1-image (H1) or 2-image (H2 and H2L).
The difference between H2 and H2L is the Transformer output of [CLS] vs. 2-Linear, respectively.
