Abstract
As technology improves, criminals find new ways to gain unauthorised access. Accordingly, face spoofing
has become more prevalent in face recognition systems, with attackers gaining illegal access using simple,
non-intrusive presentation attacks, such as replaying a video containing the victim’s face. With social media
making it easy to obtain images and videos without raising suspicion, we must detect these presentation
attacks to prevent attackers from causing harm. Traditional face antispoofing methods used humanengineered
features, and due to their limited representation capacity, these features created a gap which deep
learning has filled in recent years. However, these deep learning methods still need further improvements,
especially for presentation attack detection in the wild. In this study, we use generative models as a data
augmentation strategy to improve the face antispoofing performance of a vision transformer. Furthermore,
we propose an unsupervised keyframe selection process to remove near-duplicate frames and increase
the variation among the samples. More specifically, we trained StyleGAN3-R models for each attack
vector and used them to generate candidate samples. We implemented two generative data augmentation
approaches: one trained on all the available frames (GAN3) and the other trained with only the keyframes
(KFGAN3). We used traditional data augmentation methods to generate candidate samples to compare our
generative approach. We preserved each candidate sample’s label by only using the following geometric
transformations: random horizontal flips, rotations (within 15 degrees) and enlargements (within 20%).
We selected a ViT-B/32 Vision Transformer, pre-trained on the ImageNet dataset, as our baseline face
antispoofing model. We constructed our face antispoofing pipelines and distinguished them according
to the candidate samples used in each data augmentation approach. We conducted our experiments on
the Spoof in the Wild (SiW) dataset and CASIA Face Antispoofing Database (CASIA-FASD) using the
following data augmentation percentages: 5%, 10%, 20%, and 30%. Our GAN3 approach performed
the best on SiW protocol 2, achieving an Average Classification Error Rate (ACER) of 3.29%, and our
KFGAN3 approach performed the best on protocol 3, achieving an ACER of 7.37%. As for CASIA-FASD,
our GAN3 approach achieved the best Equal Error Rate (EER) of 1.72%, and our KFGAN3 achieved the
best ACER of 1.34%. We conducted an ablation study using dependent frame analysis to classify each
video. Our KFGAN3 approach achieved an ACER of 0% on both SiW protocols, using a window size of
15 frames. Furthermore, our GAN3 approach achieved an ACER of 1.11% on CASIA-FASD protocol
7, using a window size of 7 frames. Accordingly, we achieved the state-of-the-art-performance on both
datasets in terms of ACER. We found that the keyframes were essential for improving the performance
of unknown presentation attack detection. Our results suggest that GAN-based data augmentation is an
effective method for enhancing face antispoofing performance, especially when the models are trained
using keyframes.