Biometric Authentication Based on Enhanced Remote Photoplethysmography Signal Morphology

Zhaodong Sun1, Xiaobai Li2,1, Jukka Komulainen1, and Guoying Zhao1
1Center for Machine Vision and Signal Analysis, University of Oulu, Oulu, Finland
2State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China
{zhaodong.sun, jukka.komulainen, guoying.zhao}@oulu.fi, xiaobai.li@zju.edu.cn
Corresponding author.
Abstract

Remote photoplethysmography (rPPG) is a non-contact method for measuring cardiac signals from facial videos, offering a convenient alternative to contact photoplethysmography (cPPG) obtained from contact sensors. Recent studies have shown that each individual possesses a unique cPPG signal morphology that can be utilized as a biometric identifier, which has inspired us to utilize the morphology of rPPG signals extracted from facial videos for person authentication. Since the facial appearance and rPPG are mixed in the facial videos, we first de-identify facial videos to remove facial appearance while preserving the rPPG information, which protects facial privacy and guarantees that only rPPG is used for authentication. The de-identified videos are fed into an rPPG model to get the rPPG signal morphology for authentication. In the first training stage, unsupervised rPPG training is performed to get coarse rPPG signals. In the second training stage, an rPPG-cPPG hybrid training is performed by incorporating external cPPG datasets to achieve rPPG biometric authentication and enhance rPPG signal morphology. Our approach needs only de-identified facial videos with subject IDs to train rPPG authentication models. The experimental results demonstrate that rPPG signal morphology hidden in facial videos can be used for biometric authentication. The code is available at https://github.com/zhaodongsun/rppg_biometrics.

1 Introduction

Facial videos contain invisible skin color changes induced by remote photoplethysmography (rPPG) signals, providing valuable cardiovascular information, such as heart rate. Similar to rPPG, contact photoplethysmography (cPPG) captures color changes in fingertips to monitor blood volume changes. cPPG signals, obtained using contact sensors, have been used for biometric authentication [12, 11]. Given the similar nature and measurement principles of rPPG and cPPG [26], rPPG has the potential for biometric authentication. However, the feasibility of rPPG biometric authentication still needs to be validated. Hence, our research questions are: 1) Can rPPG signals be employed for biometric authentication? 2) If so, how can an rPPG-based biometric system be developed? 3) What are the advantages associated with utilizing rPPG biometrics?

Refer to caption

(a) rPPG Authentication System

Refer to caption

(b) rPPG Morphology Enhancement

Figure 1: (a) rPPG Authentication System. (b) Our method can improve rPPG morphology information. The fiducial points [22] like the systolic peaks and diastolic peaks are the main subject-specific biometric characteristics in rPPG signals.

We first examine the quality and discriminative power of rPPG signals. rPPG signals are derived from subtle changes in facial color caused by blood volume changes during heartbeats. Recent advances [49, 20] have achieved high-quality rPPG measurement, especially when the face has minimal or no movement. Hence, it is feasible to obtain high-quality rPPG signals. However, the question remains whether these high-quality rPPG signals contain subject-specific biometric characteristics. One work [32] has tried using rPPG for biometrics, but the preliminary study was limited by a small-scale dataset and low-quality rPPG, offering inadequate authentication performance for practical applications.

In this paper, we propose an rPPG-based method for biometric authentication, as shown in Fig. 1(a). Considering facial appearance and rPPG are mixed together in facial videos, we first de-identify facial videos while preserving the rPPG information. This step can guarantee that only rPPG information is used for biometric authentication while facial appearance cannot be used. In addition, this step can also conceal sensitive facial appearance information for privacy protection. The first module is the rPPG model that can extract rPPG signals from the de-identified facial videos. The second module is the rPPG-Authn model that utilizes the rPPG morphology to output person authentication results. We design a two-stage training strategy and rPPG-cPPG hybrid training by incorporating external cPPG datasets to exploit rPPG morphology for biometric authentication. Fig. 1(b) illustrates the rPPG morphology enhancement. Note that we only use de-identified videos with subject IDs for rPPG biometrics.

There are several advantages of rPPG biometrics. Compared with facial appearances, the rPPG biometric system only utilizes de-identified facial videos, eliminating the need for sensitive facial appearance. Moreover, rPPG biometrics offers an additional degree of resistance to spoofing, as rPPG inherently serves as a countermeasure to presentation attacks [21, 19]. In contrast, without dedicated presentation attack detection (PAD) methods, conventional face recognition algorithms are vulnerable to presentation attacks and less secure than rPPG-based biometrics. Additionally, since both rPPG biometrics and face recognition use facial videos as data sources, combining both biometric modalities can potentially enhance both accuracy and security. When compared with cPPG biometrics, rPPG biometrics offers the advantages of being non-contact and only requiring off-the-shelf cameras, while cPPG biometrics necessitates specific contact sensors like pulse oximeters. Compared with iris recognition [46, 5] which requires iris scanners, rPPG biometrics only requires cheap RGB cameras and is robust to presentation attacks.

Our contributions include:

  1. 1.

    We propose a new biometric authentication method based on rPPG. We utilize two-stage training to achieve rPPG morphology enhancement and accurate biometric authentication performance. We illustrate that utilizing de-identified facial videos is effective for rPPG biometric authentication and ensures the protection of facial appearance privacy.

  2. 2.

    We conduct comprehensive experiments on multiple datasets to validate the discriminative power of rPPG biometrics. We demonstrate that rPPG biometrics can achieve comparable performance with cPPG biometrics. We also investigate factors that may influence the performance of rPPG biometrics.

  3. 3.

    We discover that our rPPG-based biometric method can enhance rPPG morphology, which opens up possibilities for rPPG morphology learning from facial videos.

2 Related Work

2.1 rPPG Measurement

[41] initially proposed measuring rPPG from face videos via the green channel. Subsequent handcrafted methods have been introduced to enhance the quality of the rPPG signal [34, 6, 18, 40, 45]. Recently, there has been rapid growth in deep learning (DL) approaches for rPPG measurement. Several studies [4, 37, 20, 31, 16] utilize 2D convolutional neural networks (CNN) to input consecutive video frames for rPPG measurement. Another set of DL-based methods [28, 29, 23, 24, 7] employ a spatial-temporal signal map obtained from different facial regions, which is then fed into 2DCNN models. 3DCNN-based methods [50] and transformer-based methods [52, 51] have been proposed to enhance spatiotemporal performance and long-range spatiotemporal perception.

Additionally, multiple unsupervised rPPG methods [8, 43, 39, 36, 47, 53] have been proposed. Since GT signals are expensive to collect and synchronize in rPPG datasets, unsupervised rPPG methods only require facial videos for training without any GT signal and achieve performance similar to the supervised methods. However, most works on rPPG measurement primarily focus on the accuracy of heart rate estimation, while neglecting the rPPG morphology.

2.2 cPPG-based Biometrics

[10] was the first attempt to utilize cPPG for biometric authentication. They extracted some fundamental morphological features, such as peak upward/downward slopes, for cPPG biometrics. Subsequently, other studies have explored additional morphological features, including cPPG derivatives [48] and fiducial points [22]. More recently, researchers have focused on employing DL methods to automatically extract morphological features. [25, 2, 15] directly input cPPG signals into 1DCNN or long short-term memory (LSTM) architectures to conduct biometric authentication, while [12, 11] cut cPPG signals into periodic segments and utilize multiple representations of these periodic segments as inputs to a 1DCNN model. Furthermore, [12] has collected datasets for cPPG biometrics and investigated the permanence of cPPG biometrics. There exists one preliminary work on rPPG biometrics [32], but only a traditional independent component analysis (ICA) based method [34] was applied for rPPG extraction, which yields low-quality rPPG morphology for biometric authentication.

3 Method

Our method consists of facial video de-identification and two training stages. As the rPPG signal does not rely on facial appearance, we first de-identify the input video to avoid facial appearance being used by our method. In the first training stage, we perform unsupervised rPPG training on the de-identified videos to achieve basic rPPG signal measurement. In the second training stage, we use rPPG-cPPG hybrid training for biometric authentication and rPPG morphology enhancement.

3.1 Face De-identification for rPPG Biometrics

We propose to de-identify facial videos using spatial downsampling and pixel permutation. This step aims to obfuscate facial appearances while preserving the rPPG information. Since rPPG signals are spatially redundant at different facial regions and largely independent of spatial information as shown by [40, 27], rPPG signals can be well preserved in this step while facial appearances are completely erased. The reasons for face de-identification are twofold. First, the facial appearance and rPPG information are intertwined in facial videos. We remove facial appearance to make sure that the biometric model performs recognition solely based on the rPPG information. Second, this step can remove facial appearances to protect facial privacy information during rPPG authentication.

Refer to caption

Figure 2: Face de-identification for rPPG biometrics. The facial appearance is obfuscated while rPPG information is retained.

The facial video is de-identified as shown in Fig. 2. Faces in the original videos are cropped using OpenFace [1] by locating the boundary landmarks. The cropped facial video vT×H×W×3𝑣superscript𝑇𝐻𝑊3v\in\mathbb{R}^{T\times H\times W\times 3}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_H × italic_W × 3 end_POSTSUPERSCRIPT, where T𝑇Titalic_T, H𝐻Hitalic_H, and W𝑊Witalic_W are time length, height, and width, is downsampled by averaging the pixels in a sample region to get vdT×6×6×3subscript𝑣𝑑superscript𝑇663v_{d}\in\mathbb{R}^{T\times 6\times 6\times 3}italic_v start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × 6 × 6 × 3 end_POSTSUPERSCRIPT. It has been demonstrated that such downsampled facial videos are still effective in rPPG estimation [40, 27]. Since rPPG signal extraction does not largely depend on spatial information [40], we further permutate the pixels to completely obfuscate the spatial information to get vdeT×6×6×3subscript𝑣𝑑𝑒superscript𝑇663v_{de}\in\mathbb{R}^{T\times 6\times 6\times 3}italic_v start_POSTSUBSCRIPT italic_d italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × 6 × 6 × 3 end_POSTSUPERSCRIPT. Note that the permutation pattern is the same for each frame in a video but distinct for different videos. Since the spatial information is eliminated, we reshape the de-identified video vdesubscript𝑣𝑑𝑒v_{de}italic_v start_POSTSUBSCRIPT italic_d italic_e end_POSTSUBSCRIPT into a spatiotemporal (ST) map M36×T×3𝑀superscript36𝑇3M\in\mathbb{R}^{36\times T\times 3}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT 36 × italic_T × 3 end_POSTSUPERSCRIPT for compact rPPG representation like [27].

3.2 The 1st training stage: rPPG Unsupervised Pre-training

This stage aims to train a basic rPPG model capable of extracting rPPG with precise heartbeats. We use unsupervised training to obtain the basic rPPG model. The main reasons for unsupervised training are: 1) Unsupervised rPPG training does not require GT PPG signals from contact sensors, which means only facial videos with subject IDs are required in our entire method. 2) The performance of unsupervised rPPG training [8, 39] is on par with supervised methods.

Refer to caption

Figure 3: The diagram of Contrast-Phys-2D (CP2D) for rPPG unsupervised pre-training based on contrastive learning.

We adopt and customize the unsupervised Contrast-Phys (CP) architecture [39] to 2D ST-map inputs since CP can only use face videos as inputs. The modified method called Contrast-Phys-2D (CP2D) is shown in Fig. 3. Two different ST maps M,M36×T×3𝑀superscript𝑀superscript36𝑇3M,M^{\prime}\in\mathbb{R}^{36\times T\times 3}italic_M , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 36 × italic_T × 3 end_POSTSUPERSCRIPT from two different videos are the inputs of the rPPG model G𝐺Gitalic_G, where T𝑇Titalic_T is 10 seconds. The rPPG model G𝐺Gitalic_G is based on a 2D convolutional neural network to output rPPG ST maps Mrppg,MrppgS×Tsubscript𝑀𝑟𝑝𝑝𝑔subscriptsuperscript𝑀𝑟𝑝𝑝𝑔superscript𝑆𝑇M_{rppg},M^{\prime}_{rppg}\in\mathbb{R}^{S\times T}italic_M start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × italic_T end_POSTSUPERSCRIPT where rPPG signals are stacked vertically. Similar to CP, the spatial dimension S𝑆Sitalic_S is set as four. The architecture of the rPPG model G𝐺Gitalic_G is presented in the supplementary materials. Inspired by spatiotemporal rPPG sampling in CP, we use a patch with the shape (1,T/2)1𝑇2(1,T/2)( 1 , italic_T / 2 ) to randomly get N=16𝑁16N=16italic_N = 16 rPPG ST samples {mrppg(1),,mrppg(N)},{mrppg(1),,mrppg(N)}subscript𝑚𝑟𝑝𝑝𝑔1subscript𝑚𝑟𝑝𝑝𝑔𝑁subscriptsuperscript𝑚𝑟𝑝𝑝𝑔1subscriptsuperscript𝑚𝑟𝑝𝑝𝑔𝑁\{m_{rppg(1)},...,m_{rppg(N)}\},\{{m^{\prime}_{rppg(1)},...,m^{\prime}_{rppg(N% )}}\}{ italic_m start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g ( 1 ) end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g ( italic_N ) end_POSTSUBSCRIPT } , { italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g ( 1 ) end_POSTSUBSCRIPT , … , italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g ( italic_N ) end_POSTSUBSCRIPT } from rPPG ST maps Mrppg,Mrppgsubscript𝑀𝑟𝑝𝑝𝑔subscriptsuperscript𝑀𝑟𝑝𝑝𝑔M_{rppg},M^{\prime}_{rppg}italic_M start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT, respectively. The rPPG ST samples are averaged along the spatial dimension to get rPPG samples {p1,,pN},{p1,,pN}subscript𝑝1subscript𝑝𝑁subscriptsuperscript𝑝1subscriptsuperscript𝑝𝑁\{p_{1},...,p_{N}\},\{p^{\prime}_{1},...,p^{\prime}_{N}\}{ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } , { italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } and the corresponding power spectral densities (PSDs) {f1,,fN},{f1,,fN}subscript𝑓1subscript𝑓𝑁subscriptsuperscript𝑓1subscriptsuperscript𝑓𝑁\{f_{1},...,f_{N}\},\{f^{\prime}_{1},...,f^{\prime}_{N}\}{ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } , { italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }. We use rPPG prior knowledge [39] including rPPG spatiotemporal similarity and cross-video rPPG dissimilarity to make positive pairs ((fi,fj)subscript𝑓𝑖subscript𝑓𝑗(f_{i},f_{j})( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) or (fi,fj)subscriptsuperscript𝑓𝑖subscriptsuperscript𝑓𝑗(f^{\prime}_{i},f^{\prime}_{j})( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )) and negative pairs (fi,fj)subscript𝑓𝑖subscriptsuperscript𝑓𝑗(f_{i},f^{\prime}_{j})( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), which can be used in the positive and negative terms in the contrastive loss L𝐿Litalic_L. The contrastive loss L𝐿Litalic_L is used to pull together the PSDs originating from the same videos and push away the PSDs from different videos. The loss function L𝐿Litalic_L is shown below. During inference, the rPPG ST map MrppgS×Tsubscript𝑀𝑟𝑝𝑝𝑔superscript𝑆𝑇M_{rppg}\in\mathbb{R}^{S\times T}italic_M start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × italic_T end_POSTSUPERSCRIPT is averaged along the spatial dimension to get the rPPG signal srppgTsubscript𝑠𝑟𝑝𝑝𝑔superscript𝑇s_{rppg}\in\mathbb{R}^{T}italic_s start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.

L=i=1Nj=1jiNfifj2+fifj22N(N1)i=1Nj=1Nfifj2N2𝐿superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑗𝑖𝑁superscriptnormsubscript𝑓𝑖subscript𝑓𝑗2superscriptnormsubscriptsuperscript𝑓𝑖subscriptsuperscript𝑓𝑗22𝑁𝑁1superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑁superscriptnormsubscript𝑓𝑖subscriptsuperscript𝑓𝑗2superscript𝑁2L=\sum\limits_{i=1}^{N}\sum\limits_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}\frac{\parallel f_{i}-f_{j}\parallel^{2}+\parallel f% ^{\prime}_{i}-f^{\prime}_{j}\parallel^{2}}{2N(N-1)}-\sum\limits_{i=1}^{N}\sum% \limits_{j=1}^{N}\frac{\parallel f_{i}-f^{\prime}_{j}\parallel^{2}}{N^{2}}italic_L = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_j = 1 end_CELL end_ROW start_ROW start_CELL italic_j ≠ italic_i end_CELL end_ROW end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_N ( italic_N - 1 ) end_ARG - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG

(1)

Refer to caption

Figure 4: GT cPPG signal and rPPG signal extracted by CP2D. After the first training stage, the rPPG signal has accurate heartbeats but lacks morphology information.

However, since CP2D does not utilize any prior knowledge about morphology, the resulting rPPG signals lack morphology information. Fig. 4 shows a GT cPPG signal and an rPPG signal produced by CP2D. CP2D generates an rPPG signal with accurate heartbeats that align with those of the cPPG signal. However, the morphological features, such as the dicrotic notch and diastolic peak evident in the cPPG morphology, are not clearly discernible in the rPPG signals. Since these morphological features play a crucial role in differentiating individuals, we aim to further refine the rPPG signal morphology at the second training stage.

3.3 The 2nd training stage: rPPG-cPPG Hybrid Training

At the second training stage, we further refine rPPG signals to obtain morphology information. Fig. 5 shows the rPPG-cPPG hybrid training, where the rPPG branch utilizes face videos and ID labels during training. On the other hand, the cPPG branch uses external cPPG biometric datasets to encourage the PPG-Morph model H𝐻Hitalic_H to learn morphology information, which can be incorporated into the rPPG branch through the PPG-Morph model H𝐻Hitalic_H. The PPG-Morph model H𝐻Hitalic_H comprises 1DCNN layers and transformer layers that extract morphological features from periodic segments. The two branches are trained alternately to facilitate the sharing of morphology information between the rPPG and cPPG branches. Note that our method only requires de-identified facial videos with subject IDs during training (enrollment) and only needs de-identified facial videos during inference.

Refer to caption

Figure 5: rPPG-cPPG hybrid training. The rPPG branch and cPPG branch are trained alternatively to utilize external cPPG signals to enhance the rPPG morphology fully.

3.3.1 rPPG Branch

The rPPG branch can extract rPPG morphology and use it to differentiate individuals. This branch only requires a de-identified facial video vdesubscript𝑣𝑑𝑒v_{de}italic_v start_POSTSUBSCRIPT italic_d italic_e end_POSTSUBSCRIPT and the ID label irppgsubscript𝑖𝑟𝑝𝑝𝑔i_{rppg}italic_i start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT and does not need any GT cPPG signal for training. Therefore, de-identified facial videos with ID labels are sufficient for enrollment in the proposed rPPG biometrics scheme. The ST map M𝑀Mitalic_M derived from the de-identified facial video vdesubscript𝑣𝑑𝑒v_{de}italic_v start_POSTSUBSCRIPT italic_d italic_e end_POSTSUBSCRIPT is fed into the pre-trained rPPG model G𝐺Gitalic_G to obtain the rPPG signal srppgsubscript𝑠𝑟𝑝𝑝𝑔s_{rppg}italic_s start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT. Note that the rPPG model G𝐺Gitalic_G is the pre-trained model from the first unsupervised training stage. To segment the signal, the systolic peaks are located, and the signal srppgTsubscript𝑠𝑟𝑝𝑝𝑔superscript𝑇s_{rppg}\in\mathbb{R}^{T}italic_s start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is divided into K clips. Due to heart rate variability, the K clips may have different lengths, so the clip length is interpolated to 90 in order to obtain rPPG periodic segments. The choice of a length of 90 is based on the fact that the minimum heart rate (40 beats per minute) for a 60 Hz signal produces the longest periodic segment with a length of 90. Consequently, we obtain CrppgK×90subscript𝐶𝑟𝑝𝑝𝑔superscript𝐾90C_{rppg}\in\mathbb{R}^{K\times 90}italic_C start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × 90 end_POSTSUPERSCRIPT. To predict an authentication score for an individual, we use the PPG-Morph model H𝐻Hitalic_H and the rPPG classification head hrppgsubscript𝑟𝑝𝑝𝑔h_{rppg}italic_h start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT, which provides the rPPG morphology representation frppg64subscript𝑓𝑟𝑝𝑝𝑔superscript64f_{rppg}\in\mathbb{R}^{64}italic_f start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 64 end_POSTSUPERSCRIPT and ID probability yrppg[0,1]K×Nrppgsubscript𝑦𝑟𝑝𝑝𝑔superscript01𝐾subscript𝑁𝑟𝑝𝑝𝑔y_{rppg}\in[0,1]^{K\times N_{rppg}}italic_y start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_K × italic_N start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Nrppgsubscript𝑁𝑟𝑝𝑝𝑔N_{rppg}italic_N start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT is the number of individuals in the rPPG biometric dataset. The cross-entropy loss is used for ID classification, which is

Lrppgid(yrppg,irppg)=1Kk=0Klog(yrppgk,irppg)subscript𝐿𝑟𝑝𝑝𝑔𝑖𝑑subscript𝑦𝑟𝑝𝑝𝑔subscript𝑖𝑟𝑝𝑝𝑔1𝐾superscriptsubscript𝑘0𝐾superscriptsubscript𝑦𝑟𝑝𝑝𝑔𝑘subscript𝑖𝑟𝑝𝑝𝑔L_{rppg-id}(y_{rppg},i_{rppg})=-\frac{1}{K}\sum_{k=0}^{K}\log(y_{rppg}^{k,i_{% rppg}})italic_L start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g - italic_i italic_d end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT ) = - divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_log ( italic_y start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_i start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (2)

where yrppgk,irppgsuperscriptsubscript𝑦𝑟𝑝𝑝𝑔𝑘subscript𝑖𝑟𝑝𝑝𝑔y_{rppg}^{k,i_{rppg}}italic_y start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_i start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the predicted probability of the kth periodic segment belonging to the ID label irppg{1,2,,Nrppg}subscript𝑖𝑟𝑝𝑝𝑔12subscript𝑁𝑟𝑝𝑝𝑔i_{rppg}\in\{1,2,...,N_{rppg}\}italic_i start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT ∈ { 1 , 2 , … , italic_N start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT }.

3.3.2 cPPG Branch

The cPPG branch utilizes external cPPG biometric datasets including Biosec2 [12], BIDMC [33, 9], and PRRB [14], to learn PPG morphology. Note that the external cPPG biometric datasets are available online and are not related to the facial videos in the rPPG branch. Similar to the rPPG branch, the cPPG signal scppgTsubscript𝑠𝑐𝑝𝑝𝑔superscript𝑇s_{cppg}\in\mathbb{R}^{T}italic_s start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is processed to obtain Ksuperscript𝐾K^{\prime}italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT cPPG periodic segments CcppgK×90subscript𝐶𝑐𝑝𝑝𝑔superscriptsuperscript𝐾90C_{cppg}\in\mathbb{R}^{K^{\prime}\times 90}italic_C start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × 90 end_POSTSUPERSCRIPT. The PPG-Morph model H𝐻Hitalic_H and cPPG classification head hcppgsubscript𝑐𝑝𝑝𝑔h_{cppg}italic_h start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g end_POSTSUBSCRIPT are employed to generate the cPPG morphology representation fcppg64subscript𝑓𝑐𝑝𝑝𝑔superscript64f_{cppg}\in\mathbb{R}^{64}italic_f start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 64 end_POSTSUPERSCRIPT and the ID probability prediction ycppg[0,1]K×Ncppgsubscript𝑦𝑐𝑝𝑝𝑔superscript01superscript𝐾subscript𝑁𝑐𝑝𝑝𝑔y_{cppg}\in[0,1]^{K^{\prime}\times N_{cppg}}italic_y start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g end_POSTSUBSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_N start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Ncppgsubscript𝑁𝑐𝑝𝑝𝑔N_{cppg}italic_N start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g end_POSTSUBSCRIPT is the number of individuals in the external cPPG biometric datasets. Note that the PPG-Morph model H𝐻Hitalic_H is shared by both the rPPG branch and cPPG branch, allowing the cPPG branch to transfer the learned morphology information to the rPPG branch. The cross-entropy loss is utilized in this branch, which is

Lcppgid(ycppg,icppg)=1Kk=0Klog(ycppgk,icppg)subscript𝐿𝑐𝑝𝑝𝑔𝑖𝑑subscript𝑦𝑐𝑝𝑝𝑔subscript𝑖𝑐𝑝𝑝𝑔1superscript𝐾superscriptsubscript𝑘0superscript𝐾superscriptsubscript𝑦𝑐𝑝𝑝𝑔𝑘subscript𝑖𝑐𝑝𝑝𝑔L_{cppg-id}(y_{cppg},i_{cppg})=-\frac{1}{K^{\prime}}\sum_{k=0}^{K^{\prime}}% \log(y_{cppg}^{k,i_{cppg}})italic_L start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g - italic_i italic_d end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g end_POSTSUBSCRIPT ) = - divide start_ARG 1 end_ARG start_ARG italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT roman_log ( italic_y start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_i start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (3)

where ycppgk,icppgsuperscriptsubscript𝑦𝑐𝑝𝑝𝑔𝑘subscript𝑖𝑐𝑝𝑝𝑔y_{cppg}^{k,i_{cppg}}italic_y start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_i start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the predicted probability of the kth periodic segment belonging to the ID label icppg{1,2,,Ncppg}subscript𝑖𝑐𝑝𝑝𝑔12subscript𝑁𝑐𝑝𝑝𝑔i_{cppg}\in\{1,2,...,N_{cppg}\}italic_i start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g end_POSTSUBSCRIPT ∈ { 1 , 2 , … , italic_N start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g end_POSTSUBSCRIPT }.

3.3.3 Alternate Backpropagation

We alternately train the two branches and backpropagate the gradient of the two loss functions Lrppgidsubscript𝐿𝑟𝑝𝑝𝑔𝑖𝑑L_{rppg-id}italic_L start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g - italic_i italic_d end_POSTSUBSCRIPT and Lcppgidsubscript𝐿𝑐𝑝𝑝𝑔𝑖𝑑L_{cppg-id}italic_L start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g - italic_i italic_d end_POSTSUBSCRIPT to achieve rPPG-cPPG hybrid training. During the first step, de-identified facial videos and ID labels are sampled from the rPPG biometric dataset to calculate the loss Lrppgidsubscript𝐿𝑟𝑝𝑝𝑔𝑖𝑑L_{rppg-id}italic_L start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g - italic_i italic_d end_POSTSUBSCRIPT, and the gradient of Lrppgidsubscript𝐿𝑟𝑝𝑝𝑔𝑖𝑑L_{rppg-id}italic_L start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g - italic_i italic_d end_POSTSUBSCRIPT is backpropagated to update the rPPG model G𝐺Gitalic_G, the PPG-Morph model H𝐻Hitalic_H, and the rPPG classification head hrppgsubscript𝑟𝑝𝑝𝑔h_{rppg}italic_h start_POSTSUBSCRIPT italic_r italic_p italic_p italic_g end_POSTSUBSCRIPT. During the second step, cPPG signals and ID labels are sampled from external cPPG biometric datasets to calculate the loss Lcppgidsubscript𝐿𝑐𝑝𝑝𝑔𝑖𝑑L_{cppg-id}italic_L start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g - italic_i italic_d end_POSTSUBSCRIPT, and the gradient of Lcppgidsubscript𝐿𝑐𝑝𝑝𝑔𝑖𝑑L_{cppg-id}italic_L start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g - italic_i italic_d end_POSTSUBSCRIPT is backpropagated to update PPG-Morph model H𝐻Hitalic_H and the cPPG classification head hcppgsubscript𝑐𝑝𝑝𝑔h_{cppg}italic_h start_POSTSUBSCRIPT italic_c italic_p italic_p italic_g end_POSTSUBSCRIPT. These two steps are repeated in an alternating manner, allowing the two branches to be trained in turns. The cPPG branch uses external cPPG datasets to encourage the PPG-Morph model H𝐻Hitalic_H to learn morphology information. The morphology features learned from the cPPG branch can then be incorporated into the rPPG branch since the PPG-Morph model H𝐻Hitalic_H is shared by both cPPG and rPPG branches thus rPPG features are enhanced. The supplementary materials provide a detailed description of the algorithm.

4 Experiments

Signal length EER\downarrow/AUC\uparrow
OBF UBFC-rPPG PURE
intra-session
cross-session
intra-session intra-session cross-session
20 heartbeats (similar-to\sim20 sec) 0.17%/99.97% 2.16%/98.10% 0%/100% 0%/100% 9.59%/93.70%
10 heartbeats (similar-to\sim10 sec) 0.14%/99.98% 2.61%/98.04% 0%/100% 0.33%/99.67% 14.00%/91.17%
5 heartbeats (similar-to\sim5 sec) 0.33%/99.97% 3.81%/97.89% 0.01%/99.99% 0.58%/99.36% 18.32%/86.81%
Table 1: EER and AUC for rPPG authentication on OBF, UBFC-rPPG, and PURE datasets.

4.1 Implementation Details

Datasets. We considered three public rPPG datasets, namely OBF [17], PURE [38], and UBFC-rPPG [3]. The scales of these rPPG datasets are enough to validate the feasibility of rPPG biometrics since previous cPPG biometric datasets [12, 11] also have similar scales. These rPPG datasets consist of facial videos, GT cPPG signals, and ID labels, but our method does not require the GT cPPG. OBF dataset [17] consists of data from 100 healthy subjects. Two 5-minute RGB facial videos were recorded for each participant. For each subject, the first facial video was recorded at rest, while the second was recorded after exercise. During the recording, participants remained seated without head or facial motions. Videos have a resolution of 1920×1080 at 60 frames per second (fps). UBFC-rPPG dataset [3] was captured using a webcam at a resolution of 640x480 at 30 fps. In each recording, the subject was positioned 1 meter away from the camera and playing a mathematical game, with the face centrally located within the video frame. The database consists of data from 42 participants, with each one having a 1-minute video. PURE dataset [38] contains data from 10 subjects. Face videos for each subject were captured in 6 distinct scenarios: steady, talking, slow translation, fast translation, small rotation, and medium rotation, leading to a total of 60 one-minute RGB videos. Videos have a resolution of 640×480 at 30 fps.

Additionally, we combined the Biosec2 [12], BIDMC [33, 9], and PRRB [14] datasets to create the external cPPG biometric dataset. These datasets contain cPPG signals from 195 subjects for the cPPG branch in the rPPG-cPPG hybrid training. More details about datasets are provided in the supplementary materials.

Experimental Setup. Our rPPG biometric experiments follow the previous cPPG biometric protocol [12, 11] where the training and test sets have the same persons but might be recorded in the same session (intra-session test) or recorded in different sessions (cross-session test). For the OBF dataset, we divide each pre-exercise video into three parts: the first 60% length is used for training, the following 20% length is used for validation, and the last 20% length is used for intra-session testing. The post-exercise videos are reserved for cross-session testing. As for the UBFC-rPPG dataset, the same division is applied to each video. Since each subject only contributes one video, only intra-session testing can be conducted on this dataset. Moving on to the PURE dataset, the same division is applied to each steady video. The videos involving head motion tasks are used exclusively for cross-session testing. At the first training stage, we select the best rPPG model with the lowest irrelevant power ratio (IPR) in the validation set, as conducted in [8, 39]. At the second training stage, we choose the best-performing models based on the lowest equal error rate (EER) in the validation set. Both training stages are carried out on a single Nvidia V100 GPU and employ the Adam optimizer with a learning rate of 1e-3. During inference, the predicted probabilities from consecutive periodic segments (5 beats, 10 beats, and 20 beats) are averaged.

Evaluation Metrics. Since the model does multi-class classification, we use the one-vs-rest strategy to get the authentication results for each person. Therefore, each person has a binary classification. For each person, we can change the threshold of the model prediction output for that person to get the binary predictions, and we can plot false positive rates and true positive rates in a graph, which is the receiver operating characteristic (ROC) curve. Areas under curve (AUC) is the area under the ROC curve. If we change the threshold, we can find the threshold where the false positive rate and the false negative rate are equal. The EER is the false positive rate or false negative rate at this threshold. The final EER and AUC are averaged across all subjects. To evaluate the rPPG morphology, we calculate the Pearson correlation between the means of periodic segments from rPPG and the GT cPPG. More details are in the supplementary materials.

4.2 Results and discussions

4.2.1 Results and discussions about rPPG authentication.

Table 1 presents the results of rPPG authentication with varying signal lengths. The performance of rPPG authentication improves with longer signal lengths, such as 20 heartbeats, compared to shorter signal lengths like 10 or 5 beats. On all three datasets, the intra-session performance is satisfactory, with EERs below 1% and AUCs above 99%. However, the performance decreases during cross-session testing. On the OBF dataset, the cross-session (pre-exercise \to post-exercise) performance is slightly lower than the intra-session (pre-exercise \to pre-exercise) performance, but still achieves EER of 2.16%. On the PURE dataset, there is a significant drop in performance during cross-session (steady \to motion tasks) compared to intra-session (steady \to steady) due to the adverse impact of motion tasks on the quality of rPPG signals. Conversely, although the OBF dataset includes exercises to increase heart rates, it does not involve facial movements. This indicates that rPPG biometrics is sensitive to low-quality rPPG caused by facial motions but rPPG has reliable and unique biometric information evidenced by the varying heart rates from the same people. In practical usage, users will face the camera and keep still (like face recognition), thus such large intended head motions will not be a concern.

The observed rPPG periodic segments from different subjects (subject A-I) in Fig. 6 align with the aforementioned quantitative results. The rPPG periodic segments from the OBF dataset exhibit consistent morphology before and after exercises in Fig. 6(a). Conversely, the motion tasks in the PURE dataset significantly alter morphology in Fig. 6(c), resulting in noisy rPPG signals and a drop in performance during cross-session testing. Furthermore, the rPPG periodic segments from all three datasets display distinct morphologies for different subjects, highlighting the discriminative power of rPPG morphology. Fig. 7 shows the subject-specific biometric characteristics of rPPG morphology in detail. The rPPG periodic segments from two subjects have distinct fiducial points [22] such as the systolic peaks, diastolic peaks, dicrotic notch, and onset/offset, which contain identity information.

Refer to caption

(a) rPPG periodic segments from OBF dataset

Refer to caption

(b) rPPG periodic segments from UBFC-rPPG dataset

Refer to caption

(c) rPPG periodic segments from PURE dataset

Figure 6: rPPG periodic segments from (a) OBF dataset, (b) UBFC-rPPG dataset, and (c) PURE dataset. The red curves are the means of periodic segments.
Refer to caption
Figure 7: rPPG periodic segments and fiducial points from two subjects.

Regarding fairness, prior studies [30, 42] highlighted skin bias in rPPG signal quality. Dark skin may yield lower-quality rPPG signals, impacting authentication performance. We assess authentication performance for light and dark skin groups in the OBF dataset with a 20-heartbeat signal length and cross-session testing. For light skin, EER and AUC are 2.52% and 97.79%, respectively. For dark skin, EER and AUC are 4.04% and 96.74%. The performance of dark skin slightly falls behind that of light skin, indicating a skin tone bias in rPPG biometrics. Addressing this fairness issue may involve collecting more data from dark-skinned people or developing new algorithms, which remains a topic for future research.

Biometric Methods EER\downarrow/AUC\uparrow
OBF UBFC-rPPG PURE
intra-sess cross-sess intra-sess intra-sess cross-sess
FaceNet [35]\blacktriangle 32.07%/65.87% 36.58%/60.84% 36.15%/61.03% 31.67%/66.67% 35.67%/65.11%
Privacy-preserving FR [13]\blacktriangle 6.46%/91.24% 6.52%/91.92% 7.26%/90.25% 6.88%/91.27% 7.82%/90.77%
Hwang2021 [11]\blacklozenge 1.21%/99.30% 16.72%/84.74% 6.30%/94.02% 0%/100% 4.23%/98.14%
Patil2018 [32]\bigstar 14.97%/89.42% 39.79%/62.14% 8.53%/88.70% 4.00%/92.00% 32.68%/72.11%
Ours w/ rPPG training\bigstar 0%/100% 3.23%/96.92% -*- ∗ 0%/100% 11.68%/92.61%
Ours w/ rPPG-cPPG hybrid training\bigstar 0.17%/99.97% 2.16%/98.10% 0%/100% 0%/100% 9.59%/93.70%
  • \blacktriangle: face recognition (FR), \blacklozenge: cPPG biometrics, \bigstar: rPPG biometrics, *: Training does not converge.

Table 2: Performance comparison between biometric methods including face recognition, cPPG biometrics, and rPPG biometrics. Note that de-identified videos proposed in the paper are used for face recognition and rPPG biometrics.

4.2.2 Comparison with other biometrics.

In Table 2, we compare rPPG biometrics with related biometric methods, including face and cPPG biometrics, when the signal length is 20 beats. For face recognition, we choose the highly cited face recognition method (FaceNet [35]) to prove how general face recognition works on de-identified facial videos. We use FaceNet to extract embeddings from de-identified images and train two fully connected layers on the embeddings to get the classification results. Table 2 demonstrates that FaceNet [35] fails to work on de-identified videos, indicating that there is no facial appearance information in the de-identified videos. Since our rPPG biometric method is privacy-preserving for facial appearances, we also compare our method with the recent privacy-preserving face recognition [13]. The results show that our method can achieve better performance than privacy-preserving face recognition [13]. Our rPPG biometric authentication completely gets rid of facial appearance while the privacy-preserving face recognition [13] only adds noises to partially remove facial appearances to guarantee face recognition performance, which may still have risks of privacy leakage. In addition, we also compare our method w/ rPPG-cPPG hybrid training to our method w/ rPPG training (only rPPG branch is used for training in Fig. 5, and the cPPG branch is disabled during training).

On the OBF dataset, ours w/ rPPG-cPPG hybrid training achieves similar intra-session performance to ours w/ rPPG training, but achieves the best cross-session performance. This means external cPPG datasets introducing morphology information can improve generalization, such as cross-session performance. Furthermore, our rPPG biometrics exhibits better performance than cPPG biometrics [11]. This is primarily because rPPG signals are extracted from both spatial and temporal representations, allowing for the utilization of more information compared to cPPG signals, which are measured from a single spatial point in the temporal dimension. However, this holds true only when the rPPG signals are of high quality.

On the UBFC-rPPG dataset, ours w/ rPPG-cPPG hybrid training achieves 100% AUC but ours w/ rPPG training does not converge. The reason might be that it is difficult for the model to learn rPPG morphology from the small-scale UBFC-rPPG dataset without the help of the external cPPG dataset. This suggests that the external cPPG dataset can help the model to learn discriminative rPPG morphology information. Moreover, the performance of cPPG biometrics is still lower than that of our rPPG biometrics.

On the PURE dataset, both rPPG and cPPG biometrics demonstrate good performance in intra-session testing. However, in cross-session testing, our rPPG biometrics are surpassed by cPPG biometrics. This is likely due to significant facial motions in the test videos, which negatively impact the quality of rPPG signals and morphology, as shown in Figure 6(c). On the other hand, cPPG signals measured from fingertips are less affected by facial motions, allowing for better performance in this scenario.

4.2.3 Results and discussions about rPPG morphology.

We also made an interesting finding that the rPPG-cPPG hybrid training can significantly improve rPPG morphology reconstruction. Table 3 shows the Pearson correlations between the mean periodic segments of GT cPPG and rPPG. High Pearson correlations mean rPPG morphology better resembles the corresponding GT cPPG. Note that our method does not require any GT cPPG for rPPG morphology reconstruction, so we choose unsupervised rPPG methods including POS [44], ICA [34], and [8] for comparison. Ours w/ rPPG-cPPG hybrid training achieves significantly higher Pearson correlation than the baseline methods, CP2D, and ours w/ rPPG training, as the external cPPG datasets introduce helpful morphology information via the hybrid training to refine the rPPG morphology. Such cPPG datasets are publicly available, and thus do not introduce extra costs of data collection.

Methods Pearson Correlations\uparrow
POS [44]* 0.78
ICA [34]* 0.77
Gideon2021 [8]* 0.77
After the 1st training stage
CP2D 0.78
After the 1st and 2nd training stages
Ours w/ rPPG training 0.70
Ours w/ rPPG-cPPG hybrid training 0.87
Table 3: Pearson correlations between GT cPPG periodic segments and the rPPG periodic segments.

5 Conclusion

In this paper, we validated the feasibility of rPPG biometrics from facial videos. We proposed a two-stage training scheme and novel cPPG-rPPG hybrid training by using external cPPG biometric datasets to improve rPPG biometric authentication. Our method achieves good performance on both rPPG biometrics authentication and rPPG morphology reconstruction. In addition, our method uses de-identified facial videos for authentication, which can protect sensitive facial appearance information. Future work will focus on collecting a large-scale rPPG biometric dataset and studying influencing factors like temporal stability, lighting, and recording devices.

Acknowledgments

This work was supported by the Research Council of Finland (former Academy of Finland) Academy Professor project EmotionAI (grants 336116, 345122), ICT 2023 project TrustFace (grant 345948), the University of Oulu & Research Council of Finland Profi 7 (grant 352788), and by Infotech Oulu. The work was also supported by the Spearhead project ’Gaze on Lips’ funded by the Eudaimonia Institute of the University of Oulu, Finland. The authors also acknowledge CSC-IT Center for Science, Finland, for providing computational resources.

References

  • [1] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L.-P. Morency. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 59–66. IEEE, 2018.
  • [2] D. Biswas, L. Everson, M. Liu, M. Panwar, B.-E. Verhoef, S. Patki, C. H. Kim, A. Acharyya, C. Van Hoof, M. Konijnenburg, et al. Cornet: Deep learning framework for ppg-based heart rate estimation and biometric identification in ambulant environment. IEEE transactions on biomedical circuits and systems, 2019.
  • [3] S. Bobbia, R. Macwan, Y. Benezeth, A. Mansouri, and J. Dubois. Unsupervised skin tissue segmentation for remote photoplethysmography. Pattern Recognition Letters, 124:82–90, 2019.
  • [4] W. Chen and D. McDuff. Deepphys: Video-based physiological measurement using convolutional attention networks. In ECCV, pages 349–365, 2018.
  • [5] J. Daugman. How iris recognition works. In The essential guide to image processing, pages 715–739. Elsevier, 2009.
  • [6] G. De Haan and V. Jeanne. Robust pulse rate from chrominance-based rppg. IEEE Transactions on Biomedical Engineering, 60(10):2878–2886, 2013.
  • [7] J. Du, S.-Q. Liu, B. Zhang, and P. C. Yuen. Dual-bridging with adversarial noise generation for domain adaptive rppg estimation. In CVPR, 2023.
  • [8] J. Gideon and S. Stent. The way to my heart is through contrastive learning: Remote photoplethysmography from unlabelled video. In ICCV, pages 3995–4004, 2021.
  • [9] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. circulation, 2000.
  • [10] Y. Gu, Y. Zhang, and Y. Zhang. A novel biometric approach in human verification by photoplethysmographic signals. In 4th International IEEE EMBS Special Topic Conference on Information Technology Applications in Biomedicine, 2003. IEEE, 2003.
  • [11] D. Y. Hwang, B. Taha, and D. Hatzinakos. Variation-stable fusion for ppg-based biometric system. In ICASSP. IEEE, 2021.
  • [12] D. Y. Hwang, B. Taha, D. S. Lee, and D. Hatzinakos. Evaluation of the time stability and uniqueness in ppg-based biometric system. IEEE Transactions on Information Forensics and Security, 2020.
  • [13] J. Ji, H. Wang, Y. Huang, J. Wu, X. Xu, S. Ding, S. Zhang, L. Cao, and R. Ji. Privacy-preserving face recognition with learnable privacy budgets in frequency domain. In European Conference on Computer Vision, pages 475–491. Springer, 2022.
  • [14] W. Karlen, S. Raman, J. M. Ansermino, and G. A. Dumont. Multiparameter respiratory rate estimation from the photoplethysmogram. IEEE Transactions on Biomedical Engineering, 2013.
  • [15] E. Lee, A. Ho, Y.-T. Wang, C.-H. Huang, and C.-Y. Lee. Cross-domain adaptation for biometric identification using photoplethysmogram. In ICASSP. IEEE, 2020.
  • [16] J. Li, Z. Yu, and J. Shi. Learning motion-robust remote photoplethysmography through arbitrary resolution videos. In AAAI, 2023.
  • [17] X. Li, I. Alikhani, J. Shi, T. Seppanen, J. Junttila, K. Majamaa-Voltti, M. Tulppo, and G. Zhao. The obf database: A large face video database for remote physiological signal measurement and atrial fibrillation detection. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 242–249. IEEE, 2018.
  • [18] X. Li, J. Chen, G. Zhao, and M. Pietikainen. Remote heart rate measurement from face videos under realistic situations. In CVPR, pages 4264–4271, 2014.
  • [19] S.-Q. Liu, X. Lan, and P. C. Yuen. Learning temporal similarity of remote photoplethysmography for fast 3d mask face presentation attack detection. IEEE Transactions on Information Forensics and Security, 2022.
  • [20] X. Liu, J. Fromm, S. Patel, and D. McDuff. Multi-task temporal shift attention networks for on-device contactless vitals measurement. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, NeurIPS, volume 33, pages 19400–19411, 2020.
  • [21] Y. Liu, A. Jourabloo, and X. Liu. Learning deep models for face anti-spoofing: Binary or auxiliary supervision. In CVPR, 2018.
  • [22] G. Lovisotto, H. Turner, S. Eberz, and I. Martinovic. Seeing red: Ppg biometrics using smartphone cameras. In CVPRW, 2020.
  • [23] H. Lu, H. Han, and S. K. Zhou. Dual-gan: Joint bvp and noise modeling for remote physiological measurement. In CVPR, pages 12404–12413, 2021.
  • [24] H. Lu, Z. Yu, X. Niu, and Y.-C. Chen. Neuron structure modeling for generalizable remote physiological measurement. In CVPR, pages 18589–18599, 2023.
  • [25] J. Luque, G. Cortes, C. Segura, A. Maravilla, J. Esteban, and J. Fabregat. End-to-end photopleth ysmography (ppg) based biometric authentication by using convolutional neural networks. In 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018.
  • [26] D. McDuff, S. Gontarek, and R. W. Picard. Remote detection of photoplethysmographic systolic and diastolic peaks using a digital camera. IEEE Transactions on Biomedical Engineering, 2014.
  • [27] X. Niu, H. Han, S. Shan, and X. Chen. Synrhythm: Learning a deep heart rate estimator from general to specific. In ICPR, pages 3580–3585. IEEE, 2018.
  • [28] X. Niu, S. Shan, H. Han, and X. Chen. Rhythmnet: End-to-end heart rate estimation from face via spatial-temporal representation. IEEE Transactions on Image Processing, 29:2409–2423, 2019.
  • [29] X. Niu, Z. Yu, H. Han, X. Li, S. Shan, and G. Zhao. Video-based remote physiological measurement via cross-verified feature disentangling. In ECCV, pages 295–310. Springer, 2020.
  • [30] E. M. Nowara, D. McDuff, and A. Veeraraghavan. A meta-analysis of the impact of skin tone and gender on non-contact photoplethysmography measurements. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020.
  • [31] E. M. Nowara, D. McDuff, and A. Veeraraghavan. The benefit of distraction: Denoising camera-based physiological measurements using inverse attention. In ICCV, pages 4955–4964, 2021.
  • [32] O. R. Patil, W. Wang, Y. Gao, W. Xu, and Z. Jin. A non-contact ppg biometric system based on deep neural network. In 2018 IEEE 9th International Conference on Biometrics Theory, Applications and Systems (BTAS). IEEE, 2018.
  • [33] M. A. Pimentel, A. E. Johnson, P. H. Charlton, D. Birrenkott, P. J. Watkinson, L. Tarassenko, and D. A. Clifton. Toward a robust estimation of respiratory rate from pulse oximeters. IEEE Transactions on Biomedical Engineering, 2016.
  • [34] M.-Z. Poh, D. J. McDuff, and R. W. Picard. Advancements in noncontact, multiparameter physiological measurements using a webcam. IEEE transactions on biomedical engineering, 58(1):7–11, 2010.
  • [35] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, pages 815–823, 2015.
  • [36] J. Speth, N. Vance, P. Flynn, and A. Czajka. Non-contrastive unsupervised learning of physiological signals from video. In CVPR, 2023.
  • [37] R. Špetlík, V. Franc, and J. Matas. Visual heart rate estimation with convolutional neural network. In BMVC, pages 3–6, 2018.
  • [38] R. Stricker, S. Müller, and H.-M. Gross. Non-contact video-based pulse rate measurement on a mobile service robot. In The 23rd IEEE International Symposium on Robot and Human Interactive Communication, pages 1056–1062. IEEE, 2014.
  • [39] Z. Sun and X. Li. Contrast-phys: Unsupervised video-based remote physiological measurement via spatiotemporal contrast. In ECCV, pages 492–510. Springer, 2022.
  • [40] S. Tulyakov, X. Alameda-Pineda, E. Ricci, L. Yin, J. F. Cohn, and N. Sebe. Self-adaptive matrix completion for heart rate estimation from face videos under realistic conditions. In CVPR, pages 2396–2404, 2016.
  • [41] W. Verkruysse, L. O. Svaasand, and J. S. Nelson. Remote plethysmographic imaging using ambient light. Optics express, 16(26):21434–21445, 2008.
  • [42] A. Vilesov, P. Chari, A. Armouti, A. B. Harish, K. Kulkarni, A. Deoghare, L. Jalilian, and A. Kadambi. Blending camera and 77 ghz radar sensing for equitable, robust plethysmography. ACM Trans. Graph.(SIGGRAPH), 2022.
  • [43] H. Wang, E. Ahn, and J. Kim. Self-supervised representation learning framework for remote physiological measurement using spatiotemporal augmentation loss. AAAI, 2022.
  • [44] W. Wang, A. C. den Brinker, S. Stuijk, and G. De Haan. Algorithmic principles of remote ppg. IEEE Transactions on Biomedical Engineering, 64(7):1479–1491, 2016.
  • [45] W. Wang, S. Stuijk, and G. De Haan. Exploiting spatial redundancy of image sensor for motion robust rppg. IEEE transactions on Biomedical Engineering, 62(2):415–425, 2014.
  • [46] R. P. Wildes. Iris recognition: an emerging biometric technology. Proceedings of the IEEE, 85(9):1348–1363, 1997.
  • [47] Y. Yang, X. Liu, J. Wu, S. Borac, D. Katabi, M.-Z. Poh, and D. McDuff. Simper: Simple self-supervised learning of periodic targets. In ICLR, 2022.
  • [48] J. Yao, X. Sun, and Y. Wan. A pilot study on using derivatives of photoplethysmographic signals as a biometric identifier. In 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE, 2007.
  • [49] Z. Yu, X. Li, and G. Zhao. Remote photoplethysmograph signal measurement from facial videos using spatio-temporal networks. In BMVC, page 277. BMVA Press, 2019.
  • [50] Z. Yu, W. Peng, X. Li, X. Hong, and G. Zhao. Remote heart rate measurement from highly compressed facial videos: an end-to-end deep learning solution with video enhancement. In ICCV, pages 151–160, 2019.
  • [51] Z. Yu, Y. Shen, J. Shi, H. Zhao, Y. Cui, J. Zhang, P. Torr, and G. Zhao. Physformer++: Facial video-based physiological measurement with slowfast temporal difference transformer. International Journal of Computer Vision, 2023.
  • [52] Z. Yu, Y. Shen, J. Shi, H. Zhao, P. H. Torr, and G. Zhao. Physformer: facial video-based physiological measurement with temporal difference transformer. In CVPR, pages 4186–4196, 2022.
  • [53] Z. Yue, M. Shi, and S. Ding. Facial video-based remote physiological measurement via self-supervised learning. TPAMI, 2023.