Addressing Gaps in The Facial Biometric Liveness Detection Testing Landscape

Serve Legal have successfully developed a testing framework that incorporates a carefully and ethically curated dataset for testing the accuracy and fairness of facial biometric age estimation technologies. The leading age estimation technologies go beyond just estimating the age of the subject being presented and include a liveness detection capability for added security.

Serve Legal are consulting with those age assurance and digital ID providers to develop the most comprehensive liveness detection test available. This article will explore the current testing landscape and some of the gaps that must be addressed to properly support the age assurance and digital ID industries by enabling potential deployers of those technologies to make informed comparisons.

Liveness detection is a key component of biometric age assurance and digital ID technology.

Its role is to verify that what is being presented to the biometric system is a live human and not a photo, prosthetic, digitally injected video or other artificial artifact. This verification exists to prevent abuses such as presenting somebody else's digital ID or injecting a deepfake video in order to trick biometric analysis.

Liveness detection is being used in a rapidly growing number of applications such as banking, government services and retail to name just a few.

Liveness detection techniques can be broadly categorised as passive or active. Passive liveness detection means the user does not need to take any specific actions to interact with the interface performing the detection. Active liveness detection means the user must take actions such as blinking, turning their head, or adjusting the distance of the camera.

Each technique has pros and cons and may be more or less suited to different deployment contexts. For example, passive liveness detection is likely to introduce less friction into the process for users since those users do not need to carry out any interactive instructions. Active liveness detection adds some friction, but providers of such technologies posit that this friction is outweighed by the increase in accuracy. It is not possible with the independent testing currently available for either provider to quantify these characteristics such that they can be compared and assessed considering both performance and user experience.

This presents a challenge for companies such as retailers, bookmakers, banks etc who are looking to integrate a liveness detection solution into their system. When adopting such a technology a company is likely to perform due diligence checks such as:

  1. Is the accuracy sufficient for the deployment context?
  2. Is the user experience good enough to keep customers happy?
  3. Which solution gives the best tradeoff between fraud prevention and user experience?

In the facial biometric space NIST (National Institute of Standards and Technology), a US Government agency, is the established steward of performance benchmarks across facial recognition and facial analysis algorithms. However, liveness detection falls outside NIST's core testing capability. This is due to their facial biometric tests being facilitated by their access to pre-collected facial image datasets gathered from mugshots and border control, by virtue of their status as a US Government agency.

Since NIST's tests are based on datasets of static images, they are not equipped with living test subjects that can perform tests of liveness detection systems.

Passive liveness detection solutions on the market that assess liveness based on a single 2D image capture can be submitted to NIST's Face Analysis Technology Evaluation (FATE) PAD programme, where PAD stands for Presentation Attack Detection. This is a term used to describe the efficacy of a facial analysis tool in detecting a spoof, that is, a bad actor trying to pass themself off as somebody else for the purpose of tricking a facial biometric check.

Unfortunately this muddies the waters when talking about liveness detection solutions because as NIST make clear in their report, Part 10: Performance of Passive, Software-Based Presentation

Attack Detection (PAD) Algorithms:"In this test, we evaluated passive PAD approaches that operated on pre-collected imagery without any sort of user interaction. PAD approaches that require user interaction are out of scope for FATE PAD." (Read more here)

The semantics in this statement are important. NIST's evaluation of "passive PAD approaches" is not an evaluation of passive liveness approaches. To illustrate, evaluation of a passive PAD attack could involve taking a selfie of somebody wearing an expertly created and expensive prosthetic mask. The 2D image captured from the selfie could subsequently be passed to a biometric system for evaluation. If the PAD evaluation rejects the image then the attack was correctly identified as a spoof and the test was passed. But a passive liveness test, while not requiring the user to take specific actions, does require a live interaction and not merely a pre-collected 2D image. For example, the system might use dynamic lighting settings on the device while taking the selfie, with the resulting image capture being evaluated in real time to assess whether the dynamic lighting appears as expected on human skin/eyes vs latex, silicone or even glass in the case of video playback.

Therefore, caution is required when assessing the performance of a passive liveness tool based on this PAD evaluation. While an algorithm scoring highly for security in this PAD test can be considered to have strong performance at detecting attacks such as 2D images of subjects using masks, or holding up photos, this does not equate to testing that the user is physically present at the time of presentation and might therefore be considered a weaker test of performance.

A strong score for convenience means the algorithm had a low false detection rate. This is an important metric, but this type of convenience is quite different to the user experience one might wish to assess when comparing active and passive liveness detection solutions.

Fortunately, ISO 30107-3:2023 provides a PAD testing framework that does cover liveness detection. Unfortunately, the tests available against this standard are reported to exhibit weaknesses that leave the age assurance and digital ID industries requiring more robust, independent, comparable tests of liveness detection systems.

Specific concerns from industry are that the tests are too easy to pass, too variable between vendors and do not consider digital injection attacks.

Other deficiencies exist in the testing landscape in terms of ensuring that systems are developed in accordance with ethical best practices. For instance, testing should be performed with sufficient sample sizes with appropriate diversity to detect any significant discrepancy in Bona Fide Presentation Classification Error Rate (BPCER) between demographic groups. Such a fairness metric is essential for deployers of these technologies, whether they are banks, supermarkets or otherwise, in order to have confidence that they will not fall foul of equality legislation by deploying a liveness detection system that exhibits bias in liveness detection for certain demographic groups.

Further, deployers of these technologies lack independently verified metrics that allow them to make risk based assessments about which system will deliver the optimal balance between accuracy and user experience, given the context in which the system will be deployed.

Lastly, there is no metric to facilitate comparison of running costs of the different solutions. For example, it is possible and perhaps even likely, that some approaches will require significantly more compute resources than others to perform a liveness check. If in aggregate those costs are substantial they might also impact the risk based decision making of companies who need to integrate liveness detection.

If you are a liveness detection provider please get in touch with our facial biometrics team who will be delighted to explore your challenges in relation to independently verifying your tool's performance, fairness, efficiency and more.