In December 2019, the non-partisan federal government National Institute of Standards and Technology (NIST) published a landmark study presenting further evidence that facial recognition algorithms, across the board, are not ready for prime-time. The researchers found that face recognition algorithms perform more poorly when examining the faces of women, people of color, the elderly, and children, raising serious concerns about police use of the technology across the United States and the world, and underscoring the need to press pause on government use of the tech.
The December NIST report is the third in a series, publishing results from its ongoing Facial Recognition Vendor Test (FRVT). The FRVT is widely regarded as an international standard for facial recognition evaluation. While the prior two tests studied the accuracy of various facial verification (1-to-1 matching) and identification (1-to-many matching) algorithms, respectively, the most recent report examines how such algorithms perform on faces of different demographics.
Here are the five primary things you should know.
- The report is legit, and the researchers were rigorous.
The December report is the first-ever study looking at bias in facial identification – rather than verification – across different demographic groups. The complete report is over 1,200 pages long, publishing test results of 189 algorithms from 99 distinct vendors and detailing the results of 13 different tests: 11 studying verification and 2 studying identification.
Unlike most preceding work, NIST’s demographic study takes care to report rates of both false positives – when a pair of photos of different individuals is wrongly identified as a match – and false negatives – when a pair of photos of the same individual are wrongly identified as a non-match. Additionally, a common issue in large-scale facial recognition tests like this is that each algorithm requires the user to specify a threshold for identifying matches, but such thresholds are far from consistent across vendors. NIST’s researchers get around this issue by measuring relative, rather than absolute, performance: for each algorithm, they determine the threshold which results in a benchmark success rate for white men (e.g. a false match rate of 0.0001), and then evaluate performance for other demographics relative to that individualized calibration point.
2. The report found widespread disparities in race, sex, and age.
The majority of the algorithms tested perform worse on Black, Asian, and Native American faces, and show bias against women, the elderly, and children. When evaluating nationality, faces from West Africa, the Caribbean, East Africa, and East Asia result in more uncertainty and more false matches. Across the board, facial verification and identification scans perform best on middle-aged white men, and worse on everyone else.
3. Those disparities are significant, not minor.
Concerningly large demographic differences arise in false positive rates – or how often a face was wrongly identified as someone else – with the majority of algorithms finding between 10 and 100 times more false matches for Black women than white men. White women aren’t safe either, with most algorithms’ false match rates falling between 2 and 10 times higher than for white men. Disparities are smaller for false negative rates, yet still show bias against non-European faces.
4. Even these dismal results represent a best-case-scenario.
Notably, NIST’s tests were run in a quality-controlled research setting using standardized photographs, such as police mugshots and visa application portraits. Using these algorithms to identify faces in “wild” photographs taken from surveillance footage, as police departments would, will only worsen demographic disparities, because those photos are often very low-quality.
5. Amazon refused to participate – again.
Amazon’s flagship facial recognition software, Rekognition, was not submitted for testing by NIST in parts I, II, or III of the FRVT. Amazon claims that “Rekognition can’t be ‘downloaded’ for testing outside of AWS, and components cannot be tested in isolation while replicating how customers would use the service in the real world.” However, 99 other vendors successfully packaged their proprietary algorithms for NIST testing, including multinational companies Microsoft, Toshiba, and IDEMIA. Amazon’s continued non-participation is evidence that the company prioritizes protection of their reputation over independent verification of their software. And they may indeed have reason to worry over that reputation, as existing research into Rekognition’s demographic bias suggest the results might not be pretty.
Ultimately, NIST’s FRVT results are both impressive and concerning, demonstrating widespread demographic bias in facial recognition via thousands of individual tests. But this work must be only the beginning. One of the most important tools in the fight against government face surveillance is the publication of rigorous third-party evaluations, just like this one. We need more audits from non-commercial government and nonprofit groups, studying more algorithms, using more kinds of photos. After all, we can’t manage what we don’t measure.
In the meantime, legislatures like ours here in Massachusetts must press pause on the government’s use of face recognition technology. This tech is dangerous when it works, but it’s absolutely inexcusable for government agencies to use technology that is inherently biased against large majorities of the population.
This blog post was written by ACLU of Massachusetts Technology Fellow Lauren Chambers