Face recognition (FR) as a technology has been the topic of much debate among both policymakers and AI practitioners recently. And justifiably so. It is one of the most potent technologies for identifying and tracking people.
Here, we present a conversation, in the form of questions and answers, between a policy analyst and a technologist. We hope that this will help illuminate a few key points for the reader interested in the cost-benefit analysis of this technology. More importantly, we hope that it conveys the message that the devil is in the details and unless the debate involves an interrogation of those details, there’s little hope for progress on the policy and decision-making front.
Do you need a refresher of what face recognition is? Read this, or quickly review the figure below:
Q How accurate is FR? We’ve heard differing accounts of the state of the technology — some experts say it’s nowhere near ready for field deployment, and yet we hear stories both internationally and locally about FR being used in a variety of ways. We need to understand how accurate FR tech is, as well as the anticipated development rate. I assume that the answer to this question depends on the particular circumstances (i.e. stationary individuals at an airport vs. live CCTV feeds), can you give us a sense of how close FR tech is to being usable in the field, such as on body-cameras?
AShort answer: We are getting closer every year, but we need at least two orders of magnitude reduction in error rates to make the technology feasible for real-time applications with uncooperative subjects such as in body-worn cameras.
I take FR to mean its most common definition: one-to-many identification of faces in two-dimensional camera images. There are two main algorithms at the core of FR. The accuracy of both matters for an end-to-end FR application:
- Face Detection (FD): given an image, find where the faces are, if any
- Face Matching (FM): given a database of faces (target database) and a detected face from an image (step 1), determine whether the individual in the detected face matches one of the individuals in the database
The accuracy of each of these steps depends on the particular circumstances (as you mentioned in the question, quality of the image sensor, the distance of the face to the camera, motion blur, angle of the face, occlusion, compute power on the FR system, etc.). Unfortunately, there is no way to simplify that further. Depending on the circumstances the error rates could be as high as 20% or as low as 0.1%.
There is a vast gulf between what is used in academic publications and marketing materials and the challenges of a real-world application. Any results showing greater than 90% accuracy are not to be trusted as how the algorithms will perform outside of controlled environments and narrow applications.
To provide a tangible example of this difference between published metrics and reality, let us consider the (real) anecdote of an AI researcher who has deployed Face Detection in a product that has to perform in the wild. It is interesting to see how each constraint of the real-world application lowers the accuracy:
- Due to speed requirements, they could not use the best models with a reported accuracy of >99.5% on the benchmark datasets
- They ended up using a learning model which satisfied their runtime requirements on the CPU and was reported to have >98.9% accuracy on public datasets
- After training on all publicly available datasets that they could use as well as internal data, the accuracy was only 80% precision, 70% recall (notice that they didn’t have “accuracy” numbers, and that’s because precision/recall translated better to the specific problem that their customers cared about.)
- We then tried a third party face detection algorithm and got 71% precision (9% lower than in-house) and 36% recall (33% lower than in-house). This was despite the fact that the third-party algorithm had very high reported accuracy numbers, including being a front runner in the latest Face Recognition Vendor Test (FRVT) published by the National Institute of Standards and Technology (NIST).
Q There are some who have been pushing the National Institute of Standards and Technology’s Face Recognition Vendor Tests (FRVT) as an objective way to measure the accuracy of FR algorithms. But we’ve heard from some companies that it’s a poor assessment not suited to more advanced technologies. Do you have thoughts about the FRVT? Or about other ways that one might compare relative accuracy rates of different FR algorithms?
A Short answer: FRVT is interesting and informative, but absolutely should not be interpreted as an objective way to measure the accuracy of FR algorithms in the real world.
I completely agree with the two main assertions of the FRVT report:
- There has been an “industrial revolution” in AI, and specifically in FR, since 2013
- The revolution is not over yet.
However, unfortunately, the major useful takeaways stop there. The reported numbers seem promising at first glance and might be realistic for some limited applications. Like everything in life, the devil is in the details:
- [Major problem] The test datasets are heavily sourced from publicly available datasets and from scraping the internet. The competing algorithms are from giant tech companies who have been scraping the same sources of data for decades to train their models. This violates the principle of blind testing (called “disjoint testing” in Machine Learning terminology); if one trains and tests algorithms on the same sets of data, they often end up memorizing the seen data. The entire point of machine learning is generalization to unseen data.
- Another major difference between real applications and FVRT datasets is the imbalance between match and non-match examples. Assuming a certain match over non-match rate has significant implications for the accuracy numbers. This seems to be mostly ignored, and a rate of 0.5 is assumed. Match rate of 0.001 is not unusual in real applications.
- Only one of their multiple scenarios covered in the study is remotely relevant to FR on body-cameras, and the accuracy numbers are sobering for that one. Non-cooperative subjects (page 129 in the PDF report): Best model (microsoft_4) gives 4% false-positive rate for N~=690k. If you had to reduce the FPIR to 0.1% (one in a thousand) you’ll be missing 10% of the matches (FNIR=0.10).
- The report doesn’t mention memory requirements for running the winning FR algorithms. That makes it hard to know whether it’s feasible to run these on edge devices. The total time (including template generation and search) is around 1 second on a 2.2GHz Intel CPU for the leading competitors.
- The 1:N report (most relevant to FR on body-cam scenario) does not include a detailed breakdown of the accuracy metrics for different demographic factors, such as race, age, and gender. Therefore, we can’t know how they perform on minority groups.
- The 1:1 report does have a detailed break-down of the accuracy metrics based on country of birth (if we can take that as a proxy for race) and age. The results are very interesting, but surprisingly don’t show a clear bias for male or white populations.
- Counter-intuitive observation from page 125, Fig. 93, of the 1:1 report: Accuracy is consistently higher for Asian countries (China, Vietnam, Philippines, India) than for countries with (at least some) European roots (Venezuela, Ecuador, Ukraine, Russia).
Q Relatedly, when companies say that their product should be set to only provide a match at a “99%” accuracy level, what does that mean?
A Short answer: Such metrics are meant as marketing tools and are meaningless for all practical purposes.
- The definition of accuracy is ambiguous, and each alternative definition will give you vastly different accuracy numbers.
- Speaking of accuracy without the test set and its conditions is meaningless. Recognizing dark-skinned individuals at low-light environments with low-resolution cameras, motion blur, and limited processing power is a completely different problem than recognizing people who pose for their social media profile pictures with high-quality cameras and ideal imaging conditions. Lumping these two together makes the reported metrics meaningless for both scenarios.
Q Finally, one consistently hears about demographic accuracy disparities (e.g., worst at matching women, people of color, etc.). The typical response is that this is a problem with the training data and is easily solvable. Is that the case? Are there demographic characteristics where disparities are likely to persist (for example, we’ve heard that children are much more difficult to recognize and those age disparities are likely to persist).
A It is theoretically possible to solve the bias problem with a balanced training data set. The problem is that the theoretical ideal is not going to be realized in the real world without very significant efforts for data collection and careful design of training procedures.
The most comprehensive effort on this front that I know of is IBM’s Diversity in Faces (DiF) dataset. Those are collected from Flickr and have very obvious and serious shortcomings, e.g. most of the images are taken by hobbyist photographers and have high quality — this doesn’t generalize to scenarios such as CCTV or body-worn cameras.
Q How would an FR product actually work? For those of us who are less technologically sophisticated, one question that has arisen is what form FR tech might take. (For example, we assume that an FR algorithm could be developed and integrated into a website so as to allow clients to analyze photos/videos they’re uploaded, is that correct?) We could generally use some help understanding how it might operate, and at which points could human-in-the-loop measures be integrated.
A FR products could take many shapes and forms depending on what problem we are trying to solve for the users. For example, one could build features into a website that hosts images with faces to fulfill queries such as “find all the faces in this video”, or “find all the faces with hats or glasses in this video”, etc. If connected to a database of mugshots along with their identities, one could also build a feature to enable queries such as “find all video frames in a set of videos where individual X (who has his mugshot in our database) appears”.
Note: These are hypothetical possibilities, and, needless to say, I’m ignoring accuracy and privacy concerns for the sake of the examples.
Q Relatedly, what does it mean (technically/operationally) for a body-worn camera to have FR capability? Is it that the body-worn live streams back to the cloud, where FR analysis is then performed? It does not necessarily need to be real-time, correct?
A That is correct. Both scenarios are possible. In the latter case (running on the device) there are serious considerations/limitations regarding processing power and battery life of the device.
Q [Bonus question] How do FR systems compare with human face recognition? In other words, can computers do a better job than human in recognizing faces?
A Short answer: They are far behind for most cases, but many misguiding headlines (the oldest that we could find was from 2006) have claimed FR has surpassed human-level performance.
The most common benchmarking dataset for comparing human performance with FR is Labelled Faces in the Wild (LFW) dataset. The naïve interpretation is that human performance on this dataset is 97.35% accuracy while the best FR (Google FaceNet as of Feb 2019) is 99.63%.
However, there are multiple serious issues with this benchmark:
- The faces are too easy compared to reality. The faces are scraped from the web using an old, but fast, face detection algorithm (2001 level technology). The result is that the faces are always completely within the field-of-view of the camera, show a frontal view, have very good illumination/distortion/blur characteristics. These conditions alone invalidate the dataset as a good benchmark.
- In realistic applications, the number of non-matches is significantly higher than matches. In LFW, they are assumed to be equal. Again, this completely changes the equations and invalidates the benchmark as a reliable representation of reality.
- Humans in these studies get a video-game kind of experience where they watch two faces on a computer monitor and choose whether or not they belong to the same person. This may be similar to what humans may be doing manually for matching faces in a very narrow set of applications. However, it is vastly different from how humans recognize faces in the real world where they can study the face (a 3D object) from different angles over time. Therefore, these comparisons do not represent a fair comparison between the performance of the man vs. machine.
We were not able to find a study where human performance is measured in the physical world. It would be an interesting research to conduct a comprehensive study of human performance for recognizing faces, including its break down to different races, ages, and genders.
Major takeaway: One use-case where FR systems seem to undoubtedly surpass human performance is matching frontal images of cooperative subjects to one of the N images in a mugshot database. A well-trained algorithm could achieve <1% error rates for target mugshot databases of size N=10⁶ (one million). See Figure 21 in the 2018 FRVT report. In comparison, humans can’t realistically search a database of a size greater than a few dozens. Even with smaller datasets, one study has shown that humans perform very poorly (50% to 60% error rates) for a small target size of N=8. It appears that the current state of the technology is accurate enough for this use case. Of course, there is still privacy and other issues which must be considered.
from My Reading List: Read and Unread https://ift.tt/340boSW