Keywords

1 Introduction

This integrative review synthesizes research findings from 2008 to 2020 on facial recognition software as it is deployed for young adult and adolescent populations. The aim is to determine the extent to which tests deem these technologies effective, and the extent to which test design considers potential human factors and ethical issues inherent in deployments of the software for this group. The review answers the following questions: How are these applications tested? What are the strengths and weaknesses of test design? And what human factors issues do the tests address or implicate?

1.1 Pervasiveness

Facial recognition software has been deployed to help with the responsibility of supervising adolescents and young adults. Summer camps use apps such as Bunk1 and Waldo Photos, for example, to allow parents to monitor their children's activities each day [1, 2]. Both require parents to upload sample images to provide machine learning algorithms a baseline for recognizing children. Then video sensors positioned around camp allow parents to track their child’s activities. Another company released FINE, which the company calls an empathy engine and platform for tracking emotional wellness, even for preteens and teens. FINE, or Feeling Insecure Nervous Emotional, is said to be a technology designed to play an active role in supporting mental wellness by predicting sadness and depression with facial expression and AI and digital technology [3]. In India, to reduce the number of cases of missing or abducted young adults, police recently launched the software to help trace children using public surveillance technology [4]. These applications promise to improve childcare provision, well-being, and education.

1.2 Lingering Questions

Still, questions linger regarding the user experience, legality, and ethics of using the software for monitoring young adults [5]. There are uncertainties about how facial recognition impacts human factors such as infringements on rights, privacy, and liberties and human resources and labor impacted by automation and technologies that perpetuate bias and exclusion. The Federal Trade Commission is considering updates to the online child privacy rules to deem children's faces, voices, and other biometric data as “personal information” protected under Federal law [6]. Questions of ownership pose an ethical challenge. Are voices and faces personal information? The digital and open nature of the internet gives long life and wide dissemination to photos and recordings that can be easily edited, manipulated, and used in unintended ways, such as facial profiling and data mining. When parents voluntarily upload images of children online, they usher them into this complex tangle of problems [7]. Ownership is also a point of contention. Social media companies write transferable, royalty- free, worldwide license to use images as they see fit for as long as they see fit. Facial recognition software combined with corporate data mining might infringe on privacy for children whose images are posted often [8]. These issues require thought and care during the design and testing stages of software.

Take, for example, the case of the New York Police Department, who has used the technology to compare crime scene images with juvenile mugshots for about four years. Per the NYPD, if a positive match is detected, it would not be the sole grounds upon which an arrest is made [9]. Reports suggest that they have done so without full disclosure, oversight, or awareness of civilian and civil rights groups [10]. Other departments and cities have had more public debates over the deployment of facial recognition in this population. The New York State Department of Education delayed for privacy reasons one school district’s use of facial recognition in school settings. San Francisco citizens, uneasy about potential abuse, blocked city agencies from using facial recognition. Detroit citizens complain about the accuracy of facial recognition software deployed there, particularly because the software has been shown to have lower accuracy identifying darker skinned subjects [9, 10]. Fears about the accuracy of facial recognition software strain implementation.

What compounds these problems is that technology has a higher risk of false matches in younger faces. The National Institute of Standards and Technology evaluates facial recognition algorithms for accuracy; they reported that several facial recognition algorithms have a higher rate of mistaken matches among children and other subjects across long-term aging, as well as subjects with injuries [11, 12]. The error rate was highest in young children but also noticeable in ages 10 to 16. Photos kept for several years are outdated and might further degrade facial recognition comparisons and accuracy [9, 10]. Children’s faces change substantially between ages 10 and 19; facial recognition software must accurately account for rapid development and change in facial features over time [13,14,15]. Furthermore, the judicial system handles juveniles differently; therefore, a deployment of facial recognition technology must abide long standing differentiations in policy and precedent affording them more privacy and protection than adult suspects [9, 10]. These issues require thorough consideration during the design and testing stages of the product life cycle.

1.3 Answering Questions with User Testing

These lingering questions require attention and answers. User testing is essential for answering such questions. User testing is vital for improving software design. According to the Institute of Electrical and Electronic Engineers (IEEE), post product launch, developers spend 50 percent of their time reworking software to fix problems that could have been avoided by preliminary testing [16]. However, product, usability, and user experience research and testing on products and interventions for children is difficult to design for many reasons. Some studies suggest that children and their parents welcome participation in testing [17]. Motivating factors include benefits for the children, altruism, trust in research, relation to researchers. However, fear of risks, distrust in research, logistical aspects, daily life disruptions, and feeling like a “guinea pig” are deterrents [17]. Populations that are young, less educated, ethnic minorities, and at or below the poverty line also face barriers to participating in testing [18]. Finally, the representativeness and power of the sample size matters in terms of generalizing findings. Some user testing standards suggest that testing five participants can expose many of the problems with software [19, 20]. However, many studies refute this finding by demonstrating that testing five participants does not yield enough prospective problems, nor can it uncover complex problems [21,22,23,24]. Furthermore, in terms of making software for clinical, legal, or other applications affecting life and livelihood, a higher standard of sample power including more participants may be required, particularly when necessary for research design, such as in the case of controlled trials and predictive statistical analyses.

2 Methods

An integrative review involves using research databases to compile and synthesize in a systematic way the literature published on a topic. The integrative review allows researchers to integrate both qualitative and quantitative findings [25]. For this integrative review, study characteristics included peer reviewed research articles and excluded theses and books. They also included studies published in English language only published from 2008 to 2020. I searched Scopus, Web of Science, PubMed, ERIC, IEEE Xplore, Springer Link. Science Direct, Google Scholar, ACM, and JSTOR to identify sources, and I was limited by library holdings by way of accessing the full text versions of the articles. Search terms included facial recognition, adolescent, teen, university, student, and young adult (root words). Database previews, article abstracts, and titles were searched and screened for eligibility. Once eligible studies were screened, I read the entire article to discern their content and findings. Articles that did not discuss facial recognition software were eliminated. All paragraphs that developed arguments or main points about ethical dynamics, and their article characteristics, were collected in a spreadsheet for analysis and synthesis. Inductive content analysis was used to identify main findings [26,27,28]. Inductive content analysis helped reduce and group data to find main insights and findings per article.

3 Results

The studies revealed overall potential for facial recognition deployed in young adult and adolescent populations. However, important limitations persist.

3.1 Included Studies

Table 1 displays a flow chart of included studies. The search yielded 92 studies, most of which were eliminated because they were repeats (n = 29); they did not report findings of facial recognition technology, software, or applications (n = 26); they used numerical validation and did not include in the validation process participants who served as raters (n = 15); they included only a small sample of young adults in a larger sample of older adults (n = 10); or they were not available from the university library (n = 2). Ten studies remained for inclusion in the study.

Table 1. Inclusion/Exclusion.

3.2 Study Participants

Table 2 shows that the articles included a total of 1551 participants ranging in age from newborn to 33. Cohorts and subgroups ranged in size from n = 51 to n = 500. Five of the studies discussed facial recognition as it pertains to identifying medical conditions such as dysmorphia and related disabilities such as autism. Two discussed facial recognition as it pertains to identifying an ethnic group or age estimation [29, 30]. Only one tested this population for educational value [32]. Some included a few participants who were not adolescents or young adults between the ages 10 to 20 [29,30,31, 33, 34, 36, 38]. Six did not report ethnicity [29, 32,33,34, 36, 38]. Two included homogeneous samples [29, 37]. See Table 2.

Table 2. Participant demographics.

3.3 Study Methods

See Table 3 for details. Bold indicates the main points of the summary of methods details.

Table 3. Study methods

Table 3 summarizes study methods. Four used experimental design where they tested subgroups and controls [30, 32, 34, 37]. Five were validation studies testing software performance and accuracy as it pertains to actual diagnoses or conditions, particularly comparing human versus computer ratings [31, 33, 35, 36, 38]. Measurements included six studies that tracked some form of internal validity to evaluate identification effectiveness or accuracy [29, 30, 32, 34, 36, 37]. Four tracked external validity by way of human raters, physiological response, or other external data point [31, 33, 35, 38].

Although all studies collected their own prospective archives of images to test, six studies were retrospective, insofar as it included applications that used preexisting databases images [31,32,33,34, 36, 37].

Two of the studies tested proprietary software [31, 36]. Two acquired their own images rather than use preexisting data sets [34, 35]. Four studies reported training processes in addition to running accuracy tests [30, 32, 34, 37]. And two enlisted human coders alongside software codes for validation [35, 38].

3.4 Study Findings

Table 4 presents study findings. All studies found some degree of accuracy. More points of comparison increased accuracy [30, 32]. However, some studies reported problems with performance, such as inability to generalize sub-samples due to insufficient variability and distinctiveness for the features [29, 38], facial features and parameters that yielded insignificant differentiation [32], false positives and negatives, as well as undetected positives [31, 33, 34, 37]. None of the articles that tested apps factored usability or the user experience or satisfaction into the analysis. None of the studies considered the ethics of the applications. None of the studies discussing possible medical applications discussed the ethics of such applications for diagnoses. None of the studies requested user feedback from the industries that might use the software, nor did they consult participants in the age group in question or ask them about the usability, user experience, or ethics of the proposed applications. Two studies factored facial change over time into the software evaluation and analysis [30, 34].

Overall, all studies reported that the software had capability for high discrimination, but with wide variability. Software ranged from 50% to 98% accurate or more accurate than another software in comparison [30, 32,33,34, 36]. Results reveal that the more features under consideration, the better [30]. But the studies show how facial recognition proved less accurate in same sex samples and samples of small age range [39]. Examination of some facial features proved more accurate than others [32]. Some forms of dysmorphia were harder to detect than others [33]. Some software was more accurate than others and less accurate than human coding [31, 35]. Some software was deemed inaccurate about a fourth of the time and inconsistent with human coders [34, 36].

Table 4. Study findings.

3.5 Study Limitations

The studies considered their own limitations and potential for future work. The studies made calls for additional variables and conditions [29, 31, 35,36,37,38], larger and more complex samples [30, 33,34,35], time frame adjustments [31, 32], clinician involvement [33, 34], and more sample integrity and control [32, 33] (Table 5).

Table 5. Study limitations.

One study [35] factored in participant fatigue and rater protocol design as a limitation. The same study was the only one to mention racial homogeneity as a limitation. One study [37] also warned that clinician expertise is crucial in using the software. They also admitted their own limitations such as small sample size [30, 33,34,35,36], limitations of detection and reliability [29, 32, 36, 37], limitations factoring in context of facial expressions [31, 38], and archive quality [33, 35, 38]. Only one admitted the limitation of sample homogeneity [35]. Only three studies reported adverse events [35] or controversies using facial recognition for the intended purposes [36, 37].

3.6 Study Quality

Reporting quality was low in most of the studies. All were missing some form of information or another, such as follows: approval by ethics, human subjects, or institutional review board committees [29, 31]; sufficient demographic details about exact numbers of participants per age group [29,30,31, 33, 36, 38] or ethnicity [29, 32,33,34, 36, 38]; and information about conflicts of interest or acknowledgment of funding and participant contribution [36, 38]. The articles had a relatively high field-weighted citation impact (FWCI). The global mean of the FWCI is 1.0. Therefore, an FWCI of 1.50 means that the article was cited 50% more than the world average. The articles ranged from 0 to 4.43 FWCI. However, most of the articles were cited rarely (0 to 9 times). One was cited often and had a high FWCI [31]. Two studies did not publish photos of young adults and adolescents from their samples [31, 37]. Two included photos with some facial parts redacted or cropped [29, 32]. Most of the studies included full photos of the faces of young adults and adolescents with no redacting [30, 32,33,34,35,36, 38]. Five of the studies were published in ethnically diverse countries, relatively speaking, including the United States [35,36,37,38] and India [34]. Others were published in relatively ethnically homogeneous countries. See Table 6.

Table 6. Study quality.

4 Conclusion

This integrative review revealed that the field sorely needs much more research into facial recognition software for young adults and adolescents. Facial recognition software often used experimental design but failed to meet sampling standards necessary for validating and generalizing findings. The software tested, study design, and topics covered left lingering questions about the potential clinical and social applications of the technology. They also overwhelmingly did not address the complexities of facial change over time that confounds the accuracy of facial recognition software. Study design yielded retrospective and confirmatory findings, insofar as the facial recognition software was mostly deployed to predict pre-existing diagnoses and ethnicities. However, the literature did not specify to what end facial recognition software would be deployed prescriptively, nor did it sufficiently address the ethical challenges that practical applications present.

Of the peer-reviewed articles making some reference to the subject matter, only a few provided relevant and sufficient detail. Of the studies that met inclusion criteria, the combined sample of 1551 participants could not undergo rigorous meta-analysis because insufficient reporting of details about the exact numbers of participants in key subgroups. Furthermore, samples were often homogeneous, thereby making it difficult to generalize the findings beyond the relatively small subsets of subgroups included in the studies. These factors are important to consider because they impact the extent to which findings can be safely and reliably generalized. Therefore, while all studies reported moderate to high levels of facial recognition accuracy, the small sample size and (in two cases) lack of experimental design preclude generalizing those accuracy measures to the larger population. Furthermore, limitations of testing design also call into question the validity of the accuracy findings themselves. If the findings are indicative of the small participant sample, then the findings cannot account for the myriad variations and differentiations present in the general population. This finding is important because five of the studies covered medical and social topics such as disability and ethnicity, where generalizations made from poorly designed tests may find their way into clinical practice and political policy that might negatively impact young adults and adolescents. What were reported as marginal or insignificant percentages of false positives and false negatives, when tabulated across the general population, might negatively impact hundreds of thousands of lives.

Other aspects of research design also posed a problem. Most samples were ethnically homogeneous. On one hand, homogeneous studies of non-white ethnic groups were meant as a corrective to studies of facial recognition software originally tested on white participants, which skewed the accuracy and reliability toward white populations and, on the other side, skewed false positives and false negatives toward populations of color. On the other hand, only a few studies included participants with darker skin. More research is needed that investigates the interaction of the two most common confounders as it pertains to facial recognition software: different shades of skin and facial change over time. No studies testing products currently on the market included user experience or usability research. Such feedback would be important for selecting and confirming test design. Studies where facial recognition was deployed to identify disability were designed retrospectively, wherein prior diagnoses were obtained and used to confirm computer diagnoses. Application for prospective diagnoses may not be prudent, given this limitation. User experience and usability feedback can give a more well-rounded and evidence-based picture of potential applications. It would strengthen and make more ethical and reliable findings that integrate feedback from the clinicians and patient populations whose lives and practice would be impacted by the software.

Some aspects of the ethics of study reporting were also questionable. Although 8 of the 10 studies did seek and obtain approval by institutional review boards or ethics committees, as well as acknowledgements and conflicts of interest, most of the studies presented several full photos of the faces of young adults and adolescents with no redacting. Technically, publishing the photos might have been permitted by the IRB, and they also might have been approved by the participants or the participants’ care providers themselves. Even in this case, the aggregation of facial photos, particularly in the case of participants with visible disabilities, presented the possibility of readers being able to use Google Photo or other image search tools to reverse engineer these photos and find the identities of these young adults and adolescents. Visual research ethical guidelines recommend that researchers anonymize and redact photos as much as possible in ways that protect the personal information of participants [40, 41]. Visual research ethical guidelines also consider people's faces and voices as their personal information, given the fact that digital files make compromising identity easier. Visual research ethical guidelines also recommend that images of vulnerable populations such as adolescents be protected further behind firewalls that require verification to access. In the past, aggregating and reducing people by physical features for the purpose of categorizing differences in them without taking care to protect their personhood, volition, and autonomy have been used for pernicious and unethical ends, such as eugenics and race science [43,44,45,46]. Therefore, it is important to conduct and report research in this area in ways that respect participants as much as possible.

Overall, there is some potential for facial recognition products for young adults and adolescents, particularly for medical purposes. The aggregate data across 1551 participants found that facial recognition software exhibited some degree of internal and external validity by comparison to prior diagnoses. However, external validity was rarely established between software and human raters, which would provide more verisimilitude. Furthermore, human raters of all expertise (from young adults themselves to the teachers and clinicians who will use the software) should also be enlisted to provide usability and user experience feedback, which none of the studies gathered. Asking users and samples of target populations about the ethical and practical implications of the software could improve design. The studies show how more variables help with software accuracy. Varying study design to include more longitudinal controlled trials as well as user experience research might render the studies more reliable and citable in the research community. More studies are in order that investigate more complex variables and more attention paid to facial change over time and differences in ethnicity. Future studies should avoid existing study limitations, such as small and homogeneous samples and protocols that fatigue participants or bias raters. Finally, future studies should take more care in reporting sufficient details for replication, validation, and syntheses by systematic and integrative reviews.