Junk science: Detecting deception in 911 homicide calls

Having a reasonable forensic test that can legitimately be used in law enforcement investigations is hard. Many that end up being proposed are junk, mostly due to basic scientific errors in showing that the forensic test can actually identify what it says it identifies with any reasonable reliability and accuracy.

A new test on the scene are individuals claiming to be able to tell when someone is lying in 911 calls. This is junk, and has subsequently been banned in several state courts. Writing this post as another researcher has published several papers on the topic, including a recent one in a journal I have also published in, Artificial intelligence as a tool for detecting deception in 911 homicide calls. Here are the highlights and abstract:

Highlights - AI shows potential for improving deception detection in 911 homicide calls. - ChatGPT analyzed 86 behavioral cues to assess caller behavior in emergencies. - Random forest model classified deceptive and truthful callers with 70.68 % accuracy. - Deceptive callers displayed emotional, uncooperative behaviors; truthful callers were composed. - Findings highlight promise but underscore the need for further validation.

Abstract This paper investigates the application of Artificial Intelligence (AI), specifically the use of a Large Language Model (ChatGPT), in analyzing 911 calls to identify deceptive reports of homicides. The study sampled an equal number of False Allegation Callers (FACs) and True Report Callers (TRCs), categorized through judicial outcomes. Calls were processed using ChatGPT, which assessed 86 behavioral cues from 142 callers. Using a random forest model with k-fold cross-validation and repeated sampling, the analysis achieved an accuracy rate of 70.68 %, with sensitivity and specificity rates at 71.44 % and 69.92 %, respectively. The study revealed distinct behavioral patterns that differentiate FACs and TRCs. AI characterized FACs as somewhat unhelpful and emotional, displaying behaviors such as awkwardness, unintelligibility, moodiness, uncertainty, making situations more complicated, expressing regret, and self-dramatizing. In contrast, AI identified TRCs as helpful and composed, marked by responsiveness, cooperativeness, a focus on relevant issues, consistency, plausibility in their messages, and candidness.

Seems fancy right? Large language models, random forest, AI. It is easy for one to be taken in just by reading an abstract in a peer reviewed paper. This analysis though has a clear major flaw that renders it impossible to tell if the test works in a way that is helpful. I will not go through the overall nature of the test, in short it classifies statements into categories such as “awkward” and “nervous”, and then uses these as cues to predict that someone is lying in a 911 call. I will focus on the statistics.

Before we start, there are many metrics one can look at, but in terms of probative value in forensic investigations, we really only care about one, the probability one is lying if the test flags positive, P(Lying|Test Positive). This is also referred to as sensitivity in the authors abstract. While there are some circumstances tests can be useful for their obverse prediction, here the probability one is telling the truth conditional on a negative test, that is not helpful in this scenario. The reason is I posit the vast majority of people who call 911 do so truthfully, so you could have a random test, and P(Truth|Random Test) ~ 1. That is guessing people always tell the truth in 911 calls is probably pretty accurate.

This is an incredibly important point when evaluating empirical results validating a forensic test. If you artificially constructed your sample to have more liars, your sample statistics and predictions will be biased to make the test appear more accurate than it is in reality. You need to evaluate out of sample statistics on a set of data that mimics the actual prevalence in the overall population. Which the authors did not do here.

Consider this hypothetical, I have a test that always says someone is lying. In the authors sample here, since they artificially constructed the sample to be 50/50, my test is 50% accurate in the sample. It is obviously absurd though, and has no real value. In a real sample though of 911 calls, if only 1% of people lie, the accuracy of my “everyone is a liar test” is actually only 1%. The authors have that fundamental issue here, they can claim using whatever model whatever accuracy they want, it has no bearing on the accuracy or true positive rate in a real sample of 911 calls.

For example, if they predict a 911 call has a 90% probability of lying based on their sample, and the true prevalance of lying is only 1% in 911 calls, their adjusted probability estimate should only be 8%.

The authors claim that larger samples will not result in better predictions, but this is just confused logic (random forests should in my opinion typically have more like 20,000 observations, not less than 200). They also claim the study is high powered (which makes no sense, as they do not estimate any effects in the sample). In short, this research is very poor, and should not be used as evidence that one can tell when one is lying in a 911 call. Using ChatGPT and advanced machine learning merely creates the illusion that the authors did something technically impressive.