Thursday, December 12, 2024

Methodological Differences in Risk Assessments

 By Sharon M. Kelley

Sand Ridge Secure Treatment Center – Evaluation Unit, Madison, Wisconsin

A few years ago, at a conference in Australia, Katie Gotch, MA, LPC asked me how I would characterize my research. I replied, “Things that annoy Sharon Kelley.” A joke of course but with some level of truth since there are questions I have had to answer in court but did not know (e.g., what is the rate of undetected sexual offending?). I would characterize the focus of my research as best practices in risk assessment. As part of this, I have published and co-published articles on static and dynamic risk tools, protective factors including the Structured Assessment of Protective Factors against Sexual Offending (SAPROF-SO), undetected sexual offending in risk assessment, practical guidance in applying time free and long-term risk estimates, and potential differences in evaluators’ judgment outside of empirically predictive factors. I suppose when some professionals are persistently not attending to best practices in the field, I can sometimes find this annoying, too (I am not a perfect human). This is not to say people cannot have different opinions than me. Absolutely. Which risk tool is better for a particular population or referral question? That is certainly debatable. When is a brand-new methodology ready for use? It certainly depends on a variety of factors (e.g., Does it need to meet admissibility standards in court? Is one using it for a treatment needs assessment versus an assessment that considers the ultimate risk probability?) There are plenty of issues still arguably up for debate especially in certain contexts. Where I struggle is around issues, which I believe have been previously debated for the past decade, and where there are undoubtedly sufficient empirical studies that should largely resolve the argument, and still the dead horse in not yet in the ground. I would like to talk about some of those issues.

Not Using a Measure of Dynamic Risk

There was some discussion of this in professional circles recently, and I want to expand on this discussion. The meta-analysis by Mann et al. (2010) on dynamic risk factors / criminogenic factors / psychologically meaningful factors was, by all accounts, an important and oft cited study. According to Google Scholar, it has been cited 1,326 times since the article became available. At the time the paper was written – over a decade ago – the authors noted that formal measures of dynamic risk factors (DRF) were “still sufficiently underdeveloped that important questions remain concerning the conceptual foundations of these scales, whether they target the most relevant factors and the extent to which it is possible to associate recidivism rates with specific scores.” This made sense in 2010; tools like the STABLE-2007 and VRS-SO had only been in circulation for a few years and did not offer the type of norms and validation studies they have now. It was also important to explore any potential DRFs that had insufficient studies when earlier meta-analyses were conducted. However, the Mann et al. (2010) article was never meant to replace a formal measure of dynamic risk. Indeed, the authors make the final conclusions:

First, evaluators should avoid being overinfluenced by the presence of any single risk factor, however floridly manifested. Second, only relatively comprehensive assessment of a range of psychological risk factors will make it possible for this kind of assessment to have useful predictive power. Third, this is precisely the kind of situation (a relatively large number of risk factors, each making only a small contribution to prediction) in which mechanical integration of risk factors can be expected to outperform human judgment (Kahnemann & Klein, 2009).

The authors were directly telling us not to use this paper in lieu of a formal risk tool. As the authors alluded to in their conclusion, using an “empirically-guided” approach to risk assessment involves the evaluator attending to the risk factors listed in the Mann et al. (2010) article but assigning their own definition and coding instructions to the factor as well as its predictive weight and importance. There is no reliability between evaluators, no reliability within evaluators (i.e., the same evaluator can be inconsistent in how this is applied between cases), no known predictive validity, and no known error rate. Further, evaluators can be unduly influenced by one or two factors despite the Mann et al. (2010) article clearly specifying that none of these factors are hugely predictive by themselves or “super factors.” Meanwhile, there are well established and validated mechanical instruments that measure DRFs. This includes studies in a variety of samples, independent validation studies, published test manuals, known error rates, general acceptance in the field, and sufficient reliability and validity. Why professionals opt for a method that has been shown to be less predictive and potentially more fraught with problems and influences not related to prediction (e.g., differences among the evaluators themselves) continues to puzzle me.

Double Counting Risk Factors and Clinical Overrides

The second issue I sometimes see is when evaluators use a formal measure of DRFs but override the result. This is done in one of several ways. First, the evaluator will identify a clinical factor that appears especially important in the treatment needs/risk profile and see it as needing to have additional predictive weight to account for its perceived importance. Take, for example, an individual who is demonstrating a high sexual drive, which raises concerns. The evaluator scores the Static-99R and a DRF tool like the STABLE-2007. The evaluator assigns high scores (2) for both Sex Drive and Sex as Coping among several other items relevant to risk. When integrating the Static-99R and STABLE-2007 scores, the evaluator finds the individual to be in the average risk range. Yet, in the evaluation report, the evaluator concludes that this is likely to be an underestimation given that the individual is preoccupied with pornography and has a high rate of masturbation in their current setting. This might also be linked to the individual having a high sexual drive at the time of past sexual offenses, so the evaluator specifies this is especially risk relevant. What the evaluator is doing is using the relevant data that justifies giving the individual a score of “2” and then adding the same data as if it is a separate risk factor that will incrementally contribute to risk. There is no empirical study I can think of that would support this methodology. Here the evaluator is treating this as a “super factor,” even though Mann et al. (2010) found no such evidence of super factors.

Second, the evaluator will identify the person as having “unique” features outside the tool’s sampling frame (aka the individual is a “black swan” which justifies an override. This is usually done for men over the age of 60 for whom the evaluator continues to see as risky and not aging in a “normal” manner. The conclusion is to disregard the protective effect of age, which has otherwise been repeatedly identified as an important predictive variable even when the index offense occurred when they were in their 60s (see Jeff Sandler’s work). Without diving too deep, any conclusion by the evaluator that their case is unique should be done rarely, with caution, and with good justification. If one is finding “black swans” frequently, it’s probably not a black swan. 

Third, when considering clinical overrides, this should be well justified. Sometimes overrides are understandable. Within their ATSA Forum article (2021, Vol. 34, Issue 1), Hanson and Thornton mention exceptional factors or Meehl’s broken leg factors that could invalidate actuarial assessments. This includes “clear evidence that the individual has decided to reoffend” or is imminently at risk for reoffending. This can include actively setting up an offense that would happen imminently or making statements about intention to reoffend. A clinical override should not be used because the individual is demonstrating DRFs that are already captured in a formal dynamic tool.

Evaluators may wish to disagree with me. Perhaps they might argue that clinical judgment separate from scoring DRFs is of critical importance in risk assessments and provides predictive value above and beyond formal tools. However, since a succession of studies have found clinical overrides not to improve prediction, the person making this clinical judgment needs to show why their clinical judgment is far superior to other clinicians. 

Overall, the impact of evaluators relying on empirically-guided clinical judgment or “super factors” when completing high stakes assessments is potentially far more concerning in court than contending with admissibility issues related to established DRF tools like the STABLE-2007 or the VRS-SO. Evaluators appear to have internal differences that contribute to risk assessment conclusions especially when using less structured techniques (i.e., research by Marcus Boccaccini and Daniel Murrie among others). Rachel Kahn, myself, and others at Sand Ridge found this includes over-weighting risk factors as well as factors not empirically related to risk. Further, we found that some evaluators tend to be more risk-sensitive despite available base rates of sexual recidivism. As a result, the assessment outcome was best predicted by the evaluator assigned to the case. Risk assessments, especially in high stakes settings, should not be about the luck of the draw. After more than a decade of research, we can do better.


No comments:

Post a Comment