The Efficacy of the Outcome Measures in Psychotherapy

Essay details

Please note! This essay has been submitted by a student.

Table of Contents

  • Method
  • Participants and Procedure
  • Outcome Measures
  • Researcher Rating
    Self-Report Measures
    Biological Outcome Measure
  • Conclusion

Psychotherapy research typically uses a range of different measures to assess the efficacy or outcome of psychotherapeutic treatments. It goes without saying that the way outcome is assessed is of quintessential importance. According to Achenbach (2006), one of the most powerful assessment strategies in clinical science is that of multiple measures of the same disorder and/or its symptoms yielding different measurement outcomes. This has several far-reaching implications for the evaluation of outcome of therapies, in particular in a Randomized Controlled Trial, the design which is commonly considered the gold standard to evaluate the efficacy of treatments.

Essay due? We'll write it for you!

Any subject

Min. 3-hour delivery

Pay if satisfied

Get your price

[bookmark: _Hlk5280799][bookmark: _Hlk5281916]RCT’s typically use different outcome measures to evaluate outcome and it is quite possible that these different measures yield different results. Nonetheless, RCT’s only report on the efficacy for treatment based on one measure, the so-called primary outcome measure. The strategy of using the primary outcome methodology was one born out of good intentions and it has been embraced in the field of psychotherapy research as it prevents researchers from ‘cherry picking’ results. This notwithstanding, De Los Reyes (2011) argues that ‘cherry picking’ would also be prevented if researchers reported on all the measures they used. Reporting on all measures used is only logical. If no ‘definitive gold standard’ outcome measure exists, there is no clear rational ground to select one above the others. What makes most sense then, is to report on all measures.

The arguments presented by De Los Reyes (2011) is logical, but it has never been empirically addressed whether or not different outcome measures lead up to different results. If they do not, then one can say reporting on all of them is unnecessary and will only be time consuming. If they do, it shows the problematic character of the primary outcome methodology (as it excludes relevant, additional information on outcome which is not captured by one single measure). In the present paper, we will offer an empirical examination of the idea that different outcome measures might lead to different evaluations of therapeutic efficacy.

In the current paper, we will investigate in an empirical way whether or not different outcome measures lead to different conclusions on the efficacy of psychotherapy. To do so, we draw on data from the Ghent Psychotherapy Study, an RCT comparing the efficacy of cognitive behavioural therapy (CBT) and psychodynamic therapy (PDT) for treatment of Major Depressive Disorder (Meganck et al., 2017). While the GPS actually addresses an interaction effect between therapy type and personality style, for this paper, the focus was not on this interaction effect. To maximize the power of our tests, we only examined whether treatment outcomes differed significantly in the entire sample. Firstly, we assessed pre-post differences in outcome using six different outcome measures. Secondly, we examined if the test results with the six different measures differed significantly from each other.


Participants and Procedure

The ample consisted of 100 participants diagnosed with Major Depressive Disorder (MDD) recruited in Belgium. Patients were recruited through self-referral (based on advertisements, flyers, and social media) and referral by general practitioners and mental health care centres in Flanders, Belgium, over the course of 16 months. During the intake procedure, members of the research team administered the Structured Clinical Interview for DSM-IV Axis I disorders (SCID-I; First et al., 2002) and Axis II disorders (SCID-II; First et al., 1986) was administered in combination with the Hamilton Ratting Scale for Depression (HRSD; Hamilton, 1967) to ascertain if participants met the criteria for Major Depressive Disorder (MDD; DSM-IV, American Psychiatric Association, 2000). Other inclusion criteria were: HRSD total score ≥ 14, age between 18 and 65 years old; sufficient knowledge of the Dutch language. Exclusion criteria were current diagnosis of psychosis, delusions or bipolar disorder, acute suicidal risk, primary diagnosis of substance abuse, evidence of a significant medical condition that might prevent full participation in the treatments and participation in another ongoing treatment. Patients eligible for the study were randomized over the two treatment conditions (CBT and PDT) i.e. 50 patients received PDT and 50 patients received CBT. Missing data for one or more variables with the mean score of the scale not exceeding 20%; any missing data exceeding this cut-off was excluded from statistical analyses. The duration of treatment for both conditions (CBT and PDT consisted of 16-20 sessions. 79 participants, completed the treatment (i.e. 16 to 20 sessions. All participants included in the study were sufficiently fluent in the Dutch language we used the Dutch versions of self-report outcome measures and research ratings. Therapists in both conditions were blind to the research hypotheses and the outcome of the screening measures and interviews and received regular supervision from experienced psychotherapists throughout the course of the study. For a more detailed description of the research procedure, we refer to Meganck et al. (2016). Comment by Angel Norman: Ask Tom how missing data was dealt with

Outcome Measures

Researcher Rating

Hamilton Rating Scale for Depression (HRSD; Hamilton, 1967) is the most widely used clinician-rated instrument in depression treatment studies. The HRSD has 21 items to assess severity of depression, but only the first 17 items are used. The interrater reliability is considered good in most studies (Bagby et al., 2004), in some studies, internal consistency is found to be insufficient (Cuijpers et al., 2010). The HRSD has been frequently criticized for its low internal consistency (Cronbach’s alpha in the current study was ***), and for its validity in terms of factor structure Regardless of these limitations, the HRSD is still used as a gold standard primary outcome measure in psychotherapy studies for depression (Bagby et al., 2004) and was designated as the primary outcome measure in the GPS study.

Self-Report Measures

The Beck Depression Inventory-II (BDI-II; Beck, Steer & Brown, 1996) is the best known and most used self-report measure of depression. It is a 21 item self-report questionnaire that measures severity of depressed mood. For each symptom, statements are listed in ascending order, from 0 (non-depressed) to 3 (severely depressed) with total scores ranging from 0 to 63. The BDI-II is generally considered to have good validity and reliability (Beck et al., 1996a, 1996b; Richter et al., 1998; Burt & Ishak, 2002). Although Hagen (2007), has suggested that the BDI-II is very heavily based on a cognitive model. The psychometric properties of the Dutch translation are acceptable and comparable to those of the original version of the BDI-II (Van der Does, 2002).

Depression, Anxiety and Stress Scales (DASS-21, Lovibond and Lovibond, 1995) is a self-report questionnaire consisting of 21 items, 7 items per subscale: depression, anxiety and stress. Respondents score items on a 4-point (0-3) Likert scale regarding the frequency/severity with which they have experienced each of the 21 negative emotional symptoms during the previous week. To calculate comparable scores with the full DASS (DASS 42), each 7-item scale was multiplied by two. The higher the score the more severe the emotional distress. The Dutch translation of the DASS was previously found to have good psychometric properties (De Beurs et al., 2001).

Symptom Checklist-90-Revised (SCL-90-R; Derogatis, 1992) is a 90-item questionnaire scored on a five-point rating scale of distress, each item describes a physical or psychological symptom rated from 1 (not at all) to 5 (extremely). This self-report inventory covers 8 dimensions of psychological distress: depression, anxiety, phobic anxiety, hostility, cognitive-performance deficits, interpersonal sensitivity, somatization and sleep difficulties. Respondents were asked to rate the items indicating to what extent the symptoms of the SCL-90-R manifested in the preceding week. The Dutch version of the SCL-90-R (Arrindell & Ettema, 2003) is considered as a reliable and valid instrument; the subscales have demonstrated good to excellent internal consistency with Cronbach’s alpha-coefficients ranging between 0.73 to 0.97. The total score of the SCL-90 – the so-called Global Severity Index, GSI – is often calculated to express average symptom severity and is often used in outcome research.

The Outcome Questionnaire (OQ-45; Lambert et la., 1996) consists of 45 items scored on a five-point rating scale, ranging from never (0) to almost always (4) with a total score ranging from 0 to 180. The instrument was constructed to measure both symptom reduction and social functioning. This instrument has three general dimensions: psychiatric symptom distress, (25 items associated with most common disorders in public mental health care; depression, anxiety, addiction), interpersonal relations, (11 items associated with the functioning of the patient in relationships a partner, family and friends), and social role functioning (9 items associated with functioning at work, school and leisure). The instrument has nine reversely scored items. The original and Dutch versions have both been found to have good reliability and good validity (Lambert et al., 1996, De Jong et al., 2007; alpha = 0.68 and 0.95).

Biological Outcome Measure

Salivary Cortisol Assessment: Cortisol levels as an indicator of stress and biological markers for depression in (ug/dl) were measured using saliva samples. Adam & Kumari (2009) propose that four or five time points over three to five days are characteristic of “moderately high intensity protocols” for diurnal salivary cortisol collection. Thus, each participant provided 4 saliva samples upon awakening on 4 consecutive days at baseline, before every fourth session, post-treatment and at follow-up. Saliva samples were collected in a salivette by passive drool. Participants were instructed not to drink or eat anything one hour before sampling, as this may influence the cortisol levels as suggested by Kudielka et al., (2007). Participants were also given a card on which they were asked to record the date and time of sampling. On completion of sample collection, the samples were mailed back to the GPS research group and stored at -80° Celsius until analysis. Quantification of cortisol levels was measured by means of mass-spectrometry, following the standard practice in salivary hormone research (e.g. Kirshbaum, Bartussek, & Strasbuger, 1992). Comment by Angel Norman: Verify

The reliable change index (RCI) suggested a significant therapeutic effect when applied to the researcher ratings (Hamilton Rating Scale for Depression) and the self-report measures (BDI-II, DASS-21, SCL-90, OQ-45). However, no significant effect was found for the biological measure cortisol. When comparing the RCI’s obtained for the different outcome measures, we observed that researcher ratings and self-report measures lead up to similar (i.e. not significantly different) conclusions with respect to therapeutic efficacy but that there are significant differences between these measures on the one hand and biological measures on the other hand.

Consistently we found that all measures differed from cortisol as an indicator for outcome. Cortisol levels in our study were lower than those found in previous studies (e.g. Wust et al., 2000b), which could be reflective of differences in cortisol assays used, or due to non-compliance with instructions for collecting samples as suggested by Kudielka et al., (2003). Although monitoring compliance was not possible, evidence suggests high concordance between objectively measured and self-reported collection times of morning samples (DeSantis et al., 2010). Follow-up findings again showed significant differences between pre-post treatment for self-report measures but not for the cortisol measures. Some studies have shown high cortisol levels from morning samples in depressed populations (Bhaguagar, 2005; Pruessner, 2003), while other studies have reported findings of low cortisol levels in depressed populations (Huber et al., 2006; Stetler & Miller, 2005). Findings form our study fit into the latter of low cortisol levels. Although there is support for higher cortisol levels from morning samples from depressed populations, the evidence remains inconclusive. In our study we found that cortisol was not an indicator of depressive symptoms which is inconsistent with findings from Knight et al., (2010) where it was reported that salivary cortisol was strongly and independently associated with depressive symptoms.

As far as we know this is the first study to engage the approach proposed by De Los Reyes et al., (2011) to examine and report on all outcome measures included in an RCT rather than a singular focus on the primary outcome method that is the norm in psychotherapy research. In this study, in examining the convergence or divergence of outcome measures we found convergence among five of the six outcome measures administered. The guiding principle behind testing multiple measures according to De Los Reyes et al., (2011) is that no definitive outcome measure exists to ascertain treatment response for any one intervention. Thus, it is logical to report on all outcome measures included in a study. This will be even more prudent when the outcome measures used to assess treatment outcome show divergence even though this was not the case in our study.

Our study did not find any significant convergence with cortisol as an outcome measure and the other outcome measures used to examine treatment efficacy as hypothesized. Nonetheless, our study is unique in that it examines convergence among biological measures, self-report measures and researcher ratings in testing efficacy of treatment for major depression. Examining and reporting on all the outcome measures in our study provides a broad overview of possibly relevant factors. Also, our sample comprised of individuals with previous and current depressive episode. Nevertheless, some limitations need to be taken into account.


We encountered problems with missing data on self-report measures and cortisol samples at post and follow-up assessment and our sample as a whole was also subject to attrition, thus, limiting the generalizability of our findings. Furthermore, attrition may have resulted in lower power to demonstrate efficacy for some of the outcome measures at posttreatment, and at follow-up assessments making it difficult to accurately point at convergence or divergence.

The HRSD assessors were not blind to the treatment condition, therefore we cannot rule out observer bias on the researcher ratings. However, controlling for assessor-rated treatment expectations did not alter the pattern of results and we did find convergence on the researcher-ratings and patient rated outcomes. Recorded times of patient saliva samples were monitored only by the patients thus we must comment on total patient compliance with regards to given instructions on sample collection.

It could be argued that examining and reporting on all outcome measures is indeed time consuming and somewhat unnecessary especially when the measures show convergence. However, the reality is that in order to know something of the expected rate of convergence, estimates of the statistical power to observe replicated effects which go beyond the parameters of interest in power analysis with a single measure would need to be attained.

Power analyses were conducted for the GPS study (Meganck et al., 2017) based on the primary outcome method – decipher the statistical power needed to detect a given effect on one measure. We did not conduct a separate power analyses to decipher the statistical power needed to observe replicated effects across multiple measures.

This study does not speak to the reasons why these interventions with markedly different theoretical backgrounds might result in comparable outcomes. Comment by Angel Norman: Relevance


In mental health research it is important to address whether a treatment evaluated within a randomized controlled trial is successful based on all reliable and valid outcomes. Researchers make assertions that they created have techniques that alleviates suffering difficult to treat let alone assess (De Los Reyes, et al., 2011). These assertions are extraordinary to say the least. However, how can we say they are extraordinary if they did not come about from extraordinary evidence? From our study we agree with De Los Reyes (2011) that reporting that a treatment successfully improves outcome based on one ‘nondefinitive’ measure when there are other measures that could justifiably be used as well is illogical. As researchers we aim to collect data and make interpretations based on extraordinary evidence to further and improve current evidentiary practices. To do this, value must be placed on the use of multiple measures especially when there is no single definitive measure. Shifting focus on how different indicators of outcome converge or diverge can be very informative on whether and how our treatments work.

Get quality help now

Professor Jared

Verified writer

Proficient in: Health Care, Clinical Psychology, Medical Practice & Treatment

4.9 (378 reviews)
“My paper was finished early, there were no issues with the requirements that were put in place. Overall great paper and will probably order another one.”

+75 relevant experts are online

More Essay Samples on Topic

banner clock
Clock is ticking and inspiration doesn't come?
We`ll do boring work for you. No plagiarism guarantee. Deadline from 3 hours.

We use cookies to offer you the best experience. By continuing, we’ll assume you agree with our Cookies policy.