Table of Contents
- Data Collection Methods
- Data Analysis
- Ethical Considerations
Thorough consideration is needed regarding the representativeness and generalisability of findings from this review to the wider population. Apart from Smith et al. (2015), all studies reported no significant differences between study groups at baseline, such statements were further supported by p > 0.1 and data in baseline characteristics summary table. Joanna Briggs Institute (2017) advised that statistical significance should not be the only measure when commenting on baseline characteristics. Smith et al. (2015) did not information regarding participants’ demographics at baseline. However, results from separate linear regression models showed that variables such as age or gender did not have a moderating effect on clinical outcomes (all p >0.8). Although participants’ demographics were not confounding in Smith et. al (2015), it is inconclusive whether participants were similar at baseline. Difference between participants in compared groups could reduce comparability (Martínez-Mesa et al., 2016) and constitute threats to the internal validity of the study (Lund Research, 2012).
All selected studies had a sample with mean age under 24 as per the inclusion and exclusion criteria of this review. Although the mean age for McCall et. al (2018) was 21.86, participants’ age ranged from 17 to 46 years old, a standard deviation of 5.5 suggested that individual age could exceed 24. It is imprecise as to how many participants over the age of 24 were included in this study and whether these participants had a contrasting result compared to the other participants. As a result, findings may be non-representative for the target population of this review.
Several studies (De Voogd et al. (2017), Fitzgerald et al. (2016), Ip et al. (2016) and McCall et al. (2018)) had a higher proportion of female than male within their samples, suggesting potential gender bias. Fitzgerald et al. (2016) argued that such gender imbalance reflected the wider population, as mental illness is more prevalent among women than men (Mental Health Foundation, 2016). A separate group analysis for males and females was conducted in McCall et al. (2018), results suggested that gender might not be confounding. Besides, McCall et al. (2018) was conducted among psychology undergraduate students where majorities were females. The unequal sex ratio and high-level education of the samples limited the generalisability of findings from this study to other populations.
Lastly, five out of the seven studies were conducted in western countries, while the other two were from Asia. Since there is a diverse ethnical population, careful examination is needed before generalising and applying findings to the general UK population.
Data Collection Methods
For the purpose of this review, this section will only be focusing on outcome measures that address the research question. All included studies used well-established, standardised outcome measures for data collection, this strengthened the reliability of findings. Apart from Yang et al. (2016), all studies clearly specified their choice of outcome measure along with justification. Instead of categorising clinical outcomes (e.g. primary outcome: depressive symptoms, secondary outcome: resilience), Yang et al. (2016) grouped them by the actual outcome measure used (e.g. Primary outcomes: depressive symptoms count and HAM-D). Reasons behind this reporting style were unexplained. According to Garg and Choudhary (2011), it is crucial to differentiate primary and secondary outcome, this reporting style made it unclear and confusing for readers to justify the desired clinical outcomes and focus of the study.
Apart from Fitzgerald et al. (2016), all other studies commented on the validity of the selected outcome measures. Cronbach’s alpha was referenced to demonstrate internal consistency of the outcome measures used; with α= 0.93 for SCARED and α = 0.89 for CDI (De Voogd et al., 2017); α = 0.85 – 0.9 for CESD-R (Ip et al., 2016); α = 0.9 for MFQ-C (Smith et al., 2015); α = 0.82 for HAM-D (Yang et al., 2016). Such high to excellent score indicated that these scales were reliable across items (Langdridge and Hagger-Johnson, 2013). Additionally, CESD-R had a high test-retest reliability, this showed that the scale was not only highly reliable across items but also consistent over time (Ip et al., 2016). Rickhi et.al (2015) had high inter-rater reliability for all primary outcomes, this demonstrated high degree agreement between raters and increased reliability of findings (Langdridge and Hagger-Johnson, 2013).
Reliability refers to the extent that results can be reproduced if a study were to be repeated under the same conditions (Goodwin and Goodwin, 2012). In other words, reliability is compromised if a researcher failed to articulate sufficient details for replication. Results were collected at pre- and post-treatment for all studies. De Voogd et al. (2017), Fitzgerald et al. (2016), McCall et al. (2018), Smith et al. (2015) and Yang et al. (2016) did not outline their methodology and procedures for data collection, this limited their ability to replicate and raised concerns regarding the reliability of findings.
All studies apart from De Voogd et al. (2017) included the CONSORT flow chart which listed information regarding the number of participants who were randomly assigned, received intended treatment and analysed for primary outcomes (Schulz et al,2010). Except for Yang et al. (2016), all studies reported number of incomplete data as well as provided adequate information regarding attrition and detailed impact analyses. Such transparent reporting eliminated risk of selection bias and ensured that participants were not missed intentionally. The high dropout rate within control group in De Voogd et al. (2017) reduced the sample size and affected the representativeness of findings. Careful consideration is required when including findings from this study as comparability between study groups might have been distorted.
After randomisation, participants may deviate from protocol and not complete the study due to various reasons, such as, non-compliance, dropping out, withdrawal or not getting allocated treatment (Schulz et al., 2010). Rickhi et al. (2015) and McCall et al. (2018) did not report how non-adherence data was managed, it is arguable whether the baseline of the two study groups were still similar during data analysis. This might have interfered with data and led to potential bias. De Voogd et al. (2017), Fitzgerald et al. (2016), Ip et al. (2016), Smith et al. (2015) and Yang et al. (2016) adopted ‘intention to treat’ (ITT) analysis to minimise bias caused by non-adherence. ITT ensure continuous and representative findings (Joanna Briggs Institute, 2017), participants entered the trial were all analysed in their originally assigned group from initial randomisation regardless whether they participated in those groups throughout the study (Gupta,2011).
None of the studies reported confidential interval when presenting their findings. Since the range of possible true treatment difference was not recorded, it is difficult to determine the clinical significance of findings by examining statistical significance only (Young and Lewis, 1997).
Inferential statistics apply results beyond data collected in a study, it draws conclusion about the whole population base on a small sample (Goodwin and Goodwin, 2012). Inappropriate statistical analysis can bring about statistical inference error, thus, statistical tests conducted were examined against Hick’s decision tree (Appendix 4). Since all the studies were investigating a difference between groups of interval/ratio data, parametric statistical test was adopted.
All studies selected appropriate data analysis technique as per Hick’s decision tree. Ip et al. (2016), McCall et al. (2018) and Rickhi et al. (2015) applied independent t-test to compare scores of two different samples – experimental group and control group. While Fitzgerald et al. (2016) and Yang et al. (2016) analysed their data using Analysis of Variance (ANOVA). ANOVA is similar to t-test, but it also looks at variation of scores within each set of data and can compare more than two groups (Howitt and Cramer, 2007). In both studies, ANOVA were used to compare scores between three groups (pre, post-treatment and follow-up data).
De Voogd et al. (2017) and Smith et al. (2015) assumed data were miss at random. Smith et al. (2015) fitted linear mixed models using maximum likelihood to test whether treatment effects can maintain long term. While, De Voogd et al. (2017) conducted regression analysis to investigate the interaction between outcome measures scores and time point in each study group – such selection was justified as the study seek to find a relationship between one sample of two measures (Langdridge and Hagger-Johnson, 2013).
All selected studies were conducted on human, hence rigorous ethical principles must be applied (The British Psychological Society, 2014). Majority of the participants included in this review were 18 years old, and so, in addition to informed consents from participants, parental consents were also required. Parental consents were obtained from all participants under 18, and informed consents were gained from all participants. Findings were recorded appropriately, participants were anonymous and not identifiable. To increase rigour, all studies underwent ethic review and gained ethical approval from local institutional ethics committee.
Debriefing serves an ethical purpose by undoing deceptions and remove negative feelings caused by an experiment (Neaton, 2015). It is critical that researchers debrief their participants after a study (Howitt and Cramer, 2007), especially in this case where all participants experience symptoms of depression and/or anxiety. Both De Voogd et al. (2017) and Fitzgerald et al. (2016) demonstrated good ethical consideration by debriefing their participants to inform them about the nature, results and conclusion of the study (Goodwin and Goodwin, 2012).