Fear of childbirth measurement: appraisal of the content overlap of four instruments

ABSTRACT Objective To evaluate empirically the degree of content overlap between four self-report measures of fear of childbirth (FoC) identified as ‘best in class’ by a recent review. Background FoC and tokophobia is an area of increasing clinical concern and has been linked to poor maternal and neonatal outcomes. Clinical pathways have been established to improve care and interventions for FoC however, ambiguity and inconsistency remain regarding the most appropriate assessment measures. Method A multi-rater and consensus content analysis was undertaken to determine the degree of overlap between four ‘best in class’ measures of FoC/tokophobia. Results The Slade-Pais expectations of childbirth scale (SPECS) was found to be the preferred measure in terms of symptom overlap of the tools evaluated, however, the overall level of overlap among these measures was weak. Conclusion Limitations inherent to the current battery of preferred measures of FoC suggests both the desirability and urgency to develop a theoretically-grounded, psychometrically robust and accurate FoC assessment measure. Current measures of FoC are not interchangeable.


Introduction
Fear of childbirth (FoC) represents an area of contemporary clinical concern and research focus and describes anxiety related to the event of childbirth (Dencker et al., 2019;Nilsson et al., 2018). The underlying complexity of FoC as a phenomenon is important to unpack both within the context of identification (Richens et al., 2018;Toohill et al., 2015) and intervention Klabbers et al., 2019;Toohill et al., 2014). The urgency for this is not only in relation to the well-being of the woman herself, but also in relation to the potential impact on the future relationship between mother and baby (Pazzagli et al., 2015). An existing tension in the literature relates to differentiating (or not) FoC from tokophobia (tocophobia), the challenge being that FoC may be conceptualised within a continuum model where a degree of FoC may be anticipated to be both normal and anticipated (Richens et al., 2018). Tokophobia implies severe FoC and thus a dichotomy between absence/presence of a clinical presentation that is defined as a specific phobia (Hofberg & Brockington, 2000;Striebich et al., 2018). FoC and tokophobia appear to be used interchangeably within the literature, a context of equipoise with doubtful support given fundamental disagreement between the notion of childbirth-related fear as a continuum (Richens et al., 2018) or distinct pathological state (Poggi et al., 2018). A definitive position on the continuum/state differentiation is unlikely to be realised soon, particularly given these distinctions represent seemingly irreconcilable positions in many areas of mental health, for example, schizophrenia (Fleming & Martin, 2011. The recent development of clinical pathways specific to FoC ) represent a step-change in access to, and provision of, evidence-based interventions in severe FoC or tokophobia such as cognitive-behavioural therapy (Larsson et al., 2017). Somewhat surprisingly, the most fundamental component of a clinical referral pathway, the initial screen, remains an area of ambiguity in terms of choice of tool (Pallant et al., 2016;Richens et al., 2018) and associated confidence in the measurement acuity of such tools that are available (König, 2019).
A number of measurement tools appropriate for FoC screening have been developed ranging from single items to questionnaires with 70+ items (Richens et al., 2018). The most widely-used measure to date has been the 33-item Wijma Delivery Expectancy Questionnaire (W-DEQ; Wijma et al., 1998), in many respects, the 'gold standard' of the FoC screening genre due to both extensive use in clinical research (Nilsson et al., 2018) and wide-spread translation and validation internationally (Korukcu et al., 2012;MoghaddamHosseini et al., 2019;Mortazavi, 2017;Takegata et al., 2013). A central tenet of the W-DEQ and indeed a core feature of both its use and conceptual underpinning is that the tool assesses a single dimension of FoC (Wijma et al., 1998), thus interpretation is based on the total score and preferred screening threshold level (Nilsson et al., 2018). This core ascribed attribute (uni-dimensionality), presents a significant challenge to the conceptual alignment of the W-DEQ to clinical application in that, invariably, studies that have examined the underlying measurement characteristics of the W-DEQ using factor analysis have found the tool to indeed be multi-dimensional (Fenaroli et al., 2019;Johnson & Slade, 2002;König, 2019;Pallant et al., 2016). Challenges to the accepted dimensionality of a screening measure may foster new and useful insights in what a tool really does measure. The Hospital Anxiety and Depression Scale (HADS; Zigmond & Snaith, 1983) for example, is conceptualised and scored as a two-dimensional (anxiety and depression) measure but has been found to be tri-dimensional in many studies, for example, (Christensen et al., 2020), findings that has been valuable in contextualising the measure within an alternative and coherent model of depression (Martin et al., 2008). However, such insights are generally based on consistency between factor analytic findings across studies. Many of these are specific to the W-DEQ. Findings from factor analysis studies vary widely, with large variation between the number of factors found and fundamental differences in the pattern of item-factor loadings observed (Fenaroli et al., 2019;Johnson & Slade, 2002;Mortazavi, 2017;Pallant et al., 2016). It has been suggested that translations of the W-DEQ may be affected significantly by cultural context and the translation of some items may be problematic (Richens et al., 2018). However, the relative merits of this perspective are profoundly limited by the measurement model of the W-DEQ being uni-dimensional (Wijma et al., 1998). The most extensive measurement model evaluation of the W-DEQ was undertaken by Pallant et al. (2016) using both exploratory and confirmatory factor analysis and in addition, a Rasch analysis for scale and sub-scale uni-dimensionality. Pallant et al. (2016) and colleagues concluded from their analysis that (a) the W-DEQ is multi-dimensional comprising four distinct factors, (b) a shortened revision may be useful with redundant items removed and, (c) the W-DEQ should not be used in its current form.
Given the above concerns about the W-DEQ and the length of the tool for practical clinical application (Richens et al., 2018), a short instrument, such as the Fear of Birth Scale (FOBS; Haines et al., 2011), may have greater potential, particularly within the context of both research and clinical practice (Richens et al., 2019). The FOBS does seem promising for an initial screen, comprising just two items (fear and worry) both scored on a 10 cm visual analogue scale and a mean score taken to compare against threshold. Circumscribed by brevity, the FOBS has been shown to be as effective as the W-DEQ for screening (Richens et al., 2018) and for the pragmatics of stepped screening within a clinical pathway, the notion of using both tools has garnered interest . A concern raised regarding the W-DEQ has been inconsistent threshold/cut-off scores for the identification of severe or significant FoC, a concern that may also be inferred from studies of the FOBS which have indicated varying cut-off scores Ternstrom et al., 2015) and even utilised alternative thresholds within the same study (Richens et al., 2019). A recent investigation also highlighted what may be a fundamental limitation of the FOBS, namely inherent measurement error, the suggestion being that at the very least the FOBS requires further psychometric appraisal and potentially modification (Richens et al., 2019).
The literature thus presents a service provision conundrum, the goal of providing evidence-based support for clinically-relevant FoC against lack of an agreed definition (Nilsson et al., 2018) and significant limitations in screening measures in relation to the two most widely-used tools (Pallant et al., 2016;Richens et al., 2019;. Consequently, the operationalising of a clinical pathway in these circumstances is limited not only in terms of accurate screen and thus appropriate access to a service but also in relation to accurate assessment of outcomes, since both the W-DEQ and FOBS are also used to assess intervention efficacy . Recognising that the precipitant of the issues above are largely a function of a lack of an unambiguous and evidence-based construct of FoC,  undertook a meta-synthesis of the literature to identify and understand the content of FoC from the woman's perspective. The findings from this work were incorporated into a further study, combined with in-depth interviews with pregnant women experiencing FoC and midwives to identify the underlying components of FoC which may be used to develop measurement tools that are theoretically and conceptually anchored (Slade et al., 2019(Slade et al., , 2016. Slade et al. (2019) detailed the next stage in their stepwise project was to examine women's appraisal of items used in existing measures and mapping the constructs from their study on to these tools. A recently published study Slade et al., 2020) highlighted the potential use of four instruments, these being the W-DEQ, the FOBS, the Slade-Pais Expectations of Childbirth Scale (SPECS; Slade et al., 2016) and the Oxford Worries about Labour Scale (OWLS; Redshaw et al., 2009). The SPECS is a 50-item multidimensional measure of birth expectancy of which fear of childbirth represents a distinct sub-scale (10-items) as well as many items which are conceptually related to fear, for example, loss of control. Twenty-six items from the SPECS have been suggested to be used in clinical practice for identification of FoC, though as far as the authors are aware this measure is currently in clinical use at one site in the UK. 1 Uniquely among the four instruments, the OWLS was never conceived as an instrument to assess any domain of FoC. Indeed, the OWLS was developed and originally validated as a nine-item multi-dimensional measure of worries about the labour experience. The OWLS has been or is planned to be used in a number of studies none of which has a primary focus on FoC (Henderson et al., 2018;Henderson & Redshaw, 2016;Krusche et al., 2019;Roch et al., 2018). The OWLS assesses three distinct but correlated domains of distress, uncertainty and interventions with respect to labour. Despite profound conceptual heritage and measurement characteristic differences between these four measures, it is important to be aware that the selection of these measures as representing key aspects of relevance of FoC to women themselves is representative of an endpoint of exhaustive reviews of the literature Slade et al., 2020) and in-depth interviews with practitioners and women experiencing FoC (Slade et al., 2019). Further, these measures and in particular, the SPECS and the OWLS received endorsement from women themselves as including items that best represents their experience .
Reflecting on the range of instruments that are used to assess FoC, the observation that they are used interchangeably without a selection rationale and the context that conceptually FoC itself has only recently received focused attention in conceptual alignment from the woman's perspective Slade et al., 2019). The overlap between the four measures highlighted by  and Slade et al. (2020) is of interest to appraise for two reasons. Firstly, the rich qualitative insights that have led to a focus on these four tools has yet to be triangulated using a quantitative approach. This is methodologically relevant because limitations in existing tools has highlighted the quantitative aspects of measurement to a significant degree, for example, the work of  and Slade et al. (2019). Secondly, the notion that measures of a concept may be used interchangeably has been emphasised as a highly contentious practice (Fried, 2017). Indeed, such assumed equivalence of measures may be one of the contributors to the 'replicability crisis' currently confronting the behavioural sciences (Anderson & Maxwell, 2017;Bardi & Zentner, 2017;Coiera et al., 2018;Loken & Gelman, 2017). The findings from  and Slade et al. (2020) regarding the use of the FOBS, W-DEQ, SPECS and OWLS in terms of representing, to a lesser or greater degree, women's symptoms and experience may suggest that these measures could be used interchangeably if there is sufficient overlap in symptoms within the scales. Recent influential work in this area by Fried (2017) in relation to depression, where self-report measures are indeed used interchangeably, has found that the degree of overlap between measures to be low and that this may be a significant contributor to replicability failure issues. Given the diversity inherent in the measures of FoC outlined above, Fried (2017) perspective would appear relevant to investigate in the context of these tools.

Aim
The aim of the current investigation was to evaluate the four instruments identified by  and Slade et al. (2020) as best representing women's experience of FoC in terms of overlap of symptoms intrinsic within each scale across scales.

Methods
Using an adaptation of the approach of Fried (2017), an empirical content analysis of the FOBS, W-DEQ, SPECS and OWLS was undertaken to evaluate item overlap across the scales. The approach of Fried (2017) was modified for two key reasons. Firstly, Fried (2017) evaluated depression screening measures against an established diagnostic entity, specifically symptoms associated with depression, in terms of guiding the selection of the items from each questionnaire in that they may be meaningfully compared. Consequently, in that study from a total of 125 items in the seven measures evaluated, less than half of the items were evaluated for overlap by condensing items to specific depression symptoms. However, in the case of FoC, a comparable diagnostic entity does not currently exist, therefore, across the four measures all items were included for overlap comparison. Statistically exquisite and undoubtedly methodologically innovative as Fried (2017) study was, the author himself highlighted that the condensation of items for comparison was subjective. Fried (2017) also emphasised that a less conservative approach would be to consider all items. Reflecting on the above our approach was therefore to select all 70 items from the four measures for overlap analysis to ensure all FoC experiences that are captured by the tools are incorporated into the analysis without subjective bias. Fried (2017) differentiated between specific symptoms, those that would be generally considered more or less identical between items and compound symptoms, those which shared a high degree of similarity between items but were not equal. We adopted the same approach, but differentiated between high overlap and moderate/modest overlap for our overlap categorisation, all compared to a no overlap categorisation. Secondly, a further elaboration of Fried (2017) methodology to the current study was that we conducted four content analyses (in contrast to a single content analysis in Fried (2017)) based on the appraisal of overlap by practitioners knowledgeable in the area of FoC and who had developed a clinical pathway for FoC. 2 Three further content analyses was also conducted by academic and clinical colleagues with no significant specific knowledge of FoC but with familiarity with content analysis, thus offering an opportunity to evaluation any marked variability between those with and without specialist FoC knowledge. A consensus content analysis was then constructed by using the modal value (mode score) across raters for each overlap score. The consensus content analysis was then subject to statistical analysis.

Statistical analysis
Consistent with Fried (2017), the content overlap of items was estimated using the Jaccard Index (JI). This metric represents a similarity coefficient specifically for binary data with a 0-1 range where 0 represents an absence of overlap and 1 represents complete overlap. The calculation of the JI is described in detail in Fried (2017) and within the context of the current investigation binary classification for calculating the JI is by collapsing high and moderate overlap classification into one category for comparison against no overlap classification, thus a dichotomous categorisation. We also adopted the same criterion used by Fried (2017) for evaluation of the strength of the JI correlation coefficient. The criteria of Evans (1996) ranges from very weak (0.00-0.19), increasing incrementally through weak, moderate, strong to very strong (0.80-1.00). Consistent with Fried (2017) and as an abstraction of the specific/compound conceptualisation of scale overlap we calculated separate item high vs. moderate overlap estimations across scales and also calculated the frequency of idiosyncratic items per scale, essentially those that appeared in no other scale.
Inter-rater reliability was calculated across the seven raters for each combination of responses using Fleiss' kappa (Fleiss, 1971) with level of agreement determined by reference to the thresholds of Landis and Koch (1977).

Results
Seven content analyses were completed individually by four specialist practitioners in perinatal mental health (drawn from the disciplines of nursing, midwifery, and psychology), a mental health nursing practitioner, a medical general practitioner and a statistician. Fleiss' kappa calculated for each instrument revealed significant agreement between raters (Table 1) ranging from fair to moderate agreement with reference to the criteria of Landis and Koch (1977).
The JI overlap index for each scale by each rater is summarised in Table 2. Notwithstanding variability between raters, the SPECS was found to have consistently the most overlap. The consensus content analysis is also summarised in Table 2. revealing the SPECS to have the most overlap and the OWLS the least.
The JI correlation coefficients for the consensus content analysis are summarised in Table 3. The mean JI index across scales based on the consensus content analysis was 0.247 which according to the criteria of Evans (1996) is a weak level of overlap.
The OWLS and the FOBS were observed to capture the lowest percentage similarity of the total seventy items at 26% (18 items) each, while the SPECS and W-DEQ captured the most at 67% (47 items) each. The total number of idiosyncratic items, those not represented in any other scale was 29 (41%). The FOBS had no idiosyncratic items, whereas the SPECS had 7 (27%), the W-DEQ had 16 (48%) and the OWLS had 6 (67%).  Landis and Koch (1977) values of 0-0.20 = slight agreement, 0.21-0.40 = fair agreement, 0.41-0.60 = moderate agreement, 0.61-0.80 = substantial agreement, and 0.81-1.00 near perfect agreement.
Comparison between items across scales in relation to degree of similarity (high vs. moderate overlap vs. no overlap) is summarised in Figure 1.

Discussion
The findings from the current investigation raise a number of questions concerning the measurement of FoC. The weak level of overlap between measures is highly indicative that the tools are not interchangeable, thus confirming a significant inherent source of error if comparisons are made between studies based on different FoC measures. These findings are thus consistent with the assertion of Fried (2017) that assumed interchangeability of measures is erroneous. This is fundamentally important because of the suggestion that this is a potential contributory factor to the replicability crisis (Fried, 2017). Moreover, these findings are consistent with  and Slade et al. (2020) which revealed, from women's perspectives, that measures of FoC symptoms of are not comprehensively captured by one particular instrument. However, the current study does indicate that in terms of overlap across scales, as assessed by the JI, the SPECS would seem to be the current tool of preference in terms of overlap across scales, thus capturing the larger component of symptoms across the total seventy items of all scales combined. It is noteworthy that though the W-DEQ captured an identical absolute percentage of symptoms across scales as the SPECS (67%), the W-DEQ is also a longer measure and more importantly had a much larger percentage of idiosyncratic items (48% vs. 27%) than the SPECS. Further supportive evidence for the preference for the SPECS can be inferred from the individual content analyses, which though exhibiting a degree of variability between raters, also consistently found the SPECS to offer most overlap across tools.
The caveat in suggesting the SPECS is the preferred measure from the four evaluated in the current study must be the weak level of overlap between scales which suggests the development of an experience-informed, theoretically-grounded and psychometrically robust measure of FoC is a pressing contemporary need, particularly given the pre-eminence of accurate assessment within the establishment of clinical pathways . The OWLS in contrast to the SPECS had very little overlap and the highest percentage of idiosyncratic items. This perhaps should not be an entirely surprising finding as uniquely among the four tools it was never designed to be a measure of FoC. However, it is also important to reflect that the OWLS was selected by a review of measures and evaluated by women with FoC to be a measure which represented their experiences . The SPECS, in contrast, was designed to intrinsically assess FoC and followed a robust instrument development process (Slade et al., 2016), however, the findings from  and Slade et al. (2020) in terms of the diversity of the four tools and women's experiences would indicate that the SPECS does not assess the core aspects of FoC comprehensively. However, as highlighted by (Slade et al., 2020), none of the four instruments were optimal in terms of content validity, understanding and acceptability from the woman's perspective.
The study had one important limitation. The approach to content analysis taken is relatively novel and therefore findings must be tempered within the parameters of an approach which has yet to penetrate the mainstream literature. Further, although the statistical analysis undertaken was sophisticated, the use of content analysis and multiple raters is an established approach and it is hoped also contextually sensitive to the qualitative research Slade et al., 2019Slade et al., , 2020 which underpinned the approach taken for the current study.
Given that even the SPECS was observed to have inherent limitations in assessing all key aspects of FoC, future research to develop a definitive measure of FoC that addresses these deficits is suggested. This may meaningfully incorporate the elements of the measures in the current study that both overlap and are appraised by women to be representative and sensitive to their individual experience. Further, since the approach to content analysis undertaken was found to be both useful and insightful, application to other aspects of perinatal mental health, such as anxiety, and perinatal wellbeing, such as quality of life, is also suggested.
Finally, by highlighting issues related to tools that have been either specifically designed to or suggested could be used to assess FoC, it is useful to consider that a further crucial need is to develop an evidence-based and universally agreed definition of FoC from which measures can be conceptually grounded . Indeed, the recently published consensus statement by Jomeen and colleagues emphasises the potentially negative clinical implications of the current rudimentary theoretical and knowledge base regarding FoC. Central to this is the impact on adequacy of screening, assessment and intervention and these have been emphasised as key areas of pressing future research and highlighted within the consensus statement is the use of measures and an understanding of their implicit measurement characteristics .
In summary, the current study took a methodologically novel approach to reflect upon and consider a fundamental component of the FoC literature, namely the accurate and appropriate measurement of the concept by existing measures. Principally informed by qualitative research, our quantitative approach has indicated not only a preference for the SPECS among the instruments evaluated but also highlighted the limitations of the same.

Notes
1. The use of the 26-items from the SPECS for FoC assessment and the use of the measure in one site in the UK comes from personal communication with the SPECS study lead author. 2. Personal communication with Dr Fried supported the adoption of multiple content analyses to offer enhanced rigour in terms of inter-rater reliability.