A validation and calibration process for self-reported tobacco use with participants’ cotinine levels: an example from the Building Blocks trial

Introduction: Reducing smoking in pregnancy was a primary outcome in our Building Blocks trial of the Family Nurse Partnership[1]. We calibrated maternal reports of smoking using cotinine values derived from urine samples to assess tobacco use [2]. This involves identifying the extent to which an individual accurately reports smoking and requires complete and synchronized data collection over time. However, some urine samples may be missed or collected at a different time from self-report (non-synchronised). Methods: We used statistical validation processes to address both non-synchronized and incomplete data. First, we examined consistency in reporting behaviours at baseline and follow-up for participants grouped by extent of non-synchronized time of collection. Second, we used data from complete cases to infer values for mothers with missing urine samples at follow-up. We then used Markov chain transition rate matrix constructed to assess the robustness of such inferences. Results: Maternal under- and over-reporting of smoking were consistent across the 870 participants grouped by different levels of non-contemporary data collection (Breslow-Day test: p=0.24; Chi-squared test: p=0.69). Using participants’ baseline reporting behaviours to infer their follow-ups provided comparable smoking outcomes (4.5 cigarettes per day with SD of 5.5) to the simulated counterparts (4.5 cigarettes per day with SD


INTRODUCTION
The Family Nurse Partnership (FNP) is a licensed intensive home-visiting intervention developed in the USA that involves up to 64 structured home visits by specially recruited and trained family nurses, from early pregnancy until the child's second birthday. The Building Blocks trial [1] aimed to assess the effectiveness of FNP in England, for adolescent first-time mothers. This pragmatic, nonblinded, randomised controlled, parallel-group trial in community midwifery settings was carried out within 18 partnerships between local authorities and primary and secondary care organisations in England.
In the Building Blocks trial [1], one of the primary outcomes was to investigate the effectiveness of FNP in reducing smoking behaviour during pregnancy. We collected self-reported data on number of cigarettes smoked per day, at trial baseline and late in pregnancy. Self-report is an effective method of data collection in terms of time, efficiency and feasibility, and accuracy can be reasonable within research studies, especially with less sensitive health behaviours. However, self-reported smoking can be inaccurate and some participants are likely to report smoking fewer (or more) cigarettes than they actually do [2,3]. Recall bias, the difficulty in remembering information over time, could randomly cause both over-reporting and under-reporting [4,5]. Other factors tend to lead to bias by underreporting only, for example, younger people and pregnant women were more likely than others to answer in ways that are congruent with social norms (social desirability bias) [ 6,7,8,9]. The prevalence of misreporting of self-reported smoking is not trivial. In one trial with adolescent participants, 30% were either under-reported, over-reported or both [4]. In another study, almost 26% of pregnant women reporting themselves as non-smokers were subsequently classified as smokers after validation by means of serum cotinine measurement [8].
The accuracy of self-reported health behaviours had been discussed widely and the use of validation techniques and objective measures were strongly recommended to minimize bias [10]. For selfreported smoking data, several biochemical measures were available as validation techniques [11 -17]. In the Building Blocks trial, we used urinary cotinine levels to supplement the information gained from participants' self-reported behaviours [1]. Together we used the analytical approach from Dukic et al. [2] to combine both self-report and biochemical information to increase the precision of the smoking measurements. This calibration approach requires both complete and wellsynchronized collection of self-reports and urine samples, i.e., the self-report of number of cigarettes and urine samples should be collected on the same day. This is especially important among pregnant women, as it is known that their smoking patterns fluctuate over short periods of time [18,19].
In the Building Blocks trial we achieved high levels of synchronization in data collection for selfreport and cotinine measures at baseline, as both were obtained during a single face-to-face interview. However, we faced greater challenges in synchronized collection at follow-up in late pregnancy, when separate collection approaches were necessary (women self-reported their smoking during telephone interviews and were asked to return urine samples by post). For most participants follow-up self-reports and urine samples collection occurred on different dates. In addition, there was more missing data at follow-up. Some participants provided only self-reports, while some had neither self-reports nor urine samples. These issues created the risk of losing participants for the analyses (loss of power) and potential bias.
This research aimed to provide a pragmatic solution to facilitate the calibration process of selfreported tobacco use and retain adequate power without introducing undue bias. The integrated validation and calibration process was proposed to tackle the data incompleteness and nonsynchronization issues.

Calibration approach
The calibration approach we adopted from Dukic et al [2] could be summarized into five steps. The first two steps were to calculate the cotinine weighted number of cigarettes ( co t N ) based on the participant's cotinine level and the self-reported weighted number of cigarettes ( self N ) based on the participant's self-report of number of cigarettes smoked on each of the 3 days prior to interview. The third step was to classify the participants into four reporting groups: over-reporter, accurate reporter, under-reporter and extreme under-reporter, by comparing their Over-reporters were participants whose cotinine level was at least 30% less than that expected according to their self-reported average number of cigarettes smoked. Accurate-reporters had urine cotinine levels within 30% differences of their self-reported average. Under-reporters had cotinine levels 30% -80% greater than expected for their self-reported average while extreme underreporters had cotinine levels >80% greater than their self-reported average. At step four, for each reporting group, we calculated the averaged difference between self-reported number of cigarettes and number based on cotinine samples. Finally, for each participant, these averaged differences were used to calibrate the self-reported number of cigarettes (Details on this approach were provided in supplementary materials S1).

Study design and participants
The Building Blocks trial was originally registered with ISRCTN (number ISRCTN23019866). Completeness and synchronization in data collection at baseline were well achieved (95.7% were completely collected and 96.1% of them were collected on the same day). We focused on the nonsynchronization at follow-up. We first grouped the trial participants according to their completeness of data collection at follow-ups. Participants who had both self-report and cotinine sample collected at follow-up were categorized as full data cases, for whom only non-synchronization issue existed. Participants with only self-reports at follow-up formed partial data cases (incompleteness in data collection). Participants who neither have self-report nor urine sample at follow-up were categorized as insufficient data cases (not valid for analyses).

Statistical analysis
A participant flow chart for the validation and calibration process was presented in Figure 1. Full data cases were further divided into four core groups (Core 1 -Core 4) according to the extent of gaps between self-report and urine sample collection dates. Participants of each core group were further cross-tabulated and the homogeneity in reporting behaviours was assessed by Breslow-Day test, while the baseline/follow-up association was evaluated by Mantel-Haenszel test. A contingency table on reporting behaviour shifting patterns was also employed for Chi-squared test on homogeneity. For partial data cases, the robustness of inferring participants' reporting behaviours at follow-up by their baseline counterparts was examined by a robustness testing process. We first directly imputed the follow-up reporting behaviours by their baseline counterparts and calculate their calibrated tobacco use. We then conducted the Monte Carlo simulation to impute the followup reporting behaviours by using the Markov chain transition rate matrix constructed by the full data cases. With multiple imputation, ten simulated data sets were repeatedly generated and the pooled results of calibrated tobacco use were calculated and compared with the results from direct imputing, which lead to the robustness justification. R language version 3.4 and SPSS 20 were utilized for the analysis.

RESULTS
From 1618 participants in the Building Blocks study, 526 insufficient data cases provided too little information and hence were excluded for analysis. Of 1092 participants for this research, 870 participants were full data cases and 222 participants were partial data cases (Figure 1).

Non-synchronization in data collection at follow-up
To check if the duration of the time lag between the urine sample collection date and interview date caused heterogeneities in smoking outcomes, we divided the participants into the following four core groups (Figure 1): Core 1: Participants whose late pregnancy interview date and urine sample collection were on the same day (49 in total). Core 2: Participants whose interview date and urine sample date were within 2 weeks (620 in total). Core 3: Participants whose interview date and urine sample date were more than 2 weeks, but less than 4 weeks apart (146 in total). Core 4: Participants with greater than 4 weeks lag between interview and urine sample date (55 in total). Applying Dukic and colleagues' calibration approach [2], these 870 participants were cross-tabulated according to their reporting behaviours (over-reporter, accurate reporter, under-reporter and extreme under-reporter) at baseline and follow-up (Supplementary materials S2).
In order to utilize the Breslow Day test for homogeneity assessment, we further collapsed these reporting behaviours into two simplified groups: accurate and over reporters forming the positive reporters; under and extreme under reporters forming negative reporters ( Table 1). The Breslow-Day test on Table 1 showed that these four core groups present homogenous reporting behaviours (p=0.24). This implied that the impacts of non-synchronisation in data collection on reporting behaviour patterns were ignorable, supporting combining all 870 participants for analysis. Additionally, the Mantel-Haenszel chi-squared test showed that the participants' baseline and follow-up reporting behaviours were associated (p<0.01). Table 1. Simplified reporting behaviours at baseline and follow-up for core 1 -core 4. In each core, participants were classified as positive or negative reporters at baseline and follow-up. To investigate the homogeneity in reporting behaviour shifting between the four core groups, participants were regrouped into three categories: consistent reporter, increasingly positive reporter and increasingly negative reporters ( Table 2). Consistent reporters were participants whose reporting behaviours at follow-up were the same as theirs at baseline (second row). Positive-shifting reporters were participants who became more positive in reporting at follow-up (first row), while negative-shifting reporters became more negative in reporting at follow-up (reporting less smoking in self-report; third row). The Chi-squared test of Table 2 showed the proportions of participants with different reporting behaviour shifting were homogeneous across the four core groups (p=0.69). This evidence also supported the validity of combining the 870 participants. In the Building Blocks trial, the impact of intervention on participants' reporting behaviours was also of scientific interest [1] and we carried out an additional analysis to investigate this. The results (Supplementary material S3) showed that 263 of 431 (61.0%) participants who received usual care were consistent reporters, compared to 272 of 439 (62.0%) of participants who received the FNP intervention. The Breslow-Day test (Supplementary material S4) indicated no significant treatment effects on participants' reporting behaviours (p=0.92).

Incompleteness in data collection at follow-up
To address incompleteness of data collection in partial data cases (222 participants with missing cotinine level at follow-ups), reporting behaviours of 870 full data cases were tabulated into Table 3. In terms of their reporting behaviour shifting, 535 (61.5%) participants were consistent reporters (diagonal entries), while 154 (17.7%) participants became positive-shifting reporters (lower subdiagonal entries) and 181 (20.8%) became negative-shifting reporters (upper sub-diagonal entries). We then used this table as transition rate matrix to test the robustness of assuming the partial data cases were consistent reporters. The first step was to use these participants' baseline reporting behaviours to directly impute their follow-up reporting behaviours by the consistent reporter assumption. Then we calculated their calibrated self-reported tobacco use, which was an average of 4.5 (SD=6.0) cigarettes per day. In the second step, we used the Table 3 as the benchmark transition rates to carry out multiple imputation on follow-up reporting behaviours. For example, if participant A was an over-reporter at baseline, then following the first row of the table, her probability of becoming an over-reporter at follow-up would be 8/(8+15+4+12)=20.5%. Similarly we could work out the probability of her becoming an accurate reporter (15/(8+15+4+12)=38.5%), under reporter (4/(8+15+4+12)=10.3%) or extreme under reporter (12/(8+15+4+12)=30.8%). The reporting behaviour of participant A at follow-up was then allocated by Mont-Carlo simulation with these prior transition probabilities, i.e., we would allocate participant A as an over, accurate, under or extreme under reporter with probabilities of 20.5%, 38.5%, 10.3% and 30.8% respectively. For the multiple imputation process, ten simulated data sets were generated with imputed reporting behaviours at follow-up. The pooled mean and standard deviation of calibrated cigarettes were calculated, which was averagely 4.5 cigarettes per day (SD=5.5). These figures were consistent to the results from step one, which confirming the robustness of consistent reporter assumption.

DISCUSSION
To our best knowledge, this is the first study to tackle the non-synchronisation issue in smoking data analysis with a formal testing procedure. For the Building Blocks study, we demonstrated no heterogeneity in reporting behaviours among participants with different extent of nonsynchronisation in smoking data collection. This result provided reassurance to our decision to combine these participants in the main trial analysis.
We also formally tested the robustness of assuming consistent reporting behaviours among participants with partial data, by borrowing the strength of constructed transitional rate matrix from full data cases. The underline claim was that the partial data cases and those with full data were the same when it comes to their consistency of reporting over time. Again, results confirmed the validity of this assumption for our study.
There were 526 participants with neither urine samples nor self-reports collected at follow-up; these participants were not included in the main trial analysis. One theoretical analytic option for this participant group would be to impute both their follow-up cotinine levels and self-reports from their baseline values. This approach was clearly not appropriate in the Building Blocks trial, as the intervention was hypothesised to reduce participants' smoking activities. In observational studies of smoking, direct imputation may be a reasonable approach.
At baseline interview, about 50% participants were negative reporters (either under-reporters or extreme under-reporters). This high underestimation rate suggests a strong sense of social undesirability in reporting smoking in this population group (teenage women experiencing their firsttime pregnancy, aged 16.9-18.8).
In the Building Blocks trial [1], we found no difference in the proportion of smokers and average number of cigarettes smoked per day between the treatment and control groups. Our additional analysis assessed the impact of the intervention on the participants' reporting behaviours at followup; that we found no significant difference in reporting behaviours across trial arms provides additional support for the main trial conclusions.
There are limitations to this study. The calibration approach we employed in this study [2] did not consider participants' demographic and metabolic heterogeneity. It has been shown that women differ in how quickly they metabolise nicotine and this may also change over time during pregnancy [20]. Therefore, the interpretation of the results should be restricted to the population level rather than the individual level. With 1618 participants, the Building Blocks study has strengthened our understanding of smoking behaviours in young woman with first-time pregnancies, however, we should be cautious in extrapolating the results to other populations.