Models of mortality rates – analysing the residuals

ABSTRACT The area of mortality modelling has received significant attention over the last 25 years owing to the need to quantify and forecast improving mortality rates. This need is driven primarily by the concern of governments, insurance and actuarial professionals and individuals to be able to fund their old age. In particular, to quantify the costs of increasing longevity we need suitable model of mortality rates that capture the dynamics of the data and forecast them with sufficient accuracy to make them useful. In this article, we test several of the leading time series models by considering the fitting quality and in particular, testing the residuals of those models for normality properties. In a wide ranging study considering 30 countries we find that almost exclusively the residuals do not demonstrate normality. Further, in Hurst tests of the residuals we find evidence that structure remains that is not captured by the models.


Introduction
The rapid growth in the development of models of mortality designed to capture patterns in mortality data and accurately forecast and quantify future mortality rates has been dramatic. Over the recent decades, life expectancy in developed countries has risen to historically unprecedented levels and there is clearly a need from a demographic, financial, social and actuarial perspective to understand and predict these improvements for the future. The prospects of future reductions in mortality rates are of fundamental importance in various areas such as public health and old age care planning, social insurance planning, welfare benefit forecasting and economic policy. Over recent years, significant progress has been made in mortality forecasting (for reviews see Booth and Tickle 2008;Plat 2009;O'Hare and Li 2012) with the most popular approaches to longterm forecasting being based on the Lee and Carter (1992) model. A time series model, it describes the movement of age-specific mortality as a function of a latent level of mortality, also known as the overall mortality index, which can be forecasted using simple time-series methods. The method was initially used to forecast mortality in the US, but since then has been applied to many other countries (amongst others see Carter and Prskawetz 2001;Lee and Miller 2001;Booth, Maindonald, and Smith 2002;Brouhns, Denuit, and Vermunt 2002;Renshaw and Haberman 2003;Koissi, Shapiro, and Hognas 2005).
The success of the Lee Carter model can be seen in the number and variety of mortality models that extend the Lee Carter approach (see O'Hare and Li 2012) for examples of these extensions. One thread of extensions to the Lee Carter model involves including additional latent age and period effects with the objective of better fitting the data, producing a less simplistic correlation structure between ages and capturing the non-linear profile of mortality data. This has led for example to the models of Renshaw and Haberman (2003), Cairns, Blake, and Dowd (CBD)(2006, Cairns et al. (2009), Plat (2009 and O'Hare and Li (2012) for example. These models extend the Lee Carter approach by including additional period effects and in some cases cohort effects and improve upon each other by producing better fits to the data and in the main better forecasts. In the literature, however, there has been limited attempts to test the fitting of such models. The majority of papers calculate point estimates of the average errors produced between the fitted and actual rates using one of several measures (for example root mean square errors, mean average percentage errors, etc.). There has been little work looking at the patterns of such errors.
One such paper that considered the shape of the residuals in a range of mortality models is that of Dowd et al. (2010) where the authors assess the residuals for normality carrying out several tests of the mean, variance and skewness of the residuals. Dowd et al. (2010) fitted a range of models, primarily the Lee and Carter (1992) model and a selection of CBD models to data and then after calculating the in sample forecasts they derived standardized residuals from the forecasts and tested these for normality. Their paper concluded that none of the models considered performed well under these tests. In this article, we extend and modify this work in three ways. First, rather than forecasting and testing the derived residuals we calculate the residuals directly from the fitted models. This will enable us to test the model to ensure that all of the structure of mortality is being captured prior to forecasting. Second, we extend the work by considering several multi factorial models, namely Plat (2009) andO'Hare andLi (2012), that were not considered in the previous study. Finally, in addition to the normality tests we calculate Hurst exponents for each of the residual time series for each country and gender to test for the presence of autocorrelations within the period or age dimensions of the residuals.
The article is organized as follows. Section 2 presents a brief review of extrapolative models such as the Lee-Carter model and its extensions. Section 3 discusses the data we have used in this study. In section 4, we discuss the methodology we use to test the residuals for normality and in section 5 we present the results of our analysis and discuss some of the implications for pricing, forecasting and the making of policy. Finally, Section 6 concludes with some ideas for further research.

Lee-Carter and its variants
The current leading method for forecasting mortality rates is the stochastic extrapolation approach. In this method, data is first transformed (by taking natural logarithms) and then analysed using statistical methods to identify and extract patterns. These patterns are then forecast using well-known time series approaches. The resulting forecasts are then used to predict future mortality rates. The first and most well-known stochastic mortality model of this type is the Lee and Carter (1992) model. Based on US data, the model uses a stochastic, time series framework to identify a single-period effect pattern in the natural logarithm of mortality rates. This linear trend over time is extracted and using Box-Jenkins an appropriate ARIMA processes is fitted to the data (a random walk with drift in each case). The random walk with drift is forecast and resulting future mortality rates predicted. Also known as a one factor or one principle component approach, the model became a benchmark and underlined a new approach to modelling mortality rates for several reasons: the model has an extremely simple structure and so is very easy to communicate; and the use of the random walk with drift enabled the authors not only to predict the expected future mortality rates but also to visualize the uncertainty associated with the predictions. The Lee-Carter model, outlined below includes two age-dependent parameters a x and b x which, respectively, represent the intercept and gradient for the log mortality rate at each age and the time or period trend κ t which is forecast using a random walk with drift: (1) where a x and b x are age effects and κ t is a random period effect. The model is known to be over-parameterized and applying the necessary constraints as in the original Lee and Carter (1992) paper leads to an estimate of a x of ln m x;t : In the original paper, the bilinear part b x κ t of the model specification was determined as the first singular component of a singular value decomposition (SVD), with the remaining information from the SVD considered to be part of the error structure. The κ t were then estimated and refitted to ensure the model mapped onto historic data. Finally, the subsequent time series κ t was used to forecast mortality rates.
Despite the attractiveness of the model's simplicity it has several weaknesses. Among many discussions of the Lee-Carter model, Cairns, Blake, and Dowd (2006), Cairns et al. (2009), and(2011) summarized the main disadvantage of the model as having only one factor, resulting in mortality improvements at all ages being perfectly correlated (trivial correlation structure). They also note that for countries where a cohort effect is observed in the past, the model gives a poor fit to historical data. The uncertainty in future death rates is proportional to the average improvement rate b x which for high ages can lead to this uncertainty being too low, since historical improvement rates have often been lower at high ages. Also, the model can result in a lack of smoothness in the estimated age effect b x .
Despite the weaknesses of the Lee-Carter model its simplicity has led to it being taken as a benchmark against which other stochastic mortality models can be assessed. There has been a significant amount of literature developing additions to, or modifications of, the Lee-Carter model. For example Booth, Maindonald, and Smith (2002), Brouhns, Denuit, and Vermunt (2002), Lee and Miller (2001), Girosi and King (2008), De Jong and Tickle (2006), Delwarde, Denuit, and Eilers (2007) and Haberman (2003, 2006). In this article, we consider four models from the time series mortality modelling literature. The Lee and Carter (1992) Table 1. We have selected a range of simple factor models and larger multifactorial models to see if the addition of latent factors affects the residuals in any way.
For a review of the main extensions and modifications of the Lee-Carter model, the interested reader is directed to O'Hare and Li (2012).

Data
The data that we use in this article comes from the Human Mortality Database. 1 The data available for each country includes number of deaths D x;t and exposure to death E x;t for lives aged x last birthday during year t. We can use this to gain a proxy for the central mortality rate for lives aged x during year t as: The data provides an estimate of the true mortality due to issues with the recording of data. Death data tends to be recorded accurately, with death certificates in most cases. However, exposure data are taken from census data which may only be accurately recorded every 5 or 10 years adjusting these figures for migration, deaths and births, etc. The resulting mortality estimates are therefore quite noisy, particularly at the older ages where there is less data available. Data are available going back to the mid nineteenth century in some cases but we have restricted this study to data from 1960 to 2009 in order to have a consistent period across all countries. This has resulted in the 30 countries we have considered in this article. The countries along with their 3 letter codes are outlined in Table 2.
The wide ranges of countries give a good spread of populations both geographically and in terms of economic development. The inclusion of Male and Female data also enables gender differences to be considered. We focus on the age range 20-89 for several reasons. First, the models upon which we Table 1. The stochastic mortality models.

Name
Model and Name M1 Lee and Carter (1992) The models selected form a sample of the existing time series models in the literature and represent models with both small and large numbers of factors. have based our comparisons are also fitted to this age range. Secondly, and as identified by Currie (2011), data at the older ages provide additional problems in terms of the reliability. Indeed in several cases mortality rates determined using older data appear to fall sharply beyond age 95.

Methodology
We begin by fitting each of the models considered to the data above for the 30 countries considered and for both males and females. In this article, we will consider the four models Lee and Carter (1992), CBD(2006), Plat (2009) and O'Hare and Li (2012). We fit the models using a maximum likelihood approach using code developed in R and publicly available for several of the models. 2 The results of fitting are assessed and presented using three point measures of fit quality outlined below. The average error, E1this equals the average of the standardized errors, this is a measure of the overall bias in the projections. The average absolute error, E2this equals the average of absolute value of the standardized errors, this is a measure of the magnitude of the differences between the actual and projected rates. The standard deviation of the error, E3this equals the square root of the average of the squared errors, where X 1 and X 2 and the age limits of our sample X 1 ¼ 20 and X 2 ¼ 89, and T ¼ 60 is the number of years of data we have in our sample. The models are fitted by assuming that death rates are drawn from a Poisson distribution with parameter given by E x;t m x;t . We then calculate the corresponding fitted mortality ratesm x;t and calculate the standardized residuals using the following formulam This approach to calculating the residuals is consistent with that of the Dowd et al. (2010) paper and should represent samples drawn from a standard normal distribution if indeed the residuals are reflecting no more than random noise. The tests used in this section aim to identify whether the mortality residuals described above are consistent with i.i.d. Nð0; 1Þ. We carry out the following tests on the matrix of mortality residuals: • A t-test of the prediction that their mean should be 0. • A variance ratio (VR) test of the prediction that the variance should be 1 (see Cochrane, 1988;Lo andMacKinley, 1988, 1989), and • A Jarque Bera normality test based on the skewness and kurtosis predictions (see Jarque and Bera, 1980).
In addition, we calculate Hurst exponent, H, for each of the time series extracted from the residuals. The Hurst exponent is referred to as the 'index of dependence' or 'index of long-range dependence'. It quantifies the relative tendency of a time series either to regress strongly to the mean or to cluster in a direction. A value of H in the range 0:5 < H < 1 indicates a time series with long-term positive autocorrelation, meaning both that a high value in the series will probably be followed by another high value and that the values a long time into the future will also tend to be high. A value in the range 0 < H < 0:5 indicates a time series with long-term switching between high and low values in adjacent pairs, meaning that a single high value will probably be followed by a low value and that the value after that will tend to be high, with this tendency to switch between high and low values lasting a long time into the future. A value of H ¼ 0:5 can indicate a completely uncorrelated series, but in fact it is the value applicable to series for which the autocorrelations at small time lags can be positive or negative but where the absolute values of the autocorrelations decay exponentially quickly to zero. Given that we are expecting the residuals to be samples for a Nð0; 1Þ distribution we should not expect any correlations between residuals. In other words a Hurst exponent of 0,5 would be ideal.
The Hurst exponent, H, is defined in terms of the asymptotic behaviour of the rescaled range as a function of the time span of a time series as follows

E½
RðnÞ where; • RðnÞ is the range of the first n values, and SðnÞ is their standard deviation • E½ is the expected value • n is the time span of the observation (number of data points in a time series) • C is a constant In order to consider the Hurst exponent analysis we must apply it to a time series of residuals not a matrix of residuals. We therefore consider both the age dimension and the period dimension separately. We should not expect any correlations between residuals across age nor should we expect any across the period dimension. In the empirical section following, we present the analysis in both dimensions and comment accordingly.

Empirical analysis
In this action, we present and discuss our findings. We first show the fitting results measured using the standard E1, E2 and E3 measures of fitting quality. These are calculated as shown in the methodology section and in the main confirm the reported findings of each of the previous papers proposing the models. We follow this with a discussion of some of the residuals calculated for each of the countries in the study. We present some of the residual plots and comment on some common characteristics we find. Finally, we empirically test the residuals using a range of tests as discussed above.

Fitting the models and assessing with point estimates
We consider each of the 30 countries covered in the paper for both male and female data, fitting the models to data from 1960 to 2009. We present results below in Tables 3-5 using the three measures of error E1, E2 and E3 outlines earlier.
The results in Tables 3-5 show the fitting results across the 30 countries and the four models considered do vary significantly. Some noticeable comments include; • The results for the multifactorial models, Plat (2009) andO'Hare andLi (2012), are markedly better than those for the smaller models of Lee and Carter (1992) and Cairns, Blake, and Dowd (2006). This is to be expected given the additional parameters in the larger models. The results show that if the purpose of the exercise is purely to find the best fitting model then from these results the multi factor models do outperform. One of the questions of this research, however, is do the additional factors and additional structure lead to models which capture more of the structure of mortality. In other words do they results in residuals that conform more to the random noise we should expect.

Analysing the residuals
To test for the normality of the residuals, we follow an approach similar to that of Dowd et al. (2010). We carry out three statistical tests of the residuals. First, we test the mean of the residuals. If the residuals do represent random noise then their mean should be zero. We carry out a t-test to test for this. Similarly, the variance of the residuals should be 1, and we use the variance ratio test to test this. Finally, we test for skewness using the  Jarque Bera test. Below we plot the results of the sample mean, variance and skewness before presenting the test results. Figure 1 shows the plots of the sample means and variances for the male and female residuals after the models have been fitted. The models represented in these plots are the Lee Carter model, the Cairns, Blake and Dowd model, the Plat model and the O'Hare and Li model. As can be seen, the mean figures are above zero and the variances are significantly different from 1. Figures 2 through to 4 show the t-test results, the variance ratio test results and the Jarque Bera test results.
As can be seen for the test results in every case, the fitted models fail the basic normality tests suggesting that the residuals mean and variances do not conform to those of the standard normal distribution, nor do the higher moments. In addition, the Hurst exponent calculations show long-term positive correlation in the residuals both in the age dimension and the period dimension. This suggests that perhaps there is still some structure in the residuals that might be identified. In particular, the inclusion of additional period effects (as in the Plat 2009; and O'Hare and Li, 2013) models does not compensate for this. This is an area of further research.
Finally, Figures 5 and 6 show point plots and averages of the Hurst exponent calculations across the age dimension and the period dimension, respectively. As can be seen from the plots, and from the tables in the appendix, the Hurst exponents primarily lie between 0.5 and 1 showing that there is some autocorrelation present in both the age and period dimensions of the residuals. The Hurst calculation being larger on average across the period dimension than the age one. This suggests that far from being samples of random noise the residuals are still showing some patterns or structure in both age and period. Note also that the inclusion of more period effects (as in the Plat (2009) and O'Hare and Li (2012)) models does not diminish this effect. This is an area for further research.

Implications for pricing, forecasting and policy
The analysis in this article has shown that within the extrapolative family of models of mortality considered there can still be found some structure within Table 5. Fitting results (expressed as percentages) measured using the root mean square percentage error (E3) for Males and Females for the Lee-Carter, CBD, Plat, and OL  the residuals after fitting to the data. In particular, for all the models considered, and the vast majority of the counties considered, the models failed the tests on the normality of those residuals. For comparative purposes, the lack of normality of the residuals may be considered a minor issue, particularly when the aim is to compare the ability of various models to fit to the same mortality data. However, when using such models to forecast mortality rates, particularly long-term forecasting for welfare purposes, for pension pricing purposes or for government longer-term planning there remains a significant risk that the forecasts will be inappropriate. In particular, the prevailing trend of mortality improvement determined through the modelling process could be mis-specified. In addition, the assumption of independence in the residuals could result in forecasting confidence intervals being  narrower than they should be, and hence providing false confidence in the forecasts. In some cases, these concerns may be overcome by repeatedly refitting and forecasting at periodic intervals, for example as actuaries do when creating updated life tables, In many other cases, however, long-term forecasts are used to determine funding and resource allocations. In these cases, it would be important to appreciate the limitations of the modelling and forecasting process identified here.

Conclusions
In this article, we have considered several of the leading extrapolative models of mortality rates and have applied normality tests and Hurst calculations to the fitted residuals. More specifically we have fitted the models of Lee and Carter (1992), Cairns, Blake, and Dowd (2006), Plat (2009) andO'Hare andLi (2012) to the data for 30 countries for both males and females and tests the resulting residuals using t-tests, variance  ratio tests and Jaque bear tests. We have also calculated age and prior Hurst exponents for each of the countries and genders and note that exclusively these Hurst exponents lie in the region 0.5 < 1. This suggests some positive correlations between residuals.
Further research will now focus further on the Hurst exponents analysing in more detail the patterns found within these exponents to try to identify what if any structure still remains in the data after fitting such time series model.

Acknowledgements
A version of this article was presented at the IME Congress in Shanghai in 2014, the authors are grateful for the comments received.

Disclosure statement
No potential conflict of interest was reported by the authors.

Appendix B. Hurst Tests
The Hurst exponent calculations are done by first splitting the matrix of residuals into time series of age-specific residuals and period specific residuals. In other words, by considering the columns and rows of the matrix separately. Of course we might also consider the cohort pattern (or the diagonals) of the matrix also but we defer this to further study. The results presented below should the Hurst exponents over period and over age.