## Statistical models for mouthlevel caries data

Mouth-level data, resulting from the DMF index, are typically analyzed as unbounded or bounded counts. For unbounded counts, a Poisson regression model or its extension the negative binomial regression model that accounts for overdispersion in the data, are often used. A binomial regression model for bounded counts is often advocated.

For unbounded counts, these models assume that the basic underlying distribution for the data is either a Poisson or a negative binomial distribution. The Poisson model is the simplest distribution for nonnegative discrete data, and is entirely specified by a positive parameter the mean. This mean is often related to potential explanatory variables using a log link function. Specifically, let Y define the outcome variable and X the set of explanatory variables. A Poisson regression model for the mean is defined as E(Y|X) = ea+x^, where a and p are the intercept and the regression parameter vector associated with X. The probability mass function of Y is given by: P(Y = y|X) = ^ , y=0,1,..., where = E(Y |X)

is the conditional mean which depends on covariates.

One major restriction of the Poisson regression model is that its mean is equal to its variance. For dental caries data, however, it is not uncommon for the variance to be much greater than the mean. For such data, a negative binomial regression model has been advocated as an alternative to Poisson regression models. It is typically used when the variability in the data cannot be properly captured by Poisson regression models. The negative binomial model is a conjugate mixture distribution for count data (Agresti, 2002). It is entirely specified by two parameters, its mean and the overdispersion parameter. Similarly to the Poisson regression model, the mean is related to potential explanatory variables using a log link function. However, the probability mass function of Y is given by:

where = E(Y|X) = ea+x^ is the conditional mean which depends on covariates, and k is the overdispersion parameter. This distribution has variance |x + K|x2. Parameter k is typically unknown and estimated from data to evaluate the extent of overdispersion in the data. When k tends to zero, the negative binomial model converges to a Poisson process (Agresti, 2002).

The presence of an upper bound for possible values taken by DMF scores suggests a model based on the binomial rather than the Poisson distribution (Hall, 2000). Data are then viewed as being generated from a binomial process with m trials and success probability nx. Here m represents the maximum number of teeth or tooth surfaces in the mouth susceptible to decay, and nx the probability for a tooth or tooth surface to present a sign of decay. The binomial model is given by:

P(Y = y|X) = —,-+,11-— (nx)y (1-nx)m-y,y = 0,1, ...m, v n J r(m-y+1)r(y+1) v v XJ y ea+xp where the success probability is related to covariates as nx = a+xp, with a and p being the intercept and the regression parameter vector associated with X. One should note however that Poisson and negative binomial distributions provide a reasonable approximation to the binomial distribution in dental caries research.

Dental caries data with excess zeros are common in statistical practice. For example, in young children, DMF scores generally generate an excessive number of zeros in that many children do not experience dental caries. This is typically due to a short exposure time to caries development. The limitations of Poisson and negative binomial regression models to analyze such data are well established (see, for example, Lambert, 1992; and Hall, 2000). One approach to analyze count data with many zeros is to use zero-inflated models. This class of models views the data as being generated from P(Y = y|X) a mixture of a zero point mass and a non-degenerate homogenous discrete distribution P^Y = y|X) as follows:

prY = y|X = (« + a - «)pi(Y = yix»- y = 0 yw (d-w)p1(Y = y|x), y > 0, where 0< M <1 represents the mixing probability that captures the heterogeneity of zeros in the population. The choice of the homogenous distribution P1(Y = y|X) for the most part depends on the nature of counts under consideration. For bounded counts, a binomial distribution is typically used (Hall, 2000). Poisson and negative binomial distributions are the standard for unbounded counts (Bohning et al., 1999). Ridout, Demetrio and Hinde (1998) provide an extensive review of this literature. In real applications of these models in dental caries research, the mixing probability is often related to covariates using for example a logistic model.

We illustrate below how some of these simple models can be applied to dental caries scores data generated from a survey designed to collect oral health information on low-income African American children (0-5 years), living in the city of Detroit (see Tellez et al., 2006). This study aimed at promoting oral health and reducing its disparities within this community through the understanding of determinants of dental caries. Dental caries were measured using DMF scores which represent the cumulative severity of the disease for each surveyed participants. Possible covariates include the study participant's age (AGE) and his/her sugar intake (SI). In Table 1, we present the fitted regression models applied to children's data. For these data, the mean structure of the homogeneous model is specified as E(Y|X) =ea+xP, where X= {AGE,SI, AGE * SI), with AGE* SI being a multiplicative interaction, and P = C Pi, P2, P3)'. Parameter K of the Negative Binomial model captures overdispersion in data.

Parameter |
Homogeneous Poisson |
Homogeneous Negative Binomial |
Zero-inflated Negative Binomial (mixing weight depends on covariates) |

a |
1.3994(0.0209)* |
1.3484(0.0725)* |
2.0158(0.0676)* |

Pi |
0.6981(0.0193)* |
0.9188(0.0861)* |
0.2350(0.0679)* |

Pi |
0.2696(0.0203)* |
0.2378(0.0853)* |
0.0573(0.0695) |

P3 |
-0.2790(0.0219)* |
-0.3314(0.0877)* |
-0.0728(0.0739) |

Yo |
- |
- |
-0.6131(0.1595)* |

Yi |
- |
- |
-1.7191(0.2276)* |

Yz |
- |
- |
-0.2226(0.1509) |

Y3 |
- |
- |
0.3163(0.2022) |

K |
- |
2.6178(0.1753)* |
0.9295(0.1058)* |

-2logLik |
8455.7 |
4059.1 |
3815.4 |

AIC |
8463.7 |
4069.1 |
3833.4 |

Table 1. Parameter estimates and (Standard errors) from a homogeneous Poisson model, a homogeneous Negative Binomial model, and a zero-inflated Negative Binomial model with covariate dependent mixing weights applied to DMF scores

Table 1. Parameter estimates and (Standard errors) from a homogeneous Poisson model, a homogeneous Negative Binomial model, and a zero-inflated Negative Binomial model with covariate dependent mixing weights applied to DMF scores

As a basic starting model, a homogeneous Poisson regression model is fit and compared to a homogeneous Negative Binomial model. In view of the AIC, the homogeneous Negative Binomial model provides a reasonably good fit compared to the Poisson model. This result is consistent with overdispersion parameter K in the homogeneous Negative Binomial model being statistically significant at 5%, suggesting that overdispersion cannot be ignored in these data. As a result, the standard errors of parameter estimates in the mean model under the homogeneous Negative Binomial model are larger compared to those of the homogeneous Poisson model. The homogeneous Negative Binomial model is further compared to a zero-inflated Negative Binomial model which potentially accommodates extra zeros in the data. In the latter model, the mixing weight wis related to covariates as, « = {1 + e_YZ}_1, where Z = (AGE,SI, AGE * SI) and y = (y0,Yi,Y2'Y3)'. In view of the AIC, this model provides a better representation of the data compared to the homogeneous Negative Binomial model. This is consistent with findings from the literature dental caries in young children typically exhibit overdispersion in addition to zero-inflation (Bohning et al, 1999).

The zero-inflated regression models provide an interesting parametric framework to accommodate heterogeneity in a population. A prevailing concern, however, is that these models only accommodate an inflation of zeros in the population. Inflation and deflation at zero often arise in various practical applications. Homogeneous models (Poisson and negative binomial regression models) when applied to data from the Detroit study typically reveal an inflation of zeros (few children with no dental caries predicted than observed) for younger children and deflation of zeros (more children with no dental caries predicted than observed) for older children. For such data, a model that captures only inflation of zeros may fail to properly represent heterogeneity in the population. This then necessitates the use of models that can accommodate both inflation and deflation in the population. A good example of such models is the two-stage model also known as the Hurdle model (Mullahy, 1986). An alternative approach is to use the marginal distribution derived from the mixture distribution:

only by imposing that, 0 < P(Y = y|X) < 1 for all y. The mixing weight is potential negative to accommodate deflation in the data. For this class of models, the marginal mixture model maintains his hierarchical representation only if the mixing weight are bounded between 0 and 1. When the mixing weight is negative, the marginal mixture model then loses its hierarchical representation.

Finally, the models described above are basic starting models and should be extended to accommodate unique features of the data under consideration. For example, it is often the case that the sampling design used to recruit study participants leads to clustered data. In survey research, sampled subjects living in the same neighborhood are more likely to share common, typically unmeasured, predispositions or characteristics that lead to dependent data. This therefore necessitates the use of models for clustered or correlated data. An example of such models is described by Todem et al. (2010) for the analysis of dental caries for low-income African American children under the age of six living in the city of Detroit. These authors extended the family of Poisson and negative binomial models to derive the where

< M <1. Note here that the constraints on the mixing weights are obtained joint distribution of clustered counted outcomes with extra zeros. Two random effects models were formulated. The first model assumed a shared random effects term between the logistic model of the conditional probability of perfect zeros and the conditional mean of the imperfect state. The second formulation relaxed the shared random effects assumption by relating the conditional probability of perfect zeros and the conditional mean of the imperfect state to two correlated random effects variables. Under the conditional independence assumption and the missing data at random assumption, a direct optimization of the marginal likelihood and an EM algorithm were proposed to fit the proposed models.

## Post a comment