Dealing with highly kurtotic count data with excess zeroes: comparing different treatments in the control of dairy cattle gastrointestinal parasites

In this paper we provide a flexible statistical framework to compare different treatments, conventional and selective strategical, in the gastrointestinal parasites control of dairy cattle through the count of Giardia cysts. Distributional regression models are considered, allowing the modelling of not only the average of these counts, but also the extra probability that calves do not present any cysts. Our findings show a positive relationship between the count of cysts with the animal body temperature and in animals until the age 150 days (and then, the count decreases). Higher responses are observed during summertime. Animals submitted to the selective strategical treatment present a lower count of cysts than the ones submitted to the conventional option. Further, there is a smaller chance that the sample does not present any cyst during wintertime. Conversely, this chance increases if the animal is submitted to the ST and in the earliest ages. Finally, the probability that no cysts are observed in the sample is roughly constant up to 40 o C and then rapidly increases. Hence, distributional regression models provide a great alternative to explicitly select features to model different aspects (average and extra probability of zero) of the count of Giardia cysts.


Introduction
Brazil is considered one of the largest milk producers in the world (Barbero et al., 2022).Livestock management has been improved in the past years, increasing the quality due to disease control.However, climatic changes and incorrect use of medicines are causing the increasing of parasitic diseases with high resistance, and consequently, increasing costs, implying millionaire losses in beef production (Oliveira et al., 2017).
Giardia duodenalis is a unicellular protozoan (Plutzer et al., 2010) and in cattle clinical infection mainly affects young animals (Geurden et al., 2010).Giardiasis is a public health problem (Coelho et al., 2017), with outbreaks reported by the use of contaminated water for consumption or recreation with parasite cysts and estimated to cause 280 million diarrhoea infections annually (Einarsson et al., 2016;Efstratiou et al., 2017).Its cycle is related to the gastrointestinal tract of vertebrate hosts (Xiao & Fayer, 2008), generating intestinal disorders, with diarrhoea that does not respond to the treatment using antibiotics and coccidiostats (Geurden et al., 2010).The control of this disease is characterised using prophylactic treatments such as fenbendazole, showing a control of giardiasis in young calves (Blanco et al., 2017).Blanco (2015) presented a study where they compared the efficiency between a so-called selective strategical treatment (ST) against the conventional treatment (CT) in gastrointestinal parasites control (G.duodenalis) of dairy cattle through the count of Giardia cysts.Due to some of its characteristics, the authors performed a transformation and used as the response, a binary variable, which may imply a loss of information and negative consequences, leading researchers to a not reliable conclusion.In fact, in Blanco's dissertation, the author has not detected any difference between both treatments (one of its main hypotheses).
Hence, in this paper we revisit the data available in Blanco (2015) and propose a different and more robust statistical framework in order to understand whether the ST performs better than the CT, considering all the available information in the variability of count of Giardia cysts, i.e., instead of simplifying this variable into a dichotomous response, we model its original values using a distributional regression approach.In addition to which treatment was applied to explain the number of cysts, we also consider the season the data was collected and the age, body weight and height of the calf as possible candidates to explain such behaviour.

The data
In this paper we are considering the same data set studied by Blanco (2015).In the author's experiment, the efficiency of a selective strategical treatment (ST) against the conventional treatment (CT) in gastrointestinal parasites control of dairy cattle has been evaluated.The efficiency of both treatments was measured by the count of Giardia cysts.
Faecal samples from the heifers (30 up to 50 g) were collected directly from the rectum every 15 days, massaging the rectal walls using disposable gloves, which were then removed in an inverted way, eliminating the air, tied and identified with their respective name and number of the animal, with further analyses at the Laboratory of Parasitic Diseases of DMV/UFLA.These stool samples were used to count cysts per gram of stool based on Gordon & Whitlock (1939), and the zincsulphate flotation technique at 33% to search for cysts of Giardia spp.
In addition to the two treatments (ST and CT), the following explanatory variables were considered as possible candidates to explain the response: • treatment: a two-levels factor indicating whether the calf received the CT or ST; • season: a four-levels factor indicating in which season the data was collected (summer, autumn, winter or spring); • age: a quantitative variable indicating the age of the calf; • weight: a quantitative variable indicating the body weight of the calf; • temperature: a quantitative variable indicating the body temperature of the calf.
Due to the complexity of these data, we have used a very flexible class of regression models known as distributional regression models, which as stated by Heller et al. (2022), were first proposed as generalised additive models for location, scale and shape (GAMLSS) (Rigby & Stasinopoulos, 2005).The key advantage is that they are able to consider different flexible distributions and distinct regression structures (subsets of covariates) in each of these distribution parameters.

Statistical modelling
GAMLSS are a very flexible semiparametric class of regression models, which involves a distribution for the response variable and may involve non-parametric smoothing terms when modelling any and/or all parameters of the response variable distribution, generalising the well-established generalised linear (Nelder & Wedderburn, 1972) and generalised additive (Hastie & Tibshirani, 1990) models.Mathematically, let Y ∼ D(θ), where D is the response variable density and θ is its vector of parameters of length p.The GAMLSS can be written as for k = 1, . . ., p, where g(•) is a known monotonic link function, η k denotes a predictor, X k is a known design matrix, β ⊤ k = β 1k , . . ., β J k k is a parameter vector of length J k k and s jk (•) is a smooth non-parametric function of explanatory variable x jk , e.g., P-splines (Eilers & Marx, 1996;Eilers et al., 2015).We shall highlight that when a smoothing term is included in the model, its coefficient (and standard error) should not be interpreted, as stated by Ramires et al. (2019) .
The estimation process of GAMLSS consists in maximising the penalised log-likelihood function, which is achieved through three different options, based on an iteratively reweighted penalised least squares algorithm: CG algorithm, RS algorithm and a combination of both methods (Stasinopoulos et al., 2017).

Selecting the response variable distribution
As it will be highlighted in Section 3, we are dealing with count data characterised by excess zero responses and long tail.Hence, a natural choice here is to select zero inflated distributions to explain the response variable counts of Giardia cysts.The most used distributions in literature with this feature are the zero inflated Poisson (ZIP) and zero inflated negative binomial (ZINB) distributions (some examples of their applications may be seen in Morales & Higuchi (2018) and Ostfeld & Keesing (2018)).Nevertheless, in this paper we will also consider the zero inflated Poisson inverse Gaussian (ZIPIG) distribution that has longer upper tails than the ZINB distribution.
Within the GAMLSS framework and the gamlss package (Stasinopoulos & Rigby, 2007) in R (R Core Team, 2022), most distributions are reparametrized (Rigby et al., 2019) to facilitate the interpretation of the results.The ZIP distribution has two parameters where µ > 0 and 0 < σ < 1 are the mean of the Poisson component and the inflated probability that Y = 0 (i.e., there are no cysts found and the calf is healthy), respectively.Its probability mass function (pmf ) is given by On the other hand, ZINB is a three-parameter distribution with pmf given by where µ > 0 is is the mean of the negative binomial component, σ > 0 is a dispersion parameter related to the negative binomial component, 0 < ν < 1 is the extra probability that Y = 0 and t)dt is the gamma function.The ZIPIG is a three-parameter distribution as well, and its pmf can also be expressed as (2).However, in this case, µ > 0 is the mean of the Poisson inverse Gaussian (PIG) component, σ > 0 is the dispersion of the PIG component, 0 < ν < 1 is the extra probability that Y = 0, dx is the Bessel function of the second kind.

Selecting regression structures and residual analysis
In summary, the main advantages of the GAMLSS model are that any distribution can be assumed to model the response variable and we may select different subsets of explanatory variables to explain each of its parameters (e.g., a parameter related to the inflated probability that no cysts are observed).These regression structures selection can be performed by several distinct methods.Within GAMLSS framework, the most frequently used method for model selection is a stepwisebased procedure (called Strategy A) which uses the generalised Akaike information criterion (GAIC) (Voudouris et al., 2012) given by GAIC(κ) = -2 × l + κ × df , where l is the fitted log-likelihood function, df are the effective degrees of freedom of the fitted model and κ is a penalty.If κ = 2 or κ = log(n), it reduces to the Akaike information criterion (AIC) (Akaike, 1974) and Bayesian information criterion (BIC) (Schwarz, 1978) The adequacy of the final fitted model can be checked through the normalised randomised quantile residuals (Dunn & Smyth, 1996) within the GAMLSS framework.The main advantage of these residuals is that despite the distribution assumed to explain the response variable, if the fitted model is reasonable, they follow a standard normal distribution.As stated by Stasinopoulos et al. (2017), when dealing with a discrete response, they can be defined as where Φ -1 (•) is the inverse cumulative distribution function of the standard normal distribution, u is defined as a random value from the uniform distribution on the interval [u 1 , u 2 ] = F (y -1|θ) , F (y|θ) and F (y|θ) is the model cumulative distribution function.Because of this randomization, it is advisable to compute the median of a collection of residual realisations.

Results and discussion
Table 1 displays some descriptive statistics for the response variable.We note that the count of Giardia cysts presents a large range as its minimum and maximum values are zero and 24,000, respectively.The average count of cysts is 605.33 and its standard deviation (SD) is 1,670.30.This variable is highly positively skewed (6.81) and has an extreme excess of kurtosis (70.55), i.e., the count's density is very asymmetrical and heavy-tailed (leptokurtic) as displayed in Figure 1.It is noteworthy that more than 50% of the observations are zeroes (342 cases) and only 25% of the count are higher than 400 (even though the maximum value is 24,000).The high number of outliers present in the data is a characteristic of this variable rather than a measurement error.Figure 2 displays plots of the response against each of the considered explanatory variables, which may give us an idea of the relationships in the data despite the interplay between the covariates.Panels (a) and (b) show boxplots of the explanatory factors where animals submitted to the conventional treatment and during summer presented higher counts than the other levels.Panels (c) and (d) display the response against age and body weight, respectively, where we can note that the highest counts were collected in individuals where the date of collection is in the age range of 100 to 200 days and with 100 to 150 kg.Finally, as expected, higher counts were detected in animals that present higher body temperatures, as can be seen in Panel (e).It is noteworthy the presence of excess zeroes in the response variable that can be checked in all these panels as already highlighted in Table 1.
We then proceed to the distributional regression approach.The stepwise-based procedure (Strategy A) was performed in all GAMLSS based on the ZIP, ZINB and ZIPIG distributions.Table 2 displays values of the global deviance (GD, equals minus twice the fitted log-likelihood), AIC and BIC for each model, which were used to compare the fitted model.As we can see, the ZIPIG model clearly outperformed the others since it returned the lowest AIC and BIC values (5,689 and 5,808, respectively).All linear coefficients were statistically significant at a 5% level (i.e., p-value<0.05),except for spring and autumn levels in the µ structure, indicating that there is no statistical difference between the observed average of count of Giardia cysts during these seasons and wintertime.The fitted model for the mean of the PIG component µ, in equation ( 3), indicates that, as expected, as the body temperature of the animal increases, the count of cysts also increases.More precisely, for each degree Celsius it is expected that the average count increases in exp(0.545)= 1.725 times, considering calves in the same condition (i.e., setting all other covariates as constants).Furthermore, in average, animals submitted to the selective strategical treatment present a lower count of Giardia cysts than the ones submitted to the conventional option.
Another covariate considered to model the mean of the PIG component was the season time.During summer, the average of counts is exp(0.548)= 1.730 times greater than the ones observed during wintertime.As previously stated, there are no statistical differences between the other levels and winter, which contradicts the findings of Ehsan et al. (2015), who found that there was no greater presentation of positive cattle for Giardia spp.During the humidity and precipitation season.
The last considered covariate to model the average of the counts is age.Since a non-linear effect of covariate age in the parameter was detected, a smoothing function (p-spline) was considered to model this relationship (Figure 3(a)).As we can see, the expected count of cysts increases until age 150 days, then decreases up to age 275 days, after which is roughly linear until age 365 days.
Regarding the dispersion parameter σ, it can be seen from equation ( 3) that no covariates were selected to compose its regression structure.Hence, σ was modelled as a constant, where σ = exp(0.621)= 1.861.
The extra probability of zeroes ν, i.e., the probability that no cysts are observed in the sample, was modelled using four different covariates.Based on the estimated coefficients in equation ( 3), this probability will be higher if the calf is submitted to the selective strategical treatment.Regarding the seasons of the collection, during the summer there is a smaller probability of zero than in the other three seasons and this probability is higher during the winter.Smoothing functions were considered for both age and temperature in the ν structure.
Cattle feeding varies according to their nutritional requirements based on race, sex and age, and in the higher phase of growth and consumption of these parasites, it increases in moderate amounts in the feed or pasture provided, with a lower risk of characteristic signs of the disease, as it allows the development of protective immunity in the animal (Taylor et al., 2013).Volpato et al. (2017) concluded that the infection of Giardia spp. is higher in calves, which interferes with the animal's growth and low yield.Nonetheless, results in our paper show that there is a rapid decrease of the probability of zero up to age 200 days (the largest probabilities are observed during the early days of the animal age), and then a slight increase until age 365 days (Figure 3 (b)).
The extra probability that no cysts are observed is roughly constant up to 40 o C and then rapidly increases from that point (Figure 3(c)).The increase in temperature that have been shown by the animals may be directly related to the presence of other agents found when conducting this study, as presented by Blanco et al. (2017) and Barbieri et al. (2017), or good livestock practices that are of great relevance for the development of animals (Hernández-Gallo & Cortés-Vecino, 2012), so the presence of some symptoms cannot be directly related to parasitosis, since the loads found do not affect animal development.
In the way the GAMLSS is defined in equation ( 1), since we are modelling all parameters of the response variable distribution, different sets of values of the selected regression structures for each parameter may result in different shapes.Figure 4 summarises the behaviour of all predicted ν values in our data, divided by the two distinct treatments applied to the animals.In general, the selective strategical treatment returned higher ν values, suggesting that it may be more effective than the conventional treatment, in the sense that higher values of ν imply a higher probability of counting zero Giardia cysts -which occurs when the calf is healthy -and a smaller average count.
As highlighted in Section 1, one of the main hypotheses tested by Blanco (2015) was how se- lective strategical treatment works over the conventional one in the count of Giardia cysts.Despite the author having not found any statistical difference between both treatments, as can be seen in equation ( 3), the covariate treatment was statistically significant to explain both average cysts count and the extra probability that the animal is healthy.Note that this finding was only possible since we are considering the GAMLSS framework, where different regression structures may be considered in all the response variable distribution parameters.As stated by Ramires et al. (2021a), considering a more sophisticated regression structure such as GAMLSS may result in a better (and more parsimonious) model than when using a more complex location model (e.g., modelling only the mean of cysts count).Different combinations may be made to obtain the extra probability of zeroes and/or the expected number of cysts.Moreover, several statistical distribution shapes may be obtained depending on the combinations of covariates in both µ and ν parameters.Thus, an interactive plot to find these probabilities, values and shapes is provided by the authors through the link: https://ramires.shinyapps.io/EPG-paper/.
The adequacy of the final ZIPIG fitted model is performed through the normalised randomised quantile residuals.In order to graphically check these residuals, we display a set of randomised residual realisations (greyish dots) in a QQ-plot and a worm plot (van Buuren & Fredriks, 2001) in Figure 5.In the worm plot (Panel (b)), if the assumed distribution for the response variable is correct, then at least 95% of the median residuals (black points) must lie between the elliptical 95% pointwise interval band curves and no particular shape may be observed.With that said, and given that the median residuals (black points) in the QQ-plot (Panel (a)) are essentially linear, the final fitted model is reasonable and the ZIPIG regression model provides a good fit to these data.

Concluding remarks
This paper detailed the analysis of a positively skewed and highly kurtotic count data with excess zeroes from a real experiment.Although zero inflated Poisson and/or zero inflated negative binomial regression models may be promptly considered in such cases, we have shown that they were not able to model such characteristics, even under the generalised additive models for location, scale and shape (GAMLSS) framework, a very flexible class of semi-parametric regression models.In this sense, a distribution with longer upper tails than the previously mentioned ones was required, and thus the zero inflated Poisson inverse Gaussian (ZIPIG) was considered.Due to the ease of interpretation of each of the parameters of the ZIPIG distribution, based on the GAMLSS framework, we were able to understand how different features (age, temperature, treatment and season) impacted the average count of cysts and the extra probability that the animal is healthy.This analysis showed that GAMLSS are a competitive approach to model highly complex data and can be applied to other problems that face these situations.

FrequencyFigure 1 .
Figure 1.Bar plot of the count of Giardia cysts.For clarity of presentation zeroes (342 cases) are not displayed.

Figure 2 .Figure 3 .
Figure 2. Relationship between the count of Giardia cysts and each of the covariates: (a) treatment; (b) season; (c) age; (d) weight; and (e) temperature.

Figure 4 .
Figure 4.Estimated probability that the calf is healthy fixing winter season, changing age and temperature, considering (a) selective strategical treatment and (b) conventional treatment.

Figure 5 .
Figure 5. (a) QQ-plot and (b) worm plot of the final ZIPIG fitted model based on the GAMLSS framework.

Table 1 .
Some descriptive statistics for the count of Giardia cysts