Spatial autocorrelation analysis of multivariate rural insurance data

The environment in which agricultural activities are developed presents high risk and great uncertainty. Several factors related to the agricultural sector can generate fluctuations in the income of producers. These fluctuations must be faced through risk management support policies such as, for example, the hiring of rural insurance. This type of insurance enables the recovery of the financial capacity of the producer in the occurrence of adverse events that cause economic damage. Considering the relevance of rural insurance in the agricultural sector, this study aims to evaluate the spatial distribution of the variables of this type of insurance in Brazilian municipalities from 2006 to 2019. The data used were obtained from rural insurance censuses compiled by the Ministry of Agriculture, Livestock, and Supply. Principal Component Analysis (PCA) was used to reduce data dimensionality and Exploratory Spatial Data Analysis (ESDA) using the scores of the first PC was used to investigate the presence of spatial distribution patterns of rural insurance. By using PC scores, it was found that the highest concentrations of rural insurance policies are located in the South and Midwest regions of Brazil, and there is a tendency for an increase in the spatial dependence of rural insurance throughout the analyzed period. The identification of these areas shows how rural insurance is heterogeneously distributed in Brazil. This result suggests that some strategies can be adopted by policy makers and insurers in order to serve areas that have demand and are not yet covered by rural insurance.


Introduction
Agricultural activities are of great relevance in the Brazilian economy.According to a study carried out by the Center for Advanced Studies in Applied Economics, the agribusiness's share in the Brazilian GDP was 20.5% in 2019 and, in 2020, this percentage reached 26.6% (CEPEA, 2021).According to the latest Agricultural Census, between 2006 and 2017, there was an increase of approx-imately 5.8% in the total area of agricultural establishments, and both the total area and agricultural and livestock production showed considerable growth (IBGE, 2019).
However, the environment in which agricultural activities are carried out presents high risk and relevant uncertainty.This insecurity is mainly due to climatic instabilities and sanitary threats, which can affect production, or market reasons, such as exchange and interest rates, or conditions related to the business environment, such as changes in regulatory frameworks and in public policies.All these variables related to agricultural markets generate variations in the sector's income, which are commonly addressed through policies to support risk management (BRASIL, 2018).
Risk management in agriculture can occur in various ways.However, the purchase of rural insurance is one of the most common forms.This type of insurance works to mitigate losses and enable the producer to recover's financial recovery in the event of losses.Rural insurance provides a more favorable environment for the development of agricultural activities, as it guarantees tincome flow, favors an increase in the planted area and facilitates obtaining financing.In addition, rural insurance is an instrument that enables the sharing of agricultural risk with other economic agents and sectors (BRASIL, 2019).
It is important to highlight that the rural insurance market does not consolidate without the participation of the State.Problems such as high investments and administrative costs, the possibility of catastrophic risks, the strong influence of moral hazard and adverse selection in the formation of portfolios,formation limit the efficiency of private initiatives in offering products.In this sense, the government is called upon to interfere in the market, either by acting directly as an insurer or by creating programs that stimulate the supply and demand for insurance products (ibid.).
Considering the relevance of rural insurance in the agricultural sector, this study aims to analyze the spatial distribution of multivariate data on rural insurance in Brazil municipalities between 2006 and 2019.Furthermore, the study seeks to investigate the presence of spatial distribution patterns in the data, more specifically, whether there is a presence of spatial dependence or heterogeneity.Finally, this work intends to provide information that can contribute to the debate around the improvement of the rural insurance system in Brazil.To achieve these objectives, the work makes use of Principal Component Analysis (PCA) to reduce the dimensionality of the data and Exploratory Spatial Data Analysis (ESDA) to investigate the presence of spatial distribution patterns.The data used come from the Rural Insurance Census, compiled by the Ministry of Agriculture, Livestock and Supply (BRASIL, 2021).
This paper is structured as follows: section 2 describes the methodology used in this study, the data source and the computational resources.Section 3 discusses the results obtained.Finally, section 4 brings the final considerations.

Source of data and computational resources used
The data used are annual, referring to rural insurance policies of Brazilian municipalities from the year 2006 to the year 2019.The variables used in the analysis are presented in Table 1.Regarding spatial aggregation, for all variables, the municipal aggregation level was used.In cases where the available information referred to other localities, such as districts, farms and villages, such information was assigned as the corresponding municipalities.
The data on rural insurance come from the website of the Ministry of Agriculture, Livestock and Supply (MAPA).Data containing geographic attributes, such as the position and shape of the Brazilian territory, were obtained from the website of the Brazilian Institute of Geography and Statistics (IBGE, 2021).
Before performing the analysis itself, some descriptive statistics of the variables were obtained for all years to provide a general sense of what the original data looks like.(Hunter, 2007) and Seaborn (Waskom, 2021), which are libraries to create graphics and Jenkspy for the use of the Fisher-Jenks algorithm (Jenks, 1977).PCA was performed using the library Scikit-learn and the libraries GeoPandas (Jordahl, 2014) and PySAL (Rey & Anselin, 2010) enabled the spatial analysis.

Principal component analysis
First, the correlation matrix between the variables was obtained, and then the principal component analysis (PCA) was used to explain the covariance structure of the p variables, X T = [X 1 X 2 . . .X p ], through linear combinations of the original variables.The principal components (PCs) obtained are the p linear combinations Y T = [Y 1 Y 2 . . .Y p ], not being correlated with each other (Everitt & Hothorn, 2011;Mingoti, 2007).
Furthermore, the PCs are ordered in such a way that the first ones already explain most of the variance of the original variables.Therefore, they can be used to provide a reduction in the dimensionality of the data (Everitt & Hothorn, 2011;Mingoti, 2007).According to Mingoti (2007), the j-th principal component is defined by: (1) The vector of coefficients defining the j-th principal component, a j , is the eigenvector of the covariance matrix of the data (S), associated with its j-th largest eigenvalue.The variance of the j-th principal component is given by λ j , where λ 1 , λ 2 , . . ., λ p are the eigenvalues of S subject to the constraint a T j a j = 1 (Everitt & Hothorn, 2011).As PCA was applied to the correlation matrix (R), the j-th principal component shown in (1) is defined by: where Z i is the standardized variable, that is, subtracted from the mean and divided by the standard deviation.The vector of coefficients defining the j-th principal component, a j , is the eigenvector of the data correlation matrix (R), associated with its j-th largest eigenvalue.The variance of the j-th principal component is given by λ j , where λ 1 , λ 2 , . . ., λ p are the eigenvalues of R. The proportion of the total variance of Z explained by the j-th principal component is defined by The scores, that is, the numerical values of the components, are obtained by replacing the values of the standardized variables (Z i ) in each of the components Y j of (2).The scores of the first principal component (CP1) were used as the variable to be used in the exploratory analysis of spatial data.

Exploratory Spatial Data Analysis
In this work, the Exploratory Spatial Data Analysis (ESDA) was used with the objective of evaluating the spatial distribution of rural insurance in Brazilian municipalities in the period from 2006 to 2019.The ESDA seeks to capture patterns of spatial association of variables and verify the existence of spatial clusters.In addition, the ESDA seeks to analyze the presence of different spatial regimes or the identification of atypical observations, the spatial outliers (Almeida, 2012).
To perform ESDA, it is necessary to use the matrix of spatial weights in which each entry represents the connection between two municipalities and it is called spatial weight.The spatial weighting matrix, with dimension n × n, is denoted by W.
The continuity spatial weight matrix with the queen convention was adopted, considering the first-order neighbors.In addition, the municipalities of Fernando de Noronha and Ilhabela, belonging to the states of Pernambuco and São Paulo, respectively, were removed.These municipalities were disregarded because, in addition to being islands, they do not have any rural insurance policy taken out during the years under analysis.
In the binary contiguity weighting matrix used, the municipalities that have physical boundaries in comon are considered neighbors and the value 1 is assigned to the corresponding element in the matrix, otherwise it is assigned the value 0.Moreover, by convention, it is assumed that w ij = 0, for all i = j, that is, a region is not considered a neighbor of itself (ibid.).
Spatial autocorrelation was used in order to measure the relationship of a certain variable in a region with the values of that same variable in neighboring regions.One of the ways to calculate the autocorrelation is using Moran's I.This indicator measures the ratio of the standard deviation of a variable z in an area I to the standard deviation of neighboring areas of the same variable z.Formally, the statistic is expressed by: where z i denotes the value of the variable of interest standardized in region i, n represents the number of regions and S 0 is the sum of all weights matrix W (ibid.).The global Moran's I was calculated and its significance obtained through the pseudo-p, obtained from 999 random permutations.In this study, a significance level of α = 0.05 was considered.
The calculation of I i is performed considering only the neighboring regions of i, defined through the spatial weight matrix.As the calculation of the I i is performed for each of the n observations, it generates a large amount of information, a more efficient way to visualize the set of generated statistics is to present them in a cluster (Almeida, 2012).

Results and discussion
Table 2 presents the mean, median, standard deviation and maximum values of the rural insurance variables in all the years analyzed.It is possible to observe that the median value of the variables is equal to 0, that is, more than half of the Brazilian municipalities present rural insurance data in the years studied.Furthermore, it can be seen that the averages of the variables referring to the premium and subsidy values present growth during the analyzed period.Analyzing Table 2 one can also see that, with the exception of the variables related to indemnity, the averages of the variables showed a drop between the years 2014 and 2015.The drop in the mean values of the variables in 2015 may be caused by the containment undertaken by the federal government, which released only R$400 million of the R$700 million expected to subsidize rural insurance that year (Andrade, 2017).
It can be seen in Table 2 that the average ARP rises between 2006 and 2019.According to Santos & Silva (2017), it was expected that, as the rural insurance system consolidates, policy prices would decrease due to agricultural productivity gains and the reduction of risk factors.These reductions could occur due to the adoption of agricultural zoning guidelines, a greater knowledge of the history of climatic events and claims, as well as the use of technologies such as irrigation, for example.However, it was found that this reduction in rates did not occur during the analyzed period.
Figure 1 presents the values of Pearson's correlations between each of the pairs of rural insurance variables.The lighter the color of the rectangle, the closer to 1 is the value of the correlation coefficient and, therefore, the greater is the degree of positive association between the variables.On the other hand, the darker the color of the rectangle, the lower is the value of the correlation coefficient and, therefore, the lower is the degree of association between the variables.Although the correlation values may be in the range of -1 to 1, in the case of the data in this work, no value was negative.
When analyzing Figure 1, it is possible to observe that there are groups of variables that present a greater correlation with each other.The variable that represents the total amount of subsidy granted (TSB) presents, in almost all years, a high correlation with variables such as the number of policies contracted (TPC), sum of insured amount (SIA) and the sum of premiums (SPR).In particular, the correlation value of the total amount of subsidy with the sum of the premiums is equal to 0.94 in the year 2006 and, approximately, equal to 1 in the other analyzed years.The positive correlation between the total subsidy granted and the variables related to the total insured and the premium indicates that, as pointed out by Ferreira & da Rocha Ferreira (2009), the government participation, through the subsidy granted by the PSR, is fundamental and must guarantee the sustainability of rural insurance in order to ensure the stability of farmers and other participants in the system.
In addition, there is a high correlation between the variable number of policies contracted (TPC) and total insured (SIA), as well as between the total insured (SIA) and the premium amount (SPR).The variables that showed lowest correlation, although still showing a positive correlation, are the variables that are related to the occurrence of claims, such as the number of indemnities (NIP) and  total indemnities paid (SIP).These variables present, in all years, lower correlations with the other variables of the data set.The graph in Figure 2 presents the proportion of total variance explained by the principal components for each year analyzed.In this graph it is possible to observe that the first principal component is responsible for most of the explained variance in all the years analyzed.That is, only the first principal component is able to provide information on the variance of the entire set of rural insurance variables.For example, in 2006, the first PC explained 53% of the total variance of the data, and in 2019, the first PC explained 78% of the total variance.Figure 3 shows the effect of rural insurance variables on each of the principal components.The lighter the color of the rectangle, the closer to 1 is the coefficient value, therefore, the greater the degree of positive association between the principal components and the variables.On the other hand, the darker the color of the rectangle, the lower the value of the coefficient, therefore, the lower the degree of association between the principal components and the variables.
It is possible to see that, in all years between 2006 and 2019, the first principal component is positively correlated with all variables in the dataset.This makes it easier to interpret the contribution of each variable to the value of the PC.The higher the value of each variable, the higher the value of the PC.The distribution of the effect of the variables in the first component is relatively homogeneous among the variables analyzed, since the values of the coefficients have similar values.Despite SPR and TSB variables present higher weight values, the effects of all the variables in the first component are close to each other.Figure 4 presents the biplots between the first two PCs.In this graph, each point represents a municipality and the axes show the PC scores.All vectors start at the origin and their values show the weight of the variable in the PC under which the vector is projected.Furthermore, the correlation between pairs of variables is represented by the angle between the vectors, in this case, the greater the angle, the lower the correlation between the variables.
By analyzing the Figure 4, it is possible to observe that all variables have a strong positive relationship with the first PC.The variables NIP and SIP, with the exception of 2007, have greater weights related to the second PC, as well as NIP and ARP for most of the years analyzed.In addition, it appears that the SPR and TSB variables are correlated in all years and the SIP and NIP variables are positively correlated.In any case, as shown in the first column of Figure 3, the correlations between all variables and the first PC are expressive.Thus, due to the correlation structure between the rural insurance variables, it was possible to use the scores of the first principal component as a representative variable of the other original variables.Moreover, in all the years analyzed, the accumulated explained variance of the first component was greater than 55.3%.Thus, the value of the first PC score is representative to summarize the information about all the rural insurance variables presented here.The higher the value of this score for a municipality, the higher the values of policies, insured amounts, premiums, etc.
The first group of maps, presented in Figure 5, shows the spatial distribution of the scores of the first principal component in Brazilian municipalities between the years 2006 and 2019.Through these maps, we seek to visually identify whether there are patterns in the spatial distribution of the set of rural insurance variables represented by the scores of the first component.As evidenced earlier, the higher the PC score, the higher the values of variables related to rural insurance (Table 1).It is important to note that, as the correlation matrix was used to apply the PCA, the PCs are linear combinations of the standardized variables.Even so, it is possible to state that municipalities with higher scores have higher values of the rural insurance variables.
To construct the maps, the values of the variable, the PC1 scores, were divided into five intervals by using the Fisher-Jenks algorithm to the values of the variables other than 0. In addition, to make comparison between the years possible, the intervals were created with the values of the year 2019 as reference (Jenks, 1977).
By analyzing Figure 5, it is possible to observe that the spatial distribution of rural insurance in Brazilian municipalities has changed over the years, although it is mainly concentrated in the South, Southeast and Center-West regions.It is also possible to highlight that, during the period analyzed, there is evidence of spatial concentrations in the west of State of Bahia, Southwest of Mato Grosso do Sul, South of Goiás in the State of Goiás and in Southeastern Brazil, in the south of the State of São Paulo.The value of Moran's I was calculated using the scores of the first component for all years between 2006 and 2019.The results obtained are shown in Figure 6.When analyzing the graph in Figure 6, it is possible to observe that all values I were positive and significant (the average Moran's I was 0.46 in the analyzed period).This indicates the presence of positive spatial autocorrelation of rural insurance in Brazilian municipalities.That is, the fact that a municipality has above-average values for variables related to rural insurance is influenced, in addition to other factors, by the value of this set of variables in the region neighboring this municipality.Thus, the fact that a municipality has a high value of the sum insured is influenced, among other factors, by the sum of the sum insured in the municipalities in its surroundings.The same applies to the variables sum of premiums, total subsidy, sum of indemnities paid, average rate applied to policies and number of indemnified policies.
In addition to the presence of positive autocorrelation in all years, the analysis of Figure 6 indicates that the spatial dependence grew over the analyzed period.The lowest observed spatial autocorrelation value is equal to 0.216 in 2007 and the highest value occurred in 2019 and was equal to 0.606.Between 2006 and 2019, the value of spatial autocorrelation showed a growth of about 2.62 times the value observed in 2006.Thus, it is possible to conclude that, during the analyzed period, there was an increase in spatial autocorrelation in all the rural insurance variables analyzed.

Local spatial autocorrelation
The LISA maps presented in Figure 7 contain four groups with different statistically significant spatial association characteristics for each of the years analyzed.In 2006, HH-type clusters, that is, groups formed by municipalities whose variables presented high values and that had neighbors also with high values of rural insurance variables, were concentrated exclusively in the Midwest, Southeast and South regions.In 2006, the South region had 74.56% of the municipalities in the HH group, 64.32% of which belonged to the state of Paraná, 6.43% to Rio Grande do Sul and 3.8% to Santa Catarina.
In the Southeast region, in 2006, about 50 municipalities belonged to the state of São Paulo, a percentage corresponding to 14.6% of the municipalities in the HH group.In addition, the Midwest region had 10.82% of the municipalities belonging to HH cluster HH.The state of Mato Grosso had 7.6% of the municipalities of HH group and the state of Mato Grosso do Sul had 2.63% of the HH municipalities.
In 2019, the municipalities whose belonging to the HH group were significant, were also concentrated in the Midwest, Southeast and South regions.More specifically, 8.10% of the municipalities in the HH group were located in the state of Mato Grosso do Sul, 5.56% were located in the state of Goiás and 2.78% in Mato Grosso.In the Southeast, significant HH municipalities were located mainly in São Paulo (11.57% of the total) and Minas Gerais had only 0.46% of HH municipalities.
The South Region, in 2019, concentrated about 79.86% of significant HH type municipalities.Most of these municipalities (48.61%) were located in the state of Paraná, 19.67% of the type HH municipalities belonged, in 2019, to the state of Rio Grande do Sul and, finally, 3.24% belonged to Santa Catarina state.In addition, it is worth noting that the years 2007 and 2015, even with positive and significant spatial autocorrelation, had the lowest numbers of municipalities belonging to the BB group, that is, groups formed by municipalities whose variables presented low values and that had neighbors also with low values of rural insurance variables.
According to dos Santos & da Silva (2017), historically, the policies contracted for rural insurance, the insured amount and, consequently, the subsidy amount through the PSR are regionally concentrated.This is due to the presence of higher weather risks in the Southern Region states, in São Paulo and Minas Gerais.States such as Mato Grosso do Sul, Mato Grosso, Goiás, and the region formed by the states of Maranhão, Tocantins, Piauí and Bahia, more recently, have also been adhering to rural insurance contracts due to climatic factors.
In addition, several factors can be pointed out as determinants of the concentration of the rural insurance market in a few regions, with few crops, with reduced participation of insurance companies.According to dos Santos et al. (2013), some factors that should be considered are: the small number and discrepancy in the size of insurers that offer this type of insurance; difficulties by bank- ing institutions with operations in the rural environment, leading to inaccurate information and the elevation of risks and prices; government partnerships with operators, such as official credit programs operated by Banco do Brasil, which is also the controller of the largest agricultural insurance company; the degree of opportunity assessed as small by insurers, and finally; the small weight of partnerships and assessment of opportunities involving the brokerage segment.

Conclusions
With the use of PCA, it was possible to verify that one variable was able to represent the set of rural insurance variables efficiently.Instead of using the original seven variables, it was possible to perform the spatial analysis with only one, the first principal component.For almost all the years analyzed, except 2006 and 2007, the first principal component was able to explain more than 70% of the variation in the original data.
The ESDA results indicate that there is spatial dependence in all the years analyzed, that is, there are statistically significant spatial association patterns in the rural insurance data.It was also possible to identify the presence of significant spatial clusters.In general, it was identified that the largest concentrations of rural insurance policies are located in the South, Midwest and Southeast regions, in the southern part of the State of São Paulo.The identification of these areas shows how rural insurance is heterogeneously distributed in Brazil.This result suggests that some strategies can be adopted by policy makers and insurers in order to serve areas that have demand and are not yet covered by rural insurance.
In addition, it was found that there is an increase in the spatial dependence of rural insurance over the analyzed period.That is, municipalities that have greater adherence to rural insurance tend to be geographically close to municipalities that also have greater adherence to rural insurance.However, the analysis of the LISA indicates that the spatial concentration of type HH of rural insurance has changed little over the years.
Since it was found that there is an increase in the spatial dependence of rural insurance over the analyzed period, there is a market to be explored.A cooperation between insurers and policy makers can be established in order to serve regions with lower contracting, especially in the North and Northeast regions of Brazil.Insurers can use this information to identify areas with high demand and low supply and focus their marketing efforts to serve them.
A limitation of the study was the lack of incorporation of the temporal dimension in the analysis, thus some aspects of the dynamics of the variables involved may not have been captured, which could be addressed in future work.Another suggestion for future work would be to identify producing areas in Brazil that still do not contract rural insurance and cross-reference the results of this research.Thus, it would be possible to identify potential regions to be served by this type of insurance that do not yet do so.
The local spatial autocorrelation indicator used to detect local patterns of autocorrelation was the local indicator of spatial association (LISA), also called Local Moran's I.While Moran's I provides a global statistic (for the entire distribution), LISA provides a local statistic (for each observation), thus allowing to verify if there are spatial clusters statistically significant.For Anselin (1995), a local indicator of spatial association must satisfy two criteria: a) it must indicate spatial clusters statistically significant; and b) the sum of the local indicators should lead to the global indicator.The local Moran coefficient I can be expressed as

Figure 2 .
Figure 2. Proportion of total variance explained by the principal components.Brazil 2006 -2019 It is necessary to emphasize that, in the years 2006 and 2007 (highlighted in Figure 2), the variance explained by the first component was less than 0.72, being equal to 0.55 in 2006 and 0.65 in 2007.However, it is possible to consider that the use of the scores of the first principal component

Table 1 .
Description of the variables used.This study was carried out using the programming language Python (Van Rossum et al., 2007), using the interface Jupyter (Kluyver et al., 2016).In addition, the following libraries were used: Pandas (McKinney et al., 2010), for data manipulation, NumPy (Van Der Walt et al., 2011), which enables numerical computation with Python, Matplotlib

Table 2 .
Statistical summary of rural insurance variables.Brazil 2006 -2019