\section{Methodology}\label{met}
The logistic regression model is currently widely used to the analysis of binary outcomes
such as presence or absence of a certain characteristic of interest.
For presence of plant disease it is particularly relevant to consider
possible spatial dependence given it is reasonable to assume that neighbouring trees
are more likely to have similar status. The autologistic model~\cite{besag:72}
extends the usual logistic regression accounting for such spatial structure.
\subsection{Autologistic model}
The autologistic model describes the probability of a plant have the disease given the status of
the neighbouring plants using through a covariate connected to the outcome through the link function,
\begin{equation}\label{eq:prob}
logit(p_{ij}) = \beta_0 + \gamma_1 (y_{i-1,j} + y_{i+1,j}) +
\gamma_2 (y_{i,j-1} + y_{i,j+1}) \; ,
\end{equation}
with $p_{ij}$ being the probability for the plant in the $i^{th}$ row and $j^{th}$
column; $y_{i-1, j}$ and $y_{i+1,j}$ are the status in the adjacents rows
which are combined to produce the \textit{row covariate};
$y_{i,j-1}$ and $y_{i,j+1}$ are the status of plants in adjacent columns
producing the \textit{column covariate};
$\gamma_1$ are $\gamma_2$ the respective parameters measuring the
effect of such spatial covariates. The separation of row and column effects accommodates the fact the spacing
are typically different within and between rows, allowing to study directional effects.
A naïve method to obtain parameter estimates for $\{\gamma_1, \gamma_2\}=\gamma$
is based on the maximisation of the pseudo-likelihood~\cite{besag:77}
\begin{equation}
\tilde{L}(\gamma,y) = \prod_i \prod_j f(p_{ij}, y)\;,
\end{equation}
where $f(.)$ is the density of the Bernoulli probability distribution.
This methods provides consistent parameter point estimates, however underestimates the
associated standard errors. Intuitively this is due to the fact that an observation is used as a responser
variable as well as providing information for the covariates in the model.
One possible workaround is to use resampling methods. However within the context of spatial patterns this is
not straightforward given the need to preserve the spatial structure.
This can be achieved by block resampling~\cite{cressie:93} for instance using a Gibbs sampler~\cite{gumpertz:97}.
The basic idea is to sample from the distribution of each observation $y_{ij}$
conditioning on the current status of the neighbours, with probabilities given b the autologistic model (\ref{eq:prob}).
This is a sequential algorithm which goes as follows.
We start with observed values $y^{(0)}$ from which we obtain via pseudo-likelihood parameter estimates
$\hat{\gamma}^{(0)}$. Next we generate bootstrap resamples
$(y^{(1)}, ..., y^{(n)})$ and for each of them estimates
$(\hat{\gamma}^{(1)}, ..., \hat{\gamma}^{(n)})$ are obtained.
The bootstrap sample are obtained through the following steps:
\begin{enumerate}
\item starting from an arbitrary location (tree) update the status by sampling from the Bernoulli distribution with probability given by the fitted model parameters and current status of the plants $f(\hat{\gamma}^{(0)}, y^{t'})$. This is done for all units in a random sequence until the cycle is complete, i.~e. all the status are updated.
\item once a cycle is completed obtains parameter estimates by maximising pseudo-likelihood function
$\tilde{L}(\hat{\gamma^{(0)}}, y^{(t)})$,
\item repeat steps 1 and 2 $N$ times, the number of bootstrap samples.
\end{enumerate}
The variance of the estimate $\hat{\gamma}$ is then given simply by the variance of the
estimates $(\hat{\gamma}^{(1)}, ..., \hat{\gamma}^{(n)})$.
It is also advisable to disregard a certain number $m$ of initial resamples
and also trimming the simulations taking one at each
$k$ steps. This allows for convergence of the chain and to reduce the number of stored simulations.
These procedures were implemented by us in a add-on package \pkg{Rcitrus}~\cite{Rcitrus:05}
to the \textsf{R} statistical environment for statistical analysis \cite{R:07}.
\subsection{Models}
The data considered here comes from 11 visits to the citrus field at different
calendar dates. Three models were considered for the analysis.
The first model ($m1$) consider as spatial covariates the neighbouring observations within and between rows, at the present time:
\[\begin{array}{lll}
logit(p_{ij}^t) & = & \beta_0 + \gamma_1 (y_{i-1,j}^t + y_{i+1,j}^t) +
\gamma_2 (y_{i,j-1}^t + y_{i,j+1}^t) \;.
\end{array}\]
Model $m2$, considers the same neighbourhood, however
with data reflecting the status of the plants at the previous observation time:
\[\begin{array}{lll}
logit(p_{ij}^t) & = & \beta_0 +
\gamma_1 (y_{i-1,j}^{t-1} + y_{i+1,j}^{t-1}) +
\gamma_2 (y_{i,j-1}^{t-1} + y_{i,j+1}^{t-1}) \;.
\end{array}\]
Finally, model $m3$ considers contemporary and previous times
in the covariates.
\[\begin{array}{lll}
logit(p_{ij}^t) & = & \beta_0 +
\gamma_1 (y_{i-1,j}^{t-1} + y_{i+1,j}^{t-1}) +
\gamma_2 (y_{i,j-1}^{t-1} + y_{i,j+1}^{t-1}) + \\
&&\gamma_3 (y_{i-1,j}^t + y_{i+1,j}^t) +
\gamma_4 (y_{i,j-1}^t + y_{i,j+1}^t) \;.
\end{array}
\]
The significance test for the regression parameters is based
on the usual approximation in generalised linear models that
$\hat{\gamma} / \sqrt{Var(\hat{\gamma})} \sim N(0,1) $.
For $m1$, the significance test for the coefficients
allows for detecting spatial effects as well as distinguish
between the close neighbours effects given by the within
row covariate and more distant neighbours given by the between rows covariate.
Model $m2$ assess the predictive ability of the model
through the lagged information built in the covariate given the fact the
present status of the trees would allow to predict the probability
of trees become infected at the next observation time.
The different covariate effects assess patterns in the spread
of the disease.
A last model $m3$, can combines contemporary and lagged covariates
attempt to check whether this improves the model fit.
The Akaike Information Criteria (AIC) is a measure used to assess and compare
model fits
and is given by the penalization of the log-likelihood by model complexity and is given by
$2*log(L(Y,\theta)) + 2*k$, where $k$ is the number of parameters included in the model.
% Local variables:
% TeX-master: "article.tex"
% End:
% LocalWords: autologistic