Modelos de Regressão e aplicações no ambiente R |
|
13 a 17 de Abril de 2015 - Manaus - AM |
Prof. Dr. Walmes M. Zeviani |
Fundação Oswaldo Cruz - FIOCRUZ |
Lab. de Estatística e Geoinformação - LEG |
Departamento de Estatística - UFPR |
##=============================================================================
## Modelos de Regressão e aplicações no ambiente R
##
## 13 a 17 de Abril de 2015 - Manaus/AM
## Fundação Oswaldo Cruz - FIOCRUZ
##
## Prof. Dr. Walmes M. Zeviani
## LEG/DEST/UFPR
##=============================================================================
##-----------------------------------------------------------------------------
## Definições da sessão.
pkg <- c("lattice", "latticeExtra", "gridExtra", "car", "alr3", "asbio",
"plyr", "aod", "multcomp", "ellipse", "boot", "wzRfun")
sapply(pkg, require, character.only=TRUE)
## lattice latticeExtra gridExtra car alr3 asbio
## TRUE TRUE TRUE TRUE TRUE TRUE
## plyr aod multcomp ellipse boot wzRfun
## TRUE TRUE TRUE TRUE TRUE TRUE
trellis.device(color=FALSE)
##-----------------------------------------------------------------------------
Seja o modelo \[ \begin{aligned} \text{E}(Y|x) = \beta_0+\beta_1 x_1+\beta_2 x_2 +\ldots+ \beta_k x_k. \end{aligned} \] O teste a hipótese sobre os parâmetros, \(\text{H}_0: \beta_i = 0\), permite verificar quais as covariáveis que têm influência sobre o valor esperado da variável observada. O teste de hipótese pode ser de várias formas:
##-----------------------------------------------------------------------------
## Journal of Applied Ecology, vol. 32, 1995.
url <- "http://www.leg.ufpr.br/~walmes/data/
business_economics_dataset/EXERCISE/SNOWGEES.DAT"
da <-
read.table(paste0(strwrap(url), collapse=""), header=FALSE)
da[,1] <- NULL
names(da) <- c("diet", "wc", "de", "adf")
str(da)
## 'data.frame': 42 obs. of 4 variables:
## $ diet: Factor w/ 2 levels "Chow","Plants": 2 2 2 2 2 2 2 2 2 2 ...
## $ wc : num -6 -5 -4.5 0 2 3.5 -2 -2.5 -3.5 -2.5 ...
## $ de : num 0 2.5 5 0 0 1 2.5 10 20 12.5 ...
## $ adf : num 28.5 27.5 27.5 32.5 32 30 34 36.5 28.5 29 ...
## Data on gosling weight change (wc), digestion efficiency (de),
## acid-detergent fiber (adf) (all in %) and diet (plants or duck show)
## for 42 feeding trials. The botanists were interested in predicting
## weight change as a function of the other variables.
pairs(da[,-1])
##-----------------------------------------------------------------------------
## Ajuste do modelo.
## m0 <- lm(wc~de+adf+de:adf, data=da)
m0 <- lm(wc~de+adf, data=da)
par(mfrow=c(2,2)); plot(m0); layout(1)
##-----------------------------------------------------------------------------
## Inferência sobre cada beta.
## summary(m0)
summary(m0)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.18043771 4.4023590 2.7667979 0.0086100803
## de -0.02653926 0.0534876 -0.4961759 0.6225553001
## adf -0.45783404 0.1282758 -3.5691367 0.0009688975
## Como chegar a tais valores?
X <- model.matrix(m0); head(X)
## (Intercept) de adf
## 1 1 0.0 28.5
## 2 1 2.5 27.5
## 3 1 5.0 27.5
## 4 1 0.0 32.5
## 5 1 0.0 32.0
## 6 1 1.0 30.0
vcov <- solve(crossprod(X))*summary(m0)$sigma^2; vcov
## (Intercept) de adf
## (Intercept) 19.3807647 -0.221914245 -0.551723317
## de -0.2219142 0.002860923 0.006038197
## adf -0.5517233 0.006038197 0.016454693
vcov(m0) ## É a matriz de variância e covariância dos betas.
## (Intercept) de adf
## (Intercept) 19.3807647 -0.221914245 -0.551723317
## de -0.2219142 0.002860923 0.006038197
## adf -0.5517233 0.006038197 0.016454693
sqrt(diag(vcov))
## (Intercept) de adf
## 4.4023590 0.0534876 0.1282758
## Valor da estatística t.
coef(m0)/sqrt(diag(vcov(m0)))
## (Intercept) de adf
## 2.7667979 -0.4961759 -3.5691367
2*pt(2.7667979, df=df.residual(m0), lower.tail=FALSE)
## [1] 0.008610081
##-----------------------------------------------------------------------------
## Teste por análise de variância.
## Sequêncial.
anova(m0)
## Analysis of Variance Table
##
## Response: wc
## Df Sum Sq Mean Sq F value Pr(>F)
## de 1 384.24 384.24 31.020 2.05e-06 ***
## adf 1 157.79 157.79 12.739 0.0009689 ***
## Residuals 39 483.08 12.39
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Marginal.
## drop1(m0, test="F")
drop1(m0, scope=.~., test="F")
## Single term deletions
##
## Model:
## wc ~ de + adf
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 483.08 108.59
## de 1 3.05 486.13 106.85 0.2462 0.6225553
## adf 1 157.79 640.88 118.46 12.7387 0.0009689 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(m0, type="II")
## Anova Table (Type II tests)
##
## Response: wc
## Sum Sq Df F value Pr(>F)
## de 3.05 1 0.2462 0.6225553
## adf 157.79 1 12.7387 0.0009689 ***
## Residuals 483.08 39
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(m0, type="III")
## Anova Table (Type III tests)
##
## Response: wc
## Sum Sq Df F value Pr(>F)
## (Intercept) 94.82 1 7.6552 0.0086101 **
## de 3.05 1 0.2462 0.6225553
## adf 157.79 1 12.7387 0.0009689 ***
## Residuals 483.08 39
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## m1 <- lm(cbind(wc, wc^2)~de+adf, data=da)
## anova(m1)
## summary(m1)
Os testes t retornados no summary
são marginais, ou seja, são fazem restrição aos demais parâmetros. Sendo assim, o teste para verificar se de
tem efeito nulo é um teste feito considera adf
presente no modelo (e não considerando-o igual a zero ou outro particular valor). Por outro lado, o teste retornado pela função anova()
é uma lista de testes sequênciais. No caso, testa df
considerando como referência o modelo com intercepto (sem adf
, efeito zero). Na sequência, o teste para adf
é feito com relação ao modelo de referência que contém de
. A função drop1
também faz testes marginais, mas considerando a distribuição F.
##-----------------------------------------------------------------------------
## Testes conjuntos.
## Testar se simultaneamente o efeito de de e adf são zero. Ao
## considerar isso, tem-se que o modelo sem covariáveis (só com
## intercepto) é o modelo de referência. Esse teste pode ser feito de
## várias formas.
summary(m0)
##
## Call:
## lm(formula = wc ~ de + adf, data = da)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.0649 -2.0241 0.5645 2.4590 6.8556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.18044 4.40236 2.767 0.008610 **
## de -0.02654 0.05349 -0.496 0.622555
## adf -0.45783 0.12828 -3.569 0.000969 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.519 on 39 degrees of freedom
## Multiple R-squared: 0.5288, Adjusted R-squared: 0.5046
## F-statistic: 21.88 on 2 and 39 DF, p-value: 4.25e-07
m00 <- update(m0, formula=.~1)
anova(m00, m0)
## Analysis of Variance Table
##
## Model 1: wc ~ 1
## Model 2: wc ~ de + adf
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 41 1025.12
## 2 39 483.08 2 542.03 21.88 4.25e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Teste de Wald.
coef(m0)
## (Intercept) de adf
## 12.18043771 -0.02653926 -0.45783404
C <- cbind(0, diag(2))
linearHypothesis(m0, hypothesis.matrix=C, rhs=c(0,0), test="F")
## Linear hypothesis test
##
## Hypothesis:
## de = 0
## adf = 0
##
## Model 1: restricted model
## Model 2: wc ~ de + adf
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 41 1025.12
## 2 39 483.08 2 542.03 21.88 4.25e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## linearHypothesis(m0, hypothesis.matrix=C, rhs=c(1,1), test="F")
wald.test(Sigma=vcov(m0), b=coef(m0),
Terms=2:3, H0=c(0,0), df=df.residual(m0))
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 43.8, df = 2, P(> X2) = 3.1e-10
##
## F test:
## W = 21.9, df1 = 2, df2 = 39, P(> W) = 4.2e-07
## wald.test(Sigma=vcov(m0), b=coef(m0),
## L=C, H0=c(0,0), df=df.residual(m0))
##-----------------------------------------------------------------------------
## Testes de hipótese simples com correção para multiplicidade.
## Sem correção. Equivalente ao summary().
cftest(m0)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = wc ~ de + adf, data = da)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) == 0 12.18044 4.40236 2.767 0.008610 **
## de == 0 -0.02654 0.05349 -0.496 0.622555
## adf == 0 -0.45783 0.12828 -3.569 0.000969 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Univariate p values reported)
## Com correção.
g <- glht(m0, linfct=C)
summary(g)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = wc ~ de + adf, data = da)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 1 == 0 -0.02654 0.05349 -0.496 0.75768
## 2 == 0 -0.45783 0.12828 -3.569 0.00155 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
summary(g, test=adjusted(type="none"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = wc ~ de + adf, data = da)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 1 == 0 -0.02654 0.05349 -0.496 0.622555
## 2 == 0 -0.45783 0.12828 -3.569 0.000969 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)
summary(g, test=adjusted(type="bonferroni"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = wc ~ de + adf, data = da)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 1 == 0 -0.02654 0.05349 -0.496 1.00000
## 2 == 0 -0.45783 0.12828 -3.569 0.00194 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- bonferroni method)
summary(g, test=adjusted(type="fdr"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = wc ~ de + adf, data = da)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 1 == 0 -0.02654 0.05349 -0.496 0.62256
## 2 == 0 -0.45783 0.12828 -3.569 0.00194 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- fdr method)
summary(m0)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.18043771 4.4023590 2.7667979 0.0086100803
## de -0.02653926 0.0534876 -0.4961759 0.6225553001
## adf -0.45783404 0.1282758 -3.5691367 0.0009688975
## Teste conjunto.
summary(g, test=Ftest())
##
## General Linear Hypotheses
##
## Linear Hypotheses:
## Estimate
## 1 == 0 -0.02654
## 2 == 0 -0.45783
##
## Global Test:
## F DF1 DF2 Pr(>F)
## 1 21.88 2 39 4.25e-07
##-----------------------------------------------------------------------------
## Qual a diferença entre testes marginais e conjuntos?
ci <- cbind(coef(m0), confint(m0))
plot(ellipse(m0, which=c("de", "adf"), level=0.95), type="l")
abline(v=ci["de",1], h=ci["adf",1], lty=2)
abline(v=ci["de",-1], h=ci["adf",-1], col=2, lty=2)
\[ \begin{aligned} \text{H}_0 &: \begin{pmatrix} 0 & 1 & 0\\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta_1\\ \beta_2 \end{pmatrix} = \begin{pmatrix} m_1\\ m_2 \end{pmatrix}\\ &: C\beta = M \\ \hat{V}_{\beta} &= \hat{\sigma}^2 (X^\top X)^{-1}\\ F &= \frac{(C\hat{\beta}-M)^\top (C\hat{V}_{\beta}C^\top)^{-1} (C\hat{\beta}-M)}{\text{rank}(C)} \stackrel{H_0}{\sim} F_{\text{rank}(C), n-p} \end{aligned} \]
##-----------------------------------------------------------------------------
## Como fazer esse teste matricialmente? Chegar ao valor de F?
b <- coef(m0); V <- vcov(m0); M <- c(0,0)
CB <- (C%*%b-M)
CVi <- solve(C%*%V%*%t(C))
F <- t(CB)%*%CVi%*%CB/nrow(C); F
## [,1]
## [1,] 21.87958
## Aplicativos que fazem essas contas certamente empregam métodos
## númericos, como decomposições de matrizes, ao invés de fazer as
## contas matriciais como colocadas acima.
http://socserv.socsci.mcmaster.ca/jfox/Courses/SPIDA/linear-models-notes.pdf
O intervalo de confiança pára um parâmetro é dado por \[ \begin{aligned} \frac{\hat{\beta}_i-\beta_i}{\sqrt{\text{diag}(\hat{V}_{\beta})_i}} &\stackrel{H_0}{\sim} t_{n-p} \\ \text{IC}(\beta_i) &= \left\{\beta_i : |\beta_i-\hat{\beta_i}|\leq t_{\alpha/2}\sqrt{\text{diag}(\hat{V}_{\beta})_i}\right\} \end{aligned} \]
A região de confiança é dada por \[ \begin{aligned} F &= \frac{(\hat{\beta}_{1:k}-\beta_{1:k})^\top (\hat{V}_{1:k})^{-1} (\hat{\beta}_{1:k}-\beta_{1:k})}{k} \stackrel{H_0}{\sim} F_{k, n-p}\\ RC(\beta_{1:k}) &= \{\beta_{1:k} : F\leq F_{k,n-p}\}. \end{aligned} \]
O intervalo de confiança para uma função linear dos parâmetros é obtido por \[ \begin{aligned} F &= (L\hat{\beta})^\top (L\hat{V}_{\beta}L^\top)^{-1} (L\hat{\beta}) \stackrel{H_0}{\sim} F_{1, n-p} = t^2_{n-p}\\ \text{IC}(L\beta) &= L\hat{\beta} \pm t_{\alpha/2, n-p}\sqrt{L\hat{V}_{\beta}L^\top}. \end{aligned} \] O caso mais frequente de função linear dos parâmetros é a estimação dos valores preditos pelo modelo. Nesse caso \(L = x_i^{+}\) representa um vetor com valores das covariáveis para o qual se deseja obter o valor esperado. O intervalo de confiança para o valor futuro é dado por \[ \text{IC}(L\beta) = L\hat{\beta} \pm t_{\alpha/2, n-p}\sqrt{\hat{\sigma}^2+L\hat{V}_{\beta}L^\top} \]
##-----------------------------------------------------------------------------
## Dados de inventário florestal.
rm(list=ls())
dap <- read.table("http://www.leg.ufpr.br/~walmes/data/dap.txt",
header=TRUE, sep="\t")
str(dap)
## 'data.frame': 991 obs. of 2 variables:
## $ DAP: num 4.87 7.38 5.95 5.73 6.4 ...
## $ HT : num 9.5 9.8 13 13.5 13.5 13.5 13.5 14.3 14.8 14.8 ...
## Renomeia.
names(dap) <- c("d","h")
## Ordenar para evitar efeito espaguete quando plotar.
dap <- dap[order(dap$d),]
## Nova base que contém d e h observados.
dapcc <- dap[complete.cases(dap),]
## reseta o nome das linhas
rownames(dapcc) <- NULL
head(dapcc)
## d h
## 1 4.8701 9.5
## 2 5.7296 13.5
## 3 5.9524 13.0
## 4 6.3344 15.5
## 5 6.3980 13.5
## 6 6.7482 13.5
##-----------------------------------------------------------------------------
## Visualização.
xyplot(h~d, data=dap, type=c("p","smooth"))
##-----------------------------------------------------------------------------
## Ajuste de modelos.
## Linear simples.
m0 <- lm(h~d+sqrt(d), data=dapcc)
summary(m0)
##
## Call:
## lm(formula = h ~ d + sqrt(d), data = dapcc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3596 -1.1310 0.0444 1.1666 10.8313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.0753 4.2832 -3.987 9.12e-05 ***
## d -1.2162 0.2889 -4.210 3.73e-05 ***
## sqrt(d) 15.3125 2.2487 6.809 9.22e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.099 on 220 degrees of freedom
## Multiple R-squared: 0.7843, Adjusted R-squared: 0.7823
## F-statistic: 399.9 on 2 and 220 DF, p-value: < 2.2e-16
## Resultado.
layout(matrix(c(1,1,2,3,4,5),2,3))
plot(h~d, dapcc)
lines(fitted(m0)~d, dapcc, col=2)
plot(m0)
layout(1)
##-----------------------------------------------------------------------------
## Medidas de influência.
inf <- influence.measures(m0)
summary(inf)
## Potentially influential observations of
## lm(formula = h ~ d + sqrt(d), data = dapcc) :
##
## dfb.1_ dfb.d dfb.sq() dffit cov.r cook.d hat
## 1 -0.25 -0.22 0.24 -0.27 1.17_* 0.02 0.14_*
## 2 0.12 0.11 -0.12 0.14 1.11_* 0.01 0.09_*
## 3 -0.01 0.00 0.01 -0.01 1.10_* 0.00 0.08_*
## 4 0.19 0.16 -0.18 0.22 1.07_* 0.02 0.06_*
## 5 -0.04 -0.03 0.04 -0.05 1.08_* 0.00 0.06_*
## 6 -0.09 -0.08 0.08 -0.11 1.07_* 0.00 0.05_*
## 7 -0.03 -0.03 0.03 -0.04 1.07_* 0.00 0.05_*
## 8 -0.12 -0.10 0.11 -0.15 1.06_* 0.01 0.05_*
## 9 0.10 0.08 -0.09 0.12 1.06_* 0.00 0.05_*
## 10 -0.41 -0.33 0.37 -0.56_* 0.94_* 0.10 0.04
## 11 -0.04 -0.03 0.04 -0.05 1.05_* 0.00 0.04
## 15 0.38 0.25 -0.31 0.78_* 0.71_* 0.18 0.02
## 29 -0.02 0.03 -0.01 -0.25 0.96_* 0.02 0.01
## 41 -0.17 -0.26 0.23 0.58_* 0.69_* 0.10 0.01
## 160 0.02 0.00 -0.01 -0.20 0.94_* 0.01 0.01
## 194 -0.08 -0.11 0.09 -0.27 0.93_* 0.02 0.01
## 209 -0.25 -0.31 0.28 -0.50_* 0.87_* 0.08 0.02
## 221 0.03 0.04 -0.04 0.05 1.05_* 0.00 0.03
## 222 0.13 0.15 -0.14 0.19 1.04_* 0.01 0.04
## 223 -0.19 -0.22 0.20 -0.26 1.05_* 0.02 0.05_*
## Sinalizando os pontos influentes.
## str(inf)
## Influentes pelo DFITS (influência sobre ajuste).
dfits <- inf$is.inf[,4]
plot(h~d, dapcc)
lines(fitted(m0)~d, dapcc, col=2)
with(dapcc, points(d[dfits], h[dfits], col=2, pch=19))
##-----------------------------------------------------------------------------
## Identificar/remover os pontos discrepantes/influentes manualmente
## (critério visual).
id <- which(dfits)
## Refazer a análise com os pontos removidos.
m1 <- update(m0, data=dapcc[-id,])
summary(m1)
##
## Call:
## lm(formula = h ~ d + sqrt(d), data = dapcc[-id, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6150 -1.0316 0.1181 1.2315 3.9940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.210 3.585 -4.243 3.28e-05 ***
## d -1.034 0.241 -4.291 2.68e-05 ***
## sqrt(d) 14.081 1.878 7.497 1.66e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.729 on 216 degrees of freedom
## Multiple R-squared: 0.8459, Adjusted R-squared: 0.8444
## F-statistic: 592.7 on 2 and 216 DF, p-value: < 2.2e-16
##-----------------------------------------------------------------------------
## Predição.
er <- extendrange(dapcc$d, f=0.1)
d.new <- seq(er[1], er[2], length=100)
c(head(d.new), tail(d.new))
## [1] 2.606910 2.881236 3.155562 3.429888 3.704214 3.978540 28.393560 28.667886
## [9] 28.942212 29.216538 29.490864 29.765190
pred <- data.frame(d=d.new)
## Fazendo predição com intervalo de confiança e predição futura.
Yp <- predict(m1, newdata=pred, interval="confidence")
Yf <- predict(m1, newdata=pred, interval="prediction")
colnames(Yf) <- toupper(colnames(Yf))
pred <- cbind(pred, Yp, Yf)
str(pred)
## 'data.frame': 100 obs. of 7 variables:
## $ d : num 2.61 2.88 3.16 3.43 3.7 ...
## $ fit: num 4.83 5.71 6.54 7.32 8.06 ...
## $ lwr: num 2.45 3.5 4.49 5.41 6.29 ...
## $ upr: num 7.21 7.92 8.59 9.23 9.83 ...
## $ FIT: num 4.83 5.71 6.54 7.32 8.06 ...
## $ LWR: num 0.672 1.651 2.563 3.416 4.219 ...
## $ UPR: num 8.99 9.77 10.52 11.23 11.9 ...
## Inclusão da expressão do modelo.
c0 <- coef(m1)
co <- format(c(abs(c0), summary(m1)$r.squared),
digits=3, trim=TRUE)
sinal <- ifelse(c0<0, "-", "+")
co[seq_along(sinal)] <- paste(sinal, co[seq_along(sinal)], sep="")
l <- c("Predito", "IC predito", "IC obs. futura")
key <- list(#corner=c(0.05,0.95),
lines=list(lty=1),
text=list(l[1]),
rect=list(col=c("red"), alpha=0.1, lty=3),
text=list(l[2]),
rect=list(col=c("blue"), alpha=0.1, lty=3),
text=list(l[3]))
p1 <- xyplot(h~d, data=dapcc[-id,], xlim=er,
xlab="DAP (cm)", ylab="Altura (m)", key=key)
p2 <- xyplot(fit~d, data=pred, type="l",
ly=pred$lwr, uy=pred$upr, cty="bands", fill="red",
prepanel=prepanel.cbH, panel=panel.cbH)
p3 <- xyplot(FIT~d, data=pred, type="l",
ly=pred$LWR, uy=pred$UPR, cty="bands", fill="blue",
prepanel=prepanel.cbH, panel=panel.cbH)
p1+as.layer(p2)+as.layer(p3)+
layer(panel.text(
x=20, y=10,
label=substitute(hat(h)==b0~b1*d~b2*sqrt(d)~~~~(R^2==r2),
list(b0=co[1], b1=co[2], b2=co[3], r2=co[4])), bty="n"))
##-----------------------------------------------------------------------------
## Calcular as bandas de confiança de forma matricial.
formula(m1)
## h ~ d + sqrt(d)
## X <- model.matrix(~d+sqrt(d), data=pred)
X <- model.matrix(m1)
L <- model.matrix(formula(m1)[-2], data=pred)
V <- vcov(m1)
U <- chol(V)
s2 <- summary(m1)$sigma^2
dfr <- df.residual(m1)
## solve(crossprod(X))*summary(m1)$sigma^2
## V
## Erro padrão de cada valor predito (\hat{y}).
## predict(m1, newdata=pred, se=TRUE)
## sqrt(diag(L%*%V%*%t(L)))
se.fit <- sqrt(apply(L%*%t(U), 1, function(x) sum(x^2)))
## Erro padrão de cada observação futura (y).
## sqrt(s2+diag(L%*%V%*%t(L)))
se.fut <- sqrt(s2+apply(L%*%t(U), 1, function(x) sum(x^2)))
## Limite superior de cada IC, apenas para comparar.
bc <- data.frame(predito=L%*%coef(m1)+qt(0.975, df=dfr)*se.fit,
obsfut=L%*%coef(m1)+qt(0.975, df=dfr)*se.fut)
cbind(head(pred[,c("upr","UPR")]), head(bc))
## upr UPR predito obsfut
## 1 7.210138 8.986372 7.210138 8.986372
## 2 7.920705 9.772898 7.920705 9.772898
## 3 8.591321 10.517627 8.591321 10.517627
## 4 9.227137 11.225693 9.227137 11.225693
## 5 9.832295 11.901201 9.832295 11.901201
## 6 10.410188 12.547488 10.410188 12.547488
##-----------------------------------------------------------------------------
## Também é possível obter os IC usando a glht. Útil quando se tem
## termos de controle local no modelo, como blocos, e deseja-se uma
## predição na média nos blocos.
## head(predict(m1, newdata=pred, se.fit=TRUE))
## g <- summary(glht(m1, linfct=X), test=adjusted(type="none"))
g <- summary(glht(m1, linfct=L), test=adjusted(type="none"))
ci <- confint(g, calpha=univariate_calpha())
ci <- ci$confint
head(ci)
## Estimate lwr upr
## 1 4.829259 2.448379 7.210138
## 2 5.711868 3.503030 7.920705
## 3 6.540154 4.488987 8.591321
## 4 7.321067 5.414998 9.227137
## 5 8.060191 6.288088 9.832295
## 6 8.762092 7.113995 10.410188
head(pred[c("fit","lwr","upr")])
## fit lwr upr
## 1 4.829259 2.448379 7.210138
## 2 5.711868 3.503030 7.920705
## 3 6.540154 4.488987 8.591321
## 4 7.321067 5.414998 9.227137
## 5 8.060191 6.288088 9.832295
## 6 8.762092 7.113995 10.410188
##-----------------------------------------------------------------------------
## help(ais)
rm(list=ls())
str(ais)
## 'data.frame': 202 obs. of 14 variables:
## $ Sex : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Ht : num 196 190 178 185 185 ...
## $ Wt : num 78.9 74.4 69.1 74.9 64.6 63.7 75.2 62.3 66.5 62.9 ...
## $ LBM : num 63.3 58.5 55.4 57.2 53.2 ...
## $ RCC : num 3.96 4.41 4.14 4.11 4.45 4.1 4.31 4.42 4.3 4.51 ...
## $ WCC : num 7.5 8.3 5 5.3 6.8 4.4 5.3 5.7 8.9 4.4 ...
## $ Hc : num 37.5 38.2 36.4 37.3 41.5 37.4 39.6 39.9 41.1 41.6 ...
## $ Hg : num 12.3 12.7 11.6 12.6 14 12.5 12.8 13.2 13.5 12.7 ...
## $ Ferr : int 60 68 21 69 29 42 73 44 41 44 ...
## $ BMI : num 20.6 20.7 21.9 21.9 19 ...
## $ SSF : num 109.1 102.8 104.6 126.4 80.3 ...
## $ Bfat : num 19.8 21.3 19.9 23.7 17.6 ...
## $ Label: Factor w/ 17 levels "f-b_ball","f-field",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Sport: Factor w/ 10 levels "b_ball","field",..: 1 1 1 1 1 1 1 1 1 1 ...
## xyplot(LBM~Ht, groups=Sex, ais, type=c("p","smooth"))
xyplot(LBM~Wt, groups=Sex, ais, type=c("p","smooth"))
m0 <- lm(LBM~Sex*Wt, data=ais)
par(mfrow=c(2,2)); plot(m0); layout(1)
pred <- ddply(ais, .(Sex), summarise,
Wt={
er <- extendrange(Wt, f=0.1)
seq(er[1], er[2], l=20)
})
bc <- predict(m0, newdata=pred, interval="confidence")
pred <- cbind(pred, bc)
p1 <- xyplot(LBM~Wt, groups=Sex, data=ais,
xlim=range(pred$Wt))
p2 <- xyplot(fit~Wt, groups=Sex, data=pred, type="l",
ly=pred$lwr, uy=pred$upr, cty="bands",
prepanel=prepanel.cbH,
panel.groups=panel.cbH, panel=panel.superpose)
p1+as.layer(p2, under=TRUE)
##-----------------------------------------------------------------------------
## As retas são paralelas?
anova(m0)
## Analysis of Variance Table
##
## Response: LBM
## Df Sum Sq Mean Sq F value Pr(>F)
## Sex 1 19720.1 19720.1 3000.294 < 2e-16 ***
## Wt 1 13075.2 13075.2 1989.307 < 2e-16 ***
## Sex:Wt 1 240.2 240.2 36.544 7.3e-09 ***
## Residuals 198 1301.4 6.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m0)
##
## Call:
## lm(formula = LBM ~ Sex * Wt, data = ais)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5195 -1.1987 0.4867 1.6712 6.1258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.85097 1.71575 6.324 1.66e-09 ***
## Sex 4.72013 2.35299 2.006 0.0462 *
## Wt 0.77318 0.02056 37.602 < 2e-16 ***
## Sex:Wt -0.18925 0.03131 -6.045 7.30e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.564 on 198 degrees of freedom
## Multiple R-squared: 0.9621, Adjusted R-squared: 0.9615
## F-statistic: 1675 on 3 and 198 DF, p-value: < 2.2e-16
##-----------------------------------------------------------------------------
## Dados.
rm(list=ls())
## url <- "http://www.leg.ufpr.br/~walmes/data/cresmicelial.txt"
url <- "/home/walmes/Dropbox/Consultorias/Paulo Lichtemberg/cresmicelial.txt"
da <- read.table(url, header=TRUE, sep="\t")
da$isolado <- factor(da$isolado)
str(da)
## 'data.frame': 192 obs. of 3 variables:
## $ isolado: Factor w/ 16 levels "1","2","4","5",..: 1 1 1 2 2 2 3 3 3 4 ...
## $ temp : num 20 20 20 20 20 20 20 20 20 20 ...
## $ mmdia : num 14.25 13.25 13.25 9.75 12.75 ...
panel.quad <- function(x, y, ...){
panel.xyplot(x, y, ...)
m <- lm(y~poly(x, degree=2))
er <- extendrange(x, f=0.1)
xp <- seq(er[1], er[2], l=20)
yp <- predict(m, newdata=list(x=xp))
panel.lines(xp, yp)
}
xyplot(mmdia~temp|isolado, data=da, panel=panel.quad)
##-----------------------------------------------------------------------------
## Ajustar modelo para isolado 6 e fazer inferência para o ponto de
## máximo.
db <- subset(da, isolado=="6")
m0 <- lm(mmdia~temp+I(temp^2), data=db)
b <- coef(m0); b
## (Intercept) temp I(temp^2)
## -192.99424 17.91204 -0.38257
-b["temp"]/(2*b["I(temp^2)"])
## temp
## 23.41014
## Erro padrão usando o método delta.
dm <- deltaMethod(m0, "-0.5*b1/b2",
parameterNames=paste0("b", 0:2))
dm
## Estimate SE
## -0.5 * b1/b2 23.41014 0.3515234
## IC.
ic <- with(dm, Estimate+c(-1,1)*qnorm(0.975)*SE); ic
## [1] 22.72117 24.09912
##-----------------------------------------------------------------------------
## Inferência com bootstrap não paramétrico.
proc <- function(data, idx){
m0 <- lm(fitted+resid[idx]~temp+I(temp^2), data=data)
b <- coef(m0)
xm <- -b["temp"]/(2*b["I(temp^2)"])
return(xm)
}
db <- cbind(db, fitted=fitted(m0), resid=residuals(m0))
names(db)
## [1] "isolado" "temp" "mmdia" "fitted" "resid"
bt <- boot(data=db, statistic=proc, R=999)
plot(density(bt$t))
rug(bt$t)
abline(v=quantile(bt$t, c(0.025, 0.975)), lty=2)
abline(v=ic, col=2, lty=2)
##-----------------------------------------------------------------------------
## Inferência com boostrap paramétrico.
dat.fun <- function(data){
## Função que calcula a estatística de interesse sendo essa função
## apenas dos dados.
m0 <- lm(mmdia~temp+I(temp^2), data=data)
b <- coef(m0)
xm <- -b["temp"]/(2*b["I(temp^2)"])
return(xm)
}
dat.rg <- function(data, mle){
## Função que gera dados simulados a partir dos dados originais.
out <- data
out$mmdia <- rnorm(mle$Ey, mean=mle$Ey, sd=mle$sigma)
return(out)
}
btp <- boot(data=db, statistic=dat.fun,
R=999, sim="parametric",
ran.gen=dat.rg,
mle=list(Ey=fitted(m0), sigma=summary(m0)$sigma))
plot(density(btp$t))
rug(btp$t)
abline(v=quantile(btp$t, c(0.025, 0.975)), lty=2)
abline(v=ic, col=2, lty=2)
##-----------------------------------------------------------------------------
## Sobrepondo.
plot(ecdf(btp$t))
lines(ecdf(bt$t), col=2)
quantile(btp$t, probs=c(0.025, 0.0975))
## 2.5% 9.75%
## 22.55157 22.93705
quantile(bt$t, probs=c(0.025, 0.0975))
## 2.5% 9.75%
## 22.64222 22.92771
##-----------------------------------------------------------------------------
## Perfil de verossimilhança.
db <- subset(da, isolado=="6")
str(db)
## 'data.frame': 12 obs. of 3 variables:
## $ isolado: Factor w/ 16 levels "1","2","4","5",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ temp : num 20 20 20 25 25 25 30 30 30 18.5 ...
## $ mmdia : num 11 21.2 12.8 16.2 12.2 ...
n0 <- nls(mmdia~thy+thc*(temp-thx)^2, data=db,
start=list(thy=15, thx=23.4, thc=-0.38), trace=TRUE)
## 152.9888 : 15.00 23.40 -0.38
## 121.684 : 16.66740 23.41021 -0.38257
## 121.684 : 16.66744 23.41014 -0.38257
pfl <- profile(n0, which=2)
## 129.1757 : 16.66744 -0.38257
## 126.6535 : 16.1108927 -0.3546916
## 139.8982 : 15.5142283 -0.3248035
## 139.8524 : 15.4910464 -0.3219603
## 160.4033 : 14.7624456 -0.2834861
## 160.3873 : 14.8142422 -0.2845005
## 187.306 : 13.9884530 -0.2387947
## 187.1458 : 14.1194831 -0.2439575
## 219.6601 : 13.2313854 -0.1921321
## 218.9192 : 13.4481298 -0.2019116
## 256.7664 : 12.5416070 -0.1451373
## 254.2865 : 12.8507454 -0.1600461
## 299.1534 : 11.98783119 -0.09957198
## 291.5582 : 12.3952607 -0.1201382
## 129.1816 : 16.66744 -0.38257
## 128.1241 : 17.1131998 -0.4030493
## 148.5081 : 17.5330003 -0.4223358
## 148.1548 : 17.3277703 -0.4103218
## 180.146 : 17.5244575 -0.4169882
## 179.7018 : 17.2679812 -0.4029819
## 221.3291 : 17.2101760 -0.3958854
## 220.8958 : 16.9386614 -0.3818288
## 270.3907 : 16.6028265 -0.3602574
## 270.0324 : 16.3423303 -0.3475108
plot(pfl)
##-----------------------------------------------------------------------------
ic ## Método delta.
## [1] 22.72117 24.09912
quantile(btp$t, probs=c(0.025, 0.0975)) ## Bootstrap não paramétrico.
## 2.5% 9.75%
## 22.55157 22.93705
quantile(bt$t, probs=c(0.025, 0.0975)) ## Bootstrap paramétrico.
## 2.5% 9.75%
## 22.64222 22.92771
confint(pfl) ## Perfil de verossimilhança.
## 2.5% 97.5%
## 22.24649 24.10730
##-----------------------------------------------------------------------------
## É uma estimativa de variância livre da especificação do modelo para a
## média disponível quando se tem repetições. É comum em experimentos
## planejados mas em estudos obsevacionais dificilmente
## acontece. Permite testar a falta de ajuste do modelo para média com
## relação ao modelo que assume que os níveis são categorias.
str(turk0)
## 'data.frame': 35 obs. of 2 variables:
## $ A : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Gain: int 644 631 661 624 633 610 615 605 608 599 ...
plot(Gain~A, data=turk0)
xtabs(~A, data=turk0)
## A
## 0 0.04 0.1 0.16 0.28 0.44
## 10 5 5 5 5 5
m0 <- lm(Gain~factor(A), data=turk0)
m1 <- lm(Gain~A+I(A^2), data=turk0)
## Teste da falta de ajuste.
anova(m0, m1)
## Analysis of Variance Table
##
## Model 1: Gain ~ factor(A)
## Model 2: Gain ~ A + I(A^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 9823.6
## 2 32 11339.9 -3 -1516.2 1.492 0.2374
## Teste da falta de ajuste imbutida.
pureErrorAnova(m1)
## Analysis of Variance Table
##
## Response: Gain
## Df Sum Sq Mean Sq F value Pr(>F)
## A 1 124689 124689 368.090 < 2.2e-16 ***
## I(A^2) 1 23836 23836 70.366 3.028e-09 ***
## Residuals 32 11340 354
## Lack of fit 3 1516 505 1.492 0.2374
## Pure Error 29 9824 339
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##-----------------------------------------------------------------------------
## Informações da sessão.
Sys.time()
## [1] "2015-04-15 10:05:05 BRT"
sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: i686-pc-linux-gnu (32-bit)
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=pt_BR.UTF-8
## [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=pt_BR.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=pt_BR.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] splines methods tcltk grid stats graphics grDevices utils
## [9] datasets base
##
## other attached packages:
## [1] wzRfun_0.4 boot_1.3-13 ellipse_0.3-8 multcomp_1.3-7
## [5] TH.data_1.0-5 survival_2.37-7 mvtnorm_1.0-1 aod_1.3
## [9] plyr_1.8.1 asbio_1.1-1 alr3_2.0.5 car_2.0-22
## [13] gridExtra_0.9.1 latticeExtra_0.6-26 RColorBrewer_1.0-5 lattice_0.20-29
## [17] rmarkdown_0.3.3 knitr_1.8
##
## loaded via a namespace (and not attached):
## [1] deSolve_1.11 digest_0.6.4 doBy_4.5-12 evaluate_0.5.5
## [5] formatR_1.0 htmltools_0.2.6 MASS_7.3-37 Matrix_1.1-5
## [9] multcompView_0.1-5 nnet_7.3-8 pixmap_0.4-11 plotrix_3.5-10
## [13] Rcpp_0.11.3 sandwich_2.3-2 scatterplot3d_0.3-35 stringr_0.6.2
## [17] tkrplot_0.0-23 tools_3.1.2 yaml_2.1.13 zoo_1.7-11