Black Cherry Trees ()
Variável resposta (Y) = Volume (volume de madeira)
Variável explicativa (X1) = Girth (diâmetro da árvore a uma determinada altura do solo)
Variável explicativa (X2) = Height (altura da árvore)
data(trees)
## Girth Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
## 7 11.0 66 15.6
## 8 11.0 75 18.2
## 9 11.1 80 22.6
## 10 11.2 75 19.9
## 11 11.3 79 24.2
## 12 11.4 76 21.0
## 13 11.4 76 21.4
## 14 11.7 69 21.3
## 15 12.0 75 19.1
## 16 12.9 74 22.2
## 17 12.9 85 33.8
## 18 13.3 86 27.4
## 19 13.7 71 25.7
## 20 13.8 64 24.9
## 21 14.0 78 34.5
## 22 14.2 80 31.7
## 23 14.5 74 36.3
## 24 16.0 72 38.3
## 25 16.3 77 42.6
## 26 17.3 81 55.4
## 27 17.5 82 55.7
## 28 17.9 80 58.3
## 29 18.0 80 51.5
## 30 18.0 80 51.0
## 31 20.6 87 77.0
Há correlação entre as variáveis?
## Girth Height Volume
## Girth 1.0000000 0.5192801 0.9671194
## Height 0.5192801 1.0000000 0.5982497
## Volume 0.9671194 0.5982497 1.0000000
Vemos que existe uma forte correlação positiva entre as variáveis Volume e Girth e uma correlação positiva de teor médio entre as variáveis Volume x Height e Girth x Height.
Vamos tentar ajustar um modelo de regressão linear.
Esse primeiro modelo não seria o primeiro a ser explorado, pois usualmente métodos computacionais (forward, backward, stepwise) são usados para encontrar um modelo inicial, mas por questões didaticas vamos a esse:
Volume ~ Altura
mod1<-lm(trees$Volume~trees$Height)
anova(mod1)
## Analysis of Variance Table
##
## Response: trees$Volume
## Df Sum Sq Mean Sq F value Pr(>F)
## trees$Height 1 2901.2 2901.19 16.165 0.0003784 ***
## Residuals 29 5204.9 179.48
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod1)
##
## Call:
## lm(formula = trees$Volume ~ trees$Height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.274 -9.894 -2.894 12.068 29.852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.1236 29.2731 -2.976 0.005835 **
## trees$Height 1.5433 0.3839 4.021 0.000378 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.4 on 29 degrees of freedom
## Multiple R-squared: 0.3579, Adjusted R-squared: 0.3358
## F-statistic: 16.16 on 1 and 29 DF, p-value: 0.0003784
par(mfrow=c(2,2))
plot(mod1,which=c(1:4),add.smooth=FALSE, pch=20)
Esse modelo está explicando muito pouco da variabilidade pode-se verificar olhando para o R² dentro do sumario e tambem pela soma de quadrados residual apresentada na Anova.
Modelo 2:
Volume ~ Diâmetro
mod2<-lm(trees$Volume~trees$Girth)
anova(mod2)
## Analysis of Variance Table
##
## Response: trees$Volume
## Df Sum Sq Mean Sq F value Pr(>F)
## trees$Girth 1 7581.8 7581.8 419.36 < 2.2e-16 ***
## Residuals 29 524.3 18.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod2)
##
## Call:
## lm(formula = trees$Volume ~ trees$Girth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.065 -3.107 0.152 3.495 9.587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.9435 3.3651 -10.98 7.62e-12 ***
## trees$Girth 5.0659 0.2474 20.48 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared: 0.9353, Adjusted R-squared: 0.9331
## F-statistic: 419.4 on 1 and 29 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mod2, which=c(1:4),add.smooth=FALSE, pch=20)
Nesse modelo tivemos boa parte da variabilidade explicada com R² maior que 90%, partimos então para análise dos resíduos, cujas presuposições aparentemente estão sendo atendidas, talvez a normalidade está um pouco fora nos extremos, e os residuos estão em um intervalo grande com um formato concavo, podemos tentar corrigir com outro modelo.
Modelo 3:
Volume ~ Altura + Diâmetro
mod3<-lm(trees$Volume~trees$Height +trees$Girth)
anova(mod3)
## Analysis of Variance Table
##
## Response: trees$Volume
## Df Sum Sq Mean Sq F value Pr(>F)
## trees$Height 1 2901.2 2901.2 192.53 4.503e-14 ***
## trees$Girth 1 4783.0 4783.0 317.41 < 2.2e-16 ***
## Residuals 28 421.9 15.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod3)
##
## Call:
## lm(formula = trees$Volume ~ trees$Height + trees$Girth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4065 -2.6493 -0.2876 2.2003 8.4847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
## trees$Height 0.3393 0.1302 2.607 0.0145 *
## trees$Girth 4.7082 0.2643 17.816 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
## F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mod3, which=c(1:4),add.smooth=FALSE, pch=20)
summary(influence.measures(mod3))
## Potentially influential observations of
## lm(formula = trees$Volume ~ trees$Height + trees$Girth) :
##
## dfb.1_ dfb.tr$H dfb.tr$G dffit cov.r cook.d hat
## 31 -0.74 0.34 0.97 1.50_* 0.68 0.61 0.23
O modelo 3 não nos da uma melhora tão relevante, porem talvez se tirarmos a observação 31, cujo podemos considerar um outlier, faça com que nosso gráfico normal fique mais alinhado.
Modelo 3 sem a observação 31:
treess31 <- trees[-31,]
mod3s31 <- lm(treess31$Volume~treess31$Height +treess31$Girth)
anova(mod3s31)
## Analysis of Variance Table
##
## Response: treess31$Volume
## Df Sum Sq Mean Sq F value Pr(>F)
## treess31$Height 1 1661.3 1661.3 136.43 4.591e-12 ***
## treess31$Girth 1 3849.9 3849.9 316.16 < 2.2e-16 ***
## Residuals 27 328.8 12.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod3s31)
##
## Call:
## lm(formula = treess31$Volume ~ treess31$Height + treess31$Girth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6396 -1.2786 -0.4938 2.0262 6.4599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -52.2362 8.0390 -6.498 5.76e-07 ***
## treess31$Height 0.2992 0.1179 2.538 0.0172 *
## treess31$Girth 4.4773 0.2518 17.781 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.49 on 27 degrees of freedom
## Multiple R-squared: 0.9437, Adjusted R-squared: 0.9395
## F-statistic: 226.3 on 2 and 27 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mod3s31, which=c(1:4),add.smooth=FALSE, pch=20)
Ao tirarmos a observação 31, o gráfico normal fica muito desbalanceado, agora vamos tentar alguma transformação entre as variáveis.
Modelo 4:
Volume ~ Altura + Diâmetro + Diâmetro²
mod4<-lm(trees$Volume~trees$Height+trees$Girth+I(trees$Girth^2))
anova(mod4)
## Analysis of Variance Table
##
## Response: trees$Volume
## Df Sum Sq Mean Sq F value Pr(>F)
## trees$Height 1 2901.2 2901.2 421.113 < 2.2e-16 ***
## trees$Girth 1 4783.0 4783.0 694.259 < 2.2e-16 ***
## I(trees$Girth^2) 1 235.9 235.9 34.243 3.13e-06 ***
## Residuals 27 186.0 6.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod4)
##
## Call:
## lm(formula = trees$Volume ~ trees$Height + trees$Girth + I(trees$Girth^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2928 -1.6693 -0.1018 1.7851 4.3489
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.92041 10.07911 -0.984 0.333729
## trees$Height 0.37639 0.08823 4.266 0.000218 ***
## trees$Girth -2.88508 1.30985 -2.203 0.036343 *
## I(trees$Girth^2) 0.26862 0.04590 5.852 3.13e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.625 on 27 degrees of freedom
## Multiple R-squared: 0.9771, Adjusted R-squared: 0.9745
## F-statistic: 383.2 on 3 and 27 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mod4, which=c(1:4),add.smooth=FALSE, pch=20)
Tivemos um ganho no R² para 97%, mas a análise dos residuos mostra uma forma não muito aleatória.
Modelo 5:
Volume ~ Altura + Altura² + Diâmetro + Diâmetro²
mod5<-lm(trees$Volume~trees$Height+I(trees$Height^2)+trees$Girth+I(trees$Girth^2))
anova(mod5)
## Analysis of Variance Table
##
## Response: trees$Volume
## Df Sum Sq Mean Sq F value Pr(>F)
## trees$Height 1 2901.2 2901.2 405.841 < 2.2e-16 ***
## I(trees$Height^2) 1 98.7 98.7 13.800 0.0009789 ***
## trees$Girth 1 4731.9 4731.9 661.928 < 2.2e-16 ***
## I(trees$Girth^2) 1 188.5 188.5 26.373 2.351e-05 ***
## Residuals 26 185.9 7.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod5)
##
## Call:
## lm(formula = trees$Volume ~ trees$Height + I(trees$Height^2) +
## trees$Girth + I(trees$Girth^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.368 -1.670 -0.158 1.792 4.358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.955101 63.013630 -0.015 0.988
## trees$Height 0.119372 1.784588 0.067 0.947
## I(trees$Height^2) 0.001717 0.011905 0.144 0.886
## trees$Girth -2.796569 1.468677 -1.904 0.068 .
## I(trees$Girth^2) 0.265446 0.051689 5.135 2.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.674 on 26 degrees of freedom
## Multiple R-squared: 0.9771, Adjusted R-squared: 0.9735
## F-statistic: 277 on 4 and 26 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mod5, which=c(1:4),add.smooth=FALSE, pch=20)
Temos o último modelo, o ganho com esse modelo é baixo, mas os presupostos aparentemente estão sendo atendidos.