Black Cherry Trees ()
Variável resposta (Y) = Volume (volume de madeira)
Variável explicativa (X1) = Girth (diâmetro da árvore a uma determinada altura do solo)
Variável explicativa (X2) = Height (altura da árvore)

data(trees)
##    Girth Height Volume
## 1    8.3     70   10.3
## 2    8.6     65   10.3
## 3    8.8     63   10.2
## 4   10.5     72   16.4
## 5   10.7     81   18.8
## 6   10.8     83   19.7
## 7   11.0     66   15.6
## 8   11.0     75   18.2
## 9   11.1     80   22.6
## 10  11.2     75   19.9
## 11  11.3     79   24.2
## 12  11.4     76   21.0
## 13  11.4     76   21.4
## 14  11.7     69   21.3
## 15  12.0     75   19.1
## 16  12.9     74   22.2
## 17  12.9     85   33.8
## 18  13.3     86   27.4
## 19  13.7     71   25.7
## 20  13.8     64   24.9
## 21  14.0     78   34.5
## 22  14.2     80   31.7
## 23  14.5     74   36.3
## 24  16.0     72   38.3
## 25  16.3     77   42.6
## 26  17.3     81   55.4
## 27  17.5     82   55.7
## 28  17.9     80   58.3
## 29  18.0     80   51.5
## 30  18.0     80   51.0
## 31  20.6     87   77.0

Há correlação entre as variáveis?

##            Girth    Height    Volume
## Girth  1.0000000 0.5192801 0.9671194
## Height 0.5192801 1.0000000 0.5982497
## Volume 0.9671194 0.5982497 1.0000000

Vemos que existe uma forte correlação positiva entre as variáveis Volume e Girth e uma correlação positiva de teor médio entre as variáveis Volume x Height e Girth x Height.

Vamos tentar ajustar um modelo de regressão linear.

Esse primeiro modelo não seria o primeiro a ser explorado, pois usualmente métodos computacionais (forward, backward, stepwise) são usados para encontrar um modelo inicial, mas por questões didaticas vamos a esse:
Volume ~ Altura

mod1<-lm(trees$Volume~trees$Height)
anova(mod1)
## Analysis of Variance Table
## 
## Response: trees$Volume
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## trees$Height  1 2901.2 2901.19  16.165 0.0003784 ***
## Residuals    29 5204.9  179.48                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod1)
## 
## Call:
## lm(formula = trees$Volume ~ trees$Height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.274  -9.894  -2.894  12.068  29.852 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -87.1236    29.2731  -2.976 0.005835 ** 
## trees$Height   1.5433     0.3839   4.021 0.000378 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.4 on 29 degrees of freedom
## Multiple R-squared:  0.3579, Adjusted R-squared:  0.3358 
## F-statistic: 16.16 on 1 and 29 DF,  p-value: 0.0003784
par(mfrow=c(2,2))
plot(mod1,which=c(1:4),add.smooth=FALSE, pch=20)

Esse modelo está explicando muito pouco da variabilidade pode-se verificar olhando para o R² dentro do sumario e tambem pela soma de quadrados residual apresentada na Anova.
Modelo 2:
Volume ~ Diâmetro

mod2<-lm(trees$Volume~trees$Girth)
anova(mod2)
## Analysis of Variance Table
## 
## Response: trees$Volume
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## trees$Girth  1 7581.8  7581.8  419.36 < 2.2e-16 ***
## Residuals   29  524.3    18.1                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod2)
## 
## Call:
## lm(formula = trees$Volume ~ trees$Girth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.065 -3.107  0.152  3.495  9.587 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
## trees$Girth   5.0659     0.2474   20.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.252 on 29 degrees of freedom
## Multiple R-squared:  0.9353, Adjusted R-squared:  0.9331 
## F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mod2, which=c(1:4),add.smooth=FALSE, pch=20)

Nesse modelo tivemos boa parte da variabilidade explicada com R² maior que 90%, partimos então para análise dos resíduos, cujas presuposições aparentemente estão sendo atendidas, talvez a normalidade está um pouco fora nos extremos, e os residuos estão em um intervalo grande com um formato concavo, podemos tentar corrigir com outro modelo.
Modelo 3:
Volume ~ Altura + Diâmetro

mod3<-lm(trees$Volume~trees$Height +trees$Girth)
anova(mod3)
## Analysis of Variance Table
## 
## Response: trees$Volume
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## trees$Height  1 2901.2  2901.2  192.53 4.503e-14 ***
## trees$Girth   1 4783.0  4783.0  317.41 < 2.2e-16 ***
## Residuals    28  421.9    15.1                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod3)
## 
## Call:
## lm(formula = trees$Volume ~ trees$Height + trees$Girth)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4065 -2.6493 -0.2876  2.2003  8.4847 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -57.9877     8.6382  -6.713 2.75e-07 ***
## trees$Height   0.3393     0.1302   2.607   0.0145 *  
## trees$Girth    4.7082     0.2643  17.816  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.9442 
## F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mod3, which=c(1:4),add.smooth=FALSE, pch=20)

summary(influence.measures(mod3))
## Potentially influential observations of
##   lm(formula = trees$Volume ~ trees$Height + trees$Girth) :
## 
##    dfb.1_ dfb.tr$H dfb.tr$G dffit   cov.r cook.d hat  
## 31 -0.74   0.34     0.97     1.50_*  0.68  0.61   0.23

O modelo 3 não nos da uma melhora tão relevante, porem talvez se tirarmos a observação 31, cujo podemos considerar um outlier, faça com que nosso gráfico normal fique mais alinhado.
Modelo 3 sem a observação 31:

treess31 <- trees[-31,]
mod3s31 <- lm(treess31$Volume~treess31$Height +treess31$Girth)
anova(mod3s31)
## Analysis of Variance Table
## 
## Response: treess31$Volume
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## treess31$Height  1 1661.3  1661.3  136.43 4.591e-12 ***
## treess31$Girth   1 3849.9  3849.9  316.16 < 2.2e-16 ***
## Residuals       27  328.8    12.2                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod3s31)
## 
## Call:
## lm(formula = treess31$Volume ~ treess31$Height + treess31$Girth)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6396 -1.2786 -0.4938  2.0262  6.4599 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -52.2362     8.0390  -6.498 5.76e-07 ***
## treess31$Height   0.2992     0.1179   2.538   0.0172 *  
## treess31$Girth    4.4773     0.2518  17.781  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.49 on 27 degrees of freedom
## Multiple R-squared:  0.9437, Adjusted R-squared:  0.9395 
## F-statistic: 226.3 on 2 and 27 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mod3s31, which=c(1:4),add.smooth=FALSE, pch=20)

Ao tirarmos a observação 31, o gráfico normal fica muito desbalanceado, agora vamos tentar alguma transformação entre as variáveis.
Modelo 4:
Volume ~ Altura + Diâmetro + Diâmetro²

mod4<-lm(trees$Volume~trees$Height+trees$Girth+I(trees$Girth^2))
anova(mod4)
## Analysis of Variance Table
## 
## Response: trees$Volume
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## trees$Height      1 2901.2  2901.2 421.113 < 2.2e-16 ***
## trees$Girth       1 4783.0  4783.0 694.259 < 2.2e-16 ***
## I(trees$Girth^2)  1  235.9   235.9  34.243  3.13e-06 ***
## Residuals        27  186.0     6.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod4)
## 
## Call:
## lm(formula = trees$Volume ~ trees$Height + trees$Girth + I(trees$Girth^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2928 -1.6693 -0.1018  1.7851  4.3489 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -9.92041   10.07911  -0.984 0.333729    
## trees$Height      0.37639    0.08823   4.266 0.000218 ***
## trees$Girth      -2.88508    1.30985  -2.203 0.036343 *  
## I(trees$Girth^2)  0.26862    0.04590   5.852 3.13e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.625 on 27 degrees of freedom
## Multiple R-squared:  0.9771, Adjusted R-squared:  0.9745 
## F-statistic: 383.2 on 3 and 27 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mod4, which=c(1:4),add.smooth=FALSE, pch=20)

Tivemos um ganho no R² para 97%, mas a análise dos residuos mostra uma forma não muito aleatória.
Modelo 5:
Volume ~ Altura + Altura² + Diâmetro + Diâmetro²

mod5<-lm(trees$Volume~trees$Height+I(trees$Height^2)+trees$Girth+I(trees$Girth^2))
anova(mod5)
## Analysis of Variance Table
## 
## Response: trees$Volume
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## trees$Height       1 2901.2  2901.2 405.841 < 2.2e-16 ***
## I(trees$Height^2)  1   98.7    98.7  13.800 0.0009789 ***
## trees$Girth        1 4731.9  4731.9 661.928 < 2.2e-16 ***
## I(trees$Girth^2)   1  188.5   188.5  26.373 2.351e-05 ***
## Residuals         26  185.9     7.1                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod5)
## 
## Call:
## lm(formula = trees$Volume ~ trees$Height + I(trees$Height^2) + 
##     trees$Girth + I(trees$Girth^2))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.368 -1.670 -0.158  1.792  4.358 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.955101  63.013630  -0.015    0.988    
## trees$Height       0.119372   1.784588   0.067    0.947    
## I(trees$Height^2)  0.001717   0.011905   0.144    0.886    
## trees$Girth       -2.796569   1.468677  -1.904    0.068 .  
## I(trees$Girth^2)   0.265446   0.051689   5.135 2.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.674 on 26 degrees of freedom
## Multiple R-squared:  0.9771, Adjusted R-squared:  0.9735 
## F-statistic:   277 on 4 and 26 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mod5, which=c(1:4),add.smooth=FALSE, pch=20)

Temos o último modelo, o ganho com esse modelo é baixo, mas os presupostos aparentemente estão sendo atendidos.