Using Framingham dataset
setwd("c:/pendrivok/oktatos/oktatas_2020tavasz/nemet_stat")
frmgham2_vds <- read.csv("frmgham2_vds.csv", sep = ";")
fr <- frmgham2_vds[frmgham2_vds$PERIOD == 1, ]
Questions:
Answer: using correlation, regression
symmetric relation of 2, random variable,
type of relation:
plot(fr$DIABP, fr$DIABP)
plot(fr$SYSBP, fr$DIABP)
plot(fr$AGE, fr$SYSBP)
plot(fr$HEARTRTE, fr$AGE)
correlation: if monotonic relation (positive or negative….)
plot(fr$SYSBP, fr$DIABP, xlab = "systolic", ylab = "diastolic",
main = "correlation1", col = c("blue", "red")[fr$SEX])
“Strength” of the correlation
Measures, usually:
Values: between -1 and 1
Calculation: Measures the distance from the “middle” (check google for formula… :)
cor(fr$SYSBP, fr$DIABP, method = "pearson")
[1] 0,7842
cor(fr$SYSBP, fr$DIABP, method = "spearman")
[1] 0,7762
Hypothesis test: Assumptions: independent H0: R = 0 Statistics: \(t = \sqrt{r^2*\frac{n-2}{1-r^2}}\)
cor.test(fr$SYSBP, fr$DIABP, use = "complete.obs", method = "pearson")
Pearson's product-moment correlation
data: fr$SYSBP and fr$DIABP
t = 84, df = 4432, p-value <2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0,7726 0,7953
sample estimates:
cor
0,7842
Meaning:…
RELEVANT?!
Function relation (NOT symmetric) between dependent (outcome, result, Y) variable and independent (explanatory, predictor, X) variable(s).
Y depends on X!!! - knowledge, not statistical
Questions:
Now we will talk about linear regression
Assumptions:
For 2 variables correlation - regression questions are “transformable”.
(For comparing two raters, two measuring devices (with no gold standard)…. - do NOT use correlation, regression - next lecture…)
Linear regression:
Linear function: \(Y = \beta_0 + \beta_1 * X + \epsilon\)
With the fitted line we estimate the y value (\(\hat{y}\)) for a given x
\(\hat{y} = b_0 + b_1 *x\)
Meaning of intercept and slope: ….. learned before!, but reveal it
For estimation of the linear we (usually) use the OLS (Ordinary Least Square method)
Residuals: points-line vertical differences (difference of measured and estimated values)
plot(fr$BMI, fr$SYSBP)
abline(lm(fr$SYSBP ~ fr$BMI))
lm(fr$SYSBP ~ fr$BMI)
Call:
lm(formula = fr$SYSBP ~ fr$BMI)
Coefficients:
(Intercept) fr$BMI
86,67 1,79
H0: theoretical slope = 0 Statistics: t = b1/SE(b1)
summary(lm(fr$SYSBP ~ fr$BMI))
Call:
lm(formula = fr$SYSBP ~ fr$BMI)
Residuals:
Min 1Q Median 3Q Max
-56,21 -14,78 -3,83 10,42 138,90
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 86,6668 2,0286 42,7 <2e-16 ***
fr$BMI 1,7885 0,0775 23,1 <2e-16 ***
---
Signif. codes: 0 '***' 0,001 '**' 0,01 '*' 0,05 '.' 0,1 ' ' 1
Residual standard error: 21,1 on 4413 degrees of freedom
(19 observations deleted due to missingness)
Multiple R-squared: 0,108, Adjusted R-squared: 0,107
F-statistic: 532 on 1 and 4413 DF, p-value: <2e-16
H0: X and Y are independent Statistics: \(F = \frac{SS_R}{SS_H/(n-2)}\)
Variance divided 2 part: Total variance of Y = Variance of X regression + Variance of random error
anova(lm(fr$SYSBP ~ fr$BMI))
Analysis of Variance Table
Response: fr$SYSBP
Df Sum Sq Mean Sq F value Pr(>F)
fr$BMI 1 237560 237560 532 <2e-16 ***
Residuals 4413 1969349 446
---
Signif. codes: 0 '***' 0,001 '**' 0,01 '*' 0,05 '.' 0,1 ' ' 1
In case of 1 X the two Hypothesis are the same.
How much of variability of Y explaned by X
\(R^2 = \frac{SS_R}{SS_T} = 1- {SS_H}{SS_T}\)
(If we repeat the measurement, the 95% of the fitted parameters will be in this range)
confint(lm(fr$SYSBP ~ fr$BMI))
2,5 % 97,5 %
(Intercept) 82,690 90,64
fr$BMI 1,637 1,94
reg_mod <- lm(SYSBP ~ BMI, data = fr)
x <- data.frame(BMI = 20)
predict(reg_mod, newdata = x, int = "confidence")
fit lwr upr
1 122,4 121,4 123,5
For more x
reg_mod <- lm(SYSBP ~ BMI, data = fr)
x <- data.frame(BMI = 20:25)
predict(reg_mod, newdata = x, int = "confidence")
fit lwr upr
1 122,4 121,4 123,5
2 124,2 123,3 125,2
3 126,0 125,2 126,9
4 127,8 127,0 128,6
5 129,6 128,9 130,3
6 131,4 130,7 132,0
An interval that contains a new observation with 95% probability
reg_mod <- lm(SYSBP ~ BMI, data = fr)
x <- data.frame(BMI = 20)
predict(reg_mod, newdata = x, int = "prediction")
fit lwr upr
1 122,4 81,01 163,9
We have assumptions…
plot(reg_mod)
Always write up the equation!
With 2 X variable
regmod2 <- lm(formula = SYSBP ~ BMI + TOTCHOL, data = fr, na.action = na.exclude)
summary(regmod2)
Call:
lm(formula = SYSBP ~ BMI + TOTCHOL, data = fr, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-50,73 -14,29 -3,66 10,16 138,96
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 69,94431 2,47293 28,3 <2e-16 ***
BMI 1,68506 0,07746 21,8 <2e-16 ***
TOTCHOL 0,08176 0,00711 11,5 <2e-16 ***
---
Signif. codes: 0 '***' 0,001 '**' 0,01 '*' 0,05 '.' 0,1 ' ' 1
Residual standard error: 20,8 on 4361 degrees of freedom
(70 observations deleted due to missingness)
Multiple R-squared: 0,134, Adjusted R-squared: 0,134
F-statistic: 339 on 2 and 4361 DF, p-value: <2e-16
Without X variable?
regmod_null <- lm(formula = SYSBP ~ 1, data = fr, na.action = na.exclude)
summary(regmod_null) # intercept is the mean!
Call:
lm(formula = SYSBP ~ 1, data = fr, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-49,41 -15,41 -3,91 11,09 162,09
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 132,908 0,337 395 <2e-16 ***
---
Signif. codes: 0 '***' 0,001 '**' 0,01 '*' 0,05 '.' 0,1 ' ' 1
Residual standard error: 22,4 on 4433 degrees of freedom
Which model is the best? Which variable is good to put in?
regmod3 <- lm(formula = SYSBP ~ BMI + AGE, data = fr, na.action = na.exclude)
summary(regmod3)
Call:
lm(formula = SYSBP ~ BMI + AGE, data = fr, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-62,71 -12,78 -2,46 10,10 129,25
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47,1360 2,3799 19,8 <2e-16 ***
BMI 1,5254 0,0725 21,1 <2e-16 ***
AGE 0,9282 0,0343 27,1 <2e-16 ***
---
Signif. codes: 0 '***' 0,001 '**' 0,01 '*' 0,05 '.' 0,1 ' ' 1
Residual standard error: 19,6 on 4412 degrees of freedom
(19 observations deleted due to missingness)
Multiple R-squared: 0,235, Adjusted R-squared: 0,234
F-statistic: 676 on 2 and 4412 DF, p-value: <2e-16
regmod4 <- lm(formula = SYSBP ~ CURSMOKE, data = fr, na.action = na.exclude)
summary(regmod4)
Call:
lm(formula = SYSBP ~ CURSMOKE, data = fr, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-50,37 -15,37 -3,87 11,15 159,13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 135,868 0,468 290,22 <2e-16 ***
CURSMOKEyes -6,018 0,668 -9,02 <2e-16 ***
---
Signif. codes: 0 '***' 0,001 '**' 0,01 '*' 0,05 '.' 0,1 ' ' 1
Residual standard error: 22,2 on 4432 degrees of freedom
Multiple R-squared: 0,018, Adjusted R-squared: 0,0178
F-statistic: 81,3 on 1 and 4432 DF, p-value: <2e-16
It looks the same as SYSBP in 2 groups - t-test?
t.test(fr$SYSBP ~ fr$CURSMOKE)
Welch Two Sample t-test
data: fr$SYSBP by fr$CURSMOKE
t = 9, df = 4400, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
4,712 7,324
sample estimates:
mean in group no mean in group yes
135,9 129,8