REVIEW
We’ve studied distributionin histogram , measures of location/ spread. range. and more for normaldistributions.
Lecture 6.1
so relationship ofdatas ?
we start by looking at twovariable , start from histogram analysis.
linear relation / correlation
univariate data
Bivariate data : scatterdiagram
cant see relation
( scatter. As in Octave )
Bivariate data : positiveassociation , linear
Terminology:
association : any relation between variables.
positive association: above average values of onevariable tend to go with the above average values of the other; the scatterslopes up
and corresponding to negative association
linear association : roughly , the scatter diagram is clustered around a straight line.
E.G.
Scatterplot of GMAT data
Football-shaped scatterplot[橄榄球]
Heteroscedastic Data [Heteroscedastic: 异常的,unnormal,异方差的]
May lay out with outlier
The correlation coefficient: Calculation and properties
for the same scale, wecould see how linear two variables are
correlationcoefficient ( r ) : a number to discribe how linear a set of data is.
Intuition
could see the correspondngdirections from mean of xs and ys.
Formula
1. convert both lists instd.units.
2. multiply from thecorrespoding position
3. mean of products.
Math
r = 1/n *SUM( ( stdx ) * (stdy) )
Properties of r
1. a pure number.
2. -1<=r<=1 ,intuition : the mean of the products in the std.units.
3. switch the variables x,y, r stays the same.
for linear transformations:
4. add a constant to onethe of lists , r stays the same , u know.
5. multiplying one thelists by a positive constant does not change standard units , ( think , as howto measure , or using different units for data.) , so r stays
6. and for a negativemultiplier ?~ u can imagine!
R makes the degreeof the linear associations.~
Lecture 6.2 Using R with Caution !!
think of the bigger shoes of children wearingon , the more ability of reading to them !
Association is not causation
( the basic statistical principle! )
if two variables have anon-zero correlation, they are related to each other in some way , but doesn'tmean that one is the cause of the other!
correlated : linearlyrelation.
Outlier
one point , noticableeffect on r , for a outlier
Ecological correlation
Lecture 7 Regression
consider , knowing twovariables are linear associated , and knowing the correlation .
can u estimate another fromone of the known variable data. ==> Regression ,u know how important it is.
Estimate
Estimate: one variable
Heights: average 67 inches,SD 3inches.
so , one of these people ispicked, u have to estimate the person's height.
so , u guess 67 inches.
error : actual height - 67inches = actual height - average height.
error in using the averageas the estimate : deviation from average
rough size of errors = r.m.s of deviation from average = SD = 3 inches.
chebychev: For at least 75% of the people , the estimate will be correct to within 6inches.
and if roughly normal distribution , 95% , will be correct to within 6 inches.
HOW TO CHOOSE ESTIMATE c?
makes the smallest error.
THE r.m.s of the errorswill be smallest if u choose c = average
average: least squaresestimate
for two variables
Given the values of onevariable , and estimate the other.
Lecture 7.2
Regression line: intuition; the equation in standards units; regression estimates.
How to identify the best line- > Equation of the regression line
estimate y = r*x given x ,correlation r.
in standard units of y,x
SO WHY , CAN U INTUIT IT ?
E.G.
1.Heights: average 67inches, SD 3 inches.
Weights : average 160pounds , SD 20 pounds.
r = 0.6
scatter diagram is roughlyfootball shaped.
Estimate a person with 73inches may weigh how many pounds ?
73 inches in standard units= (73-67)/3 = 2
estimate of weight instandard units = 2*0.6 = 1.2
so estimate of weight inpound = 1.2 * 20 + 160 = 184 pounds.
2.Midterm and final coursesin a large class have a correlation of 0.5 .
The scatter diagram is roughlyfootball shaped.
One of the students is onthe 80th percentile of midterm scores. Estimate the students' percentile rankon the final.
so ,using football shapedproperty: roughly normal.
Lecture 7.3 Regression effect , Galton, and the regression fallacy
Regression Effect
Sir Francis Galton , 1822 –1911 , given the following terminology first:
SD
correlation
regression
Galton's observation:Fathers who are tall tend to have sons who are note quite that tall , onaverage.
Further Explanition
we saw that forfootball-shaped scatterplots the graph of averages is not as steep as the SD line, unless r=±1: If 0<r<1, the average value of Y for individuals whose values of X areabout kSDX above the mean(X) is less than kSDY above themean(Y). Similarly, if −1<r<0,the average value of Y for individuals whose values of X are about kSDX abovemean(X) is less than kSDY below mean(Y).
This phenomenon is calledthe regression effect or regression towards the mean.
Individualswith a given value of X tend to have values of Y that are closer to themean, where closer means fewer SD away.
Consider the IQs of a large group of married couples. Essentially bydefinition, the average IQ score is 100. The SD of IQ is about 15 points.Suppose that for this group, the correlation between the IQs of spouses is0.7—women with above average IQ tend to marry men with above average IQ, andvice versa. Consider a woman in the group whose IQ is 150 (genius level). Whatis our best estimate of her husband's IQ? We shall estimate his IQ using theregression line: Her IQ is 150, which is 50 points above average. 50 points is
3*1/3×15points=3*1/3
so we would estimatethe husband's IQ to be r×31/3SD=0.7×3 1/3SD above average, or about 2 1/3SD above average.Now 2 1/3SD is 35 points, so we expect the husband's IQ to be about 135, notnearly as "smart" as she is.
Now let's predict theIQ of the wife of a man whose IQ is 135. His IQ is 2 1/3SD above average,so we expect her IQ to be 0.7×21/3SD above average. That's about 1.63 SD or 1.63×15=24.12 points aboveaverage, or 124.12, not as "smart" as he is. How can this be consistent?
Thealgebra is correct. The phenomenon is quitegeneral. It is called the regressioneffect. The regression effect is caused bythe same thing that makes the slope of the regression line smaller inmagnitude than the slope of the SD line. If the scatterplot is football-shaped and r is at leastzero but less than 1, then
In a vertical slicecontaining above-average values of X, most of the y coordinates are below theSD line.
In a vertical slicecontaining below-average values of X, most of the y coordinates are above theSD line.
this is confusing for me right now , maybe later , i shouldread it again.
but i understand the general meaning of the regressioneffect , that is the estimate performs more to the average .
Rearrangement: estimate of y = slope xx + intercept
slope = r*sigmay/ sigmax ;intercept = mu y - slope * mu * x
use the slope equation tocalculate more conveniently.
"Plug in" mu x asthe value of x.
estimateof y = mu y ==> the regression line passes through the point ofaverages.
Interpertation of slope
the data meaning , a grouppeople measured at the same time period.