Ch3 Notes

AP Statistics – Chapter 3 Notes
§3.1 Scatterplots and Correlations
Consider the following scenarios:
A medical study finds that short women are more likely to have heart attacks than women of
average height, while tall women have the fewest heart attacks.
An insurance group reports that heavier cars have fewer deaths per 100,000 vehicles than lighter
cars do.
What are these studies looking to establish?
A _____________________________ measures an outcome of a study. An _______________________
_______________________ may help explain or influence changes in a response variable.
The goal of many statistical studies is to show that changes in one or more explanatory variables actually
cause changes in a response variable.
Example: Julie asks, “Can I predict a state’s mean SAT Math score if I know its mean SAT Critical
Reading score?” Jim wants to know how the mean SAT math and Critical Reading scores this year in the
50 states are related to each other.
For each student, identify the explanatory variable and the response variable if possible
Example: Julie wants to know if she can predict a student’s weight from his or her height. Information
about height is easier to obtain than information about weight. Jim wants to know if there is a
relationship between height and weight.
For each student, identify the explanatory variable and response variable, if possible.
CYU: Pg.144
Displaying Relationships: ______________________________
Shows relationship between two ______________________ variables measured on the same
individuals.
The values of one variable appear on the horizontal axis, and the values of the other variable
appear on the vertical axis.
Each individual in the data appears as a point in the graph.
To make a scatterplot:
1) Decide which variable should go on each axis.
2) Label and scale your axes.
3) Plot individual data values.
To interpret a scatterplot:
Look for overall patterns and striking departures from
the pattern:
Describe overall pattern:
_________________: positive or negative association
_____________: linear, roughly linear, curved
________________ of the relationship: strong,
moderately strong, weak
Departure from pattern: ______________________.
Alt. Examples: Track and Field Day!
The table below shows data for 13 students in a statistics class. Each member of the class ran a 40-yard
sprint and then did a long jump (with a running start). Make a scatterplot of the relationship between
sprint time (in seconds) and long jump distance (in inches). Interpret the scatterplot.
Sprint Time (s)
5.41
5.05
9.49
8.09
7.01
7.17
6.83
6.73
8.01
5.68
5.78
6.31
6.04
Long Jump Distance (in)
171
184
48
151
90
65
94
78
71
130
173
143
141
**Pick ‘nice’ values to mark on each axis. Be sure to cover the range of each variable. You don’t need to
start at 0 or to have both axes on the same scale. However, each scale must be consistent. Also,
remember to clearly label the variable name on each axis.
Interpretation: (DOFS in context!)
**Don’t let unusual values influence their description. If covering up one value makes the form go from
nonlinear to linear, you should call it a linear association with an outlier. This is especially true for small
data sets.
Since it is easy to be confused by different scales or by the amount of space around the cloud of points in
a scatterplot, we need to use a numerical measure to supplement the graph. The ________________ (r)
measures the direction and strength of the linear relationship between two quantitative variables.
Facts about r = correlation
r is always a number between ____________________.
The sign indicates the direction: r > 0 denotes a __________________ association and r < 0
denotes a _______________________ association.
Values of r near 0 indicate a very __________________________relationship
Values of r near -1 or 1 indicate ____________________________ relationship
r is exactly -1 or 1 if the relationship is a ___________________ linear relationship. (all points lie
exactly along a straight line)
A correlation greater than 0.8 is generally described as strong, whereas a correlation less than 0.5
is generally described as weak. These values can vary based upon the "type" of data being
examined. A study utilizing scientific data may require a stronger correlation than a study using
social science data.
Correlation makes no distinction between explanatory and response variables.
r does _______ change when we change the units of measurement of x , y , or both.
Correlation r is just a number; it does not have a ___________ of measurement.
Correlation requires that both variables be ____________________.
Correlation measures the strength of only the linear relationship between two variables. It does
not describe ________________ relationships!
Correlation is not resistant. It is strongly affected by ______________________
Correlation is not a complete summary of two-variable data. You should also look at the mean
and standard deviation as well as a plot of the data. (Numerical summaries complement plots of
data, they do not replace them!)
AP Statistics – Chapter 3 Notes
§3.2 Least-Squares Regression
A ____________________________ describes how a response variable y changes as an explanatory
variable x changes. We often use a _________________ to predict the value of y for a given value of x.
A ____________________________________ is a ______________ for the data.
Regression line formula:
__________________________: the use of a regression line for prediction far outside the interval of
values on the explanatory variable x used to obtain the line. Such predictions are often not accurate.
**Don’t make predictions using values of x that are much larger or much smaller than those that actually
appear on your data.
Example: Used Hondas
The following data shows the number of miles driven and advertised price for 11 used Honda CR-Vs
from the 2002-2006 model years (prices found at www.carmax.com).
Miles Driven (thousands)
Cost (dollars)
22
17998
29
16450
35
14998
39
13998
45
14599
49
14988
55
13599
56
14599
69
11998
70
14450
86
10998
a) Draw and interpret a scatterplot of the data.
b) Find the correlation coefficient r.
c) Find the regression line that models this data. Identify the slope and y-intercept of the regression
line. Interpret each value in context.
d) Predict the price for a car with 50,000 miles.
e) Should we predict the asking price for a used 2002-2006 Honda CR-V with 250,000 miles?
A ________________________ is the difference between an observed value of the response variable
and the value predicted by the regression line.
Formula:
**When interpreting residuals, be sure to address both the distance from what we predicted and the
direction of the error.
f) Find and interpret the residuals for the Hondas with 39,000 and 70,000 miles.
The __________________________________________ line of y on x is the line that makes the sum
of the squared residuals as small as possible.
Remember that correlation will stay the same no matter how you rescale the units. The slope
does not share this property!
g) Find the least-squares regression line for the Honda CR-V’s data.
CYU: Pg.167
CYU: Pg.171
Example: Working backwards: The equation of the least-squares regression line for the sprint time and
long-jump distance data is:
predicted long-jump distance = 304.56 - 27.63 (sprint time)
a) Find and interpret the residual for the student who had a sprint time of 8.09 seconds.
b) What was the actual long-jump distance for a student who had a time of 5.78 seconds and a
residual of 28.14?
______________________________________: How well does a line fit the data?
Since residuals show how far the data fall from the regression line, examining residuals helps us
assess how well the line describes the data.
**The mean of the least-squares residuals is always zero.
A residual plot is a _______________________ of the residuals against the explanatory variable.
Residual plots help us assess how well a regression line fits the data.
**Only a residual plot can adequately address whether a line is an appropriate model for the data.
A residual plot turns the regression line horizontal.
It magnifies the deviations of the points from the line, making it easier to see unusual
observations and patterns.
If the regression line captures the overall pattern
CYU: Pg.176
Examining Residual Plots:
A residual plot turns the regression line horizontal.
It magnifies the deviation of the points from the line.
Makes it easier to see unusual observations and patterns.
The residual plot should show no obvious pattern. (this is how you know that the regression
line captures the overall pattern of the data.)
A curved pattern in a residual plot shows that the relationship is not linear.
The residuals should be relatively small in size. (A regression line that fits the data well
should come close to most of the points)
Standard deviation of the residuals:
If we use a least-squares line to predict the values of a response variable y from an explanatory variable x,
the standard deviation of the residuals is given by:
In general, the standard deviation is the average distance of actual values from their expected values. It
gives us the approximate size of the typical or average prediction error (residual) and it is measured in the
units of the response variable.
CYU: Pg.179
The coefficient of determination: r2 in regression
The coefficient of determination r2 is the fraction of the variabtion in the values of y that is accounted for
by the least-squares regression line of y on x. Formula:
Interpreting r2:
“The least-squares regression line accounts for _________% of the variation in the (response variable
name).”
Or
“____________% of the variation in (response variable name) is accounted for by the regression line”.
CYU: Pg. 181
See Computer Output: pg.182
Correlation vs. Regression
Correlation
Regression
The distinction between explanatory and response
variables in NOT important in regression. Switching x
and y doesn’t affect the value of r.
The distinction between explanatory and response
variables is important in regression.
Describes linear relationships only
Describes linear relationships only
Not resistant; one unusual point in a scatterplot can
greatly change the value of r.
Not resistant; one unusual point in a scatterplot can
greatly change the value of r.
Association does not imply causation. Even if
association is very strong, is not by itself good evidence
that changes in x actually cause changes in y
Association does not imply causation. Even if
association is very strong, is not by itself good evidence
that changes in x actually cause changes in y
Outliers: Observations that lie outside the overall pattern of the other observations. Points that are
outliers in the y direction by not the x direction of a scatterplot have large residuals. Other outliers may
not have large residuals
An observation is ________________ for a statistical calculation if removing it would markedly change
the result of the calculation. Points that are outliers in the x direction of a scatterplot are often influential
for the least-squares regression line.