AP Statistics – Chapter 3 Notes §3.1 Scatterplots and Correlations Consider the following scenarios: A medical study finds that short women are more likely to have heart attacks than women of average height, while tall women have the fewest heart attacks. An insurance group reports that heavier cars have fewer deaths per 100,000 vehicles than lighter cars do. What are these studies looking to establish? A _____________________________ measures an outcome of a study. An _______________________ _______________________ may help explain or influence changes in a response variable. The goal of many statistical studies is to show that changes in one or more explanatory variables actually cause changes in a response variable. Example: Julie asks, “Can I predict a state’s mean SAT Math score if I know its mean SAT Critical Reading score?” Jim wants to know how the mean SAT math and Critical Reading scores this year in the 50 states are related to each other. For each student, identify the explanatory variable and the response variable if possible Example: Julie wants to know if she can predict a student’s weight from his or her height. Information about height is easier to obtain than information about weight. Jim wants to know if there is a relationship between height and weight. For each student, identify the explanatory variable and response variable, if possible. CYU: Pg.144 Displaying Relationships: ______________________________ Shows relationship between two ______________________ variables measured on the same individuals. The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis. Each individual in the data appears as a point in the graph. To make a scatterplot: 1) Decide which variable should go on each axis. 2) Label and scale your axes. 3) Plot individual data values. To interpret a scatterplot: Look for overall patterns and striking departures from the pattern: Describe overall pattern: _________________: positive or negative association _____________: linear, roughly linear, curved ________________ of the relationship: strong, moderately strong, weak Departure from pattern: ______________________. Alt. Examples: Track and Field Day! The table below shows data for 13 students in a statistics class. Each member of the class ran a 40-yard sprint and then did a long jump (with a running start). Make a scatterplot of the relationship between sprint time (in seconds) and long jump distance (in inches). Interpret the scatterplot. Sprint Time (s) 5.41 5.05 9.49 8.09 7.01 7.17 6.83 6.73 8.01 5.68 5.78 6.31 6.04 Long Jump Distance (in) 171 184 48 151 90 65 94 78 71 130 173 143 141 **Pick ‘nice’ values to mark on each axis. Be sure to cover the range of each variable. You don’t need to start at 0 or to have both axes on the same scale. However, each scale must be consistent. Also, remember to clearly label the variable name on each axis. Interpretation: (DOFS in context!) **Don’t let unusual values influence their description. If covering up one value makes the form go from nonlinear to linear, you should call it a linear association with an outlier. This is especially true for small data sets. Since it is easy to be confused by different scales or by the amount of space around the cloud of points in a scatterplot, we need to use a numerical measure to supplement the graph. The ________________ (r) measures the direction and strength of the linear relationship between two quantitative variables. Facts about r = correlation r is always a number between ____________________. The sign indicates the direction: r > 0 denotes a __________________ association and r < 0 denotes a _______________________ association. Values of r near 0 indicate a very __________________________relationship Values of r near -1 or 1 indicate ____________________________ relationship r is exactly -1 or 1 if the relationship is a ___________________ linear relationship. (all points lie exactly along a straight line) A correlation greater than 0.8 is generally described as strong, whereas a correlation less than 0.5 is generally described as weak. These values can vary based upon the "type" of data being examined. A study utilizing scientific data may require a stronger correlation than a study using social science data. Correlation makes no distinction between explanatory and response variables. r does _______ change when we change the units of measurement of x , y , or both. Correlation r is just a number; it does not have a ___________ of measurement. Correlation requires that both variables be ____________________. Correlation measures the strength of only the linear relationship between two variables. It does not describe ________________ relationships! Correlation is not resistant. It is strongly affected by ______________________ Correlation is not a complete summary of two-variable data. You should also look at the mean and standard deviation as well as a plot of the data. (Numerical summaries complement plots of data, they do not replace them!) AP Statistics – Chapter 3 Notes §3.2 Least-Squares Regression A ____________________________ describes how a response variable y changes as an explanatory variable x changes. We often use a _________________ to predict the value of y for a given value of x. A ____________________________________ is a ______________ for the data. Regression line formula: __________________________: the use of a regression line for prediction far outside the interval of values on the explanatory variable x used to obtain the line. Such predictions are often not accurate. **Don’t make predictions using values of x that are much larger or much smaller than those that actually appear on your data. Example: Used Hondas The following data shows the number of miles driven and advertised price for 11 used Honda CR-Vs from the 2002-2006 model years (prices found at www.carmax.com). Miles Driven (thousands) Cost (dollars) 22 17998 29 16450 35 14998 39 13998 45 14599 49 14988 55 13599 56 14599 69 11998 70 14450 86 10998 a) Draw and interpret a scatterplot of the data. b) Find the correlation coefficient r. c) Find the regression line that models this data. Identify the slope and y-intercept of the regression line. Interpret each value in context. d) Predict the price for a car with 50,000 miles. e) Should we predict the asking price for a used 2002-2006 Honda CR-V with 250,000 miles? A ________________________ is the difference between an observed value of the response variable and the value predicted by the regression line. Formula: **When interpreting residuals, be sure to address both the distance from what we predicted and the direction of the error. f) Find and interpret the residuals for the Hondas with 39,000 and 70,000 miles. The __________________________________________ line of y on x is the line that makes the sum of the squared residuals as small as possible. Remember that correlation will stay the same no matter how you rescale the units. The slope does not share this property! g) Find the least-squares regression line for the Honda CR-V’s data. CYU: Pg.167 CYU: Pg.171 Example: Working backwards: The equation of the least-squares regression line for the sprint time and long-jump distance data is: predicted long-jump distance = 304.56 - 27.63 (sprint time) a) Find and interpret the residual for the student who had a sprint time of 8.09 seconds. b) What was the actual long-jump distance for a student who had a time of 5.78 seconds and a residual of 28.14? ______________________________________: How well does a line fit the data? Since residuals show how far the data fall from the regression line, examining residuals helps us assess how well the line describes the data. **The mean of the least-squares residuals is always zero. A residual plot is a _______________________ of the residuals against the explanatory variable. Residual plots help us assess how well a regression line fits the data. **Only a residual plot can adequately address whether a line is an appropriate model for the data. A residual plot turns the regression line horizontal. It magnifies the deviations of the points from the line, making it easier to see unusual observations and patterns. If the regression line captures the overall pattern CYU: Pg.176 Examining Residual Plots: A residual plot turns the regression line horizontal. It magnifies the deviation of the points from the line. Makes it easier to see unusual observations and patterns. The residual plot should show no obvious pattern. (this is how you know that the regression line captures the overall pattern of the data.) A curved pattern in a residual plot shows that the relationship is not linear. The residuals should be relatively small in size. (A regression line that fits the data well should come close to most of the points) Standard deviation of the residuals: If we use a least-squares line to predict the values of a response variable y from an explanatory variable x, the standard deviation of the residuals is given by: In general, the standard deviation is the average distance of actual values from their expected values. It gives us the approximate size of the typical or average prediction error (residual) and it is measured in the units of the response variable. CYU: Pg.179 The coefficient of determination: r2 in regression The coefficient of determination r2 is the fraction of the variabtion in the values of y that is accounted for by the least-squares regression line of y on x. Formula: Interpreting r2: “The least-squares regression line accounts for _________% of the variation in the (response variable name).” Or “____________% of the variation in (response variable name) is accounted for by the regression line”. CYU: Pg. 181 See Computer Output: pg.182 Correlation vs. Regression Correlation Regression The distinction between explanatory and response variables in NOT important in regression. Switching x and y doesn’t affect the value of r. The distinction between explanatory and response variables is important in regression. Describes linear relationships only Describes linear relationships only Not resistant; one unusual point in a scatterplot can greatly change the value of r. Not resistant; one unusual point in a scatterplot can greatly change the value of r. Association does not imply causation. Even if association is very strong, is not by itself good evidence that changes in x actually cause changes in y Association does not imply causation. Even if association is very strong, is not by itself good evidence that changes in x actually cause changes in y Outliers: Observations that lie outside the overall pattern of the other observations. Points that are outliers in the y direction by not the x direction of a scatterplot have large residuals. Other outliers may not have large residuals An observation is ________________ for a statistical calculation if removing it would markedly change the result of the calculation. Points that are outliers in the x direction of a scatterplot are often influential for the least-squares regression line.
© Copyright 2024 ExpyDoc