Computer Assignment 1 Exploratory Analysis Part II - Created by James D. Wilson, UNC Chapel Hill - Edited by Andrew Nobel, UNC Chapel Hill In this assignment, we will explore several data sets which are simulated (independent) samples of various random variables. In a later assignment, we will learn how to simulate random quantities in R but for now, we will focus a little more on exploratory analysis. 1. Loading data from external sources: An important component of the R software is its input / output (I/O) features. The read.table() function can be used to import, or read, data from pre-saved files including .txt, .csv, .xlsx or files available online. Similarly, the write.table() function can be used to write a saved R variable to a .txt, .csv, or .xlsx file. Load each of the data sets (saved as .txt files) from the course website using the following commands. dat.1 dat.2 dat.3 dat.4 dat.5 dat.6 = = = = = = read.table("http://www.unc.edu/%7Ejameswd/data/dat1.txt") read.table("http://www.unc.edu/%7Ejameswd/data/dat2.txt") read.table("http://www.unc.edu/%7Ejameswd/data/dat3.txt") read.table("http://www.unc.edu/%7Ejameswd/data/dat4.txt") read.table("http://www.unc.edu/%7Ejameswd/data/dat5.txt") read.table("http://www.unc.edu/%7Ejameswd/data/dat6.txt") Calculate the standard deviation and the 5-number summary for each of the samples using the summary() and sd() commands described in Computer Assignment 0. Also, estimate the coefficient of variation (cv = µ/σ) for each of the samples. Provide the code needed to do this in your final R code printout. Answer the following questions: Questions (a) Make a table (hand-written is fine) which includes the following information about the first two samples: i. Number of observations ii. Whether the data is integer or non-integer valued iii. Mean iv. Median v. Standard deviation vi. Coefficient of variation (b) Comment on the similarities and differences between each of these samples. (c) Based on the estimated coefficient of variation you calculated above, which of these samples might we expect to have the highest signal? Which sample has the lowest signal? 2. Empirical cumulative distribution functions: The empirical cumulative distribution function (ECDF) of a random sample provides a summary of the sample based on the order (smallest to largest) of the sample. When the sample is perceived to come from a probability distribution, the ECDF can be used to estimate the true cumulative distribution function of the sample. We can calculate the ECDF of a sample by using the ecdf() command in R . Plot the ECDF of the first three samples dat.1, dat.2, and dat.3 on the same plot using the following commands. 1 #plot the ecdf of dat.1 and color it green plot(ecdf(dat.1), col = "green") #plot the ecdf of dat.2, color it blue, and add this plot to the previous figure plot(ecdf(dat.2), col = "blue", add = TRUE) #plot the ecdf of dat.3, color it red, and add this plot to the previous figure plot(ecdf(dat.3), col = "red" , add = TRUE) #add a legend to this plot (place it in the bottomright of the figure) #the first argument gives location, the second gives the names #the third gives the line type, and the fourth is the color of each label legend("bottomright", c("dat.1","dat.2","dat.3"), lty = c(1,1,1), col = c("green","blue","red")) Next, plot the histograms of each of the samples using the following code: #plot the histogram of x1 hist(dat.1, main = "Histogram of Sample 1") hist(dat.2, main = "Histogram of Sample 2") hist(dat.3, main = "Histogram of Sample 3") Questions (a) Note that the means of dat.1, dat.2, and dat.3 are approximately equal. It may be tempting to think that two samples are similar (or even that they are samples from the same population) when they share the same mean. Do the ECDFs of these variables support this claim? (b) Comment on what the histograms of each of these samples provides. Do these histograms support the claim that the samples are realizations of the same random variable? 3. Bivariate relationships and Correlation: Correlation and covariance are two descriptive statistics that quantify the association between two variables. In R , we can calculate the sample correlation between two quantitative variables x and y using the cor(x,y) command. Similarly, we can use the cov(x,y) command to calculate the covariance between x and y. Calculate the pairwise correlations and covariances between dat.4, dat.5, and dat.6 using the code below: #calculate pairwise covariances cov.45 = cov(dat.4,dat.5) cov.46 = cov(dat.4,dat.6) cov.56 = cov(dat.5,dat.6) #calculate pairwise correlations cor.45 = cor(dat.4,dat.5) cor.46 = cor(dat.4,dat.6) cor.56 = cor(dat.5,dat.6) Using the plot() command, plot a scatterplot between each pair of the above three variables. Be sure to appropriately label each of these plots. Questions (a) Make a table (again hand-written is OK) for each of the following datasets (dat.4,dat.5), (dat.4,dat.6), and (dat.5,dat.6) that includes the following: i. Correlation ii. Covariance iii. Product of standard deviations of each variable in the pair (b) Verify that in each case, the correlation is the quotient of the covariance and the product of the standard deviations 2 (c) Comment on each of the generated scatterplots. What does the correlation tell us about the relationships shown in the scatterplots? Does the covariance provide similar information as the correlation? Be sure to explain the “peculiarity” of the plot between dat.4 and dat.6. 4. t-tests: One way to test for statistically significant differences between two samples is to use a formal hypothesis test known as the t-test. There were two types of t-statistics described in class: the Student’s t-statistic and the Welsh t-statistic. These statistics are used in different situations depending on the variance of the two samples being compared. Consider comparing two samples x and y. We can calculate either of these t-test statistics using the function t.test(). In particular, if the variance of the two samples are not equal, then we use the command t.test(x, y, var.equal = FALSE). If the variances are equal, then we use the command t.test(x, y, var.equal = TRUE). Questions (a) Which t-statistic is appropriate to compare the samples dat.1 and dat.2? How about dat.1 and dat.3? (b) Calculate the t-statistic to compare dat.1 and dat.2. Are the two samples statistically significantly different at a 0.05 level? Be sure to include your code in your R output file. (c) Repeat (b) for dat.1 and dat.3. 5. Fisher’s iris data Now we will apply the above techniques to further explore the iris data set in R . Answer each of the following questions and include any R code used in your R script file. Questions (a) Load the iris data and store the species names, as well as the the petal length and the petal width of each of the samples. Use the following code: #load the iris data data(iris) #store the species names of each flower species.names = iris$Species #convert these to a character string for later use species.names = as.character(species.names) #store the petal length petal.length = iris$Petal.Length #store the petal width petal.width = iris$Petal.Width For later use, store the petal length and petal width of the setosa and virginica species. #store the petal length and width of the setosa species petal.length.setosa = petal.length[which(species.names == "setosa")] petal.width.setosa = petal.width[which(species.names == "setosa")] #store the petal length and width of the virginica species petal.length.virginica = petal.length[which(species.names == "virginica")] petal.width.virginica = petal.width[which(species.names == "virginica")] Make a table (hand-written) that includes the five-number summary as well as standard deviation of the petal length and petal width of a) all 150 flowers, b) setosa species, and c) virginica species. (b) Generate an appropriately labeled scatterplot showing the relationship between the petal length and petal width of all 150 flowers. Calculate the correlation and covariance between these two variables. Based on what you see, comment on the relationship between these two variables. 3 (c) Repeat part (b) for the petal length and petal width of a) just the setosa species, and b) just the virginica species. Comment on what the differences between these two scatterplots and any observations that may be useful in distinguishing the two species. (d) Plot the ECDF of the petal length of the setosa and the petal length of the virginica species in two different plots. Do the same for the petal width of these two species. Be sure to appropriately label each of the four plots. What do these plots reveal about the relationship between these two species? (e) Which t-statistic is appropriate for testing the difference between the petal length of the setosa and virginica species? Why? Calculate the t-statistic. Are the petal lengths between these two species statistically different? (f) Which t-statistic is appropriate for testing the difference between the petal width of the setosa and virginica species? Why? Calculate the t-statistic. Are the petal widths between these two species statistically different? 4
© Copyright 2024 ExpyDoc