Computational HW 1

Computer Assignment 1
Exploratory Analysis Part II
- Created by James D. Wilson, UNC Chapel Hill
- Edited by Andrew Nobel, UNC Chapel Hill
In this assignment, we will explore several data sets which are simulated (independent) samples of various
random variables. In a later assignment, we will learn how to simulate random quantities in R but for now,
we will focus a little more on exploratory analysis.
1. Loading data from external sources: An important component of the R software is its input / output
(I/O) features. The read.table() function can be used to import, or read, data from pre-saved files
including .txt, .csv, .xlsx or files available online. Similarly, the write.table() function can be used to
write a saved R variable to a .txt, .csv, or .xlsx file. Load each of the data sets (saved as .txt files)
from the course website using the following commands.
dat.1
dat.2
dat.3
dat.4
dat.5
dat.6
=
=
=
=
=
=
read.table("http://www.unc.edu/%7Ejameswd/data/dat1.txt")
read.table("http://www.unc.edu/%7Ejameswd/data/dat2.txt")
read.table("http://www.unc.edu/%7Ejameswd/data/dat3.txt")
read.table("http://www.unc.edu/%7Ejameswd/data/dat4.txt")
read.table("http://www.unc.edu/%7Ejameswd/data/dat5.txt")
read.table("http://www.unc.edu/%7Ejameswd/data/dat6.txt")
Calculate the standard deviation and the 5-number summary for each of the samples using the
summary() and sd() commands described in Computer Assignment 0. Also, estimate the coefficient
of variation (cv = µ/σ) for each of the samples. Provide the code needed to do this in your final R
code printout. Answer the following questions:
Questions
(a) Make a table (hand-written is fine) which includes the following information about the first two
samples:
i. Number of observations
ii. Whether the data is integer or non-integer valued
iii. Mean
iv. Median
v. Standard deviation
vi. Coefficient of variation
(b) Comment on the similarities and differences between each of these samples.
(c) Based on the estimated coefficient of variation you calculated above, which of these samples
might we expect to have the highest signal? Which sample has the lowest signal?
2. Empirical cumulative distribution functions: The empirical cumulative distribution function (ECDF)
of a random sample provides a summary of the sample based on the order (smallest to largest) of the
sample. When the sample is perceived to come from a probability distribution, the ECDF can be
used to estimate the true cumulative distribution function of the sample. We can calculate the ECDF
of a sample by using the ecdf() command in R . Plot the ECDF of the first three samples dat.1,
dat.2, and dat.3 on the same plot using the following commands.
1
#plot the ecdf of dat.1 and color it green
plot(ecdf(dat.1), col = "green")
#plot the ecdf of dat.2, color it blue, and add this plot to the previous figure
plot(ecdf(dat.2), col = "blue", add = TRUE)
#plot the ecdf of dat.3, color it red, and add this plot to the previous figure
plot(ecdf(dat.3), col = "red" , add = TRUE)
#add a legend to this plot (place it in the bottomright of the figure)
#the first argument gives location, the second gives the names
#the third gives the line type, and the fourth is the color of each label
legend("bottomright", c("dat.1","dat.2","dat.3"), lty = c(1,1,1), col = c("green","blue","red"))
Next, plot the histograms of each of the samples using the following code:
#plot the histogram of x1
hist(dat.1, main = "Histogram of Sample 1")
hist(dat.2, main = "Histogram of Sample 2")
hist(dat.3, main = "Histogram of Sample 3")
Questions
(a) Note that the means of dat.1, dat.2, and dat.3 are approximately equal. It may be tempting to
think that two samples are similar (or even that they are samples from the same population)
when they share the same mean. Do the ECDFs of these variables support this claim?
(b) Comment on what the histograms of each of these samples provides. Do these histograms
support the claim that the samples are realizations of the same random variable?
3. Bivariate relationships and Correlation: Correlation and covariance are two descriptive statistics that
quantify the association between two variables. In R , we can calculate the sample correlation
between two quantitative variables x and y using the cor(x,y) command. Similarly, we can use the
cov(x,y) command to calculate the covariance between x and y. Calculate the pairwise correlations
and covariances between dat.4, dat.5, and dat.6 using the code below:
#calculate pairwise covariances
cov.45 = cov(dat.4,dat.5)
cov.46 = cov(dat.4,dat.6)
cov.56 = cov(dat.5,dat.6)
#calculate pairwise correlations
cor.45 = cor(dat.4,dat.5)
cor.46 = cor(dat.4,dat.6)
cor.56 = cor(dat.5,dat.6)
Using the plot() command, plot a scatterplot between each pair of the above three variables. Be sure
to appropriately label each of these plots.
Questions
(a) Make a table (again hand-written is OK) for each of the following datasets (dat.4,dat.5),
(dat.4,dat.6), and (dat.5,dat.6) that includes the following:
i. Correlation
ii. Covariance
iii. Product of standard deviations of each variable in the pair
(b) Verify that in each case, the correlation is the quotient of the covariance and the product of the
standard deviations
2
(c) Comment on each of the generated scatterplots. What does the correlation tell us about the
relationships shown in the scatterplots? Does the covariance provide similar information as the
correlation? Be sure to explain the “peculiarity” of the plot between dat.4 and dat.6.
4. t-tests: One way to test for statistically significant differences between two samples is to use a formal
hypothesis test known as the t-test. There were two types of t-statistics described in class: the
Student’s t-statistic and the Welsh t-statistic. These statistics are used in different situations
depending on the variance of the two samples being compared.
Consider comparing two samples x and y. We can calculate either of these t-test statistics using the
function t.test(). In particular, if the variance of the two samples are not equal, then we use the
command t.test(x, y, var.equal = FALSE). If the variances are equal, then we use the command
t.test(x, y, var.equal = TRUE).
Questions
(a) Which t-statistic is appropriate to compare the samples dat.1 and dat.2? How about dat.1 and
dat.3?
(b) Calculate the t-statistic to compare dat.1 and dat.2. Are the two samples statistically
significantly different at a 0.05 level? Be sure to include your code in your R output file.
(c) Repeat (b) for dat.1 and dat.3.
5. Fisher’s iris data Now we will apply the above techniques to further explore the iris data set in R .
Answer each of the following questions and include any R code used in your R script file.
Questions
(a) Load the iris data and store the species names, as well as the the petal length and the petal
width of each of the samples. Use the following code:
#load the iris data
data(iris)
#store the species names of each flower
species.names = iris$Species
#convert these to a character string for later use
species.names = as.character(species.names)
#store the petal length
petal.length = iris$Petal.Length
#store the petal width
petal.width = iris$Petal.Width
For later use, store the petal length and petal width of the setosa and virginica species.
#store the petal length and width of the setosa species
petal.length.setosa = petal.length[which(species.names == "setosa")]
petal.width.setosa = petal.width[which(species.names == "setosa")]
#store the petal length and width of the virginica species
petal.length.virginica = petal.length[which(species.names == "virginica")]
petal.width.virginica = petal.width[which(species.names == "virginica")]
Make a table (hand-written) that includes the five-number summary as well as standard
deviation of the petal length and petal width of a) all 150 flowers, b) setosa species, and c)
virginica species.
(b) Generate an appropriately labeled scatterplot showing the relationship between the petal length
and petal width of all 150 flowers. Calculate the correlation and covariance between these two
variables. Based on what you see, comment on the relationship between these two variables.
3
(c) Repeat part (b) for the petal length and petal width of a) just the setosa species, and b) just the
virginica species. Comment on what the differences between these two scatterplots and any
observations that may be useful in distinguishing the two species.
(d) Plot the ECDF of the petal length of the setosa and the petal length of the virginica species in
two different plots. Do the same for the petal width of these two species. Be sure to
appropriately label each of the four plots. What do these plots reveal about the relationship
between these two species?
(e) Which t-statistic is appropriate for testing the difference between the petal length of the setosa
and virginica species? Why? Calculate the t-statistic. Are the petal lengths between these two
species statistically different?
(f) Which t-statistic is appropriate for testing the difference between the petal width of the setosa
and virginica species? Why? Calculate the t-statistic. Are the petal widths between these two
species statistically different?
4