# loading the data
<- read.csv("sim_data.csv") sim_data
Problem Set 1
These exercises cover Modules 1 through 3 of Machine Learning Methods and Applications. There is a point value for each exercise. There are 24 total points across two exercises. Submit your .qmd
and your rendered PDF or HTML with your name in both filenames. You can do this! Good luck!
Instructions
For each exercise, write your answers in a well-formatted combination of text and code blocks. You are required to submit your answers in Quarto because of its ability to combine text and code seamlessly. Answer the question first, then follow that with any supporting code. Commenting your code will make it easier for me to grade your work as you intended it to be understood.
The guidance within the document uses R, but you can use R or Python as you wish. Note that while specific functions may be recommended, you won’t be penalized for utilizing alternatives that produce similar results. Points may be deducted for untidy formatting. You will submit your complete .qmd
file and your rendered .pdf or .html document.
You may collaborate with your classmates, but every keystroke that goes into your final work must be your own. Do not copy or paste another student’s exact language or substantive or numeric examples. You can talk about the assignment and study together, but I expect you to submit your own work. I understand that there are only so many ways to code simple mathematical operations. But if I see substantive examples, critical reasoning/interpretations/responses, or vectors of example numbers that are the same, this can be problematic from an academic integrity standpoint. Make this your own work.
I assume that some of you will find using an LLM like Google Gemini or ChatGPT helpful in completing this assignment. I do not object to you using these tools to find information or generate ideas. However, academic integrity and good professional practice require that you not copy the results of an LLM query blindly or uncritically. If you use an LLM, be prepared to describe your queries and how you adapted the responses to represent your own work. You should know that LLMs include a lot of comments in their code when you ask it a coding question. If I see an unnatural level of documentation in your code, I may assume that you copied it directly from an LLM. Similarly, many LLMs are quite verbose and explain every step when you ask them a question. I don’t need you to document every line in painstaking detail. Answer the questions presented directly and concisely. This presentation may factor into the points you earn on a question.
Objectives
These exercises will help you develop your skills with the following:
Calculate and interpret descriptive statistics. (CLO 1)
Calculate and interpret regression results. (CLO 1, 3)
Recommended Timeline
I highly recommend completing this problem set over the course of several weeks. Here’s one possible timeline you might follow:
Exercise 1 (1 week)
Exercise 2 (1 week)
Exercise 1 - 11 points
This exercise is modeled off Exercises 2.9 and 4.14 in An Introduction to Statistical Learning (with Applications in R) (second edition).
Download sim_data.csv
, which is posted on Canvas. You should subset your data to include only absentee
as your outcome variables and four quantitative predictor variables. Note that you may find it helpful to use the data.frame() function to create a single data set containing both absentee and the four predictor variables. Make sure that the missing values have been removed from the data.
# exploring the data
names(sim_data)
[1] "X" "absentee" "nonwhite" "lunch" "IEP" "GPA" "change"
[8] "income"
# checking data types and summary
summary(sim_data)
X absentee nonwhite lunch
Min. : 1.0 Min. : 0.5646 Min. :35.04 Min. :19.80
1st Qu.: 250.8 1st Qu.: 5.3206 1st Qu.:50.00 1st Qu.:38.09
Median : 500.5 Median : 6.3702 Median :54.24 Median :44.39
Mean : 500.5 Mean : 6.3408 Mean :54.55 Mean :44.55
3rd Qu.: 750.2 3rd Qu.: 7.3542 3rd Qu.:58.94 3rd Qu.:51.09
Max. :1000.0 Max. :11.3776 Max. :73.26 Max. :74.98
NA's :2 NA's :1
IEP GPA change income
Min. : 3.471 Min. :1.052 Min. :2.591 Min. : 12335
1st Qu.:12.778 1st Qu.:1.775 1st Qu.:3.690 1st Qu.: 50519
Median :14.878 Median :1.980 Median :3.990 Median : 59333
Mean :14.890 Mean :1.994 Mean :3.984 Mean : 59812
3rd Qu.:17.157 3rd Qu.:2.220 3rd Qu.:4.298 3rd Qu.: 69707
Max. :23.530 Max. :3.077 Max. :5.245 Max. :100574
NA's :1 NA's :1
# subsetting the data
<- data.frame(
subset absentee = sim_data$absentee,
GPA = sim_data$GPA,
income = sim_data$income,
change = sim_data$change,
IEP = sim_data$IEP
)
# clean the data for missing values
<- na.omit(subset) subset
# Package installation
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
1.1 - 1 point
What is the range of each quantitative predictor? You can answer this using the range() function.
1.2 - 1 point
What is the mean and standard deviation of each quantitative predictor?
1.3 - 2 points
Now take a random sample of at least 100 observations. What is the range, mean, and standard deviation of each predictor in this sample? Do these differ from the values in the full sample?
1.4 - 2 points
Using the full data set, investigate the predictors and outcome graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
1.5 - 2 points
Suppose that we wish to predict absentee
on the basis of the other variables. Do your plots suggest that any of the variables might be useful in predicting absentee
? Justify your answer.
1.6 - 1 point
Create a binary variable, absentee_bin
, that contains a 1 if absentee
contains a value above its median, and a 0 if absentee
contains a value below its median. You can compute the median using the median() function.
1.7 - 2 points
Explore the data graphically in order to investigate the association between absentee_bin
and the four selected predictors/features. Which of the other features seem most likely to be useful in predicting absentee_bin
? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
Exercise 2 - 13 points
This exercise is modeled off Exercises 3.8 and 3.9 in An Introduction to Statistical Learning (with Applications in R) (second edition).
Download sim_data.csv
, which is posted on Canvas. You should subset your data to include only absentee
as your outcome variables and four quantitative predictor variables. Avoid selecting predictor variables that could not have a plausible causal effect on the response. Note that you may find it helpful to use the data.frame() function to create a single data set containing both absentee and the four predictor variables. Make sure that the missing values have been removed from the data.
# Create subset2 for Problem 2, just to be clean!
<- data.frame(
subset2 absentee = sim_data$absentee,
GPA = sim_data$GPA,
income = sim_data$income,
change = sim_data$change,
IEP = sim_data$IEP
)
# Remove missing values
<- na.omit(subset2) subset2
2.1 - 2 points
Use the lm() function to perform a simple bivariate linear regression with absentee
as the response and only one of the four quantitative variables as the predictor. Use the summary() function to print the results. Comment on the output. For example:
Is there a relationship between the predictor and the response?
How strong is the relationship between the predictor and the response?
Is the relationship between the predictor and the response positive or negative?
What is the predicted
absentee
associated with a specific value of the predictor of your choosing? What are the associated 95% confidence and prediction intervals?
2.2 - 2 points
Plot the relationship between the response and the predictor. Use the abline() function to display the least squares regression line.
2.3 - 2 points
Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit. Do residual plots suggest any unusually large outliers? Does a leverage plot identify any observations with unusually high leverage?
2.4 - 1 point
Compute the matrix of correlations between all five selected variables using the function cor().
2.5 - 2 points
Identify two quantitative variables that are plausible confounders for the relationship explored above in 2.1. Describe your reasoning for choosing these variables as potential confounders.
2.6 - 2 points
Use the lm() function to perform a multiple linear regression with absentee
as the response and with the same variable from 2.1 and the two plausible confounders from 2.6 as predictors. Use the summary() function to print the results. Comment on the output. For example:
Is there a relationship between the predictors and the response?
Which predictors appear to have a statistically significant relationship to the response?
How did the coefficient for the variable from 2.1 change? Does this change constitute evidence that one or more predictor from 2.6 was a confounder?
Calculate the variance inflation factor for each of the three predictors. Should any of these predictors be dropped from the model?
2.7 - 2 points
Provide a one-paragraph, high-level, non-technical summary of your analysis that’s targeted toward a specific stakeholder. Identify an interest the stakeholder might have in your results and connect the results to a recommended action they could take in pursuit of this interest.
Reflection on LLM Usage
Transparency is an important objective in data analytics. Use this section to describe whether and how you used an LLM to help you complete this problem set.