Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 156 additions & 0 deletions submissions/Day4Exercise_HuangZixiao.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
---
title: "Exercises Day 2"
author: "Richard Paquin Morel, adapted from exercises by Christina Maimone"
date: "`r Sys.Date()`"
output:
pdf_document: default
html_document: default
params:
answers: yes
---


```{r, echo=FALSE, eval=TRUE}
answers<-params$answers
```

```{r global_options, echo = FALSE, include = FALSE}
knitr::opts_chunk$set(echo=answers, eval=answers,
warning = FALSE, message = FALSE,
cache = FALSE, tidy = FALSE)
```

## Load the data

Load the `gapminder` dataset.

```{asis}
### Answer
```

```{r}
gapminder <- read.csv(here::here("/Desktop/Northwestern/Bootcamp/bootcamp-2020/data/gapminder5.csv"), stringsAsFactors=FALSE)
```

## Class Example
generation <- read_csv("Desktop/Northwestern/Bootcamp/bootcamp-2020/data/ca_energy_generation.csv")
imports <- read_csv("Desktop/Northwestern/Bootcamp/bootcamp-2020/data/ca_energy_imports.csv")
merged_energy <- merge(generation, imports, by = "datetime")
dim(merged_energy)
head(merged_energy)
long_merged_energy <- gather(merged_energy, key = source, value = usage, -datetime)
head(long_merged_energy)
dim(long_merged_energy)

## If Statement

Use an if() statement to print a suitable message reporting whether there are any records from 2002 in the gapminder dataset. Now do the same for 2012.

Hint: use the `any` function.

```{asis}
### Answer
```

```{r}
year<-2002
if(any(gapminder$year == year)){
print(paste("Record(s) for the year",year,"found."))
} else {
print(paste("No records for year",year))
}
```


## Loop and If Statements

Write a script that finds the mean life expectancy by country for countries whose population is below the mean for the dataset

Write a script that loops through the `gapminder` data by continent and prints out whether the mean life expectancy is smaller than 50, between 50 and 70, or greater than 70.

```{asis}
### Answer
```

```{r}
overall_mean <- mean(gapminder$pop)

for (i in unique(gapminder$country)) {
country_mean <- mean(gapminder$pop[gapminder$country==i])

if (country_mean < overall_mean) {
mean_le <- mean(gapminder$lifeExp[gapminder$country==i])
print(paste("Mean Life Expectancy in", i, "is", mean_le))
}
} # end for loop
```

```{r}
lower_threshold <- 50
upper_threshold <- 70

for (i in unique(gapminder$continent)){
tmp <- mean(gapminder$lifeExp[gapminder$continent==i])

if (tmp < lower_threshold){
print(paste("Average Life Expectancy in", i, "is less than", lower_threshold))
}
else if (tmp > lower_threshold & tmp < upper_threshold){
print(paste("Average Life Expectancy in", i, "is between", lower_threshold, "and", upper_threshold))
}
else {
print(paste("Average Life Expectancy in", i, "is greater than", upper_threshold))
}

}
```


## Writing Functions

Create a function that given a data frame will print the name of each column and the class of data it contains. Use the gapminder dataset. Hint: Use `mode()` or `class()` to get the class of the data in each column. Remember that `names()` or `colnames()` returns the name of the columns in a dataset.

```{asis}
### Answer

Note: Some of these were taken or modified from https://www.r-bloggers.com/functions-exercises/
```

```{r}
data_frame_info <- function(df) {
cols <- names(df)
for (i in cols) {
print(paste0(i, ": ", mode(df[, i])))
}
}
data_frame_info(gapminder)
```

Create a function that given a vector will print the mean and the standard deviation of a **vector**, it will optionally also print the median. Hint: include an argument that takes a boolean (`TRUE`/`FALSE`) operator and then include an `if` statement.

```{asis}
### Answer

```

```{r}
vector_info <- function(x, include_median=FALSE) {
print(paste("Mean:", mean(x)))
print(paste("Standard Deviation:", sd(x)))
if (include_median) {
print(paste("Median:", median(x)))
}
}

le <- gapminder$lifeExp
vector_info(le, include_median = F)
vector_info(le, include_median = T)
```

## Analyzing the relationship between GDP per capita and life expectancy

Use what you've learned so far to answer the following questions using the `gapminder` dataset. Be sure to include some visualizations!

1. What is the relationship between GDP per capita and life expectancy? Does this relationship change over time? (Hint: Use the natural log of both variables.)

2. Does the relationship between GDP per capita and life expectacy vary by continent? Make sure you divide the Americas into North and South America.
211 changes: 211 additions & 0 deletions submissions/FinalRExercise_HuangZixiao.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
---
title: "FinalRExercise_HuangZixiao.Rmd"
author: "Zixiao Huang"
date: "9/15/2020"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

```{r}
library(tidyverse)
library(data.table)
```

# Task 1: Import your data (nys_schools.csv and nys_acs.csv)
```{r}
# Read the data of schools
schools <- read.csv(here::here("Desktop/Northwestern/Bootcamp/bootcamp-2020/data/nys_schools.csv"))

# Read the data of counties
counties <- read.csv(here::here("Desktop/Northwestern/Bootcamp/bootcamp-2020/data/nys_acs.csv"))
```

# Task 2: Explore your data
```{r}
summary(schools)
summary(counties)
```

Answer:
There are many different types of variable in this dataset, such as character and numerical
variable (integer/double). There are missing data because some of the variables have a minimum
value of -99 and missing values are currently encoded as -99. The time ranges of these two datasets
are different. The schools data frame has a time range from 2008 to 2017, while that of counties
is 2009 to 2016. Therefore, if we want to merge these two data frames together, some entries in
the schools data frame might be dropped to facilitate our analysis.

# Task 3: Recoding and Variable Manipulation
1. Deal with missing values, which are currently encoded as -99
```{r}
# Set all missing values to NA since later in our calculations and analysis,
# we can just ignore them.
schools <- replace(schools, schools == -99, NA)
counties <- replace(counties, counties == -99, NA)
```

2. Create a categorical variable that groups counties into "high", "medium", and "low" poverty groups.
```{r}
# Group counties into three different poverty groups by using the median_household_income column
# Set the counties with lowest 25% median household income (first quartile) as "high" poverty group (income <= 46347)
# Set the counties with highest 25% median household income (fourth quartile) as "low" poverty group (income > 56448)
# Set the middle 50% as "medium" poverty group (46347 < income <= 56448)

# Start by creating a new variable with all missing values
counties$poverty_level <- NA
# Replace lowest 25% value with "high"
counties$poverty_level[counties$median_household_income <= 46347] <- "high"
# Replace middle 50% value with "medium"
counties$poverty_level[counties$median_household_income <= 56448 & counties$median_household_income > 46347] <- "medium"
# Replace highest 25% value with "low"
counties$poverty_level[counties$median_household_income > 56448] <- "low"
```

3. Create a new variable that is the standardized z-score for math and English Language Arts (ELA)
for each year.
```{r}
# First group by year, then use the scale() function
schools <- schools %>%
group_by(year) %>%
mutate(z_score_math = scale(mean_math_score),
z_score_ela = scale(mean_ela_score))
```

# Task 4: Merge datasets
Create a county-level dataset that merges variables from the schools dataset and the ACS dataset.
```{r}
county_school <- merge(schools, counties, by = c("county_name", "year"))
```

# Task 5: Generate summary tables
1. For each county: total enrollment, percent of students qualifying for free or reduced price lunch, and percent of
population in poverty.
```{r}
summary1 <- county_school %>%
# Calculate the total number of students with free/reduced lunch over the period
mutate(free_lunch = total_enroll * per_free_lunch, reduced_lunch = total_enroll * per_reduced_lunch) %>%
group_by(county_name) %>%
summarise(sum_enroll = sum(total_enroll, na.rm = T),
per_free_lunch = sum(total_enroll, na.rm = T) / sum(free_lunch, na.rm = T),
per_reduced_lunch = sum(total_enroll, na.rm = T) / sum(reduced_lunch, na.rm = T),
# Calculate the poverty rate by calculating the average over years
per_poverty = mean(county_per_poverty))

summary1
```

2. For the counties with the top 5 and bottom 5 poverty rate: percent of population in poverty, percent of students
qualifying for free or reduced price lunch, mean reading score, and mean math score.
```{r}
# Create a temporary table with the mean reading score and mean math score for each county
tmp <- county_school %>%
group_by(county_name) %>%
summarise(mean_ela = mean(mean_ela_score, na.rm = T),
mean_math = mean(mean_math_score, na.rm = T))

# Merge the temporary table with the summary1 table in the previous task
summary2 <- merge(summary1, tmp, by = c("county_name"))

# Select the counties with the top5 and bottom5 poverty rate by removing
summary2 <- summary2[order(-summary2$per_poverty),]
tmp <- summary2[1:5,]
tmp2 <- summary2[-1:-57,]

# Combine tmp and tmp2 together
summary2 <- rbind(tmp, tmp2)

# Select the required columns of summary2
summary2 <- summary2 %>% select(-sum_enroll)
summary2
```

# Task 6: Data Visualization
1. The relationship between access to free/reduced price lunch and test performance, at the school level.
```{r}
# Use the schools dataframe
# Relationship between free price lunch and ela score
ggplot(data = schools) +
geom_point(aes(x = per_free_lunch, y = z_score_ela)) +
labs(title = "Relationship between percentage of free lunch and ela score", x = "Percentage of free lunch",
y = "z-score of ELA") +
scale_x_continuous(limits = c(0,1)) +
scale_y_continuous(limits = c(-5,5)) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5, face="bold"), panel.border = element_blank())
```

```{r}
# Relationship between free price lunch and math score
ggplot(data = schools) +
geom_point(aes(x = per_free_lunch, y = z_score_math)) +
labs(title = "Relationship between percentage of free lunch and math score", x = "Percentage of free lunch",
y = "z-score of math") +
scale_x_continuous(limits = c(0,1)) +
scale_y_continuous(limits = c(-5,5)) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5, face="bold"), panel.border = element_blank())
```

```{r}
# Relationship between reduced price lunch and ela score
ggplot(data = schools) +
geom_point(aes(x = per_reduced_lunch, y = z_score_ela)) +
labs(title = "Relationship between percentage of reduced lunch and ela score", x = "Percentage of reduced lunch",
y = "z-score of ELA") +
scale_x_continuous(limits = c(0,1)) +
scale_y_continuous(limits = c(-5,5)) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5, face="bold"), panel.border = element_blank())
```

```{r}
# Relationship between reduced price lunch and math score
ggplot(data = schools) +
geom_point(aes(x = per_reduced_lunch, y = z_score_math)) +
labs(title = "Relationship between percentage of reduced lunch and math score", x = "Percentage of reduced lunch",
y = "z-score of ELA") +
scale_x_continuous(limits = c(0,1)) +
scale_y_continuous(limits = c(-5,5)) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5, face="bold"), panel.border = element_blank())
```

2. Average test performance across counties with high, low, and medium poverty.
```{r}
# ELA test performance across counties with high, low, and medium poverty
county_school %>%
group_by(year, poverty_level) %>%
summarise(mean_z_score_ela = mean(z_score_ela, na.rm = T)) %>%
ggplot() +
geom_line(aes(x = year, y = mean_z_score_ela, group = poverty_level, col = poverty_level)) +
labs(title = "Relationship between poverty level and ela score across years", x = "year", y = "ELA z-score") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5, face="bold"), panel.border = element_blank())
```
```{r}
# Math test performance across counties with high, low, and medium poverty
county_school %>%
group_by(year, poverty_level) %>%
summarise(mean_z_score_math = mean(z_score_math, na.rm = T)) %>%
ggplot() +
geom_line(aes(x = year, y = mean_z_score_math, group = poverty_level, col = poverty_level)) +
labs(title = "Relationship between poverty level and math score across years", x = "year", y = "Math z-score") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5, face="bold"), panel.border = element_blank())
```

# Task 7: Answering questions
What can the data tell us about the relationship between poverty and test performance in New York public schools? Has this relationship changed over time? Is this relationship at all moderated by access to free/reduced price lunch?

Answer:
The data tells us that the lower the poverty level, the better the test performance in New York public schools. This
relationship hasn't changed over time. This relationship is not seemed to be moderated by access to free/reduced price
lunch because the gap in performances between different poverty levels grew during the past few years.