From 3e743caeebf19f8aa92196a9dcdd2f9758af02fa Mon Sep 17 00:00:00 2001 From: xavierdong Date: Thu, 3 Sep 2020 10:34:04 -0500 Subject: [PATCH 1/3] updated reshape2 to dpylr --- lectureslides/day4_Rmd-datamanip1_slides.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lectureslides/day4_Rmd-datamanip1_slides.Rmd b/lectureslides/day4_Rmd-datamanip1_slides.Rmd index 748b701..8ca68f5 100644 --- a/lectureslides/day4_Rmd-datamanip1_slides.Rmd +++ b/lectureslides/day4_Rmd-datamanip1_slides.Rmd @@ -140,7 +140,7 @@ head(generation) ``` -## Using `reshape2` +## Using `dplyr` - `gather` --> make data long - `spread` --> make data wide From 12299e7f8d8120eb4af18f110c89966d615573e5 Mon Sep 17 00:00:00 2001 From: xavierdong Date: Thu, 3 Sep 2020 13:23:18 -0500 Subject: [PATCH 2/3] submit day4 slide --- .../day4_Rmd-datamanip1_slides_Xavier.Rmd | 271 ++++++++++++++++++ 1 file changed, 271 insertions(+) create mode 100644 submissions/day4_Rmd-datamanip1_slides_Xavier.Rmd diff --git a/submissions/day4_Rmd-datamanip1_slides_Xavier.Rmd b/submissions/day4_Rmd-datamanip1_slides_Xavier.Rmd new file mode 100644 index 0000000..8ca68f5 --- /dev/null +++ b/submissions/day4_Rmd-datamanip1_slides_Xavier.Rmd @@ -0,0 +1,271 @@ +--- +title: "Day 4: Rmarkdown, RStudio Git, RMD, and Advanced Data Manipulation 1" +author: "Amanda Sahar d'Urso (materials from: R.P. Morel" +date: "`r Sys.Date()`" +output: revealjs::revealjs_presentation +params: + notes: no +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, fig.height = 4.35, fig.width = 4.75) +``` + +```{r, include=FALSE} +notes<-params$notes +``` + +## Session Outline + +- Pull request (theory and practice) by the Wonderful Ali +- Open exercise file, save with name in submissions folder +- RMD and Intro to data manipulation +- Git commit, push, pull **in R studio** +- Pull request your submission to MSiA by the Wonderful Ali + +# Reporting analysis with Rmarkdown and GitHub + +## Rmarkdown & GitHub + +- Rmarkdown creates dynamic reports in HTML, PDF, and Word +- Combine text (using the markdown language) and R code +- Rmarkdown runs R code, compiles, and produces a report in chosen format +- This presentation was created using Rmarkdown! + +## Exercises in R + +1) Open the `day4_Rmd-datamanip1_exercises.Rmd` +2) Save as: `Day4Exercise_LastnameFirstname.Rmd` within the `submissions` folder +3) Read in the gapminder data set +4) As you answer questions, be sure to annotate your work with as much detail as possible! +5) At the end of the day, we'll push it over to MSiA + +# RMD + +>- If you're familiar with LaTex, Rmarkdown (RMD) uses most of the same syntax you're used to when writing in Latex +>- RMD documents have two major components: The written text and the code +>- This means you can write up analyses and insert figures and tables in the same document! +>- It does mean that there are special codes requires to be able to do this, however +>- Basically the R script is the scratch paper and the RMD is where you put your final code and analysis into one document + +## RMD Codes + +>- The white spaces in RMD look similar to the script, however the white spaces will not run code; they are used to write out sentences like analyses, document headings, etc... +>- To work with code, you must initiate a chunk via \`\`\`\{r\} and end it with \`\`\` + +## Useful specifications, example + +![specifications](figures/rmd cheats.png) + +[Useful Cheat Sheet](https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) + +## Knitting + +As you work through your file, you can knit it to different formats in order to see what it looks like. Ultimately you can share the output as a document or memo. + +- knit to pdf requires Latex +- knit options are specified manually when pressing the arrow next to knit + +![knit image](figures/knit.png) + + +# Advanced data manipulation, part 1: reshaping and merging + +Let's get set up! + +## A new dataset + +California energy data + +- file names: `ca_energy_generation.csv` and `ca_energy_imports.csv` +- Read these two files into your working environment + - They are in the "data" folder + +## Reading in the data + +```{r importing, warning=F, message = F} +library(tidyverse) +generation <- read_csv("../data/ca_energy_generation.csv") +imports <- read_csv("../data/ca_energy_imports.csv") +``` + +## Tidy data: What the heck is the `tidyverse`? + +>- “System of packages for data manipulation, exploration, and visualization that share a common design philosophy” - Rickert +>- Mainly developed by Hadley Wickham  +>- These packages work together using consistent language structures—once you learn it, these packages will feel as one + +## Tidy data: Tidyverse packages + +![packages included in the tidyverse, always adding new packages](figures/tidyverse universe.png) + +## What is Tidy Data? + +>- "Happy families are all alike; every unhappy family is unhappy in its own way" - Leo Tolstoy +>- "Tidy datasets are all alike, but every messy dataset is messy in its own way" - Hadley Wickham +>- Basically, tidy data are the way your data should be organized + +## Three rules for tidy data: + +1. Each variable must have its own column. +2. Each observation must have its own row. +3. Each value must have its own cell. + +## 1. Each variable must have its own column + +![variables to column](figures/variables to column.png) + +## 2. Each observation must have its own row + +![observations to row](figures/observations to row.png) + +## 3. Each value must have its own cell. + +![values to cell](figures/values to cell.png) + +## Long to wide + +![long to wide](figures/long to wide.png) + +## Wide versus long data + +- Often, we want to make wide data long (or tidy) for analysis + +![Wide to long](figures/wide to long.png) + +## Wide versus long data + +```{r wide data} +head(generation) +``` + + +## Using `dplyr` + +- `gather` --> make data long +- `spread` --> make data wide + +## Reshaping CA energy data + +- Right now, the `generation` dataframe has several observations per row + +```{r untidy data} +head(generation) +``` + +## `gather` the generation data + +`gather(df, key = new column name for key variable, value = new column name for data)` + +- Specify the variable that _doesn't_ gather through `-variable` (or specify the ones you want to gather) + +```{r gather} +long_gen <- gather(generation, key = source, value = usage, -datetime) +head(long_gen) +``` + +## reordering the `gather` the generation data by time + +```{r reordering} +head(long_gen[order(long_gen$datetime), ]) +``` + +```{r} +spread(long_gen, key = source, value = usage) %>% slice(1:5) +``` + +# Dealing with dates/time +How the heck am I supposed to work with "2019-09-03 00:00:00"? + +## Dealing with dates and times + +- Notice that the first variable in both dataset is the called "datetime" +- What class are these variables? + +```{r} +class(generation$datetime) +class(imports$datetime) +``` + + + + +## Dealing with dates/times: `lubridate` + +- The best way to deal with date-time data is to use the `lubridate` package +- You can convert character variables into datetime format using the `as_datetime` function + - One advantage of `readr::read_csv` is that it will often detect and convert datetime variables when importing + +```{r, warning=F, message=F} +library(lubridate) +``` + +## Dealing with dates/times with `lubridate` +```{r datetime} +generation$datetime <- as_datetime(generation$datetime) +head(generation$datetime) +``` + + +```{r} +imports$datetime <- as_datetime(imports$datetime) +head(imports$datetime) +``` + +# Merging data + +## Merging CA energy data + +- Sometimes you have data from two (or more) sources that you want to analyze +- Need to merge these dataframes together +- To merge, need to chose the columns that have common values between the dataframes + - Usually a variable with ids or years, or both + +## Merging the `merge` + +`merge(x, y, by = c("id", "year"))` + +- Key arguments: + - `x`: first dataframe + - `y`: second dataframe + - `by`: variables to match (must have common name) + +## More `merge` arguments + +```{r, eval = F} +merge(x, y, by.x = "id", by.y = "cd", all.x = T, all.y = T) +``` + +- Advanced arguments: + - Use `by.x` and `by.y` if the dataframes have different variable names + - Use `all.x = T` if you want to keep all the observation in the first dataframe (unmatched observations in `y` are dropped!) + - Use `all.y = T` if you want to keep all observations in the second dataframe (umatched observations in `x` are dropped!) + - Use both (or, simply `all = T`) to keep all observations! + - By **default** R will drop unmatched observations from **both** dataframes! + +## Merge by `datetime` + +- Use `merge` to join the `generation` and `imports` dataframes, using the `datetime` variable to match + +## Merge by `datetime` + +- Always check your merge! + +```{r merge} +merged_energy <- merge(generation, imports, by = "datetime") +dim(merged_energy) +head(merged_energy) +``` + +## Try reshaping the merged data! + +- Our merged dataframe is still wide and untidy + - Create a long version called `long_merged_energy` + - Take a peek to make sure the long version looks correct + +## Try reshaping the merged data! + +```{r gather exercise} +long_merged_energy <- gather(merged_energy, key = source, value = usage, -datetime) +head(long_merged_energy) +``` From cd637a3a80bdf8436ae3342b87882ca5191716c2 Mon Sep 17 00:00:00 2001 From: xavierdong Date: Thu, 17 Sep 2020 00:29:29 -0500 Subject: [PATCH 3/3] submit finalRexercise --- submissions/FinalRExercise_DongXavier.Rmd | 208 ++++++++++++++++++++++ 1 file changed, 208 insertions(+) create mode 100644 submissions/FinalRExercise_DongXavier.Rmd diff --git a/submissions/FinalRExercise_DongXavier.Rmd b/submissions/FinalRExercise_DongXavier.Rmd new file mode 100644 index 0000000..b3c6f37 --- /dev/null +++ b/submissions/FinalRExercise_DongXavier.Rmd @@ -0,0 +1,208 @@ +--- +title: " FinalRExercise-Xavier-Dong" +author: "Xavier Dong" +date: "`r Sys.Date()`" +output: pdf_document + +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +```{r global_options, echo = FALSE, include = FALSE} +knitr::opts_chunk$set(echo=TRUE, eval=TRUE, + warning = FALSE, message = FALSE, + cache = FALSE, tidy = TRUE) +``` + +```{r include = FALSE, eval=TRUE} +library(tidyverse) +library(dplyr) +library(ggplot2) +library(reshape2) +library(data.table) +options(tibble.width = Inf) +``` + +# MSIA Boot Camp - Final R exercise + +You've learned quite a lot about R in a short time. Congratulations! This exercise is designed to give you some additional practice on the material we have discussed this week while the lectures are still fresh in your mind, and to integrate different tools and skills that you have learned. + +## Instructions + +#### Task 1: Import your data + +Read the data files `nys_schools.csv` and `nys_acs.csv` into R. These data come from two different sources: one is data on *schools* in New York state from the [New York State Department of Education](http://data.nysed.gov/downloads.php), and the other is data on *counties* from the American Communities Survey from the US Census Bureau. Review the codebook file so that you know what each variable name means in each dataset. + +```{r echo=TRUE,eval=TRUE} +schools = read.csv(here::here("data/nys_schools.csv"), stringsAsFactors=F) +acs = read.csv(here::here("data/nys_acs.csv"), stringsAsFactors=F) +head(schools) +head(acs) +``` + +#### Task 2: Explore your data + +Getting to know your data is a critical part of data analysis. Take the time to explore the structure of the two dataframes you have imported. What types of variables are there? Is there any missing data? How can you tell? What else do you notice about the data? + +# Schools: categorical (school name, district etc.) and numerical (integer for total_enroll, float for scores) and year +# acs: categorical (county name) and numerical (integer for median household income, float for county per poverty) and year + +#### Task 3: Recoding and variable manipulation + +1. Deal with missing values, which are currently coded as `-99`. +2. Create a categorical variable that groups counties into "high", "medium", and "low" poverty groups. Decide how you want to split up the groups and briefly explain your decision. +3. The tests that the NYS Department of Education administers changes from time to time, so scale scores are not directly comparable year-to-year. Create a new variable that is the standardized z-score for math and English Language Arts (ELA) for each year (hint: group by year and use the `scale()` function) + +1. +#replace -99 with NA +```{r echo=TRUE,eval=TRUE} +schools[schools == "-99"] <- NA +acs[acs == "-99"] <- NA +``` +2. +#split up the groups b ased on quantile, less than 25% low, between 25% and 75% medium, greater than 75% high +```{r echo=TRUE,eval=TRUE} +poverty_quantile <- quantile(acs$county_per_poverty, c(.25, .75)) +acs$poverty_group <- cut(acs$county_per_poverty, + breaks=c(-Inf, poverty_quantile[[1]], poverty_quantile[[2]], Inf), + labels=c("low","middle","high")) +``` + +3. +```{r echo=TRUE,eval=TRUE,tidy=FALSE} +schools_zscore <- schools %>% + group_by(year) %>% + mutate(ela_zscore = scale(mean_ela_score),math_zscore = scale(mean_math_score)) %>% + ungroup() + +head(schools_zscore %>% select(year,mean_ela_score,mean_math_score,ela_zscore,math_zscore)) +``` +#### Task 4: Merge datasets + +Create a county-level dataset that merges variables from the schools dataset and the ACS dataset. Remember that you have learned multiple approaches on how to do this, and that you will have to decide how to summarize data when moving from the school to the county level. + +```{r echo=TRUE,eval=TRUE,tidy=FALSE} +schools_grouped <- schools %>% + group_by(county_name,year) %>% + summarise(total_enroll = sum(total_enroll,na.rm=T), + per_free_lunch = mean(per_free_lunch,na.rm=T), + per_reduced_lunch = mean(per_reduced_lunch,na.rm=T), + per_lep = mean(per_lep,na.rm=T), + mean_ela_score = mean(mean_ela_score,na.rm=T), + mean_math_score = mean(mean_math_score,na.rm=T)) + +county_school_merged <- merge(acs, schools_grouped, by = c("county_name", "year")) +head(county_school_merged) +``` +#### Task 5: Create summary tables + +Generate tables showing the following: + +1. For each county: total enrollment, percent of students qualifying for free or reduced price lunch, and percent of population in poverty. + +#group by county and summarize over all the year that have data, since last section grouped by county AND year + +1. +```{r echo=TRUE,eval=TRUE,tidy=FALSE} +summary_table <- county_school_merged %>% + group_by(county_name) %>% + summarize(total_enroll = round(mean(total_enroll,na.rm=T)), + per_free_lunch = mean(per_free_lunch,na.rm=T), + per_reduced_lunch = mean(per_reduced_lunch,na.rm=T), + county_per_poverty = mean(county_per_poverty,na.rm=T)) + +print(summary_table) +``` +2. For the counties with the top 5 and bottom 5 poverty rate: percent of population in poverty, percent of students qualifying for free or reduced price lunch, mean reading score, and mean math score. + +```{r echo=TRUE,eval=TRUE,tidy=FALSE} +poverty_table <- county_school_merged %>% + group_by(county_name) %>% + summarize(county_per_poverty = mean(county_per_poverty,na.rm=T), + per_free_lunch = mean(per_free_lunch,na.rm=T), + per_reduced_lunch = mean(per_reduced_lunch,na.rm=T), + mean_ela_score = mean(mean_ela_score,na.rm=T), + mean_math_score = mean(mean_math_score,na.rm=T)) + +top_poverty_table=arrange(poverty_table,desc(county_per_poverty)) +print(top_poverty_table[1:5,]) + +bot_poverty_table=arrange(poverty_table,county_per_poverty) +print(bot_poverty_table[1:5,]) +``` + +#### Task 6: Data visualization + +Using `ggplot2`, visualize the following: + +1. The relationship between access to free/reduced price lunch and test performance, at the *school* level. + +```{r echo=TRUE,eval=TRUE,tidy=FALSE} +schools %>% + group_by(school_name) %>% + summarize(per_free_lunch = mean(per_free_lunch,na.rm=T), + per_reduced_lunch = mean(per_reduced_lunch,na.rm=T), + mean_ela_score = mean(mean_ela_score,na.rm=T), + mean_math_score = mean(mean_math_score,na.rm=T)) %>% + mutate(per_reduced_and_free_lunch = per_free_lunch+per_reduced_lunch) %>% + mutate(test_performance = (mean_ela_score + mean_math_score)/2) %>% + ggplot() + + geom_point(aes(x=per_reduced_and_free_lunch, y=mean_ela_score, col = 'red')) + + geom_point(aes(x=per_reduced_and_free_lunch, y=mean_math_score, col = 'blue')) + + xlim(0, 1) + + labs(title = "Relationship between Reduced/Free Lunch and Test Performance", + x = "Percentage of Reduced and Free Lunch", y = "Test Score") + + scale_colour_manual(name = 'Legend', + values =c('red'='red','blue'='blue'), labels = c('ELA Score','Math Score')) +``` + +2. Average test performance across *counties* with high, low, and medium poverty. +```{r echo=TRUE,eval=TRUE,tidy=FALSE} +county_school_merged %>% + group_by(poverty_group) %>% + summarize(mean_ela_score = mean(mean_ela_score,na.rm=T), + mean_math_score = mean(mean_math_score,na.rm=T)) %>% + mutate(test_performance = (mean_ela_score + mean_math_score)/2) %>% + ggplot() + + geom_col(aes(x=poverty_group, y=test_performance, group=poverty_group,fill=poverty_group, position="dodge")) + + labs(title="Poverty Level and Average Test Performance", x="Poverty Level", y="Test Performance") +``` + +#### Task 7: Answering questions + +Using the skills you have learned in the past three days, tackle the following question: + +> What can the data tell us about the relationship between poverty and test performance in New York public schools? Has this relationship changed over time? Is this relationship at all moderated by access to free/reduced price lunch? + +You may use summary tables, statistical models, and/or data visualization in pursuing an answer to this question. Feel free to build on the tables and plots you generated above in Tasks 5 and 6. + +Given the short time period, any answer will of course prove incomplete. The goal of this task is to give you some room to play around with the skills you've just learned. Don't hesitate to try something even if you don't feel comfortable with it yet. Do as much as you can in the time allotted. + +# Poverty level of a county is directly related to test performance, the higher the poverty level in a county, the lower the average test performance across schools in that county. + +Lets now take a look at how poverty level relates to percentage of free/reduced lunch. +We can see a positive correlation between percentage of free/reduced lunch and poverty level. + +```{r echo=TRUE,eval=TRUE,tidy=FALSE} +ggplot(summary_table) + + geom_point(aes(x=county_per_poverty, y=per_reduced_lunch),col = 'orange') + + geom_point(aes(x=county_per_poverty, y=per_free_lunch), col = 'green') +``` + + + + +## Github submission + +When you have completed the exercise, save your Markdown file in the `submissions` folder of your forked repo using this naming convention: `FinalRExercise_LastnameFirstname.Rmd`. Commit changes periodically, and push commits when you are done. + +You can optionally create a pull request to submit this file (and other exercise files from the bootcamp sessions) to the base repo that lives in the MSiA organization. If you would like to do this, make sure that all new files you have created are in the `submissions` folder, and then create a pull request that asks to merge changes from your forked repo to the base repo. + +## Reminders + +- Remember to **load necessary packages**. +- Remember to **comment extensively** in your code. Since you will be working in an RMarkdown file, you can describe your workflow in the text section. But you should also comment within all of your code chunks. +- Attempt to knit your Markdown file into HTML format before committing it to Github. Troubleshoot any errors with the knit process by checking the lines referred to in the error messages. +