-
-
Notifications
You must be signed in to change notification settings - Fork 568
chapter 05/06/07 - fix grammar and rendering #453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
gsmafra
wants to merge
4
commits into
matheusfacure:master
Choose a base branch
from
gsmafra:chapter-05-fix-grammar
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,7 +13,7 @@ | |
| "\n", | ||
| "As a motivating example, let's suppose you are a data scientist in the collections team of a fintech. Your next task is to figure out the impact of sending an email asking people to negotiate their debt. Your response variable is the amount of payments from the late customers.\n", | ||
| "\n", | ||
| "To answer this question, your team selects 5000 random customers from your late customers base to do a random test. For every customer, you flip a coin, if its heads, the customer receives the email; otherwise, it is left as a control. With this test, you hope to find out how much extra money the email generates." | ||
| "To answer this question, your team selects 5000 random customers from your late customers base to do a random test. For every customer, you flip a coin, if it's heads, the customer receives the email; otherwise, it is left as a control. With this test, you hope to find out how much extra money the email generates." | ||
| ] | ||
| }, | ||
| { | ||
|
|
@@ -1196,7 +1196,7 @@ | |
| "source": [ | ||
| "## Bad Controls - Selection Bias\n", | ||
| "\n", | ||
| "Let's go back to the collections email example. Remember that the email was randomly assigned to customers. We've already explained what `credit_limit` and `risk_score` is. Now, let's look at the remaining variables. `opened` is a dummy variable for the customer opening the email or not. `agreement` is another dummy marking if the customers contacted the collections department to negotiate their debt, after having received the email. Which of the following models do you think is more appropriate? The first is a model with the treatment variable plus `credit_limit` and `risk_score`; the second adds `opened` and `agreement` dummies." | ||
| "Let's go back to the collections email example. Remember that the email was randomly assigned to customers. We've already explained what `credit_limit` and `risk_score` are. Now, let's look at the remaining variables. `opened` is a dummy variable for the customer opening the email or not. `agreement` is another dummy marking if the customers contacted the collections department to negotiate their debt, after having received the email. Which of the following models do you think is more appropriate? The first is a model with the treatment variable plus `credit_limit` and `risk_score`; the second adds `opened` and `agreement` dummies." | ||
| ] | ||
| }, | ||
| { | ||
|
|
@@ -1484,11 +1484,11 @@ | |
| "email -> opened -> agreement -> payment \n", | ||
| "$\n", | ||
| "\n", | ||
| "We also think that different levels of risk and line have different propensity of doing an agreement, so we will mark them as also causing agreement. As for email and agreement, we could make an argument that some people just read the subject of the email and that makes them more likely to make an agreement. The point is that email could also cause agreement without passing through open.\n", | ||
| "We also think that different levels of risk and limit have different propensity of doing an agreement, so we will mark them as also causing agreement. As for email and agreement, we could make an argument that some people just read the subject of the email and that makes them more likely to make an agreement. The point is that email could also cause agreement without passing through open.\n", | ||
| "\n", | ||
| "What we notice with this graph is that opened and agreement are both in the causal path from email to payments. So, if we control for them with regression, we would be saying \"this is the effect of email while keeping `opened` and `agreement` fixed\". However, both are part of the causal effect of the email, so we don't want to hold them fixed. Instead, we could argue that email increases payments precisely because it boosts the agreement rate. If we fix those variables, we are removing some of the true effect from the email variable. \n", | ||
| "\n", | ||
| "With potential outcome notation, we can say that, due to randomization $E[Y_0|T=0] = E[Y_0|T=1]$. However, even with randomization, when we control for agreement, treatment and control are no longer comparable. In fact, with some intuitive thinking, we can even guess how they are different:\n", | ||
| "With potential outcomes notation, we can say that, due to randomization $E[Y_0|T=0] = E[Y_0|T=1]$. However, even with randomization, when we control for agreement, treatment and control are no longer comparable. In fact, with some intuitive thinking, we can even guess how they are different:\n", | ||
| "\n", | ||
| "\n", | ||
| "$\n", | ||
|
|
@@ -1507,13 +1507,13 @@ | |
| "\n", | ||
| "\n", | ||
| "\n", | ||
| "Selection bias is so pervasive that not even randomization can fix it. Better yet, it is often introduced by the ill advised, even in random data! Spotting and avoiding selection bias requires more practice than skill. Often, they appear underneath some supposedly clever idea, making it even harder to uncover. Here are some examples of selection biased I've encountered:\n", | ||
| "Selection bias is so pervasive that not even randomization can fix it. Better yet, it is often introduced by the ill advised, even in random data! Spotting and avoiding selection bias requires more practice than skill. Often, they appear underneath some supposedly clever idea, making it even harder to uncover. Here are some examples of selection biases I've encountered:\n", | ||
| "\n", | ||
| " 1. Adding a dummy for paying the entire debt when trying to estimate the effect of a collections strategy on payments.\n", | ||
| " 2. Controlling for white vs blue collar jobs when trying to estimate the effect of schooling on earnings\n", | ||
| " 3. Controlling for conversion when estimating the impact of interest rates on loan duration\n", | ||
| " 4. Controlling for marital happiness when estimating the impact of children on extramarital affairs\n", | ||
| " 5. Breaking up payments modeling E[Payments] into one binary model that predict if payment will happen and another model that predict how much payment will happen given that some will: E[Payments|Payments>0]*P(Payments>0)\n", | ||
| "1. Adding a dummy for paying the entire debt when trying to estimate the effect of a collections strategy on payments.\n", | ||
| "2. Controlling for white vs blue collar jobs when trying to estimate the effect of schooling on earnings\n", | ||
| "3. Controlling for conversion when estimating the impact of interest rates on loan duration\n", | ||
| "4. Controlling for marital happiness when estimating the impact of children on extramarital affairs\n", | ||
| "5. Breaking up payments modeling $E[Payments]$ into one binary model that predicts if payment will happen and another model that predict how much payment will happen given that some will: $E[Payments|Payments>0]*P(Payments>0)$\n", | ||
|
Comment on lines
+1512
to
+1516
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Current rendering makes this a code block. Removing tabs makes it text. Tested with a local build: |
||
| " \n", | ||
| "What is notable about all these ideas is how reasonable they sound. Selection bias often does. Let this be a warning. As a matter of fact, I myself have fallen into the traps above many many times before I learned how bad they were. One in particular, the last one, deserves further explanation because it looks so clever and catches lots of data scientists off guard. It's so pervasive that it has its own name: **The Bad COP**!\n", | ||
| "\n", | ||
|
|
@@ -1595,7 +1595,7 @@ | |
| "\\end{align*} \n", | ||
| "$$\n", | ||
| " \n", | ||
| "where the second equality comes after we add and subtract $E[Y_{i0}|Y_{i1}>0]$. When we break up the COP effect, we get first the causal effect on the participant subpopulation. In our example, this would be the causal effect on those that decide to spend something. Second, we get a bias term which is the difference in $Y_0$ for those that decide to participate when assigned to the treatment ($E[Y_{i0}|Y_{i1}>0]$) and those that that participate even without the treatment ($E[Y_{i0}|Y_{i0}>0]$). In our case, this bias is probably negative, since those that spend when assigned to the treatment, had they not received the treatment, would probably spend less than those that spend even without the treatment $E[Y_{i0}|Y_{i1}>0] < E[Y_{i0}|Y_{i0}>0]$.\n", | ||
| "where the second equality comes after we add and subtract $E[Y_{i0}|Y_{i1}>0]$. When we break up the COP effect, we get first the causal effect on the participant subpopulation. In our example, this would be the causal effect on those that decide to spend something. Second, we get a bias term which is the difference in $Y_0$ for those that decide to participate when assigned to the treatment ($E[Y_{i0}|Y_{i1}>0]$) and those that would participate even without the treatment ($E[Y_{i0}|Y_{i0}>0]$). In our case, this bias is probably negative, since those that spend when assigned to the treatment, had they not received the treatment, would probably spend less than those that spend even without the treatment $E[Y_{i0}|Y_{i1}>0] < E[Y_{i0}|Y_{i0}>0]$.\n", | ||
| " \n", | ||
| "\n", | ||
| " \n", | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not grammar, but keeping vocabulary consistency