Skip to content

Conversation

@gsmafra
Copy link
Contributor

@gsmafra gsmafra commented Oct 23, 2025

No description provided.

"3. $E[T_{1i}-T_{0i}] \\neq 0$. This is the existence of a 1st stage. It is saying that the potential outcome of the 1st stage, that is, the potential treatment, is NOT the same. Another way of saying this is that the instrument does affect the treatment.\n",
"\n",
"4. $T_{i1} > T_{i0}$. This is the monotonicity assumption. It is saying that if everyone had the instrument turned on, the treatment level would be higher than if everyone had the treatment turned off. \n",
"4. $T_{i1} \\geq T_{i0}$. This is the monotonicity assumption. It is saying that if everyone had the instrument turned on, the treatment level would be equal or higher than if everyone had the instrument turned off. \n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"As it turns out, the answer is quite simple and intuitive. It is easy to find people that match on a few characteristics, like sex. But if we add more characteristics, like age, income, city of birth and so on, it becomes harder and harder to find matches. In more general terms, the more features we have, the higher will be the distance between units and their matches. \n",
"\n",
"This is not something that hurts only the matching estimator. It ties back to the subclassification estimator we saw earlier. Early on, in that contrived medicine example where with man and woman, it was quite easy to build the subclassification estimator. That was because we only had 2 cells: man and woman. But what would happen if we had more? Let's say we have 2 continuous features like age and income and we manage to discretise them into 5 buckets each. This will give us 25 cells, or $5^2$. And what if we had 10 covariates with 3 buckets each? Doesn't seem like a lot right? Well, this would give us 59049 cells, or $3^{10}$. It's easy to see how this can blow out of proportion pretty quickly. This is a phenomena pervasive in all data science, which is called the **The Curse of Dimensionality**!!!\n",
"This is not something that hurts only the matching estimator. It ties back to the subclassification estimator we saw earlier. Early on, in that contrived medicine example where with man and woman, it was quite easy to build the subclassification estimator. That was because we only had 2 cells: man and woman. But what would happen if we had more? Let's say we have 2 continuous features like age and income and we manage to discretise them into 5 buckets each. This will give us 25 cells, or $5^2$. And what if we had 10 covariates with 3 buckets each? Doesn't seem like a lot right? Well, this would give us 59049 cells, or $3^{10}$. It's easy to see how this can blow out of proportion pretty quickly. This is a phenomenon pervasive in all data science, which is called the **The Curse of Dimensionality**!!!\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github is not rendering this quite well, but the only difference is phenomena -> phenomenon

"This is not something that hurts only the matching estimator. It ties back to the subclassification estimator we saw earlier. Early on, in that contrived medicine example where with man and woman, it was quite easy to build the subclassification estimator. That was because we only had 2 cells: man and woman. But what would happen if we had more? Let's say we have 2 continuous features like age and income and we manage to discretise them into 5 buckets each. This will give us 25 cells, or $5^2$. And what if we had 10 covariates with 3 buckets each? Doesn't seem like a lot right? Well, this would give us 59049 cells, or $3^{10}$. It's easy to see how this can blow out of proportion pretty quickly. This is a phenomena pervasive in all data science, which is called the **The Curse of Dimensionality**!!!\n",
"This is not something that hurts only the matching estimator. It ties back to the subclassification estimator we saw earlier. Early on, in that contrived medicine example where with man and woman, it was quite easy to build the subclassification estimator. That was because we only had 2 cells: man and woman. But what would happen if we had more? Let's say we have 2 continuous features like age and income and we manage to discretise them into 5 buckets each. This will give us 25 cells, or $5^2$. And what if we had 10 covariates with 3 buckets each? Doesn't seem like a lot right? Well, this would give us 59049 cells, or $3^{10}$. It's easy to see how this can blow out of proportion pretty quickly. This is a phenomenon pervasive in all data science, which is called the **The Curse of Dimensionality**!!!\n",
"\n",
"![img](./data/img/curse-of-dimensionality.jpg)",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're currently rendering it like this
image

"To estimate the impacts of alcohol on death, we could use the fact that legal drinking age imposes a discontinuity on nature. In the US, those just under 21 years old don't drink (or drink much less) while those just older than 21 do drink. This means that the probability of drinking jumps at 21 years and that is something we can explore with an RDD.\n",
"\n",
"```note\n",
"```{note}\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently rendering like a code block
image

"Here, notice how there are model bands where the net value is super negative, while there are also bands where it is very positive. Also, there are bands where we don't know exactly if the net value is negative or positive. Finally, notice how they have an upward trend, from left to right. Since we are predicting net value, it is expected that the prediction will be proportional to what it predicts.\n",
"\n",
"Now, to compare this policy using a machine learning model with the one using only the regions we can also show the histogram of net gains, along with the total net value in the test set."
"Now, using a model policy that selects customers where $\\hat{E}[NetValue|Region, Income, Age] > 0$, and comparing it against the one using only the regions, we can show their histogram of net gains, along with their average net value in the test set."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making clear that the policy is score>0. This is not immediately obvious (wasn't to me at least). FWIW, we would optimize the metric "average net value in the test set" with a higher threshold than zero

" \n",
"$$\n",
"sales_i = \\beta_0 + \\beta_1 price_i + \\beta_2 price_i * temp_i * + \\pmb{\\beta_3}X_i + e_i\n",
"sales_i = \\beta_0 + \\beta_1 price_i + \\beta_2 price_i * temp_i * + \\beta_3 X_i + e_i\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear why this is bold? We're not quite interested in the effects of $X_i$. Or I may be missing something

"metadata": {},
"source": [
"Notice how the predictions are numbers that go from something like -9 to something 1. Those are not predictions of the sales column, which is in the order of the hundreds. Rather, **it's a prediction of how much sales would change if we increased price by one unit**. Right off of the bet, we can see some strange numbers. For example, take a look at day 4764. It's predicting a positive sensitivity. In other words, we are predicting that sales will increase if we increase the price of ice cream. This doesn't appeal to our economic sense. It's probably the case that the model is doing some weird extrapolation on that prediction. Fortunately, you don't have to worry too much about it. Remember that our ultimate goal is to segment the units by how sensitive they are to the treatment. It's **not** to come up with the most accurate sensitivity prediction ever. For our main goal, it suffices if the sensitivity predictions orders the units according to how sensitive they are. In other words, even if positive sensitivity predictions like 1.1, or 0.5 don't make much sense, all we need is that the ordering is correct, that is, we want the units with prediction 1.1 to be less impacted by price increase than units with predictions 0.5. \n",
"Notice how the predictions are numbers that go from something like -9 to something like 1. Those are not predictions of the sales column, which is in the order of the hundreds. Rather, **it's a prediction of how much sales would change if we increased price by one unit**. Right off of the bat, we can see some strange numbers. For example, take a look at day 4764. It's predicting a positive sensitivity. In other words, we are predicting that sales will increase if we increase the price of ice cream. This doesn't appeal to our economic sense. It's probably the case that the model is doing some weird extrapolation on that prediction. Fortunately, you don't have to worry too much about it. Remember that our ultimate goal is to segment the units by how sensitive they are to the treatment. It's **not** to come up with the most accurate sensitivity prediction ever. For our main goal, it suffices if the sensitivity predictions orders the units according to how sensitive they are. In other words, even if positive sensitivity predictions like 1.1, or 0.5 don't make much sense, all we need is that the ordering is correct, that is, we want the units with prediction 1.1 to be less impacted by price increase than units with predictions 0.5. \n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something 1 -> something like 1
off the bet -> off the bat

"# 19 - Evaluating Causal Models\n",
"\n",
"In the vast majority of material about causality, researchers use synthetic data to check if their methods are any good. Much like we did in the When Prediction Fails chapter, they generate data on both $Y_{0i}$ and $Y_{1i}$ so that they can check if their model is correctly capturing the treatment effect $Y_{1i} - Y_{0i}$. That's fine for academic purposes, but in the real world, we don't have that luxury. When applying these techniques in the industry, we'll be asked time and again to prove why our model is better, why should it replace the current one in production or why it won't fail miserably. This is so crucial that it's beyond my comprehension why we don't see any material whatsoever explaining how we should evaluate causal inference models with real data. \n",
"In the vast majority of material about causality, researchers use synthetic data to check if their methods are any good. Much like we do in the When Prediction Fails appendix chapter, they generate data on both $Y_{0i}$ and $Y_{1i}$ so that they can check if their model is correctly capturing the treatment effect $Y_{1i} - Y_{0i}$. That's fine for academic purposes, but in the real world, we don't have that luxury. When applying these techniques in the industry, we'll be asked time and again to prove why our model is better, why should it replace the current one in production or why it won't fail miserably. This is so crucial that it's beyond my comprehension why we don't see any material whatsoever explaining how we should evaluate causal inference models with real data. \n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused about how we're referencing a chapter that comes after the current one in the past tense (did)

"metadata": {},
"source": [
"Interpreting a Cumulative Sensitivity Curve can be a bit challenging, but here is how I see it. Again, it might be easier to think about the binary case. The X axis of the curve represents how many samples are we treating. Here, I've normalized the axis to be the proportion of the dataset, so .4 means we are treating 40% of the samples. The Y axis is the sensitivity we should expect at that many samples. So, if a curve has value -1 at 40%, it means that the sensitivity of the top 40% units is -1. Ideally, we want the highest sensitivity for the largest possible sample. An ideal curve then would start high up on the Y axis and descend very slowly to the average sensitivity, representing we can treat a high percentage of units while still maintaining an above average sensitivity. \n",
"Interpreting a Cumulative Sensitivity Curve can be a bit challenging, but here is how I see it. Again, it might be easier to think about the binary case. The X axis of the curve represents how many samples we are treating. Here, I've normalized the axis to be the proportion of the dataset, so .4 means we are treating 40% of the samples. The Y axis is the sensitivity we should expect at that many samples. So, if a curve has value -1 at 40%, it means that the sensitivity of the top 40% units is -1. Ideally, for this case, we want the least negative sensitivity for the largest possible sample. An ideal curve then would start high up on the Y axis and descend very slowly to the average sensitivity, representing we can treat a high percentage of units while still maintaining a low magnitude sensitivity.\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disambiguating the "highest" and "above average", which could be interpreted as "highest in magnitude". In everyday speak, I think most people wouldn't take the sign into account when talking about higher or lower sensitivity

"This seems very odd, because you are saying that the effect of the email can be a negative number, but bear with me. If we do a little bit of math, we can see that, on average or in expectation, this transformed target will be the treatment effect. This is nothing short of amazing. What I'm saying is that by applying this somewhat wacky transformation, I get to estimate something that I can't even observe. \n",
" \n",
"To understand that, we need a bit of math. Because of random assignment, we have that $T \\perp Y(0), Y(1)$, which is our old unconfoundedness friend. That implies that $E[T, Y(t)]=E[T]*E[Y(t)]$, which is the definition of independence.\n",
"To understand that, we need a bit of math. Because of random assignment, we have that $T \\perp Y(0), Y(1)$, which is our old unconfoundedness friend. That implies that $E[T, Y(t)]=E[T]*E[Y(t)]$, which is a consequence of independence.\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition of independence is stronger than this

@gsmafra gsmafra changed the title grammar, rendering, typos, etc chapters 8 to 20 - grammar, rendering, typos, etc Oct 23, 2025
@gsmafra gsmafra changed the title chapters 8 to 20 - grammar, rendering, typos, etc chapters 8 to 20 - grammar, rendering, typos, phrasing, etc Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant