chapters 8 to 20 - grammar, rendering, typos, phrasing, etc #456

gsmafra · 2025-10-23T01:24:05Z

No description provided.

gsmafra · 2025-10-23T01:26:40Z

causal-inference-for-the-brave-and-true/09-Non-Compliance-and-LATE.ipynb

    "3. $E[T_{1i}-T_{0i}] \\neq 0$. This is the existence of a 1st stage. It is saying that the potential outcome of the 1st stage, that is, the potential treatment, is NOT the same. Another way of saying this is that the instrument does affect the treatment.\n",
    "\n",
-    "4. $T_{i1} > T_{i0}$. This is the monotonicity assumption. It is saying that if everyone had the instrument turned on, the treatment level would be higher than if everyone had the treatment turned off. \n",
+    "4. $T_{i1} \\geq T_{i0}$. This is the monotonicity assumption. It is saying that if everyone had the instrument turned on, the treatment level would be equal or higher than if everyone had the instrument turned off. \n",


gsmafra · 2025-10-23T01:27:35Z

causal-inference-for-the-brave-and-true/10-Matching.ipynb

    "As it turns out, the answer is quite simple and intuitive. It is easy to find people that match on a few characteristics, like sex. But if we add more characteristics, like age, income, city of birth and so on, it becomes harder and harder to find matches. In more general terms, the more features we have, the higher will be the distance between units and their matches. \n",
    "\n",
-    "This is not something that hurts only the matching estimator. It ties back to the subclassification estimator we saw earlier. Early on, in that contrived medicine example where with man and woman, it was quite easy to build the subclassification estimator. That was because we only had 2 cells: man and woman. But what would happen if we had more? Let's say we have 2 continuous features like age and income and we manage to discretise them into 5 buckets each. This will give us 25 cells, or $5^2$. And what if we had 10 covariates with 3 buckets each? Doesn't seem like a lot right? Well, this would give us 59049 cells, or $3^{10}$. It's easy to see how this can blow out of proportion pretty quickly. This is a phenomena pervasive in all data science, which is called the **The Curse of Dimensionality**!!!\n",
+    "This is not something that hurts only the matching estimator. It ties back to the subclassification estimator we saw earlier. Early on, in that contrived medicine example where with man and woman, it was quite easy to build the subclassification estimator. That was because we only had 2 cells: man and woman. But what would happen if we had more? Let's say we have 2 continuous features like age and income and we manage to discretise them into 5 buckets each. This will give us 25 cells, or $5^2$. And what if we had 10 covariates with 3 buckets each? Doesn't seem like a lot right? Well, this would give us 59049 cells, or $3^{10}$. It's easy to see how this can blow out of proportion pretty quickly. This is a phenomenon pervasive in all data science, which is called the **The Curse of Dimensionality**!!!\n",


Github is not rendering this quite well, but the only difference is phenomena -> phenomenon

gsmafra · 2025-10-23T01:30:06Z

causal-inference-for-the-brave-and-true/10-Matching.ipynb

-    "This is not something that hurts only the matching estimator. It ties back to the subclassification estimator we saw earlier. Early on, in that contrived medicine example where with man and woman, it was quite easy to build the subclassification estimator. That was because we only had 2 cells: man and woman. But what would happen if we had more? Let's say we have 2 continuous features like age and income and we manage to discretise them into 5 buckets each. This will give us 25 cells, or $5^2$. And what if we had 10 covariates with 3 buckets each? Doesn't seem like a lot right? Well, this would give us 59049 cells, or $3^{10}$. It's easy to see how this can blow out of proportion pretty quickly. This is a phenomena pervasive in all data science, which is called the **The Curse of Dimensionality**!!!\n",
+    "This is not something that hurts only the matching estimator. It ties back to the subclassification estimator we saw earlier. Early on, in that contrived medicine example where with man and woman, it was quite easy to build the subclassification estimator. That was because we only had 2 cells: man and woman. But what would happen if we had more? Let's say we have 2 continuous features like age and income and we manage to discretise them into 5 buckets each. This will give us 25 cells, or $5^2$. And what if we had 10 covariates with 3 buckets each? Doesn't seem like a lot right? Well, this would give us 59049 cells, or $3^{10}$. It's easy to see how this can blow out of proportion pretty quickly. This is a phenomenon pervasive in all data science, which is called the **The Curse of Dimensionality**!!!\n",
+    "\n",
+    "![img](./data/img/curse-of-dimensionality.jpg)",


We're currently rendering it like this

gsmafra · 2025-10-23T01:31:21Z

causal-inference-for-the-brave-and-true/16-Regression-Discontinuity-Design.ipynb

+    "To estimate the impacts of alcohol on death, we could use the fact that legal drinking age imposes a discontinuity on nature. In the US, those just under 21 years old don't drink (or drink much less) while those just older than 21 do drink. This means that the probability of drinking jumps at 21 years and that is something we can explore with an RDD.\n",
    "\n",
-    "```note\n",
+    "```{note}\n",


Currently rendering like a code block

gsmafra · 2025-10-23T01:34:10Z

causal-inference-for-the-brave-and-true/17-Predictive-Models-101.ipynb

    "Here, notice how there are model bands where the net value is super negative, while there are also bands where it is very positive. Also, there are bands where we don't know exactly if the net value is negative or positive. Finally, notice how they have an upward trend, from left to right. Since we are predicting net value, it is expected that the prediction will be proportional to what it predicts.\n",
    "\n",
-    "Now, to compare this policy using a machine learning model with the one using only the regions we can also show the histogram of net gains, along with the total net value in the test set."
+    "Now, using a model policy that selects customers where $\\hat{E}[NetValue|Region, Income, Age] > 0$, and comparing it against the one using only the regions, we can show their histogram of net gains, along with their average net value in the test set."


Just making clear that the policy is score>0. This is not immediately obvious (wasn't to me at least). FWIW, we would optimize the metric "average net value in the test set" with a higher threshold than zero

gsmafra · 2025-10-23T01:35:16Z

...nference-for-the-brave-and-true/18-Heterogeneous-Treatment-Effects-and-Personalization.ipynb

    " \n",
    "$$\n",
-    "sales_i = \\beta_0 + \\beta_1 price_i + \\beta_2 price_i * temp_i * + \\pmb{\\beta_3}X_i + e_i\n",
+    "sales_i = \\beta_0 + \\beta_1 price_i + \\beta_2 price_i * temp_i * + \\beta_3 X_i + e_i\n",


Not clear why this is bold? We're not quite interested in the effects of $X_i$. Or I may be missing something

gsmafra · 2025-10-23T01:36:59Z

...nference-for-the-brave-and-true/18-Heterogeneous-Treatment-Effects-and-Personalization.ipynb

   "metadata": {},
   "source": [
-    "Notice how the predictions are numbers that go from something like -9 to something 1. Those are not predictions of the sales column, which is in the order of the hundreds. Rather, **it's a prediction of how much sales would change if we increased price by one unit**. Right off of the bet, we can see some strange numbers. For example, take a look at day 4764. It's predicting a positive sensitivity. In other words, we are predicting that sales will increase if we increase the price of ice cream. This doesn't appeal to our economic sense. It's probably the case that the model is doing some weird extrapolation on that prediction. Fortunately, you don't have to worry too much about it. Remember that our ultimate goal is to segment the units by how sensitive they are to the treatment. It's **not** to come up with the most accurate sensitivity prediction ever. For our main goal, it suffices if the sensitivity predictions orders the units according to how sensitive they are. In other words, even if positive sensitivity predictions like 1.1, or 0.5 don't make much sense, all we need is that the ordering is correct, that is, we want the units with prediction 1.1 to be less impacted by price increase than units with predictions 0.5. \n",
+    "Notice how the predictions are numbers that go from something like -9 to something like 1. Those are not predictions of the sales column, which is in the order of the hundreds. Rather, **it's a prediction of how much sales would change if we increased price by one unit**. Right off of the bat, we can see some strange numbers. For example, take a look at day 4764. It's predicting a positive sensitivity. In other words, we are predicting that sales will increase if we increase the price of ice cream. This doesn't appeal to our economic sense. It's probably the case that the model is doing some weird extrapolation on that prediction. Fortunately, you don't have to worry too much about it. Remember that our ultimate goal is to segment the units by how sensitive they are to the treatment. It's **not** to come up with the most accurate sensitivity prediction ever. For our main goal, it suffices if the sensitivity predictions orders the units according to how sensitive they are. In other words, even if positive sensitivity predictions like 1.1, or 0.5 don't make much sense, all we need is that the ordering is correct, that is, we want the units with prediction 1.1 to be less impacted by price increase than units with predictions 0.5. \n",


something 1 -> something like 1
off the bet -> off the bat

gsmafra · 2025-10-23T01:37:44Z

causal-inference-for-the-brave-and-true/19-Evaluating-Causal-Models.ipynb

    "# 19 - Evaluating Causal Models\n",
    "\n",
-    "In the vast majority of material about causality, researchers use synthetic data to check if their methods are any good. Much like we did in the When Prediction Fails chapter, they generate data on both $Y_{0i}$ and $Y_{1i}$ so that they can check if their model is correctly capturing the treatment effect $Y_{1i} - Y_{0i}$. That's fine for academic purposes, but in the real world, we don't have that luxury. When applying these techniques in the industry, we'll be asked time and again to prove why our model is better, why should it replace the current one in production or why it won't fail miserably. This is so crucial that it's beyond my comprehension why we don't see any material whatsoever explaining how we should evaluate causal inference models with real data. \n",
+    "In the vast majority of material about causality, researchers use synthetic data to check if their methods are any good. Much like we do in the When Prediction Fails appendix chapter, they generate data on both $Y_{0i}$ and $Y_{1i}$ so that they can check if their model is correctly capturing the treatment effect $Y_{1i} - Y_{0i}$. That's fine for academic purposes, but in the real world, we don't have that luxury. When applying these techniques in the industry, we'll be asked time and again to prove why our model is better, why should it replace the current one in production or why it won't fail miserably. This is so crucial that it's beyond my comprehension why we don't see any material whatsoever explaining how we should evaluate causal inference models with real data. \n",


I was confused about how we're referencing a chapter that comes after the current one in the past tense (did)

gsmafra · 2025-10-23T01:40:39Z

causal-inference-for-the-brave-and-true/19-Evaluating-Causal-Models.ipynb

   "metadata": {},
   "source": [
-    "Interpreting a Cumulative Sensitivity Curve can be a bit challenging, but here is how I see it. Again, it might be easier to think about the binary case. The X axis of the curve represents how many samples are we treating. Here, I've normalized the axis to be the proportion of the dataset, so .4 means we are treating 40% of the samples. The Y axis is the sensitivity we should expect at that many samples. So, if a curve has value -1 at 40%, it means that the sensitivity of the top 40% units is -1. Ideally, we want the highest sensitivity for the largest  possible sample. An ideal curve then would start high up on the Y axis and descend very slowly to the average sensitivity, representing we can treat a high percentage of units while still maintaining an above average sensitivity. \n",
+    "Interpreting a Cumulative Sensitivity Curve can be a bit challenging, but here is how I see it. Again, it might be easier to think about the binary case. The X axis of the curve represents how many samples we are treating. Here, I've normalized the axis to be the proportion of the dataset, so .4 means we are treating 40% of the samples. The Y axis is the sensitivity we should expect at that many samples. So, if a curve has value -1 at 40%, it means that the sensitivity of the top 40% units is -1. Ideally, for this case, we want the least negative sensitivity for the largest  possible sample. An ideal curve then would start high up on the Y axis and descend very slowly to the average sensitivity, representing we can treat a high percentage of units while still maintaining a low magnitude sensitivity.\n",


Disambiguating the "highest" and "above average", which could be interpreted as "highest in magnitude". In everyday speak, I think most people wouldn't take the sign into account when talking about higher or lower sensitivity

gsmafra · 2025-10-23T01:41:32Z

causal-inference-for-the-brave-and-true/20-Plug-and-Play-Estimators.ipynb

    "This seems very odd, because you are saying that the effect of the email can be a negative number, but bear with me. If we do a little bit of math, we can see that, on average or in expectation, this transformed target will be the treatment effect. This is nothing short of amazing. What I'm saying is that by applying this somewhat wacky transformation, I get to estimate something that I can't even observe. \n",
    " \n",
-    "To understand that, we need a bit of math. Because of random assignment, we have that $T \\perp Y(0), Y(1)$, which is our old unconfoundedness friend. That implies that $E[T, Y(t)]=E[T]*E[Y(t)]$, which is the definition of independence.\n",
+    "To understand that, we need a bit of math. Because of random assignment, we have that $T \\perp Y(0), Y(1)$, which is our old unconfoundedness friend. That implies that $E[T, Y(t)]=E[T]*E[Y(t)]$, which is a consequence of independence.\n",


The definition of independence is stronger than this

gsmafra added 2 commits October 22, 2025 22:13

grammar, rendering, typos, etc

ed4326d

remove empty line

460243b

gsmafra commented Oct 23, 2025

View reviewed changes

something like 1

4796171

gsmafra commented Oct 23, 2025

View reviewed changes

gsmafra changed the title ~~grammar, rendering, typos, etc~~ chapters 8 to 20 - grammar, rendering, typos, etc Oct 23, 2025

gsmafra changed the title ~~chapters 8 to 20 - grammar, rendering, typos, etc~~ chapters 8 to 20 - grammar, rendering, typos, phrasing, etc Oct 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

chapters 8 to 20 - grammar, rendering, typos, phrasing, etc #456

chapters 8 to 20 - grammar, rendering, typos, phrasing, etc #456

Uh oh!

gsmafra commented Oct 23, 2025

Uh oh!

gsmafra Oct 23, 2025

Uh oh!

gsmafra Oct 23, 2025

Uh oh!

gsmafra Oct 23, 2025

Uh oh!

gsmafra Oct 23, 2025

Uh oh!

gsmafra Oct 23, 2025

Uh oh!

gsmafra Oct 23, 2025

Uh oh!

gsmafra Oct 23, 2025

Uh oh!

gsmafra Oct 23, 2025

Uh oh!

gsmafra Oct 23, 2025

Uh oh!

gsmafra Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

chapters 8 to 20 - grammar, rendering, typos, phrasing, etc #456

Are you sure you want to change the base?

chapters 8 to 20 - grammar, rendering, typos, phrasing, etc #456

Uh oh!

Conversation

gsmafra commented Oct 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant