Skip to content

Commit aec340b

Browse files
authored
primer: fixed mistakes (#36)
Added some lines to primer to enable people to follow along. Also fixed some minor mistakes in the calculation of standard deviation.
1 parent 15e2eca commit aec340b

1 file changed

Lines changed: 7 additions & 3 deletions

File tree

docs/exploratory_data_analysis_primer.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,8 @@ For categorical data the best univariate non-graphical analysis is a tabulation
2727

2828
```haskell
2929
ghci> import qualified DataFrame as D
30+
ghci> df <- D.readCsv "./housing.csv"
31+
ghci> :set -XOverloadedStrings
3032
ghci> D.frequencies "ocean_proximity" df
3133

3234
------------------------------------------------------------------------------
@@ -100,8 +102,10 @@ In the housing dataset it'll tell how "typical" our typical home price is.
100102
```haskell
101103
ghci> import Data.Maybe
102104
ghci> m = fromMaybe 0 $ D.mean "median_house_value" df
105+
ghci> m
103106
206855.81690891474
104-
ghci> df |> D.derive "deviation" (D.col "median_house_value" - D.lit m) |> D.select ["median_house_value", "deviation"] |> D.take 10
107+
ghci> import DataFrame ((|>))
108+
ghci> df |> D.derive "deviation" (abs $ D.col "median_house_value" - D.lit m) |> D.select ["median_house_value", "deviation"] |> D.take 10
105109
-----------------------------------------------
106110
index | median_house_value | deviation
107111
------|--------------------|-------------------
@@ -124,7 +128,7 @@ Read left to right, we begin by calling `derive` which creates a new column comp
124128
This gives us a list of the deviations. From the small sample it does seem like there are some wild deviations. The first one is greater than the mean! How typical is this? Well to answer that we take the average of all these values.
125129

126130
```haskell
127-
ghci> withDeviation = df |> D.derive "deviation" (D.col "median_house_value" - D.lit m) |> "median_house_value" |> D.select ["median_house_value", "deviation"]
131+
ghci> withDeviation = df |> D.derive "deviation" (abs $ D.col "median_house_value" - D.lit m) |> D.select ["median_house_value", "deviation"]
128132
ghci> D.mean "deviation" withDeviation
129133
Just 91170.43994367732
130134
```
@@ -137,7 +141,7 @@ What if we give more weight to the further deviations?
137141
That's what standard deviation aims to do. Standard deviation considers the spread of outliers. Instead of calculating the absolute difference of each observation from the mean we calculate the square of the difference. This has the effect of exaggerating further outliers.
138142

139143
```haskell
140-
ghci> sumOfSqureDifferences = fromMaybe 0 $ D.sum "deviation" withDeviation
144+
ghci> sumOfSqureDifferences = fromMaybe 0 $ D.sum "deviation^2" $ withDeviation |> D.derive "deviation^2" ((D.col "deviation") ** (D.lit 2))
141145
ghci> n = fromIntegral $ (fst $ D.dimensions df) - 1
142146
ghci> sqrt (sumOfSqureDifferences / n)
143147
115395.6158744

0 commit comments

Comments
 (0)