Skip to content

Commit e85b08b

Browse files
committed
chore: More work on tutorial/book.
1 parent b793353 commit e85b08b

4 files changed

Lines changed: 132 additions & 11 deletions

File tree

README.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,13 @@
11
# DataFrame
22

3-
An intuitive, dynamically-typed DataFrame library.
3+
A fast, safe, and intuitive DataFrame library.
44

5-
A tool for exploratory data analysis.
5+
## Why use Haskell?
6+
7+
* Having types around eliminates many kinds of bugs before you even run the code.
8+
* It's easy to write pipelines.
9+
* The Haskell compiler does a lot of optimization that makes code very fast.
10+
* The syntax is more approachable than other compiled languages' dataframes.
611

712
## Installing
813

benchmark/Main.hs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ haskell = do
2323
ns <- VU.replicateM n (uniformRM range g)
2424
xs <- VU.replicateM n (uniformRM range g)
2525
ys <- VU.replicateM n (uniformRM range g)
26-
let df = D.fromUnamedColumns (map D.fromUnboxedVector [ns, xs, ys])
26+
let df = D.fromUnnamedColumns (map D.fromUnboxedVector [ns, xs, ys])
2727
endGeneration <- getCurrentTime
2828

2929
let generationTime = diffUTCTime endGeneration startGeneration

docs/haskell_for_data_analysis.md

Lines changed: 122 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,132 @@
11
# Haskell for Data Analysis
22

3-
This section ports/mirrors Wes McKinney's book [Python for Data Analysis](https://wesmckinney.com/book/). Examples and organizations are drawn from there. This tutorial assumes an understanding of Haskell.
3+
This section ports/mirrors Wes McKinney's book [Python for Data Analysis](https://wesmckinney.com/book/). Examples and organizations are drawn from there. This tutorial does not assume an understanding of Haskell.
4+
5+
## What is a dataframe?
6+
7+
A DataFrame is like a spreadsheet or a table — it organizes data into rows and columns.
8+
9+
* Each column has a name (like "Name", "Age", or "Price") and usually contains the same type of information (like numbers or text).
10+
11+
* Each row is one entry or record — like a person, a product, or a day’s worth of sales.
12+
13+
Imagine an Excel sheet or Google Sheets file:
14+
15+
| Name | Age | City |
16+
|-------|-------|-----------|
17+
| Alice | 30 | New York |
18+
| Bob | 25 | San Diego |
19+
| Cara | 35 | Austin |
20+
21+
That’s essentially a DataFrame!
22+
23+
DataFrames make it easy to:
24+
* Look at your data
25+
* Filter or sort it (like showing only people over 30)
26+
* Do math on it (like averaging ages)
27+
* Clean it (like removing bad or incomplete data)
28+
29+
They're a key tool for data scientists, analysts, and programmers when working with data. This guide is about how to use dataframes in a language called Haskell.
30+
31+
## Why use Haskell?
32+
33+
* Having types around eliminates many kinds of bugs before you even run the code.
34+
* It's easy to write pipelines.
35+
* The Haskell compiler has a lot of optimization that makes code very fast.
36+
* The syntax is more approachable than other compiled languages' dataframes.
37+
38+
## What to install
39+
For most of this guide we will be using a tool called GHCi. This is a program that allows you to write and evaluate Haskell code interactively. In fact, the 'i' in GHCi means interactive. GHCi comes bundled in an installation of the Haskell programming language. At the time of writing, [ghcup](www.haskell.org/ghcup/) is the best way to install Haskell tooling. To get through this guide you're going to need a tool called Cabal (which also is installed visa ghcup). Cabal is a package manager for Haskell and allows you to get the code you can use to code along.
40+
41+
After you've installed cabal you'll need to install `dataframe`. To do so run `cabal update && cabal install dataframe` on your command line. To start running GHCi type `cabal repl --build-depends dataframe` in your terminal.
42+
43+
You're now ready to start using and exploring dataframes!
44+
45+
## Getting the data
46+
Data enters a computer program in one of two ways:
47+
* manual entry of the data by a human, or,
48+
* through a file whose data was the output of another computer program or the result of manual entry.
49+
50+
We will show how to do both in dataframes.
51+
52+
### Entering the data manually
53+
54+
I live in Seattle where the weather is a legitimate, non-small-talk topic of conversation for most of the year. At any given point in time I care about what the weather is and what it will be. I'd like to do some simple computation on a week's worth of high and low temperatures. A week of data is small enough that I can enter it myself so I'll do just that.
55+
56+
```haskell
57+
ghci> import qualified DataFrame as D
58+
ghci> let df = D.fromNamedColumns [("Day", D.fromList ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]), ("High Temperature (Celcius)", D.fromList [24, 20, 22, 23, 25, 26, 26]), ("Low Temperature (Celcius)", D.fromList [14, 13, 13, 13, 14, 15, 15])]
59+
ghci> df
60+
--------------------------------------------------------------------------
61+
index | Day | High Temperature (Celcius) | Low Temperature (Celcius)
62+
------|-----------|----------------------------|--------------------------
63+
Int | [Char] | Integer | Integer
64+
------|-----------|----------------------------|--------------------------
65+
0 | Monday | 24 | 14
66+
1 | Tuesday | 20 | 13
67+
2 | Wednesday | 22 | 13
68+
3 | Thursday | 23 | 13
69+
4 | Friday | 25 | 14
70+
5 | Saturday | 26 | 15
71+
6 | Sunday | 26 | 15
72+
```
73+
74+
We use the function `fromNamedColumns` to create a dataframe from manually entered data. The format of the function is `fromNamedColumns [(<name>, <column>), (<name>, <column>),...]`. It has an equivalent for data without column names called `fromUnnamedColumns`.
75+
76+
```haskell
77+
ghci> let df = D.fromUnnamedColumns [D.fromList ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"], D.fromList [24, 20, 22, 23, 25, 26, 26], D.fromList [14, 13, 13, 13, 14, 15, 15]]
78+
ghci> df
79+
-------------------------------------
80+
index | 0 | 1 | 2
81+
------|-----------|---------|--------
82+
Int | [Char] | Integer | Integer
83+
------|-----------|---------|--------
84+
0 | Monday | 24 | 14
85+
1 | Tuesday | 20 | 13
86+
2 | Wednesday | 22 | 13
87+
3 | Thursday | 23 | 13
88+
4 | Friday | 25 | 14
89+
5 | Saturday | 26 | 15
90+
6 | Sunday | 26 | 15
91+
```
92+
93+
This function automatcally names columns with numbers 0 to n. This is generally bad practice (everything must havea descriptive name) but is useful for an initial entry where the columns are unknown/have no name.
94+
95+
### Getting the data from a file
96+
97+
We can also get data froma CSV file using the `readCsv` function.
98+
99+
```haskell
100+
ghci> df <- D.readCsv "./data/housing.csv"
101+
ghci> D.take 10 df
102+
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
103+
index | longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity
104+
------|-----------|----------|--------------------|-------------|----------------|------------|------------|--------------------|--------------------|----------------
105+
Int | Double | Double | Double | Double | Maybe Double | Double | Double | Double | Double | Text
106+
------|-----------|----------|--------------------|-------------|----------------|------------|------------|--------------------|--------------------|----------------
107+
0 | -122.23 | 37.88 | 41.0 | 880.0 | Just 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY
108+
1 | -122.22 | 37.86 | 21.0 | 7099.0 | Just 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY
109+
2 | -122.24 | 37.85 | 52.0 | 1467.0 | Just 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY
110+
3 | -122.25 | 37.85 | 52.0 | 1274.0 | Just 235.0 | 558.0 | 219.0 | 5.6431000000000004 | 341300.0 | NEAR BAY
111+
4 | -122.25 | 37.85 | 52.0 | 1627.0 | Just 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY
112+
5 | -122.25 | 37.85 | 52.0 | 919.0 | Just 213.0 | 413.0 | 193.0 | 4.0368 | 269700.0 | NEAR BAY
113+
6 | -122.25 | 37.84 | 52.0 | 2535.0 | Just 489.0 | 1094.0 | 514.0 | 3.6591 | 299200.0 | NEAR BAY
114+
7 | -122.25 | 37.84 | 52.0 | 3104.0 | Just 687.0 | 1157.0 | 647.0 | 3.12 | 241400.0 | NEAR BAY
115+
8 | -122.26 | 37.84 | 42.0 | 2555.0 | Just 665.0 | 1206.0 | 595.0 | 2.0804 | 226700.0 | NEAR BAY
116+
9 | -122.25 | 37.84 | 52.0 | 3549.0 | Just 707.0 | 1551.0 | 714.0 | 3.6912000000000003 | 261100.0 | NEAR BAY
117+
```
118+
119+
We've introduced a new function in the example above. The `take` function, given a number `n` and a dataframe, cuts everything but the first `n` rows of a dataframe. We use the function so we can check a few rows of a dataframe.
4120

5121
## Data preparation
6-
Data in the wild doesn't always come in a form that's easy to work with. A data analysis tool should make preparing and cleaning data easy. There are a number of common issues that data analysis too must handle. We'll go through a few common ones and show how to deal with them in Haskell.
122+
Data in the wild doesn't always come in a form that's easy to work with. A data analysis tool should make preparing and cleaning data easy. There are a number of common issues that data analysis too must handle. We'll go through a few common ones and show how to deal with them in Haskell.
7123

8124
### Handling missing data
9-
In Haskell, potentially missing values are represented by a "wrapper" type called [`Maybe`](https://en.wikibooks.org/wiki/Haskell/Understanding_monads/Maybe).
125+
Data is oftentimes incomplete. Sometimes because of legitimate reasons, often times because of errors. Handling missing data is a foundational tasks in data analysis. In Haskell, potentially missing values are represented by a "wrapper" type called [`Maybe`](https://en.wikibooks.org/wiki/Haskell/Understanding_monads/Maybe).
10126

11-
```
127+
```haskell
12128
ghci> import qualified DataFrame as D
13-
ghci> let df = D.fromColumnList [D.fromList[Just 1, Just 1, Nothing, Nothing], D.fromList[Just 6.5, Nothing, Nothing, Just 6.5], D.fromList[Just 3.0, Nothing, Nothing, Just 3.0]]
129+
ghci> let df = D.fromUnnamedColumns [D.fromList [Just 1, Just 1, Nothing, Nothing], D.fromList [Just 6.5, Nothing, Nothing, Just 6.5], D.fromList [Just 3.0, Nothing, Nothing, Just 3.0]]
14130
ghci> df
15131
---------------------------------------------------
16132
index | 0 | 1 | 2
@@ -83,5 +199,5 @@ ghci> D.impute @Double "0" 0 df
83199
apply @<Type> arg1 arg2
84200
```
85201

86-
In general, Haskell would usually have a compile-time. But because dataframes are usually run in REPL-like environments which offer immediate feedback to users, `dataframe` is fine turning these into compile-time exceptions.
202+
In general, Haskell would usually have a compile-time. But because dataframes are usually run in REPL-like environments which offer immediate feedback to users, `dataframe` is fine turning these into runtime exceptions.
87203

src/DataFrame/Operations/Core.hs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -187,8 +187,8 @@ partiallyParsed _ = 0
187187
fromNamedColumns :: [(T.Text, Column)] -> DataFrame
188188
fromNamedColumns = L.foldl' (\df (name, column) -> insertColumn name column df) empty
189189

190-
fromUnamedColumns :: [Column] -> DataFrame
191-
fromUnamedColumns = fromNamedColumns . zip (map (T.pack . show) [0..])
190+
fromUnnamedColumns :: [Column] -> DataFrame
191+
fromUnnamedColumns = fromNamedColumns . zip (map (T.pack . show) [0..])
192192

193193
-- | O (k * n) Counts the occurences of each value in a given column.
194194
valueCounts :: forall a. (Columnable a) => T.Text -> DataFrame -> [(a, Int)]

0 commit comments

Comments
 (0)