|
1 | 1 | # Haskell for Data Analysis |
2 | 2 |
|
3 | | -This section ports/mirrors Wes McKinney's book [Python for Data Analysis](https://wesmckinney.com/book/). Examples and organizations are drawn from there. This tutorial assumes an understanding of Haskell. |
| 3 | +This section ports/mirrors Wes McKinney's book [Python for Data Analysis](https://wesmckinney.com/book/). Examples and organizations are drawn from there. This tutorial does not assume an understanding of Haskell. |
| 4 | + |
| 5 | +## What is a dataframe? |
| 6 | + |
| 7 | +A DataFrame is like a spreadsheet or a table — it organizes data into rows and columns. |
| 8 | + |
| 9 | +* Each column has a name (like "Name", "Age", or "Price") and usually contains the same type of information (like numbers or text). |
| 10 | + |
| 11 | +* Each row is one entry or record — like a person, a product, or a day’s worth of sales. |
| 12 | + |
| 13 | +Imagine an Excel sheet or Google Sheets file: |
| 14 | + |
| 15 | +| Name | Age | City | |
| 16 | +|-------|-------|-----------| |
| 17 | +| Alice | 30 | New York | |
| 18 | +| Bob | 25 | San Diego | |
| 19 | +| Cara | 35 | Austin | |
| 20 | + |
| 21 | +That’s essentially a DataFrame! |
| 22 | + |
| 23 | +DataFrames make it easy to: |
| 24 | +* Look at your data |
| 25 | +* Filter or sort it (like showing only people over 30) |
| 26 | +* Do math on it (like averaging ages) |
| 27 | +* Clean it (like removing bad or incomplete data) |
| 28 | + |
| 29 | +They're a key tool for data scientists, analysts, and programmers when working with data. This guide is about how to use dataframes in a language called Haskell. |
| 30 | + |
| 31 | +## Why use Haskell? |
| 32 | + |
| 33 | +* Having types around eliminates many kinds of bugs before you even run the code. |
| 34 | +* It's easy to write pipelines. |
| 35 | +* The Haskell compiler has a lot of optimization that makes code very fast. |
| 36 | +* The syntax is more approachable than other compiled languages' dataframes. |
| 37 | + |
| 38 | +## What to install |
| 39 | +For most of this guide we will be using a tool called GHCi. This is a program that allows you to write and evaluate Haskell code interactively. In fact, the 'i' in GHCi means interactive. GHCi comes bundled in an installation of the Haskell programming language. At the time of writing, [ghcup](www.haskell.org/ghcup/) is the best way to install Haskell tooling. To get through this guide you're going to need a tool called Cabal (which also is installed visa ghcup). Cabal is a package manager for Haskell and allows you to get the code you can use to code along. |
| 40 | + |
| 41 | +After you've installed cabal you'll need to install `dataframe`. To do so run `cabal update && cabal install dataframe` on your command line. To start running GHCi type `cabal repl --build-depends dataframe` in your terminal. |
| 42 | + |
| 43 | +You're now ready to start using and exploring dataframes! |
| 44 | + |
| 45 | +## Getting the data |
| 46 | +Data enters a computer program in one of two ways: |
| 47 | +* manual entry of the data by a human, or, |
| 48 | +* through a file whose data was the output of another computer program or the result of manual entry. |
| 49 | + |
| 50 | +We will show how to do both in dataframes. |
| 51 | + |
| 52 | +### Entering the data manually |
| 53 | + |
| 54 | +I live in Seattle where the weather is a legitimate, non-small-talk topic of conversation for most of the year. At any given point in time I care about what the weather is and what it will be. I'd like to do some simple computation on a week's worth of high and low temperatures. A week of data is small enough that I can enter it myself so I'll do just that. |
| 55 | + |
| 56 | +```haskell |
| 57 | +ghci> import qualified DataFrame as D |
| 58 | +ghci> let df = D.fromNamedColumns [("Day", D.fromList ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]), ("High Temperature (Celcius)", D.fromList [24, 20, 22, 23, 25, 26, 26]), ("Low Temperature (Celcius)", D.fromList [14, 13, 13, 13, 14, 15, 15])] |
| 59 | +ghci> df |
| 60 | +-------------------------------------------------------------------------- |
| 61 | +index | Day | High Temperature (Celcius) | Low Temperature (Celcius) |
| 62 | +------|-----------|----------------------------|-------------------------- |
| 63 | + Int | [Char] | Integer | Integer |
| 64 | +------|-----------|----------------------------|-------------------------- |
| 65 | +0 | Monday | 24 | 14 |
| 66 | +1 | Tuesday | 20 | 13 |
| 67 | +2 | Wednesday | 22 | 13 |
| 68 | +3 | Thursday | 23 | 13 |
| 69 | +4 | Friday | 25 | 14 |
| 70 | +5 | Saturday | 26 | 15 |
| 71 | +6 | Sunday | 26 | 15 |
| 72 | +``` |
| 73 | + |
| 74 | +We use the function `fromNamedColumns` to create a dataframe from manually entered data. The format of the function is `fromNamedColumns [(<name>, <column>), (<name>, <column>),...]`. It has an equivalent for data without column names called `fromUnnamedColumns`. |
| 75 | + |
| 76 | +```haskell |
| 77 | +ghci> let df = D.fromUnnamedColumns [D.fromList ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"], D.fromList [24, 20, 22, 23, 25, 26, 26], D.fromList [14, 13, 13, 13, 14, 15, 15]] |
| 78 | +ghci> df |
| 79 | +------------------------------------- |
| 80 | +index | 0 | 1 | 2 |
| 81 | +------|-----------|---------|-------- |
| 82 | + Int | [Char] | Integer | Integer |
| 83 | +------|-----------|---------|-------- |
| 84 | +0 | Monday | 24 | 14 |
| 85 | +1 | Tuesday | 20 | 13 |
| 86 | +2 | Wednesday | 22 | 13 |
| 87 | +3 | Thursday | 23 | 13 |
| 88 | +4 | Friday | 25 | 14 |
| 89 | +5 | Saturday | 26 | 15 |
| 90 | +6 | Sunday | 26 | 15 |
| 91 | +``` |
| 92 | + |
| 93 | +This function automatcally names columns with numbers 0 to n. This is generally bad practice (everything must havea descriptive name) but is useful for an initial entry where the columns are unknown/have no name. |
| 94 | + |
| 95 | +### Getting the data from a file |
| 96 | + |
| 97 | +We can also get data froma CSV file using the `readCsv` function. |
| 98 | + |
| 99 | +```haskell |
| 100 | +ghci> df <- D.readCsv "./data/housing.csv" |
| 101 | +ghci> D.take 10 df |
| 102 | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 103 | +index | longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity |
| 104 | +------|-----------|----------|--------------------|-------------|----------------|------------|------------|--------------------|--------------------|---------------- |
| 105 | + Int | Double | Double | Double | Double | Maybe Double | Double | Double | Double | Double | Text |
| 106 | +------|-----------|----------|--------------------|-------------|----------------|------------|------------|--------------------|--------------------|---------------- |
| 107 | +0 | -122.23 | 37.88 | 41.0 | 880.0 | Just 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
| 108 | +1 | -122.22 | 37.86 | 21.0 | 7099.0 | Just 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
| 109 | +2 | -122.24 | 37.85 | 52.0 | 1467.0 | Just 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
| 110 | +3 | -122.25 | 37.85 | 52.0 | 1274.0 | Just 235.0 | 558.0 | 219.0 | 5.6431000000000004 | 341300.0 | NEAR BAY |
| 111 | +4 | -122.25 | 37.85 | 52.0 | 1627.0 | Just 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
| 112 | +5 | -122.25 | 37.85 | 52.0 | 919.0 | Just 213.0 | 413.0 | 193.0 | 4.0368 | 269700.0 | NEAR BAY |
| 113 | +6 | -122.25 | 37.84 | 52.0 | 2535.0 | Just 489.0 | 1094.0 | 514.0 | 3.6591 | 299200.0 | NEAR BAY |
| 114 | +7 | -122.25 | 37.84 | 52.0 | 3104.0 | Just 687.0 | 1157.0 | 647.0 | 3.12 | 241400.0 | NEAR BAY |
| 115 | +8 | -122.26 | 37.84 | 42.0 | 2555.0 | Just 665.0 | 1206.0 | 595.0 | 2.0804 | 226700.0 | NEAR BAY |
| 116 | +9 | -122.25 | 37.84 | 52.0 | 3549.0 | Just 707.0 | 1551.0 | 714.0 | 3.6912000000000003 | 261100.0 | NEAR BAY |
| 117 | +``` |
| 118 | + |
| 119 | +We've introduced a new function in the example above. The `take` function, given a number `n` and a dataframe, cuts everything but the first `n` rows of a dataframe. We use the function so we can check a few rows of a dataframe. |
4 | 120 |
|
5 | 121 | ## Data preparation |
6 | | -Data in the wild doesn't always come in a form that's easy to work with. A data analysis tool should make preparing and cleaning data easy. There are a number of common issues that data analysis too must handle. We'll go through a few common ones and show how to deal with them in Haskell. |
| 122 | +Data in the wild doesn't always come in a form that's easy to work with. A data analysis tool should make preparing and cleaning data easy. There are a number of common issues that data analysis too must handle. We'll go through a few common ones and show how to deal with them in Haskell. |
7 | 123 |
|
8 | 124 | ### Handling missing data |
9 | | -In Haskell, potentially missing values are represented by a "wrapper" type called [`Maybe`](https://en.wikibooks.org/wiki/Haskell/Understanding_monads/Maybe). |
| 125 | +Data is oftentimes incomplete. Sometimes because of legitimate reasons, often times because of errors. Handling missing data is a foundational tasks in data analysis. In Haskell, potentially missing values are represented by a "wrapper" type called [`Maybe`](https://en.wikibooks.org/wiki/Haskell/Understanding_monads/Maybe). |
10 | 126 |
|
11 | | -``` |
| 127 | +```haskell |
12 | 128 | ghci> import qualified DataFrame as D |
13 | | -ghci> let df = D.fromColumnList [D.fromList[Just 1, Just 1, Nothing, Nothing], D.fromList[Just 6.5, Nothing, Nothing, Just 6.5], D.fromList[Just 3.0, Nothing, Nothing, Just 3.0]] |
| 129 | +ghci> let df = D.fromUnnamedColumns [D.fromList [Just 1, Just 1, Nothing, Nothing], D.fromList [Just 6.5, Nothing, Nothing, Just 6.5], D.fromList [Just 3.0, Nothing, Nothing, Just 3.0]] |
14 | 130 | ghci> df |
15 | 131 | --------------------------------------------------- |
16 | 132 | index | 0 | 1 | 2 |
@@ -83,5 +199,5 @@ ghci> D.impute @Double "0" 0 df |
83 | 199 | apply @<Type> arg1 arg2 |
84 | 200 | ``` |
85 | 201 |
|
86 | | -In general, Haskell would usually have a compile-time. But because dataframes are usually run in REPL-like environments which offer immediate feedback to users, `dataframe` is fine turning these into compile-time exceptions. |
| 202 | +In general, Haskell would usually have a compile-time. But because dataframes are usually run in REPL-like environments which offer immediate feedback to users, `dataframe` is fine turning these into runtime exceptions. |
87 | 203 |
|
0 commit comments