DISCLAIMER: This is not an officially supported Google product.
This library implements a table, where the cells are referenced by row and column, and stored independently; with cells being dynamically loaded from and saved to disk as needed. Additionally, the table keeps track of "dependent" cells, so that when a cell gets updated, its dependent cells will get recalculated.
The main method for managing dependencies and formulas for cells is through Columns, classes that contain logic for updating, or "refreshing," cells. This library implements two such Columns: A SimpleFormula to apply operations on values in the same row, and a Waterfall, which is akin to a SUMIF on earlier rows.
The basic tool from this library is a class called Table, which is initialized with a "prefix". A prefix is a name for the table; a lot of files get saved with the table, and the prefix literally prefixes these file names.
The Table class has basic set and get functions for the cells in the table, called set_cell_value and get_cell_value. A cell is identified by date (which acts as a group of rows) and the column (identified by column name) and by key, which subdivides date, identifying a single row. Date and column_name are stored together as a cell address (CellAddr), which represents a block of cells that should get updated together.
Note: Date may be any integer, but is assumed to be in the format YYYYMMDD.
The APIs for set_cell_value and get_cell_value are:
- set_cell_value(cell_addr: CellAddr, key: CellKey, value: Any) -> None
- get_cell_value(cell_addr: CellAddr, key: CellKey) -> Optional[Any]
If for example, we wanted to store the value 123.45 to the key "Checking" on the date Apr 24, 2017, in the deposit column, we could call:
- set_cell_value(CellAddr(20170424, "deposit"), key="Checking", value=1232.45)
As cells get updated, this will change the value of cells that depend on them. This doesn't get done automatically; the user needs to call refresh() on the table. The table's internals keep track of what needs to be updated and in what order, and will update cells accordingly.
You can export the contents of the table to a pandas-style dataframe by calling make_df on the table. This takes two arguments: The columns that you want to export, and the dates that you want to export. (Though if you may leave dates unset, and it will return all dates.)
A table is loaded and saved using open() and close(), which by practice should be called in all case when starting and finishing working with a table. In addition to these functions, the table will dynamically save and load some data in order to avoid ever having too much data held in memory. This should be scalable to arbitrarily-large data sets.
There's a long example below.
To store data, we need to build Columns.
Columns are holders of logic, and sometimes state. They don't hold any data; that's done in the table. At a minimum, columns need a name (a string used to reference them; this was used on the set/get methods for example) and a reference to the table they're being built on.
Our library defines 4 columns:
- FlatColumn
- ProtectedColumn
- SimpleFormula
- Waterfall
The first two are basic. They contain no logic; they're just canvases for storing data. The only difference between these is that ProtectedColumns hold data that wouldn't be known at the start of that date; for example a target variable on a model. In the current implementation, there is no functional difference between FlatColumns and ProtectedColumns; the user might choose to differentiate as a convention.
A SimpleFormula column uses columns from the same row to calculate the value. It takes four arguments at initialization:
- A name for the column,
- The table that we want to add the column to,
- A function, f, which takes as its only argument a dict representing a row in the table, whose keys are the required columns and whose values are the values of those columns in a row. The output of the function is what will be stored in this column for that row.
- Required columns - that is any column that, when refreshed, should trigger a refresh for this column.
A Waterfall column adds up (recent) values from a given column, that match keys in another column (similar to a SUMIF from other software). Actually we can add multiple columns (WaterfallMap values) conditioned on multiple different columns (WaterfallMap keys); and the match condition we set may come from yet another column (output key). The initializer takes six arguments:
- A name for the column,
- The table that we want to add the column to,
- Required columns - that is any column that when refreshed should trigger a refresh for this column,
- Input maps - These are WaterfallMap objects that say which value to add, and which key to store it to.
- An output key, which says which key we lookup when we actually go to save.
- Tail length (in years), which says how much time to use in the running sum.
For an example, we look at some high-level Canadian Football (CFB) data.
I have some data about the final match of the Grey Cup (playoffs) from each year, along with scores, saved to a csv in the data folder.
>>> import pandas as pd
>>> cfb_df = pd.read_csv("data/cfb_test_data.csv")
>>> pd.set_option('display.max_columns', None) # Show all columns
>>> cfb_df.head()
Date Winning Team Winning Points Losing Team \
0 20191201 Winnipeg Blue Bombers 33 Hamilton Tiger-Cats
1 20181201 Calgary Stampeders 27 Ottawa RedBlacks
2 20171201 Toronto Argonauts 27 Calgary Stampeders
3 20161201 Ottawa RedBlacks 39 Calgary Stampeders
4 20151201 Edmonton Eskimos 26 Ottawa RedBlacks
<BLANKLINE>
Losing Points
0 12
1 16
2 24
3 33
4 20
Note: I changed all the date to Nov 1st, so that if I add the last X years, it will do finals from the last X years, rather than X-1 to X+1 depending on the specific dates.
Throughout this example, we pretend that we want to build a model that predicts winners, and build features to that end.
First let's store all the data into a new table.
>>> from cell_link import *
>>> PREFIX = "CFBTEST"
>>> cfb = Table(PREFIX)
>>> cfb.open()
Make three flat columns, Date, Team1, and Team2. These are the data that we know before the game happens; flat means that they aren't based on a formula. And make two protected columns, Points1 and Points2. These are the corresponding points scored. We make these columns protected because they aren't available to make predictions about this game.
>>> date_col = FlatColumn("Date", table=cfb)
>>> team_1 = FlatColumn("Team1", table=cfb)
>>> team_2 = FlatColumn("Team2", table=cfb)
>>> points_1 = ProtectedColumn("Points1", table=cfb)
>>> points_2 = ProtectedColumn("Points2", table=cfb)
Now we store data to the table. As we do we'll randomly assign winners to be team 1 or team 2. This is a modeling decision not related to our CellLink design.
>>> import random
>>> random.seed(1)
>>> key = "SINGLE_KEY"
>>> for _, row in cfb_df.iterrows():
... date = row["Date"]
... team1_map = "Winning" if random.random() < 0.5 else "Losing"
... team2_map = "Winning" if team1_map == "Losing" else "Losing"
... cfb.set_cell_value(CellAddr(date, "Date"), key, value=date)
... cfb.set_cell_value(CellAddr(date, "Team1"), key, row["{} Team".format(
... team1_map)])
... cfb.set_cell_value(CellAddr(date, "Team2"), key, row["{} Team".format(
... team2_map)])
... cfb.set_cell_value(CellAddr(date, "Points1"), key, row["{} Points".format(
... team1_map)])
... cfb.set_cell_value(CellAddr(date, "Points2"), key, row["{} Points".format(
... team2_map)])
Note: As we save data to the table, we use a single key. This is only because we have only one row per date. In general, there can be multiple entries per date, and the key is what tells the program which row to edit.
By setting up our data this way, we've lost track of which team won, and we want to add that back. One way to do this is to build a SimpleFormula column:
>>> def winner_f(row):
... if row["Points1"] > row["Points2"]:
... return "1"
... return "2" # May assume no ties.
>>> winner = SimpleFormula("Winner", table=cfb, f=winner_f, required_columns=[
... "Points1", "Points2"])
>>> cfb.refresh()
We can see the state of the table so far:
>>> cfb.make_df(columns=["Date", "Points1", "Points2", "Winner"],
... dates=range(20141201, 20201201, 10000))
Date Points1 Points2 Winner
0 20141201 20 16 1
1 20151201 26 20 1
2 20161201 39 33 1
3 20171201 24 27 2
4 20181201 16 27 2
5 20191201 33 12 1
The advantage of the cell-link table is that we can update cells, and dependent fields will update in turn.
>>> cfb.set_cell_value(CellAddr(20141201, "Points2"), key, 100)
>>> cfb.refresh()
>>> cfb.make_df(columns=["Date", "Points1", "Points2", "Winner"],
... dates=range(20141201, 20201201, 10000))
Date Points1 Points2 Winner
0 20141201 20 100 2
1 20151201 26 20 1
2 20161201 39 33 1
3 20171201 24 27 2
4 20181201 16 27 2
5 20191201 33 12 1
Now let's build some features. Maybe a predictive feature would be how many times the team has made it to the final game in the last ten years. We can calculate this with the Waterfall column:
First we will need to make a column of ones, because Waterfall always operates by adding another column. For us, we will add 1 if the team is present in prior rows. To accomplish this, we use another row type, ConstColumn, which takes in the const we want to fill in. This will continue to will in 1s as a we add new rows too.
>>> ones = ConstColumn("Ones", table=cfb, const=1)
Now we build Waterfall columns for both Player1 and Player2. For this column, we set:
- Input maps - We sum the Ones column for each row (the number 1), but we add this for both the "Team1" key and the "Team2" key.
- Output key - For Team1 10yr Appearances, we look at Team1 to know which sum (of those we calculated) we should look at.
- Tail length - to count pariticipation in the previous ten years, we set this to 10.
>>> team1_10yr_appearances = Waterfall(
... "Team1 10yr Appearances", table=cfb,
... required_columns=["Team1", "Team2", "Ones"],
... input_maps=[WaterfallMap("Team1", "Ones"),
... WaterfallMap("Team2", "Ones")],
... output_key="Team1",
... tail_length_years=10)
>>> team2_10yr_appearances = Waterfall(
... "Team2 10yr Appearances", table=cfb,
... required_columns=["Team1", "Team2", "Ones"],
... input_maps=[WaterfallMap("Team1", "Ones"),
... WaterfallMap("Team2", "Ones")],
... output_key="Team2",
... tail_length_years=10)
>>> cfb.refresh()
We can now see these columns:
>>> cfb.make_df(columns=["Date", "Team1", "Team2", "Team1 10yr Appearances",
... "Team2 10yr Appearances"],
... dates=range(20041201, 20201201, 10000))
Date Team1 Team2 \
0 20041201 BC Lions Toronto Argonauts
1 20051201 Edmonton Eskimos Montreal Alouettes
2 20061201 BC Lions Montreal Alouettes
3 20071201 Winnipeg Blue Bombers Saskatchewan Roughriders
4 20081201 Calgary Stampeders Montreal Alouettes
5 20091201 Saskatchewan Roughriders Montreal Alouettes
6 20101201 Montreal Alouettes Saskatchewan Roughriders
7 20111201 BC Lions Winnipeg Blue Bombers
8 20121201 Calgary Stampeders Toronto Argonauts
9 20131201 Hamilton Tiger-Cats Saskatchewan Roughriders
10 20141201 Calgary Stampeders Hamilton Tiger-Cats
11 20151201 Edmonton Eskimos Ottawa RedBlacks
12 20161201 Ottawa RedBlacks Calgary Stampeders
13 20171201 Calgary Stampeders Toronto Argonauts
14 20181201 Ottawa RedBlacks Calgary Stampeders
15 20191201 Winnipeg Blue Bombers Hamilton Tiger-Cats
<BLANKLINE>
Team1 10yr Appearances Team2 10yr Appearances
0 2 2
1 3 3
2 2 4
3 1 1
4 3 5
5 1 6
6 7 2
7 2 2
8 1 1
9 0 3
10 2 1
11 1 0
12 1 3
13 4 1
14 2 5
15 1 2
In 2017, Team 1 was the Stampeders, and we say that they've had 4 recent appearances. We can see that these are from years 08, 12, 14, and 16. On the other hand, in 2015, the RedBlacks have no recent appearances. (This is actually their first time in the final round since their founding in 2010.)
Another potentially predictive variable is point differential (PD), defined as points scored (in past games) minus points opponent scored. To calculate this, we first calculate points for (PF) and points against (PA) for the last ten years.
>>> pf1 = Waterfall("PF1", table=cfb,
... required_columns=["Team1", "Team2", "Points1", "Points2"],
... input_maps=[WaterfallMap("Team1", "Points1"),
... WaterfallMap("Team2", "Points2")],
... output_key="Team1", tail_length_years=10)
>>> pf2 = Waterfall("PF2", table=cfb,
... required_columns=["Team1", "Team2", "Points1", "Points2"],
... input_maps=[WaterfallMap("Team1", "Points1"),
... WaterfallMap("Team2", "Points2")],
... output_key="Team2", tail_length_years=10)
>>> pa1 = Waterfall("PA1", table=cfb,
... required_columns=["Team1", "Team2", "Points1", "Points2"],
... input_maps=[WaterfallMap("Team1", "Points2"),
... WaterfallMap("Team2", "Points1")],
... output_key="Team1", tail_length_years=10)
>>> pa2 = Waterfall("PA2", table=cfb,
... required_columns=["Team1", "Team2", "Points1", "Points2"],
... input_maps=[WaterfallMap("Team1", "Points2"),
... WaterfallMap("Team2", "Points1")],
... output_key="Team2", tail_length_years=10)
Now we put SimpleFormulas on top of the new fields.
>>> pd1 = SimpleFormula("PD1", table=cfb, f=lambda row: row["PF1"]-row["PA1"],
... required_columns=["PF1", "PA1"])
>>> pd2 = SimpleFormula("PD2", table=cfb, f=lambda row: row["PF2"]-row["PA2"],
... required_columns=["PF2", "PA2"])
>>> cfb.refresh()
And display:
>>> cfb.make_df(columns=["Date", "Team1", "PD1", "Team2", "PD2"],
... dates=range(20091201, 20201201, 10000))
Date Team1 PD1 Team2 PD2
0 20091201 Saskatchewan Roughriders 4 Montreal Alouettes -27
1 20101201 Montreal Alouettes -26 Saskatchewan Roughriders 3
2 20111201 BC Lions 3 Winnipeg Blue Bombers -12
3 20121201 Calgary Stampeders 8 Toronto Argonauts 8
4 20131201 Hamilton Tiger-Cats 0 Saskatchewan Roughriders 0
5 20141201 Calgary Stampeders -5 Hamilton Tiger-Cats -22
6 20151201 Edmonton Eskimos 3 Ottawa RedBlacks 0
7 20161201 Ottawa RedBlacks -6 Calgary Stampeders -85
8 20171201 Calgary Stampeders -91 Toronto Argonauts 13
9 20181201 Ottawa RedBlacks 0 Calgary Stampeders -94
10 20191201 Winnipeg Blue Bombers -11 Hamilton Tiger-Cats 58
We could go even further and calculate points differential per game (PDPG). Notice we need to handle the denominator = 0.
>>> def pdpg_f(row, player_no):
... appearances = row["Team{} 10yr Appearances".format(player_no)]
... if appearances == 0:
... return None
... return row["PD{}".format(player_no)] / appearances
>>> pdpg1 = SimpleFormula("PDPG1", table=cfb, f=lambda row: pdpg_f(row, "1"),
... required_columns=["Team1 10yr Appearances", "PD1"])
>>> pdpg2 = SimpleFormula("PDPG2", table=cfb, f=lambda row: pdpg_f(row, "2"),
... required_columns=["Team2 10yr Appearances", "PD2"])
>>> cfb.refresh()
>>> cfb.make_df(columns=["Date", "Team1", "PDPG1", "Team2", "PDPG2"],
... dates=range(20091201, 20201201, 10000))
Date Team1 PDPG1 Team2 \
0 20091201 Saskatchewan Roughriders 4.000000 Montreal Alouettes
1 20101201 Montreal Alouettes -3.714286 Saskatchewan Roughriders
2 20111201 BC Lions 1.500000 Winnipeg Blue Bombers
3 20121201 Calgary Stampeders 8.000000 Toronto Argonauts
4 20131201 Hamilton Tiger-Cats NaN Saskatchewan Roughriders
5 20141201 Calgary Stampeders -2.500000 Hamilton Tiger-Cats
6 20151201 Edmonton Eskimos 3.000000 Ottawa RedBlacks
7 20161201 Ottawa RedBlacks -6.000000 Calgary Stampeders
8 20171201 Calgary Stampeders -22.750000 Toronto Argonauts
9 20181201 Ottawa RedBlacks 0.000000 Calgary Stampeders
10 20191201 Winnipeg Blue Bombers -11.000000 Hamilton Tiger-Cats
<BLANKLINE>
PDPG2
0 -4.500000
1 1.500000
2 -6.000000
3 8.000000
4 0.000000
5 -22.000000
6 NaN
7 -28.333333
8 13.000000
9 -18.800000
10 29.000000
Let's say that we found out that there's a data error, and that in 2009 the Alouettes (Team1) actually scored 100,000 points. We can update that field, and this will trigger an update of Points For, Points Against, Points Differential, and PDPG in that order:
>>> cfb.set_cell_value(CellAddr(20091201, "Points1"), key, 100000)
>>> cfb.refresh()
>>> cfb.make_df(columns=["Date", "Team1", "PDPG1", "Team2", "PDPG2"],
... dates=range(20091201, 20201201, 10000))
Date Team1 PDPG1 \
0 20091201 Saskatchewan Roughriders 4.000000
1 20101201 Montreal Alouettes -14285.571429
2 20111201 BC Lions 1.500000
3 20121201 Calgary Stampeders 8.000000
4 20131201 Hamilton Tiger-Cats NaN
5 20141201 Calgary Stampeders -2.500000
6 20151201 Edmonton Eskimos 3.000000
7 20161201 Ottawa RedBlacks -6.000000
8 20171201 Calgary Stampeders -22.750000
9 20181201 Ottawa RedBlacks 0.000000
10 20191201 Winnipeg Blue Bombers -11.000000
<BLANKLINE>
Team2 PDPG2
0 Montreal Alouettes -4.500000
1 Saskatchewan Roughriders 49988.000000
2 Winnipeg Blue Bombers -6.000000
3 Toronto Argonauts 8.000000
4 Saskatchewan Roughriders 33324.333333
5 Hamilton Tiger-Cats -22.000000
6 Ottawa RedBlacks NaN
7 Calgary Stampeders -28.333333
8 Toronto Argonauts 13.000000
9 Calgary Stampeders -18.800000
10 Hamilton Tiger-Cats 29.000000
Note: We see that this negatively affects the Roughriders, who were their opponent that year.
Another nice feature of this is that we can add a new row easily. If we find out that Ottawa RedBlacks are slated to play the Hamilton Tiger-Cats in 2020, then we can add this row:
>>> cfb.set_cell_value(CellAddr(20201201, "Team1"), key, "Ottawa RedBlacks")
>>> cfb.set_cell_value(CellAddr(20201201, "Team2"), key, "Hamilton Tiger-Cats")
>>> cfb.refresh()
This triggers a calculation for all of our fields, we can then see:
>>> cfb.make_df(columns=["Date", "Team1", "Team2", "Team1 10yr Appearances",
... "Team2 10yr Appearances", "PD1", "PD2", "PDPG1", "PDPG2"], dates=[
... 20201201])
Date Team1 Team2 Team1 10yr Appearances \
0 None Ottawa RedBlacks Hamilton Tiger-Cats 3
<BLANKLINE>
Team2 10yr Appearances PD1 PD2 PDPG1 PDPG2
0 3 -11 37 -3.666667 12.333333
Clean-up
>>> cfb.close()
>>> for root,_, files in os.walk("data"):
... for file in files:
... if file.find(PREFIX) != -1:
... os.remove(os.path.join(root, file))