Skip to content

Commit 766aa2b

Browse files
committed
Cleanup for public release
1 parent ea956f6 commit 766aa2b

15 files changed

+82
-205
lines changed

README.md

+80-3
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,76 @@
1-
# pine
1+
# Pine - a tool for Random Decision Forests
22

3-
ensembles of random decision trees
3+
Ensembles of random decision trees. Make predictions using the machine learning technique.
44

5-
![random decision tree ensembles training](decision-ensembles.png)
5+
See [this Kaggle discussion](https://www.kaggle.com/general/3920) of the term *random forest*.
6+
7+
## usage
8+
9+
[Go](https://golang.org/dl) is required to build this app.
10+
11+
```bash
12+
cd pine/tree
13+
make deps
14+
go build
15+
./tree # prints help
16+
```
17+
18+
Training:
19+
```bash
20+
./tree -train -data=../test-data/iris.csv -save=../sav.gob
21+
```
22+
23+
Predicting:
24+
```bash
25+
./tree -pred -model=../sav.gob -seed=5.7,3.8,1.7,0.3
26+
```
27+
28+
All options:
29+
30+
```text
31+
Usage of ./tree:
32+
-charmode skipSize
33+
Character prediction mode rather than numeric feature mode. This will create test cases by iterating through the data skipSize at a time, and making the previous `sequenceLength` items have higher weights based on the closeness to the current item being predicted.s
34+
-data string
35+
Training data input file
36+
-folds int
37+
How many subdivisions of the dataset to make for cross-validation (default 5)
38+
-m int
39+
Override calculation for feature split size (little m)
40+
-max int
41+
Stop predicting after this many rounds (-pred only)
42+
-model string
43+
Load a pretrained model for prediction
44+
-pred
45+
Make a prediction
46+
-profile string
47+
[cpu|mem] enable profiling
48+
-save string
49+
Where to save the model after training
50+
-seed string
51+
Predict based on this string of data
52+
-seqlen int
53+
Normally equal to the number of variables during -charmode, override for fewer previous look-behind-memory-variables in every input test cases
54+
-skipsize int
55+
During -charmode, how many items to skip before making another training case (default 3)
56+
-subsetpct float
57+
Percent of the dataset which should be used to train a tree (always minus 1 fold for cross-validation) (default 0.6)
58+
-tojson
59+
Convert a model to json
60+
-train
61+
Train a model
62+
-trees int
63+
How many decision trees to make per fold of the dataset (default 1)
64+
```
65+
66+
## experimental character mode
67+
68+
There is an experimental `-charmode` flag that attempts to encode strings of text and make predictions on it, like you would with a neural network.
669

770
## how it works
871

72+
![random decision tree ensembles training](decision-ensembles.png)
73+
974
Given a data set, rows of input features x, where the last column is the expected category y.
1075
Often these are encoded in CSV format. The data should be encoded to float32 parseable values.
1176

@@ -32,3 +97,15 @@ To do it, start by splitting the whole dataset into equal bags (or folds) withou
3297
For example, say there are 20 samples and we want 4 folds. Each fold will have 5 samples, and none of the 20 samples will be repeated across all the folds. However, they need to be put randomly into the folds (random without replacement).
3398

3499
Next, loop through all the folds. The fold in the loop iteration will be the test set, so reserve it for later. Use all the other folds to train a set of decision trees. In our example above, that means on the first fold, we would use the last 3 for training, on the second, use the first fold and the last two for training, etc. For every training set, construct decision trees that best predicts it.
100+
101+
# License
102+
103+
MIT
104+
105+
# Sources
106+
107+
http://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics
108+
109+
http://blog.yhat.com/posts/random-forests-in-python.html
110+
111+
https://machinelearningmastery.com/implement-random-forest-scratch-python/
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

rf.py

-192
This file was deleted.
File renamed without changes.
File renamed without changes.

tree/datarow.go

+2
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,10 @@ import (
66
"strings"
77
)
88

9+
// datarow is a single row from a CSV file, where each column is a number
910
type datarow []float32
1011

12+
// parseRow is a utility which turns the file text from a CSV row into a list of numbers
1113
func parseRow(row string, rowIndex int) (dr datarow) {
1214
cols := strings.Split(row, ",")
1315
if len(cols) == 0 { // blank lines ignored

tree/tree.go

-10
Original file line numberDiff line numberDiff line change
@@ -19,13 +19,3 @@ type Tree struct {
1919
leftSamples []datarow // temp test cases for left group
2020
rightSamples []datarow // temp test cases for right group
2121
}
22-
23-
//func (t *Tree) String() string {
24-
// return fmt.Sprintf("VariableIndex: %f, ValueIndex: %f, LeftNode: %+v, RightNode: %+v, LeftTerminal: %f, RightTerminal: %f",
25-
// t.VariableIndex,
26-
// t.ValueIndex,
27-
// t.LeftNode,
28-
// t.RightNode,
29-
// t.LeftTerminal,
30-
// t.RightTerminal)
31-
//}

0 commit comments

Comments
 (0)