Skip to content

Commit

Permalink
Add tutorial for JSON import (#459)
Browse files Browse the repository at this point in the history
* Treat some model parameters as optional

* Rename matching_categories -> categories_list

* Allow users to use arbitrary node_id

* Test with vector leaf

* String rep of task_type should have 'k' prefix

* cuML RF now uses vector leaf output

* Add tutorial for JSON import

* Add link to tutorial

* Fix typo

* more typo

* Fix formatting

* Add notes about lack of round-trip ability

* Fix typo

* Fix typo
  • Loading branch information
hcho3 authored Mar 25, 2023
1 parent 9d8ca2c commit 0c6b0c5
Show file tree
Hide file tree
Showing 9 changed files with 450 additions and 58 deletions.
16 changes: 7 additions & 9 deletions docs/tutorials/builder.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,11 @@
Specifying models using model builder
=====================================

Since the scope of Treelite is limited to **prediction** only, one must use
other machine learning packages to **train** decision tree ensemble models. In
this document, we will show how to import an ensemble model that had been
trained elsewhere.

**Using XGBoost or LightGBM for training?** Read :doc:`this document <import>`
instead.
Treelite supports loading models from major tree libraries, such as XGBoost and
scikit-learn. However, you may want to use models trained by other tree
libraries that are not directly supported by Treelite. The model builder is
useful in this use case. (Alternatively, consider
:doc:`importing from JSON <json_import>` instead.)

.. contents:: Contents
:local:
Expand All @@ -24,7 +22,7 @@ tree ensembles programmatically. Each tree ensemble is represented as follows:
* Each :py:class:`~treelite.ModelBuilder` object is a **list** of
:py:class:`~treelite.ModelBuilder.Tree` objects.

A toy example
Toy example
-------------
Consider the following tree ensemble, consisting of two regression trees:

Expand Down Expand Up @@ -85,7 +83,7 @@ Consider the following tree ensemble, consisting of two regression trees:

.. note:: Provision for missing data: default directions

Decision trees in treelite accomodate `missing data
Decision trees in Treelite accomodate `missing data
<https://en.wikipedia.org/wiki/Missing_data>`_ by indicating the
**default direction** for every test node. In the diagram above, the
default direction is indicated by label "Missing." For instance, the root node
Expand Down
1 change: 1 addition & 0 deletions docs/tutorials/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,4 @@ This page lists tutorials about Treelite.
deploy
deploy_java
builder
json_import
271 changes: 271 additions & 0 deletions docs/tutorials/json_import.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,271 @@
Specifying models using JSON string
===================================

Treelite supports loading models from major tree libraries, such as XGBoost and
scikit-learn. However, you may want to use models trained by other tree
libraries that are not directly supported by Treelite. The JSON importer is
useful in this use case. (Alternatively, consider
:doc:`using the model builder <builder>` instead.)

.. contents:: Contents
:local:

Toy Example
-----------

Consider the following tree ensemble, consisting of two regression trees:

.. plot::
:nofigs:

from graphviz import Source
source = r"""
digraph toy1 {
graph [fontname = "helvetica"];
node [fontname = "helvetica"];
edge [fontname = "helvetica"];
0 [label=<<FONT COLOR="red">0:</FONT> Feature 1 ∈ {1, 2, 4} ?>, shape=box];
1 [label=<<FONT COLOR="red">1:</FONT> Feature 2 &lt; -3.0 ?>, shape=box];
2 [label=<<FONT COLOR="red">2:</FONT> +0.6>];
3 [label=<<FONT COLOR="red">3:</FONT> -0.4>];
4 [label=<<FONT COLOR="red">4:</FONT> +1.2>];
0 -> 1 [labeldistance=2.0, labelangle=45, headlabel="Yes/Missing "];
0 -> 2 [labeldistance=2.0, labelangle=-45, headlabel="No"];
1 -> 3 [labeldistance=2.0, labelangle=45, headlabel="Yes"];
1 -> 4 [labeldistance=2.0, labelangle=-45, headlabel=" No/Missing"];
}
"""
Source(source, format='png').render('../_static/json_import_toy1', view=False)
Source(source, format='svg').render('../_static/json_import_toy1', view=False)

.. plot::
:nofigs:

from graphviz import Source
source = r"""
digraph toy2 {
graph [fontname = "helvetica"];
node [fontname = "helvetica"];
edge [fontname = "helvetica"];
0 [label=<<FONT COLOR="red">1:</FONT> Feature 0 &lt; 2.5 ?>, shape=box];
1 [label=<<FONT COLOR="red">2:</FONT> +1.6>];
2 [label=<<FONT COLOR="red">4:</FONT> Feature 2 &lt; -1.2 ?>, shape=box];
3 [label=<<FONT COLOR="red">6:</FONT> +0.1>];
4 [label=<<FONT COLOR="red">8:</FONT> -0.3>];
0 -> 1 [labeldistance=2.0, labelangle=45, headlabel="Yes"];
0 -> 2 [labeldistance=2.0, labelangle=-45, headlabel=" No/Missing"];
2 -> 3 [labeldistance=2.0, labelangle=45, headlabel="Yes/Missing "];
2 -> 4 [labeldistance=2.0, labelangle=-45, headlabel="No"];
}
"""
Source(source, format='png').render('../_static/json_import_toy2', view=False)
Source(source, format='svg').render('../_static/json_import_toy2', view=False)

.. raw:: html

<p>
<img src="../_static/json_import_toy1.svg"
onerror="this.src='../_static/json_import_toy1.png'; this.onerror=null;">
<img src="../_static/json_import_toy2.svg"
onerror="this.src='../_static/json_import_toy2.png'; this.onerror=null;">
</p>

.. role:: red

where each node is assign a **unique integer key**, indicated in :red:`red`.
Note that integer keys need to be unique only within the same tree.

You can construct this tree ensemble by calling
:py:meth:`~treelite.Model.import_from_json` with an appropriately formatted
JSON string. We will give you the example code first; in the following section,
we will explain the meaining of each field in the JSON string.

.. note:: :py:meth:`~treelite.Model.dump_as_json` will NOT preserve the JSON string that's passed into :py:meth:`~treelite.Model.import_from_json`

The operation performed in :py:meth:`~treelite.Model.import_from_json` is strictly one-way.
So the output of :py:meth:`~treelite.Model.dump_as_json` will differ from the JSON string
you used in calling :py:meth:`~treelite.Model.import_from_json`.

.. code-block:: python
:linenos:
:emphasize-lines: 78
import treelite
json_str = """
{
"num_feature": 3,
"task_type": "kBinaryClfRegr",
"average_tree_output": false,
"task_param": {
"output_type": "float",
"grove_per_class": false,
"num_class": 1,
"leaf_vector_size": 1
},
"model_param": {
"pred_transform": "identity",
"global_bias": 0.0
},
"trees": [
{
"root_id": 0,
"nodes": [
{
"node_id": 0,
"split_feature_id": 1,
"default_left": true,
"split_type": "categorical",
"categories_list": [1, 2, 4],
"categories_list_right_child": false,
"left_child": 1,
"right_child": 2
},
{
"node_id": 1,
"split_feature_id": 2,
"default_left": false,
"split_type": "numerical",
"comparison_op": "<",
"threshold": -3.0,
"left_child": 3,
"right_child": 4
},
{"node_id": 2, "leaf_value": 0.6},
{"node_id": 3, "leaf_value": -0.4},
{"node_id": 4, "leaf_value": 1.2}
]
},
{
"root_id": 1,
"nodes": [
{
"node_id": 1,
"split_feature_id": 0,
"default_left": false,
"split_type": "numerical",
"comparison_op": "<",
"threshold": 2.5,
"left_child": 2,
"right_child": 4
},
{
"node_id": 4,
"split_feature_id": 2,
"default_left": true,
"split_type": "numerical",
"comparison_op": "<",
"threshold": -1.2,
"left_child": 6,
"right_child": 8
},
{"node_id": 2, "leaf_value": 1.6},
{"node_id": 6, "leaf_value": 0.1},
{"node_id": 8, "leaf_value": -0.3}
]
}
]
}
"""
model = treelite.Model.import_from_json(json_str)
Building model components using JSON
------------------------------------

Model metadata
^^^^^^^^^^^^^^
In the beginning, we must specify certain metadata of the model.

* ``num_teature``: Number of features (columns) in the training data
* ``average_tree_output``: Whether to average the outputs of trees. Set this to
True if the model is a random forest.
* ``task_type`` / ``task_param``: :ref:`Parameters that together define a
machine learning task <task_param>`.
* ``model_param``: :ref:`Other important parameters in the model <model_param>`.

.. _task_param:

Task Parameters: Define a machine learing task
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``task_type`` parameter is closely related to the content of ``task_param``.
The ``task_param`` object has the following parameters:

* ``output_type``: Type of leaf output. Either ``float`` or ``int``.
* ``grove_per_class``: Boolean indicating a particular organization of multi-class
classifier.
* ``num_class``: Number of targer classes in a multi-class classifier. Set this
to 1 if the model is a binary classifier or a non-classifier.
* ``leaf_vector_size``: Length of leaf output. A value of 1 indicates scalar output.

The docstring of :cpp:enum:`TaskType` explains the relationship between
``task_type`` and the parameters in ``task_param``:

.. doxygenenum:: TaskType
:project: treelite

.. _model_param:

Other Model Parameters
~~~~~~~~~~~~~~~~~~~~~~
The ``model_param`` field contains the parameters described in :doc:`../knobs/model_param`.
You may safely omit a parameter as long as it has a default value.

Tree nodes
^^^^^^^^^^
Each tree object must have ``root_id`` field to indicate which node is the root node.

The ``nodes`` array must have node objects. Each node object must have ``node_id`` field.
It will also have other fields, depending on the type of the node. A typical leaf node
will be like this:

.. code-block:: json
{"node_id": 2, "leaf_value": 0.6}
To output a leaf vector, use a list instead.

.. code-block:: json
{"node_id": 2, "leaf_value": [0.6, 0.4]}
A typical internal node with numerical test:

.. code-block:: json
{
"node_id": 1,
"split_feature_id": 2,
"default_left": false,
"split_type": "numerical",
"comparison_op": "<",
"threshold": -3.0,
"left_child": 3,
"right_child": 4
}
A typical internal node with categorical test:

.. code-block:: json
{
"node_id": 0,
"split_feature_id": 1,
"default_left": true,
"split_type": "categorical",
"categories_list": [1, 2, 4],
"categories_list_right_child": false,
"left_child": 1,
"right_child": 2
}
For the categorical test, the test criterion is in the form of

.. code-block:: none
[Feature value] \in [categories_list]
where the ``categories_list`` defines a (mathematical) set.
When the test criteron is evaluated to be true, the prediction function
traverses to the left child node (if ``categories_list_right_child=false``)
or to the right child node (if ``categories_list_right_child=true``).
24 changes: 11 additions & 13 deletions include/treelite/tree.h
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ class ContiguousArray {
/*!
* \brief Enum type representing the task type.
*
* The task type places constraints on the parameters of TaskParameter. See the docstring for each
* The task type places constraints on the parameters of TaskParam. See the docstring for each
* enum constants for more details.
*/
enum class TaskType : uint8_t {
Expand Down Expand Up @@ -142,8 +142,8 @@ enum class TaskType : uint8_t {
* are combined via summing or averaging, depending on the value of the [average_tree_output]
* field. In effect, each tree is casting a set of weighted (fractional) votes for the classes.
*
* An example of kMultiClfProbDistLeaf task type is found in RandomForestClassifier of
* scikit-learn.
* Examples of kMultiClfProbDistLeaf task type are found in RandomForestClassifier of
* scikit-learn and RandomForestClassifier of cuML.
*
* The kMultiClfProbDistLeaf task type implies the following constraints on the task parameters:
* output_type=float, grove_per_class=false, num_class>1, leaf_vector_size=num_class.
Expand All @@ -160,8 +160,6 @@ enum class TaskType : uint8_t {
* Models of type kMultiClfCategLeaf can be converted into the kMultiClfProbDistLeaf type, by
* converting the output of every leaf node into the equivalent one-hot-encoded vector.
*
* An example of kMultiClfCategLeaf task type is found in RandomForestClassifier of cuML.
*
* The kMultiClfCategLeaf task type implies the following constraints on the task parameters:
* output_type=int, grove_per_class=false, num_class>1, leaf_vector_size=1.
*/
Expand All @@ -170,22 +168,22 @@ enum class TaskType : uint8_t {

inline std::string TaskTypeToString(TaskType type) {
switch (type) {
case TaskType::kBinaryClfRegr: return "BinaryClfRegr";
case TaskType::kMultiClfGrovePerClass: return "MultiClfGrovePerClass";
case TaskType::kMultiClfProbDistLeaf: return "MultiClfProbDistLeaf";
case TaskType::kMultiClfCategLeaf: return "MultiClfCategLeaf";
case TaskType::kBinaryClfRegr: return "kBinaryClfRegr";
case TaskType::kMultiClfGrovePerClass: return "kMultiClfGrovePerClass";
case TaskType::kMultiClfProbDistLeaf: return "kMultiClfProbDistLeaf";
case TaskType::kMultiClfCategLeaf: return "kMultiClfCategLeaf";
default: return "";
}
}

inline TaskType StringToTaskType(const std::string& str) {
if (str == "BinaryClfRegr") {
if (str == "kBinaryClfRegr") {
return TaskType::kBinaryClfRegr;
} else if (str == "MultiClfGrovePerClass") {
} else if (str == "kMultiClfGrovePerClass") {
return TaskType::kMultiClfGrovePerClass;
} else if (str == "MultiClfProbDistLeaf") {
} else if (str == "kMultiClfProbDistLeaf") {
return TaskType::kMultiClfProbDistLeaf;
} else if (str == "MultiClfCategLeaf") {
} else if (str == "kMultiClfCategLeaf") {
return TaskType::kMultiClfCategLeaf;
} else {
TREELITE_LOG(FATAL) << "Unknown task type: " << str;
Expand Down
Loading

0 comments on commit 0c6b0c5

Please sign in to comment.