Add tutorial for JSON import (#459)

* Treat some model parameters as optional * Rename matching_categories -> categories_list * Allow users to use arbitrary node_id * Test with vector leaf * String rep of task_type should have 'k' prefix * cuML RF now uses vector leaf output * Add tutorial for JSON import * Add link to tutorial * Fix typo * more typo * Fix formatting * Add notes about lack of round-trip ability * Fix typo * Fix typo
dmlc · Mar 25, 2023 · 0c6b0c5 · 0c6b0c5
1 parent 9d8ca2c
commit 0c6b0c5
Show file tree

Hide file tree

Showing 9 changed files with 450 additions and 58 deletions.
diff --git a/docs/tutorials/builder.rst b/docs/tutorials/builder.rst
@@ -1,13 +1,11 @@
 Specifying models using model builder
 =====================================
 
-Since the scope of Treelite is limited to **prediction** only, one must use
-other machine learning packages to **train** decision tree ensemble models. In
-this document, we will show how to import an ensemble model that had been
-trained elsewhere.
-
-**Using XGBoost or LightGBM for training?** Read :doc:`this document <import>`
-instead.
+Treelite supports loading models from major tree libraries, such as XGBoost and
+scikit-learn. However, you may want to use models trained by other tree
+libraries that are not directly supported by Treelite. The model builder is
+useful in this use case. (Alternatively, consider
+:doc:`importing from JSON <json_import>` instead.)
 
 .. contents:: Contents
   :local:
@@ -24,7 +22,7 @@ tree ensembles programmatically. Each tree ensemble is represented as follows:
 * Each :py:class:`~treelite.ModelBuilder` object is a **list** of
   :py:class:`~treelite.ModelBuilder.Tree` objects.
 
-A toy example
+Toy example
 -------------
 Consider the following tree ensemble, consisting of two regression trees:
 
@@ -85,7 +83,7 @@ Consider the following tree ensemble, consisting of two regression trees:
 
 .. note:: Provision for missing data: default directions
 
-  Decision trees in treelite accomodate `missing data
+  Decision trees in Treelite accomodate `missing data
   <https://en.wikipedia.org/wiki/Missing_data>`_ by indicating the
   **default direction** for every test node. In the diagram above, the
   default direction is indicated by label "Missing." For instance, the root node

diff --git a/docs/tutorials/index.rst b/docs/tutorials/index.rst
@@ -14,3 +14,4 @@ This page lists tutorials about Treelite.
  deploy
  deploy_java
  builder
+ json_import
diff --git a/docs/tutorials/json_import.rst b/docs/tutorials/json_import.rst
@@ -0,0 +1,271 @@
+Specifying models using JSON string
+===================================
+
+Treelite supports loading models from major tree libraries, such as XGBoost and
+scikit-learn. However, you may want to use models trained by other tree
+libraries that are not directly supported by Treelite. The JSON importer is
+useful in this use case. (Alternatively, consider
+:doc:`using the model builder <builder>` instead.)
+
+.. contents:: Contents
+  :local:
+
+Toy Example
+-----------
+
+Consider the following tree ensemble, consisting of two regression trees:
+
+.. plot::
+  :nofigs:
+
+  from graphviz import Source
+  source = r"""
+    digraph toy1 {
+      graph [fontname = "helvetica"];
+      node [fontname = "helvetica"];
+      edge [fontname = "helvetica"];
+      0 [label=<<FONT COLOR="red">0:</FONT> Feature 1 ∈ {1, 2, 4} ?>, shape=box];
+      1 [label=<<FONT COLOR="red">1:</FONT> Feature 2 &lt; -3.0 ?>, shape=box];
+      2 [label=<<FONT COLOR="red">2:</FONT> +0.6>];
+      3 [label=<<FONT COLOR="red">3:</FONT> -0.4>];
+      4 [label=<<FONT COLOR="red">4:</FONT> +1.2>];
+      0 -> 1 [labeldistance=2.0, labelangle=45, headlabel="Yes/Missing           "];
+      0 -> 2 [labeldistance=2.0, labelangle=-45, headlabel="No"];
+      1 -> 3 [labeldistance=2.0, labelangle=45, headlabel="Yes"];
+      1 -> 4 [labeldistance=2.0, labelangle=-45, headlabel="           No/Missing"];
+    }
+  """
+  Source(source, format='png').render('../_static/json_import_toy1', view=False)
+  Source(source, format='svg').render('../_static/json_import_toy1', view=False)
+
+.. plot::
+  :nofigs:
+
+  from graphviz import Source
+  source = r"""
+    digraph toy2 {
+      graph [fontname = "helvetica"];
+      node [fontname = "helvetica"];
+      edge [fontname = "helvetica"];
+      0 [label=<<FONT COLOR="red">1:</FONT> Feature 0 &lt; 2.5 ?>, shape=box];
+      1 [label=<<FONT COLOR="red">2:</FONT> +1.6>];
+      2 [label=<<FONT COLOR="red">4:</FONT> Feature 2 &lt; -1.2 ?>, shape=box];
+      3 [label=<<FONT COLOR="red">6:</FONT> +0.1>];
+      4 [label=<<FONT COLOR="red">8:</FONT> -0.3>];
+      0 -> 1 [labeldistance=2.0, labelangle=45, headlabel="Yes"];
+      0 -> 2 [labeldistance=2.0, labelangle=-45, headlabel="           No/Missing"];
+      2 -> 3 [labeldistance=2.0, labelangle=45, headlabel="Yes/Missing           "];
+      2 -> 4 [labeldistance=2.0, labelangle=-45, headlabel="No"];
+    }
+  """
+  Source(source, format='png').render('../_static/json_import_toy2', view=False)
+  Source(source, format='svg').render('../_static/json_import_toy2', view=False)
+
+.. raw:: html
+
+  <p>
+  <img src="../_static/json_import_toy1.svg"
+       onerror="this.src='../_static/json_import_toy1.png'; this.onerror=null;">
+  <img src="../_static/json_import_toy2.svg"
+       onerror="this.src='../_static/json_import_toy2.png'; this.onerror=null;">
+  </p>
+
+.. role:: red
+
+where each node is assign a **unique integer key**, indicated in :red:`red`.
+Note that integer keys need to be unique only within the same tree.
+
+You can construct this tree ensemble by calling
+:py:meth:`~treelite.Model.import_from_json` with an appropriately formatted
+JSON string. We will give you the example code first; in the following section,
+we will explain the meaining of each field in the JSON string.
+
+.. note:: :py:meth:`~treelite.Model.dump_as_json` will NOT preserve the JSON string that's passed into :py:meth:`~treelite.Model.import_from_json`
+
+  The operation performed in :py:meth:`~treelite.Model.import_from_json` is strictly one-way.
+  So the output of :py:meth:`~treelite.Model.dump_as_json` will differ from the JSON string
+  you used in calling :py:meth:`~treelite.Model.import_from_json`.
+
+.. code-block:: python
+  :linenos:
+  :emphasize-lines: 78
+
+  import treelite
+
+  json_str = """
+  {
+      "num_feature": 3,
+      "task_type": "kBinaryClfRegr",
+      "average_tree_output": false,
+      "task_param": {
+          "output_type": "float",
+          "grove_per_class": false,
+          "num_class": 1,
+          "leaf_vector_size": 1
+      },
+      "model_param": {
+          "pred_transform": "identity",
+          "global_bias": 0.0
+      },
+      "trees": [
+          {
+              "root_id": 0,
+              "nodes": [
+                  {
+                      "node_id": 0,
+                      "split_feature_id": 1,
+                      "default_left": true,
+                      "split_type": "categorical",
+                      "categories_list": [1, 2, 4],
+                      "categories_list_right_child": false,
+                      "left_child": 1,
+                      "right_child": 2
+                  },
+                  {
+                      "node_id": 1,
+                      "split_feature_id": 2,
+                      "default_left": false,
+                      "split_type": "numerical",
+                      "comparison_op": "<",
+                      "threshold": -3.0,
+                      "left_child": 3,
+                      "right_child": 4
+                  },
+                  {"node_id": 2, "leaf_value": 0.6},
+                  {"node_id": 3, "leaf_value": -0.4},
+                  {"node_id": 4, "leaf_value": 1.2}
+              ]
+          },
+          {
+              "root_id": 1,
+              "nodes": [
+                  {
+                      "node_id": 1,
+                      "split_feature_id": 0,
+                      "default_left": false,
+                      "split_type": "numerical",
+                      "comparison_op": "<",
+                      "threshold": 2.5,
+                      "left_child": 2,
+                      "right_child": 4
+                  },
+                  {
+                      "node_id": 4,
+                      "split_feature_id": 2,
+                      "default_left": true,
+                      "split_type": "numerical",
+                      "comparison_op": "<",
+                      "threshold": -1.2,
+                      "left_child": 6,
+                      "right_child": 8
+                  },
+                  {"node_id": 2, "leaf_value": 1.6},
+                  {"node_id": 6, "leaf_value": 0.1},
+                  {"node_id": 8, "leaf_value": -0.3}
+              ]
+          }
+      ]
+  }
+  """
+  model = treelite.Model.import_from_json(json_str)
+
+
+Building model components using JSON
+------------------------------------
+
+Model metadata
+^^^^^^^^^^^^^^
+In the beginning, we must specify certain metadata of the model.
+
+* ``num_teature``: Number of features (columns) in the training data
+* ``average_tree_output``: Whether to average the outputs of trees. Set this to
+  True if the model is a random forest.
+* ``task_type`` / ``task_param``: :ref:`Parameters that together define a
+  machine learning task <task_param>`.
+* ``model_param``: :ref:`Other important parameters in the model <model_param>`.
+
+.. _task_param:
+
+Task Parameters: Define a machine learing task
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The ``task_type`` parameter is closely related to the content of ``task_param``.
+The ``task_param`` object has the following parameters:
+
+* ``output_type``: Type of leaf output. Either ``float`` or ``int``.
+* ``grove_per_class``: Boolean indicating a particular organization of multi-class
+  classifier.
+* ``num_class``: Number of targer classes in a multi-class classifier. Set this
+  to 1 if the model is a binary classifier or a non-classifier.
+* ``leaf_vector_size``: Length of leaf output. A value of 1 indicates scalar output.
+
+The docstring of :cpp:enum:`TaskType` explains the relationship between
+``task_type`` and the parameters in ``task_param``:
+
+.. doxygenenum:: TaskType
+  :project: treelite
+
+.. _model_param:
+
+Other Model Parameters
+~~~~~~~~~~~~~~~~~~~~~~
+The ``model_param`` field contains the parameters described in :doc:`../knobs/model_param`.
+You may safely omit a parameter as long as it has a default value.
+
+Tree nodes
+^^^^^^^^^^
+Each tree object must have ``root_id`` field to indicate which node is the root node.
+
+The ``nodes`` array must have node objects. Each node object must have ``node_id`` field.
+It will also have other fields, depending on the type of the node. A typical leaf node
+will be like this:
+
+.. code-block:: json
+
+  {"node_id": 2, "leaf_value": 0.6}
+
+To output a leaf vector, use a list instead.
+
+.. code-block:: json
+
+  {"node_id": 2, "leaf_value": [0.6, 0.4]}
+
+A typical internal node with numerical test:
+
+.. code-block:: json
+
+  {
+      "node_id": 1,
+      "split_feature_id": 2,
+      "default_left": false,
+      "split_type": "numerical",
+      "comparison_op": "<",
+      "threshold": -3.0,
+      "left_child": 3,
+      "right_child": 4
+  }
+
+A typical internal node with categorical test:
+
+.. code-block:: json
+
+  {
+      "node_id": 0,
+      "split_feature_id": 1,
+      "default_left": true,
+      "split_type": "categorical",
+      "categories_list": [1, 2, 4],
+      "categories_list_right_child": false,
+      "left_child": 1,
+      "right_child": 2
+  }
+
+For the categorical test, the test criterion is in the form of
+
+.. code-block:: none
+
+  [Feature value] \in [categories_list]
+
+where the ``categories_list`` defines a (mathematical) set.
+When the test criteron is evaluated to be true, the prediction function
+traverses to the left child node (if ``categories_list_right_child=false``)
+or to the right child node (if ``categories_list_right_child=true``).
diff --git a/include/treelite/tree.h b/include/treelite/tree.h
@@ -102,7 +102,7 @@ class ContiguousArray {
 /*!
  * \brief Enum type representing the task type.
  *
- * The task type places constraints on the parameters of TaskParameter. See the docstring for each
+ * The task type places constraints on the parameters of TaskParam. See the docstring for each
  * enum constants for more details.
  */
 enum class TaskType : uint8_t {
@@ -142,8 +142,8 @@ enum class TaskType : uint8_t {
    * are combined via summing or averaging, depending on the value of the [average_tree_output]
    * field. In effect, each tree is casting a set of weighted (fractional) votes for the classes.
    *
-   * An example of kMultiClfProbDistLeaf task type is found in RandomForestClassifier of
-   * scikit-learn.
+   * Examples of kMultiClfProbDistLeaf task type are found in RandomForestClassifier of
+   * scikit-learn and RandomForestClassifier of cuML.
    *
    * The kMultiClfProbDistLeaf task type implies the following constraints on the task parameters:
    * output_type=float, grove_per_class=false, num_class>1, leaf_vector_size=num_class.
@@ -160,8 +160,6 @@ enum class TaskType : uint8_t {
    * Models of type kMultiClfCategLeaf can be converted into the kMultiClfProbDistLeaf type, by
    * converting the output of every leaf node into the equivalent one-hot-encoded vector.
    *
-   * An example of kMultiClfCategLeaf task type is found in RandomForestClassifier of cuML.
-   *
    * The kMultiClfCategLeaf task type implies the following constraints on the task parameters:
    * output_type=int, grove_per_class=false, num_class>1, leaf_vector_size=1.
    */
@@ -170,22 +168,22 @@ enum class TaskType : uint8_t {
 
 inline std::string TaskTypeToString(TaskType type) {
   switch (type) {
-    case TaskType::kBinaryClfRegr: return "BinaryClfRegr";
-    case TaskType::kMultiClfGrovePerClass: return "MultiClfGrovePerClass";
-    case TaskType::kMultiClfProbDistLeaf: return "MultiClfProbDistLeaf";
-    case TaskType::kMultiClfCategLeaf: return "MultiClfCategLeaf";
+    case TaskType::kBinaryClfRegr: return "kBinaryClfRegr";
+    case TaskType::kMultiClfGrovePerClass: return "kMultiClfGrovePerClass";
+    case TaskType::kMultiClfProbDistLeaf: return "kMultiClfProbDistLeaf";
+    case TaskType::kMultiClfCategLeaf: return "kMultiClfCategLeaf";
     default: return "";
   }
 }
 
 inline TaskType StringToTaskType(const std::string& str) {
-  if (str == "BinaryClfRegr") {
+  if (str == "kBinaryClfRegr") {
     return TaskType::kBinaryClfRegr;
-  } else if (str == "MultiClfGrovePerClass") {
+  } else if (str == "kMultiClfGrovePerClass") {
     return TaskType::kMultiClfGrovePerClass;
-  } else if (str == "MultiClfProbDistLeaf") {
+  } else if (str == "kMultiClfProbDistLeaf") {
     return TaskType::kMultiClfProbDistLeaf;
-  } else if (str == "MultiClfCategLeaf") {
+  } else if (str == "kMultiClfCategLeaf") {
     return TaskType::kMultiClfCategLeaf;
   } else {
     TREELITE_LOG(FATAL) << "Unknown task type: " << str;
-Original file line number
+Diff line change
@@ Expand Up / @@ -14,3 +14,4 @@ This page lists tutorials about Treelite. @@
      deploy
      deploy_java
      builder
+     json_import