[python-package] highlight the path a sample takes through a tree in `plot_tree` and `create_tree_digraph` (fixes #4784) #5119

jmoralez · 2022-04-02T02:58:27Z

This adds an argument x (or we can maybe call it sample) that takes a single sample and highlights the path that sample takes through a tree in the tree plotting functions. The path is highlighted by making the edges of the nodes as well as the edges blue and bold. Here's an example for different tree sizes.

Sample script

import lightgbm as lgb
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression

X, y = make_regression(1_000, n_features=4, n_informative=2, random_state=0)
ds = lgb.Dataset(X, y)

fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(16, 6))
leaves = [7, 15, 31, 63]
for i, (axi, num_leaves) in enumerate(zip(ax.flat, leaves)):
    bst = lgb.train({'num_leaves': num_leaves, 'verbose': -1}, ds, num_boost_round=5)
    lgb.plot_tree(bst, x=X[-1], ax=axi)

Closes #4784.

jameslamb

Thanks for this! I think this is a really powerful feature, and I support adding it by adding new arguments to create_tree_digraph() and plot_tree().

I agree with @StrikerRUS 's suggestion to use color on the arrows and edges, I think it looks great.

A few other recommendations:

Instead of x, would you consider calling this argument example_case or example_observation?
- I think the word sample should be avoided, since in other parts of LightGBM that's used to mean "randomly choose".
could you please add some tests on this new behavior? See the existing tests in test_plotting.py for reference, for example:
- LightGBM/tests/python_package_test/test_plotting.py
  
  Line 159 in 7820746
  
  def test_create_tree_digraph(breast_cancer_split):
what will the behavior of this code be if x has more than one row in it? Would you consider adding some validation that raises an exception with an informative error if x has more than one row?

And some questions (some of which might be about these plotting functions generally and outside the scope of this PR, I'm not sure)

will this work with categorical features? If I remember correctly, decision_type can include "or" rules like ||

will this work with categorical features stored as pandas categorical types?

I think it might not, since the proposed code does a direct comparison to the values in x, so if the data is a pd.Series it isn't passed through the _data_from_pandas() logic
LightGBM/python-package/lightgbm/basic.py

Line 785 in 7820746

data = _data_from_pandas(data, None, None, self.pandas_categorical)[0]

LightGBM/python-package/lightgbm/basic.py

Lines 549 to 554 in 7820746

    
               for col, category in zip(cat_cols, pandas_categorical): 
        
                   if list(data[col].cat.categories) != list(category): 
        
                       data[col] = data[col].cat.set_categories(category) 
        
           if len(cat_cols):  # cat_cols is list 
        
               data = data.copy()  # not alter origin DataFrame 
        
               data[cat_cols] = data[cat_cols].apply(lambda x: x.cat.codes).replace({-1: np.nan})

jameslamb

Thanks for adding additional tests and categorical support! Really nice work!

I left one small suggestion. I'd still like to test this locally a little bit more, will do that tomorrow. But overall I'm really excited about this 🤩

python-package/lightgbm/plotting.py

jameslamb

VERY nice work @jmoralez ! I tested tonight in more depth and found everything worked really well. These plots are awesome 🤩 .

I was looking for an example real-world dataset that had informative categorical features, and found that you can filter the UCI Machine Learning Repository by "contains categorical features": https://archive.ics.uci.edu/ml/datasets.php?format=&task=&att=cat&area=&numAtt=&numIns=&type=&sort=nameUp&view=table.

Found that this one worked well for my investigation: https://archive.ics.uci.edu/ml/machine-learning-databases/solar-flare/.

That dataset has ONLY categorical features, so it's useful for testing categorical-specific stuff. I think I'll turn to it in the future when experimenting with lightgbm and other ML libraries.

example code (click me)

import lightgbm as lgb
import pandas as pd
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/solar-flare/flare.data2"
df = pd.read_csv(
    filepath_or_buffer=data_url,
    sep=" ",
    header=1,
    skip_blank_lines=True,
    names=[
        "class_code",
        "largest_spot_size_code",
        "spot_distribution_code",
        "activity",
        "evolution",
        "prev_flare_24h",
        "historically_complex",
        "became_complex_on_this_pass",
        "area",
        "largest_spot_area",
        "c_class_flare_count",
        "m_class_flare_count",
        "x_class_flare_count"
    ]
)

for col in df.columns:
    if pd.api.types.is_object_dtype(df[col]):
        df[col] = df[col].astype("category")

y = (df[["c_class_flare_count"]] > 0).astype("int").values.ravel()
feature_names = [c for c in df.columns if not c.endswith("flare_count")]
X = df[feature_names]

dtrain = lgb.Dataset(
    X,
    y,
    feature_name=feature_names,
    categorical_feature="auto",
    params={
        "min_data_in_bin": 5,
        "min_data_per_group": 1
    }
)
bst = lgb.train(
    params={
        'num_leaves': 7,
        'objective': 'binary',
        'min_data_in_leaf': 1,
        'verbose': 1
    },
    train_set=dtrain,
    num_boost_round=3
)

example_case = X[:1]
pd.set_option('display.width', 1000)
print(example_case)
print("---")
print(X["class_code"].cat.categories)
print("---")
lgb.create_tree_digraph(bst, example_case=example_case, tree_index=1)

Tried with a few records, and can see that the records go the correct way in the split, including the handling of || for categoricals!

In a future PR, could you please add an example using this new function to https://github.com/microsoft/LightGBM/blob/1cc9f9dcee15586aefdd8271a1c04ad010d3c53a/examples/python-guide/plot_example.py?

jmoralez · 2022-06-06T15:56:43Z

In a future PR, could you please add an example using this new function

Sure.

Thanks for the great suggestions on your review, as always!

StrikerRUS

Looks great!
But I'm afraid that this PR doesn't cover of possible scenarios of params and values in example_case (see inline comment below).

python-package/lightgbm/plotting.py

StrikerRUS · 2022-06-12T21:47:14Z

python-package/lightgbm/plotting.py

+                if root['decision_type'] == '==':
+                    thresholds = {int(x) for x in root['threshold'].split('||')}
+                    if example_case[split_feature] in thresholds:
+                        direction = 'left'
+                    else:
+                        direction = 'right'
+                else:
+                    direction = 'left' if example_case[split_feature] <= root['threshold'] else 'right'


Does this code cover all combinations of use_missing, zero_as_missing params and NaN values in example_case?

LightGBM/src/io/tree.cpp

Lines 520 to 560 in 11110c5

std::string Tree::NumericalDecisionIfElse(int node) const {

std::stringstream str_buf;

Common::C_stringstream(str_buf);

str_buf << std::setprecision(std::numeric_limits<double>::digits10 + 2);

uint8_t missing_type = GetMissingType(decision_type_[node]);

bool default_left = GetDecisionType(decision_type_[node], kDefaultLeftMask);

if (missing_type == MissingType::None

|| (missing_type == MissingType::Zero && default_left && kZeroThreshold < threshold_[node])) {

str_buf << "if (fval <= " << threshold_[node] << ") {";

} else if (missing_type == MissingType::Zero) {

if (default_left) {

str_buf << "if (fval <= " << threshold_[node] << " || Tree::IsZero(fval)" << " || std::isnan(fval)) {";

} else {

str_buf << "if (fval <= " << threshold_[node] << " && !Tree::IsZero(fval)" << " && !std::isnan(fval)) {";

}

} else {

if (default_left) {

str_buf << "if (fval <= " << threshold_[node] << " || std::isnan(fval)) {";

} else {

str_buf << "if (fval <= " << threshold_[node] << " && !std::isnan(fval)) {";

}

}

return str_buf.str();

}

std::string Tree::CategoricalDecisionIfElse(int node) const {

uint8_t missing_type = GetMissingType(decision_type_[node]);

std::stringstream str_buf;

Common::C_stringstream(str_buf);

if (missing_type == MissingType::NaN) {

str_buf << "if (std::isnan(fval)) { int_fval = -1; } else { int_fval = static_cast<int>(fval); }";

} else {

str_buf << "if (std::isnan(fval)) { int_fval = 0; } else { int_fval = static_cast<int>(fval); }";

}

int cat_idx = static_cast<int>(threshold_[node]);

str_buf << "if (int_fval >= 0 && int_fval < 32 * (";

str_buf << cat_boundaries_[cat_idx + 1] - cat_boundaries_[cat_idx];

str_buf << ") && (((cat_threshold[" << cat_boundaries_[cat_idx];

str_buf << " + int_fval / 32] >> (int_fval & 31)) & 1))) {";

return str_buf.str();

}

Refer to #2921.

I added the decisions for numerical splits in 4c19af5 and for categorical in 9f65d63

Co-authored-by: Nikita Titov <[email protected]>

python-package/lightgbm/plotting.py

StrikerRUS

Awesome contribution!
Just some minor comments below:

python-package/lightgbm/basic.py

python-package/lightgbm/plotting.py

StrikerRUS · 2022-06-25T20:07:55Z

python-package/lightgbm/plotting.py

+def _determine_direction_for_categorical_split(fval: float, thresholds: str, missing_type: str) -> str:
+    if missing_type == 'None':
+        int_fval = -1 if math.isnan(fval) else int(fval)
+    else:
+        int_fval = 0 if math.isnan(fval) else int(fval)
+    int_thresholds = {int(t) for t in thresholds.split('||')}
+    return 'left' if int_fval in int_thresholds else 'right'


Looks like NaN values are treaded here properly, right?
Refer to #4468.

I updated this in 1e6f95d to do the same as #4468. -1 are replaced by nans in

LightGBM/python-package/lightgbm/basic.py

Line 559 in df14e60

data[cat_cols] = data[cat_cols].apply(lambda x: x.cat.codes).replace({-1: np.nan})

so we only get nan or a non-negative integer.

Thanks for doing this!

Do we need to update the following code as well (in a separate PR)?

LightGBM/src/io/tree.cpp

Lines 549 to 553 in fb37e50

if (missing_type == MissingType::NaN) {

str_buf << "if (std::isnan(fval)) { int_fval = -1; } else { int_fval = static_cast<int>(fval); }";

} else {

str_buf << "if (std::isnan(fval)) { int_fval = 0; } else { int_fval = static_cast<int>(fval); }";

}

python-package/lightgbm/plotting.py

tests/python_package_test/test_plotting.py

StrikerRUS · 2022-07-03T16:21:56Z

python-package/lightgbm/plotting.py


+def _determine_direction_for_numeric_split(fval: float, threshold: float, missing_type: str, default_left: bool) -> str:
+    le_threshold = fval <= threshold
+    if missing_type == _MissingType.NONE or (missing_type == _MissingType.ZERO and default_left and ZERO_THRESHOLD < threshold):


You cannot compare string with Enum this way. Such comparison will always return False.

missing_type = 'Zero' missing_type == _MissingType.ZERO # False

missing_type should either be converted to Enum first or you should compare in the following way: missing_type == _MissingType.ZERO.value.

Also, this incorrect comparison shows that added tests are unreliable unfortunately 😢 They should fail, but all CI is green right now.

I updated the numerical split definition to the one here:

LightGBM/include/LightGBM/tree.h

Lines 335 to 353 in f94050a

inline int NumericalDecision(double fval, int node) const {

uint8_t missing_type = GetMissingType(decision_type_[node]);

if (std::isnan(fval) && missing_type != MissingType::NaN) {

fval = 0.0f;

}

if ((missing_type == MissingType::Zero && IsZero(fval))

|| (missing_type == MissingType::NaN && std::isnan(fval))) {

if (GetDecisionType(decision_type_[node], kDefaultLeftMask)) {

return left_child_[node];

} else {

return right_child_[node];

}

}

if (fval <= threshold_[node]) {

return left_child_[node];

} else {

return right_child_[node];

}

}

.
Now the test fails if I comment out the enum conversion. Let me know what you think, it looks different than the previous one.

Thanks for doing this!

Do we need to sync the code in if/else dump as well (in a separate PR)?

LightGBM/src/io/tree.cpp

Lines 520 to 560 in 11110c5

std::string Tree::NumericalDecisionIfElse(int node) const {

std::stringstream str_buf;

Common::C_stringstream(str_buf);

str_buf << std::setprecision(std::numeric_limits<double>::digits10 + 2);

uint8_t missing_type = GetMissingType(decision_type_[node]);

bool default_left = GetDecisionType(decision_type_[node], kDefaultLeftMask);

if (missing_type == MissingType::None

|| (missing_type == MissingType::Zero && default_left && kZeroThreshold < threshold_[node])) {

str_buf << "if (fval <= " << threshold_[node] << ") {";

} else if (missing_type == MissingType::Zero) {

if (default_left) {

str_buf << "if (fval <= " << threshold_[node] << " || Tree::IsZero(fval)" << " || std::isnan(fval)) {";

} else {

str_buf << "if (fval <= " << threshold_[node] << " && !Tree::IsZero(fval)" << " && !std::isnan(fval)) {";

}

} else {

if (default_left) {

str_buf << "if (fval <= " << threshold_[node] << " || std::isnan(fval)) {";

} else {

str_buf << "if (fval <= " << threshold_[node] << " && !std::isnan(fval)) {";

}

}

return str_buf.str();

}

std::string Tree::CategoricalDecisionIfElse(int node) const {

uint8_t missing_type = GetMissingType(decision_type_[node]);

std::stringstream str_buf;

Common::C_stringstream(str_buf);

if (missing_type == MissingType::NaN) {

str_buf << "if (std::isnan(fval)) { int_fval = -1; } else { int_fval = static_cast<int>(fval); }";

} else {

str_buf << "if (std::isnan(fval)) { int_fval = 0; } else { int_fval = static_cast<int>(fval); }";

}

int cat_idx = static_cast<int>(threshold_[node]);

str_buf << "if (int_fval >= 0 && int_fval < 32 * (";

str_buf << cat_boundaries_[cat_idx + 1] - cat_boundaries_[cat_idx];

str_buf << ") && (((cat_threshold[" << cat_boundaries_[cat_idx];

str_buf << " + int_fval / 32] >> (int_fval & 31)) & 1))) {";

return str_buf.str();

}

Yes, I think that code may need to be updated. I'll take a look.

Kindly ping @jmoralez for possible following-up PR.

guolinke · 2022-07-23T14:35:03Z

It seems azure pipeline is broken 😢, but I don't have permission to fix it now. @shiyu1994 , can you take a look?

python-package/lightgbm/plotting.py

Co-authored-by: Nikita Titov <[email protected]>

shiyu1994 · 2022-07-25T17:03:21Z

@guolinke I'm working on this and will fix this soon. Sorry for the delay.

StrikerRUS

Great work! Thank you very much!

github-actions · 2023-08-19T03:33:47Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jmoralez added 2 commits April 1, 2022 20:42

highlight path in plot_tree

2a7bd05

Merge branch 'master' into plot-path

99aab42

jmoralez requested review from StrikerRUS, hzy46, jameslamb, shiyu1994 and tongwu-sh as code owners April 2, 2022 02:58

lint

9a2dc46

jmoralez changed the title ~~[python-package] highlight the path a sample takes through a tree in plot_tree and create_tree_digraph~~ [python-package] highlight the path a sample takes through a tree in plot_tree and create_tree_digraph (fixes #4784) Apr 2, 2022

jmoralez added the feature label Apr 2, 2022

jameslamb requested changes Apr 2, 2022

View reviewed changes

jmoralez added the in progress label Apr 4, 2022

rename x to example_case. support categorical features. add test

c897b08

jmoralez requested a review from jameslamb April 26, 2022 02:48

lint

fa188e1

jmoralez added awaiting review and removed in progress labels May 5, 2022

jameslamb requested changes Jun 4, 2022

View reviewed changes

python-package/lightgbm/plotting.py Outdated Show resolved Hide resolved

check for exactly one row. test empty example_case

6c0c233

jameslamb approved these changes Jun 5, 2022

View reviewed changes

jameslamb removed request for hzy46 and tongwu-sh June 5, 2022 05:45

jameslamb removed the awaiting review label Jun 6, 2022

StrikerRUS reviewed Jun 12, 2022

View reviewed changes

jmoralez and others added 3 commits June 15, 2022 15:07

Apply suggestions from code review

b54d1d5

Co-authored-by: Nikita Titov <[email protected]>

handle missing values in numeric splits

4c19af5

Merge branch 'plot-path' of github.com:microsoft/LightGBM into plot-path

9616aaf

jmoralez added the in progress label Jun 16, 2022

StrikerRUS reviewed Jun 19, 2022

View reviewed changes

python-package/lightgbm/plotting.py Outdated Show resolved Hide resolved

jmoralez added 2 commits June 24, 2022 00:53

remove literal. add categorical split function

9f65d63

make categorical feature more important. lint

097fb51

StrikerRUS reviewed Jun 25, 2022

View reviewed changes

StrikerRUS removed the in progress label Jun 26, 2022

jmoralez added 2 commits June 27, 2022 20:13

add enum. update categorical split. apply suggestions

1e6f95d

Merge branch 'master' into plot-path

0ec657f

StrikerRUS reviewed Jul 3, 2022

View reviewed changes

jmoralez added 2 commits July 21, 2022 22:08

update numeric split decision

720cbac

lint

0f1fe76

StrikerRUS reviewed Jul 24, 2022

View reviewed changes

python-package/lightgbm/plotting.py Outdated Show resolved Hide resolved

Update python-package/lightgbm/plotting.py

de37d16

Co-authored-by: Nikita Titov <[email protected]>

StrikerRUS approved these changes Jul 30, 2022

View reviewed changes

jmoralez added 2 commits July 30, 2022 22:36

Merge branch 'master' into plot-path

993af64

Merge branch 'master' into plot-path

ae4f6b8

shiyu1994 merged commit 680f4b0 into master Aug 10, 2022

shiyu1994 deleted the plot-path branch August 10, 2022 08:14

jameslamb mentioned this pull request Oct 7, 2022

[DO NOT MERGE] Release v3.3.3 #5525

Closed

40 tasks

jameslamb mentioned this pull request Mar 17, 2023

TypeError: __init__() got an unexpected keyword argument 'example_case' #5791

Closed

jameslamb mentioned this pull request Jun 27, 2023

[docs] add versionadded notes for v4.0.0 features #5948

Merged

github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023

	for col, category in zip(cat_cols, pandas_categorical):
	if list(data[col].cat.categories) != list(category):
	data[col] = data[col].cat.set_categories(category)
	if len(cat_cols): # cat_cols is list
	data = data.copy() # not alter origin DataFrame
	data[cat_cols] = data[cat_cols].apply(lambda x: x.cat.codes).replace({-1: np.nan})

	std::string Tree::NumericalDecisionIfElse(int node) const {
	std::stringstream str_buf;
	Common::C_stringstream(str_buf);
	str_buf << std::setprecision(std::numeric_limits<double>::digits10 + 2);
	uint8_t missing_type = GetMissingType(decision_type_[node]);
	bool default_left = GetDecisionType(decision_type_[node], kDefaultLeftMask);
	if (missing_type == MissingType::None
	\|\| (missing_type == MissingType::Zero && default_left && kZeroThreshold < threshold_[node])) {
	str_buf << "if (fval <= " << threshold_[node] << ") {";
	} else if (missing_type == MissingType::Zero) {
	if (default_left) {
	str_buf << "if (fval <= " << threshold_[node] << " \|\| Tree::IsZero(fval)" << " \|\| std::isnan(fval)) {";
	} else {
	str_buf << "if (fval <= " << threshold_[node] << " && !Tree::IsZero(fval)" << " && !std::isnan(fval)) {";
	}
	} else {
	if (default_left) {
	str_buf << "if (fval <= " << threshold_[node] << " \|\| std::isnan(fval)) {";
	} else {
	str_buf << "if (fval <= " << threshold_[node] << " && !std::isnan(fval)) {";
	}
	}
	return str_buf.str();
	}

	std::string Tree::CategoricalDecisionIfElse(int node) const {
	uint8_t missing_type = GetMissingType(decision_type_[node]);
	std::stringstream str_buf;
	Common::C_stringstream(str_buf);
	if (missing_type == MissingType::NaN) {
	str_buf << "if (std::isnan(fval)) { int_fval = -1; } else { int_fval = static_cast<int>(fval); }";
	} else {
	str_buf << "if (std::isnan(fval)) { int_fval = 0; } else { int_fval = static_cast<int>(fval); }";
	}
	int cat_idx = static_cast<int>(threshold_[node]);
	str_buf << "if (int_fval >= 0 && int_fval < 32 * (";
	str_buf << cat_boundaries_[cat_idx + 1] - cat_boundaries_[cat_idx];
	str_buf << ") && (((cat_threshold[" << cat_boundaries_[cat_idx];
	str_buf << " + int_fval / 32] >> (int_fval & 31)) & 1))) {";
	return str_buf.str();
	}

	inline int NumericalDecision(double fval, int node) const {
	uint8_t missing_type = GetMissingType(decision_type_[node]);
	if (std::isnan(fval) && missing_type != MissingType::NaN) {
	fval = 0.0f;
	}
	if ((missing_type == MissingType::Zero && IsZero(fval))
	\|\| (missing_type == MissingType::NaN && std::isnan(fval))) {
	if (GetDecisionType(decision_type_[node], kDefaultLeftMask)) {
	return left_child_[node];
	} else {
	return right_child_[node];
	}
	}
	if (fval <= threshold_[node]) {
	return left_child_[node];
	} else {
	return right_child_[node];
	}
	}

[python-package] highlight the path a sample takes through a tree in plot_tree and create_tree_digraph (fixes #4784) #5119

[python-package] highlight the path a sample takes through a tree in plot_tree and create_tree_digraph (fixes #4784) #5119

Uh oh!

Conversation

jmoralez commented Apr 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jameslamb left a comment

Choose a reason for hiding this comment

Uh oh!

jameslamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jameslamb left a comment

Choose a reason for hiding this comment

Uh oh!

jmoralez commented Jun 6, 2022

Uh oh!

StrikerRUS left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StrikerRUS Jun 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

StrikerRUS left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StrikerRUS Jul 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guolinke commented Jul 23, 2022

Uh oh!

Uh oh!

shiyu1994 commented Jul 25, 2022

Uh oh!

StrikerRUS left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[python-package] highlight the path a sample takes through a tree in `plot_tree` and `create_tree_digraph` (fixes #4784) #5119

[python-package] highlight the path a sample takes through a tree in `plot_tree` and `create_tree_digraph` (fixes #4784) #5119

jmoralez commented Apr 2, 2022 •

edited

Loading

StrikerRUS Jun 12, 2022 •

edited

Loading

StrikerRUS Jul 3, 2022 •

edited

Loading