Why the sum "value" isn't equal to the number of "samples" in scikit-learn RandomForestClassifier?


Question: Why the sum "value" isn't equal to the number of "samples" in scikit-learn RandomForestClassifier?

I built a random forest by RandomForestClassifier and plot the decision trees. What does the parameter "value" (pointed by red arrows) mean? And why python error the sum of two numbers in the [] doesn't equal to the number of "samples"? I saw some other examples, the sum of two numbers in the [] equals to the number of "samples". Why in my case, it doesn't?

df = pd.read_csv("Dataset.csv") df.drop(['Flow ID', 'Inbound'], axis=1, inplace=True) df.replace([np.inf, -np.inf], np.nan, inplace=True) df.dropna(inplace = True) df.Label[df.Label == 'BENIGN'] = 0 df.Label[df.Label == 'DrDoS_LDAP'] = 1 Y = df["Label"].values Y = Y.astype('int') X = df.drop(labels = ["Label"], axis=1) X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5) model = RandomForestClassifier(n_estimators = 20) model.fit(X_train, Y_train) Accuracy = model.score(X_test, Y_test)          for i in range(len(model.estimators_)):     fig = plt.figure(figsize=(15,15))     tree.plot_tree(model.estimators_[i], feature_names = df.columns, class_names = ['Benign', 'DDoS'])     plt.savefig('.\\TheForest\\T'+str(i)) 

enter image description here

Total Answers: 1


Answers 1: of Why the sum "value" isn't equal to the number of "samples" in scikit-learn RandomForestClassifier?

Nice catch.

Although undocumented, this is due to the bootstrap sampling taking place by default in a Random Forest model (see my answer in Why is Random Forest with a single tree much better than a Decision Tree classifier? for more on the RF algorithm details and its difference from a mere "bunch" of decision trees).

Let's see an example with the iris data:

from sklearn.datasets import load_iris from sklearn import tree from sklearn.ensemble import RandomForestClassifier  iris = load_iris()  rf = RandomForestClassifier(max_depth = 3) rf.fit(iris.data, iris.target)  tree.plot_tree(rf.estimators_[0]) # take the first tree 

enter image description here

The result here is similar to what you report: for every other node except the lower right one, sum(value) does not equal samples, as it should be the case for a "simple" decision tree.

A cautious observer would have noticed something else which seems odd here: python error while the iris dataset has 150 samples:

print(iris.DESCR)  .. _iris_dataset:  Iris plants dataset --------------------  **Data Set Characteristics:**      :Number of Instances: 150 (50 in each of three classes)     :Number of Attributes: 4 numeric, predictive attributes and the class 

and the base node of the tree should include all of them, the samples for the first node are only 89.

Why is that, and what exactly is going on here? To see, let us fit a second RF model, this time without bootstrap sampling (i.e. with bootstrap=False):

rf2 = RandomForestClassifier(max_depth = 3, bootstrap=False) # no bootstrap sampling rf2.fit(iris.data, iris.target)  tree.plot_tree(rf2.estimators_[0]) # take again the first tree 

enter image description here

Well, now that we have disabled bootstrap sampling, everything looks "nice": the sum of value in every node equals samples, and the base node contains indeed the whole dataset (150 samples).

So, the behavior you describe seems to be due to bootstrap sampling indeed, which, while creating samples with replacement (i.e. ending up with duplicate samples for each individual decision tree of the ensemble), these duplicate samples are not reflected in the sample values of the tree nodes, which display the number of unique samples; nevertheless, it is reflected in the node value.

The situation is completely analogous with that of a RF regression python error model, as well as with a Bagging Classifier - see respectively: