Lab3 - Decision trees (10pts)

In this lab you will be analyzing datasets using decision trees from the scikit-learn library - please read more about this library here.

At the end of the lab upload your notebook containing parts A) and B) from below.

Grading:

(0pts) Part A, the code sections are correctly markdown-delimited and run in the notebook. [Note: If part of the code does no run due to mglearn library, simply read from the textbook to understand these concepts and analyses.]

(10pts) Part B that includes appropriate code and markdown answers.

Part A) Decision tree exmple from ML textbook

[Note: If part of the code does no run due to mglearn library, simply read from the textbook to understand these concepts and analyses.]

Create a folder Lab3, and inside it, create a Jupyter notebook dec_trees.ipynb

Follow the steps decribed in the MG textbook, section 2.3.5 Decision Trees (pg. 72-85); you may access it via our libary -> O’Reilly). Read through the pages and execute the Python commands in your notebook to learn how to explore and classify data. Make sure to use markdown comments to delimit various stages of the analysis, e.g.

Building decision trees

Controlling complexitiy of decision trees

etc.

[Note: the section 2.3.6 Ensembles of Decision Trees (pg. 85-91) is optional; it is useful as it illustrates how you can use a forest of trees to predict a class, hence a more powerful classifier can be built.]

Copy-paste in the first cell the following import statements. Most of them are needed for plots pulled from mglearn package that comes with the texttbook.

from IPython.display import set_matplotlib_formats, display
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import mglearn
from cycler import cycler

#set_matplotlib_formats('pdf', 'png')
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['image.cmap'] = "viridis"
plt.rcParams['image.interpolation'] = "none"
plt.rcParams['savefig.bbox'] = "tight"
plt.rcParams['lines.linewidth'] = 2
plt.rcParams['legend.numpoints'] = 1
plt.rc('axes', prop_cycle=(
    cycler('color', mglearn.plot_helpers.cm_cycle.colors) +
    cycler('linestyle', ['-', '-', "--", (0, (3, 3)), (0, (1.5, 1.5))])))

np.set_printoptions(precision=3, suppress=True)

pd.set_option("display.max_columns", 8)
pd.set_option('display.precision', 2)

In a new cell, import the following packages.

import sklearn
import graphviz

If you are missing some of the packages, install them in DataSpell with conda or pip. Graphviz is a stand-alone application for visualizing graphs, so it must be installed outside of DataSpell. Go to https://www.graphviz.org/download/ to install it.

For Mac, type in any Terminal brew install graphviz

[Note: brew is a package installer for MacOS]

For Windows, use the graphviz-7.1.0 (64-bit) EXE installer [sha256]

Then, in DataSpell Terminal, with pip install graphviz you will install the Python library that accesses the real graphviz visualization tool.

For Windows, if you get an error like “ExecutableNotFound: failed to execute WindowsPath(‘dot’), make sure the Graphviz executables are on your systems’ PATH…” This error might hapen because the graphviz executable sit on a different path from your conda directory if you use pip install graphviz. So try using: conda install python-graphviz

Part B) Create a decision tree to analyze the wine data set

This is a classification problem, and first you must learn more about the data set. Read data documentation from the UCI Repository (see below) to answer these questions for yourself (you do not need to include these answers in your norebook):

What is this data set about?

How many classes of wine are there?

What do we want to learn from this data set?

How many examples are available to learn from? How many features are there?

You may find this dataset at the UCI Repository. This data set is so popluar that even scikit-learn includes it in its Toy datasets Navigate to the scikit-learn site describing the data set and look it over, you will find helpful information for your next steps.

1) In your notebook load the wine datas set; you might have something like

from sklearn.datasets import load_wine
wine = load_wine()

2) In individual cells, type the following and answeer the questions in mardown

    wine
    Q1) What is this displaying?
    Answ:

    wine.data
    Q2) What is this displaying?
    Answ:

    wine.target
    Q3) What is this displaying?
    Answ:

    wine.target.size
    Q4) What is this displaying?
    Answ:

    wine.data[[1]]
    Q5) What is this displaying?
    Answ:

    wine.data[1]
    Q6) What is this displaying?
    Answ:

3) In a new cell type

X_train, X_test, y_train, y_test = train_test_split(
wine.data, wine.target, stratify=wine.target, random_state=42)

If you get an error, you might need to either write above

from sklearn.model_selection import train_test_split

use sklearn.model_selection.train_test_split instad of train_test_split

Hover with your mouse over the function train_test_split and read its decription to familiarize yourself with its parameters.

4) In new cells type

    X_train
    Q7) What is this displaying?
    Answ:

    Y_train
    Q8) What is this displaying?
    Answ:

5) Next, you would like to use a 20% testing and 80% training split. In a new cell, write another train_test_split command similar to the one above to accomplish this splitting.

6) Create a decision tree with the 20-80 splitting from above and name it wine_tree. In several cells,

compute accuracy in test and training data
compute and plot the most important features in classification
plot the wine_tree Note: if you have lots of trouble using graphviz for plotting trees as shown in the textbook, you could try using the plot_tree command from sklearn, similar to

from sklearn import tree
tree.plot_tree(clf)

[Make sure you also write text that explains the numbers you copmute, like “Acc. in training: 87%”]

7) Look at the leaves with [1,0,0] and [2,0,0] examples; we might want to prune these leaves to reduce potential overfitting. Create a new decision tree (call it pruned_tree) that allows only leaves with 5 or more data examples. [Hint: you might want to look at the feature min_samples_leaf]

8) In several cells,

compute accuracy in test and training data of the pruned_tree
compute and plot the most important features in classification
plot the pruned_tree

Q9) How many it-then rules are in the pruned_tree? Write down such a complete if-then rule. [Side note, so you do not have to submit an answer to the following question, but try this out because it is important for your learning: try creating your prunned tree with min_samples_leaf = 30 instead of min_samples_leaf = 5. What do you notice? Is rour tree prunned even more? What happened to the test accuracy in this newly prunned tree vs the original tree?]

Q10) How many leaves are prunned in the wine_tree to obtain the pruned_tree?

Q11) In which of the two trees is the accuracy in test data higher? Is the difference in testing data accuracy for those two trees large?

Q12) Related to the question above, if I am an expert domain in wine (oenologist), and I am choosing the pruned_tree over the wine_tree, am I wrong? If not, why not? Be brief but clear in your justification.