Machine Learning is the science of building mathematical models automatically from data (a training dataset), without being explicitly programmed, in order to make predictions on new data. This has a large and growing number of applications, from spam filtering and fraud detection to facial recognition and autonomous vehicles. Machine learning algorithms have existed for a long time, but the computational power available today to perform complex mathematical calculations over huge datasets makes them have much more practical applications than ever. Recommendation systems like Netflix’s and Amazon’s, or Google’s self driving car are some of the most successful cases of machine learning products. 

This development results in very valuable, real time predictions without any human intervention that can guide to better decisions, intelligent actions and new information discovery.

In this post, we will consider a very simple Machine Learning problem and solve it using Scikit-learn in Python in order to show how these algorithms work and which are the typicall issues encountered when facing a problem like this. We will use the Anaconda distribution, which already includes scikit-learn as well as all the needed dependencies such as numpy and scipy. We will use IPython (also included in Anaconda) to run blocks of code and see its results instantaneously. 

Cars Evaluation

We will take a very simple dataset to evaluate cars according to some attributes like safety, maintainability, buying price, etc. This will be tackled as a classification problem: the model will take car attributes as input and classify it into classes representing how convenient would be to buy it. The dataset can be obtained here.This dataset is very convenient for an introductory example since it has no missing values nor irrelevant columns, so it will be easy to start learning from it. For this problem, a simple Decision Tree will do the job.

These are the attributes of each car:

  1.     buying:   vhigh, high, med, low.
  2.     maint:    vhigh, high, med, low.
  3.     doors:    2, 3, 4, 5more.
  4.     persons:  2, 4, more.
  5.     lug_boot: small, med, big.
  6.     safety:   low, med, high.

And the output classification labels:

  • unacc, acc, good, vgood

The decision tree will build a model that, given a car, takes an attribute and splits into one of two branches based on the value of that attribute, and continues with the rest of the attributes until it has enough information to classify the input car. The key of Decision Trees is to learn from data the tree structure that will best classify each car instance, by taking at each node the attribute that gives “most useful information” regarding the classification. This is formalized using a metric to measure each attribute at each node, for example computing the information gain of each attribute at a certain node and taking the attribute with the highest one is one widely used method. For more information about how Decision Trees work, a visual explanation can be found here.

Reading the dataset

First of all, it is needed to read the dataset (a .csv file) into a matrix in Python:

In [1]: import numpy as np
In [2]: data = np.loadtxt(fname=f,delimiter=',',dtype='str')

In [3]: data

Out[3]: 

array([['vhigh', 'vhigh', '2', ..., 'small', 'med', 'unacc'],

       ['vhigh', 'vhigh', '2', ..., 'small', 'high', 'unacc'],
       ['vhigh', 'vhigh', '2', ..., 'med', 'low', 'unacc'],
       ..., 
       ['low', 'low', '5more', ..., 'big', 'low', 'unacc'],
       ['low', 'low', '5more', ..., 'big', 'med', 'good'],
       ['low', 'low', '5more', ..., 'big', 'high', 'vgood']], 
      dtype='|S5')

Now, we will split the data into attributes (X) and the classification labels (Y):

In [4]: X = data[:,0:-1]

In [5]: Y = data[:,-1]

There are many attributes that are categorical (non-numerical) which would give some trouble to the learning algorithm, so we will make some preprocessing:

In [6]: from sklearn import preprocessing

In [7]: le = preprocessing.LabelEncoder()

In [8]: for i in range(6):
    X[:,i] = le.fit_transform(X[:,i])
   ....:    
In [9]: X
Out[9]: 
array([['3', '3', '0', '0', '2', '2'],
       ['3', '3', '0', '0', '2', '0'],
       ['3', '3', '0', '0', '1', '1'],
       ..., 
       ['1', '1', '3', '2', '0', '1'],
       ['1', '1', '3', '2', '0', '2'],
       ['1', '1', '3', '2', '0', '0']], 
      dtype='|S5')
In [10]: Y = le.fit_transform(Y)

In [11]: Y
Out[11]: array([2, 2, 2, ..., 2, 1, 3])

So now the categorical attributes have been translated into numerical classes. This is all what we need to train the model. In order to train the Decision Tree and then test its accuracy, we need to split the dataset into a training set and a testing set. We will feed the training set into the training algorithm to build the classification model. Then, we use the trained model to classify the samples in the test set and measure the accuracy of the model.  We will take 30% of the samples for testing and the rest as the training set.

In [12]: from sklearn import cross_validation
In [13]: X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X,Y,test_size=0.3)

 

Training and Testing

Now we are ready to train and test the Decision Tree. We will set a maximum depth of 5 levels and see how it works:

In [14]: dt = tree.DecisionTreeClassifier(max_depth=5)

In [15]: dt.fit(X_train, Y_train)
Out[16]: 
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [17]: dt.score(X_test, Y_test)
Out[17]: 0.83236994219653182

The Decision Tree correctly classified the 83,23% of the test samples. This is a good result for a first approach, but since the dataset contains 6 attributes with multiple possible values, it probably will perform better if we increase the maximum depth of the tree. Let’s try with 10:

In [18]: dt = tree.DecisionTreeClassifier(max_depth=10)

In [19]: dt.fit(X_train, Y_train)
Out[19]: 
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [20]: dt.score(X_test, Y_test)
Out[20]: 0.95953757225433522

As you see, the accuracy of the model increased a lot just by letting the tree grow bigger so it can learn more complex patterns. We can see the tree structure with the following commands using graphviz:

In [20]: tree.export_graphviz(dt,"tree.dot", class_names = ["unacc", "acc","good", "vgood"])

And in the terminal:

$ dot -Tpng tree.dot -o tree.png

Which generates an image of the model. At each node, it says the most probable classification value for that path. When a leaf is reached, a classification label is emitted by the model.

If we generate the second model and compare it with the previous one, we will see that it is much more complex because we let it have a depth of 10 (and for that reason, it is more accurate, being able to learn more complex patterns).

Despite we got a good accuracy for this models, that result is not so reliable because it was tested using only 30% of the dataset, which may produce different results if another 30% of the samples were chosen. To avoid this, let’s add some randomness to the process. We will shuffle the samples, take a 10% for testing, train the model with the other 90% samples (which will generate a new tree structure) and measure its accuracy. Then, repeat this 20 times. This way we can get a more reliable result just adding some randomness and repeating the model training/testing many times. 
 

In [21]: from sklearn import metrics

In [22]: validator = cross_validation.ShuffleSplit(len(X), n_iter=20, test_size=0.1, random_state=0)

In [22]: score = cross_validation.cross_val_score(dt, X, Y, cv=validator)
In [23]: score
Out[23]: 
array([ 0.97109827,  0.97109827,  0.98265896,  0.96531792,  0.97687861,
        0.98265896,  0.97109827,  0.97109827,  0.97109827,  0.98265896,
        0.94219653,  0.97687861,  0.95375723,  0.97687861,  0.97109827,
        0.97687861,  0.95953757,  0.96531792,  0.95375723,  0.96531792])


Now we got 20 accuracy results, one for each iteration. We could just take the mean of the results, but it does not say anything about its confidence. We will borrow some ideas from statistics and compute the 95% confidence interval (an estimated error computed from the results) as follows:

In [24]: print("Accuracy: %0.2f (+/- %0.2f)" % score.mean(), score.std() * 2)
Accuracy: 0.97 +/- 0.02

Which means we can say with 95% of confidence, that the accuracy of the classifier is between 95% and 99%. With these ideas we got much more reliable information than the first test about the classifier accuracy. 

You may have noticed that this model can also be built as a sequence of if-else statements combined with a weighted score system, and achieve a similar classification performance. However, it would be much harder to write and to maintain, difficult to generalize and would be impossible to develop for a more complex problem, where data patterns and classification criteria are hard to discover manually.

Further References

This was a brief and practical guide to one simple classification algorithm application. For further references on mathematical implementation of this and other machine learning models, check scikit-learn documentation