In this post, a classifier is constructed which determines to which cultivar a specific wine sample belongs. Each sample consists a vector of 13 attributes of the wine, that is . The attributes are as follows:
- Malic acid
- Alcalinity of ash
- Total phenols
- Nonflavanoid phenols
- Color intensity
- OD280/OD315 of diluted wines
Based on these attributes, the goal is to identify from which of three cultivars the data originated. The data set is available at the UCI Machine Learning Repository. Below are shown three sample rows from the data set.
1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065 1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050 1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185 ...
The first column denotes the target class. This data can be read into a matrix using the loadtxt function from numpy.
import numpy as np import matplotlib.pyplot as plt from sklearn import cross_validation from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from matplotlib import colors # %% Read data from csv file A = np.loadtxt('wine.data', delimiter=',') #Get the targets (first column of file) y = A[:, 0] #Remove targets from input data A = A[:, 1:]
Linear Discriminant Analysis
The purpose of linear discriminant analysis (LDA) is to estimate the probability that a sample belongs to a specific class given the data sample itself. That is to estimate , where is the set of class identifiers, is the domain, and is the specific sample. Applying Bayes Theorem results in:
can be estimated by the frequency of class in the training data. LDA assumes that each class can be modeled as a multivariate Gaussian distribution with each class sharing a common covariance matrix . That is:
where and are the mean vector and covariance matrix for class respectively. The shared covariance matrix and mean vectors are estimated from the training data.
LDA Dimensionality Reduction
The centroids of the classes lie within an affine subspace of dimension at most . The input data can be transformed into a lower dimension that is optimal in terms of LDA classification. An optimal subspace is defined as one in which the between-class variance is maximized relative to the within-class variance. That is, the amount of overlap between the classes is minimized with respect to the variance of the class centroids and the shared covariance matrix. This can be represented by maximizing the Rayleigh quotient:
is the within-class scatter matrix,
is the between-class scatter matrix, is the number of samples belonging to class , and is the mean vector of all input vectors. The solution to this generalized eigenvalue problem is given by the largest eigenvalue of the matrix ; the corresponding eigenvector being the solution vector . This computation along with the dimension reduction can easily be performed using scikit-learn as follows:
lda = LinearDiscriminantAnalysis(n_components=2) lda.fit(A, y) drA = lda.transform(A)
As there are classes in this example, the data is transformed from to by preserving 2 components corresponding to the 2 largest eigenvalues of . A plot of the transformed data is shown next, with classes denoted with different colors.
Figure 1: Transformed Data Plot
A transformed data point can be classified by identifying the class centroid to which it is closest in the transformed space. The centroids of the input data are shown below (as large black points) along with the transformed data plotted on the Voronoi diagram induced by the centroids. The Voronoi cells correspond to the regions LDA will classify as belonging to the respective centroid’s class.
Figure 2: Transformed Data Plot with Centroids and Voronoi Cells
As can be seen, there is clear separation between the three classes of wine in this case and so the classifier is expected to perform very well.
Cross validation is used to test the performance of the classifier. The input data set is split into two sets and such that and . A larger percentage of the data is allocated for training. This process is repeated times and the classifier is trained and scored each time. This can be accomplished in python using scikit-learn as follows:
# %% Data extracted; perform LDA lda = LinearDiscriminantAnalysis() k_fold = cross_validation.KFold(len(A), 3, shuffle=True) print('LDA Results: ') for (trn, tst) in k_fold: lda.fit(A[trn], y[trn]) outVal = lda.score(A[tst], y[tst]) #Compute classification error print('Score: ' + str(outVal))
Results from the three runs are as follows:
- Run 1: 1.0
- Run 2: 0.983050847458
- Run 3: 0.966101694915
As can be seen, the classifier is able to predict the correct cultivar for a wine sample with high accuracy due to the well behaved structure of the classes.
Hastie, Trevor, et al. “The elements of statistical learning: data mining, inference and prediction.” The Mathematical Intelligencer 27.2 (2005): 83-85.