In this post, a linear regression classifier is constructed for the purpose of offering a medical diagnosis regarding breast cytology. The classifier receives a vector whose 9 components correspond to the following measurements:

- Clump Thickness: 1 – 10
- Uniformity of Cell Size: 1 – 10
- Uniformity of Cell Shape: 1 – 10
- Marginal Adhesion: 1 – 10
- Single Epithelial Cell Size: 1 – 10
- Bare Nuclei: 1 – 10
- Bland Chromatin: 1 – 10
- Normal Nucleoli: 1 – 10
- Mitoses: 1 – 10

Given a vector of measurements, the classifier determines if the cells are benign or malignant. The data used in this post is courtesy of UCI’s machine learning repository and is available here.

### Data Setup

The data set contains 699 entries total. 16 entries contain missing attributes (denoted by a ‘?’) and were discarded, leaving 683 entries. Further, the column containing patient ID was discarded.

import numpy as np import matplotlib.mlab as mlab import matplotlib.pyplot as plt from sklearn.decomposition import PCA # %% Read data from csv file bcFile = open('brstcncr.csv', 'r') #This will contain the attribute data bcDat = [] #This will contain the target data targ = [] for line in bcFile: #Skip any line with a missing attribute if '?' in line: continue intDat = [int(x) for x in line.split(',')] #Set target vector; 2 is benign 4 is malignant if(intDat[-1] == 2): targ.append(0) else: targ.append(1) #Discard target intDat.pop() #Discard ID intDat.pop(0) bcDat.append(intDat) #Done; close the file bcFile.close()

### Dimensionality Reduction with PCA

To better visualize the data, the input vectors are projected onto a lower dimensional subspace using principal component analysis (PCA). First, compute the mean vector of all the input vectors. Consider the mean-centered data matrix

.

The co-variance matrix of the input data is found as follows:

,

where is the number of records (683). The principal components are the eigenvectors of this matrix , sorted in descending order by their corresponding eigenvalues . By preserving the first principal components, the amount of variability in the original data that preserved in the transformation is:

,

where is the total number of eigenvalues of the matrix . In this example, the first principal component explains 69.1% while the second explains 7.2%. Only the first two components are maintained to achieve a dimensionality reduction. In this reduction, 76.3% of the variance of the original data is explained. These two components are treated as basis vectors and are used to transform the original data from to . All this can easily be performed using python and Anaconda as follows:

# %% Data extracted; perform PCA A = np.array(bcDat) y = np.array(targ) pca = PCA(n_components=2) pca.fit(A) #Display amount of variance explained by first 2 components print(str(pca.explained_variance_ratio_)) drA = pca.transform(A)

### Regression and Classification with Least Squares

Next, a least squares regression is computed based on the reduced data. The purpose of the regression is to find a linear function which can predict whether an input is 0 (benign) or malignant (1). The function is defined as follows:

.

The classifier determines which class an input belongs to as follows:

,

where 0 and 1 correspond to benign and malignant respectively. It is convenient to concatenate an additional one to the left of each of this input vector so that the function can simply be represented as . Let be the matrix with an additional column of ones concatenated to the left. The predicted values are computed as . In this example, the transformed 2D vectors become 3D vectors with the addition of the component containing 1.

Using the least squares criterion, it can be shown that the best fit is given by multiplying by the orthogonal projection matrix of . Thus,

.

Since by definition. Thus,

.

In this example, .

The following python code computes the vector and creates the classifier function.

# %%PCA done, find least squares line lr = LinearRegression() lr.fit(drA, y) print('Intercept: ' + str(lr.intercept_)) print('Coef: ' + str(lr.coef_)) yHat = lr.predict(drA) # %%Compute classification error corCl = 0 clsz = [] for i in range(len(yHat)): if (yHat[i] >= 0.5 and y[i] == 1) or (yHat[i] < 0.5 and y[i] == 0): corCl = corCl + 1 else: clsz.append(i) print('R^2: ' + str(lr.score(drA, y))) print('Correct: ' + str(corCl*100.0/len(yHat)) + '%')

### Results

A plot of the result is as follows:

**Figure 1: Transformed Data Shown with Least Squares Line**

In Figure 1, the malignant class is colored red while the benign class is colored blue. Points labeled with an ‘x’ are classified incorrectly. The final results are as follows:

- : 0.8366
- Percentage Classified Correctly: 95.61%

Source code for the plot is as follows:

# %% Plot the results colDic = {0:'blue', 1:'red'} colLst = np.array([colDic[i] for i in targ]) plt.figure() plt.scatter(drA[:,0], drA[:,1], color=colLst, marker='.') plt.scatter(drA[clsz, 0], drA[clsz, 1], color=colLst[clsz], marker='x') #Plot least squares line a = lr.coef_ f = lambda x : (0.5 - lr.intercept_ - x * a[0]) / a[1] xDat = np.linspace(-10, 5, 4) yDat = f(xDat) plt.plot(xDat, yDat) plt.show()

### References

- Wolberg, William H., and Olvi L. Mangasarian. “Multisurface method of pattern separation for medical diagnosis applied to breast cytology.” Proceedings of the national academy of sciences 87.23 (1990): 9193-9196.
- Hastie, Trevor, et al. “The elements of statistical learning: data mining, inference and prediction.” The Mathematical Intelligencer 27.2 (2005): 83-85.