Breast Cancer Malignancy Classification using PCA and Least Squares with Scikit-Learn

In this post, a linear regression classifier is constructed for the purpose of offering a medical diagnosis regarding breast cytology. The classifier receives a vector \textbf{v} \in \mathbb{R}^9 whose 9 components correspond to the following measurements:

  1. Clump Thickness: 1 – 10
  2. Uniformity of Cell Size: 1 – 10
  3. Uniformity of Cell Shape: 1 – 10
  4. Marginal Adhesion: 1 – 10
  5. Single Epithelial Cell Size: 1 – 10
  6. Bare Nuclei: 1 – 10
  7. Bland Chromatin: 1 – 10
  8. Normal Nucleoli: 1 – 10
  9. Mitoses: 1 – 10

Given a vector of measurements, the classifier determines if the cells are benign or malignant. The data used in this post is courtesy of UCI’s machine learning repository and is available here.

Data Setup

The data set contains 699 entries total. 16 entries contain missing attributes (denoted by a ‘?’) and were discarded, leaving 683 entries. Further, the column containing patient ID was discarded.

import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# %% Read data from csv file
bcFile = open('brstcncr.csv', 'r')
#This will contain the attribute data
bcDat = []
#This will contain the target data
targ = []
for line in bcFile:
   #Skip any line with a missing attribute
   if '?' in line:
   intDat = [int(x) for x in line.split(',')]
   #Set target vector; 2 is benign 4 is malignant
   if(intDat[-1] == 2):
   #Discard target
   #Discard ID
#Done; close the file

Dimensionality Reduction with PCA

To better visualize the data, the input vectors are projected onto a lower dimensional subspace using principal component analysis (PCA). First, compute the mean vector \overline{\textbf{v}} of all the input vectors. Consider the mean-centered data matrix


The co-variance matrix of the input data is found as follows:


where n is the number of records (683). The principal components are the eigenvectors of this matrix \textbf{C}, sorted in descending order by their corresponding eigenvalues \lambda_{i}. By preserving the first m principal components, the amount of variability in the original data that preserved in the transformation is:


where p is the total number of eigenvalues of the matrix \textbf{C}. In this example, the first principal component explains 69.1% while the second explains 7.2%. Only the first two components are maintained to achieve a dimensionality reduction. In this reduction, 76.3% of the variance of the original data is explained. These two components are treated as basis vectors and are used to transform the original data from \mathbb{R}^{9} to \mathbb{R}^{2}. All this can easily be performed using python and Anaconda as follows:

# %% Data extracted; perform PCA
A = np.array(bcDat)
y = np.array(targ)
pca = PCA(n_components=2)
#Display amount of variance explained by first 2 components
drA = pca.transform(A)

Regression and Classification with Least Squares

Next, a least squares regression is computed based on the reduced data. The purpose of the regression is to find a linear function which can predict whether an input is 0 (benign) or malignant (1). The function is defined as follows:


The classifier determines which class an input belongs to as follows:

g(\textbf{x})=\begin{cases} 0 & f(\textbf{x}) < 0.5 \\ 1 & f(\textbf{x}) \geq 0.5 \end{cases},

where 0 and 1 correspond to benign and malignant respectively. It is convenient to concatenate an additional one to the left of each of this input vector so that the function can simply be represented as f(\textbf{x})= \textbf{x} \cdot \textbf{a}. Let \textbf{X} be the matrix \textbf{V} with an additional column of ones concatenated to the left. The predicted values are computed as \hat{\textbf{y}}=\textbf{X} \textbf{a}. In this example, the transformed 2D vectors become 3D vectors with the addition of the component containing 1.

Using the least squares criterion, it can be shown that the best fit is given by multiplying \mathbf{y} by the orthogonal projection matrix of \textbf{X}. Thus,

\hat{\textbf{y}} = \textbf{X}(\textbf{X}^{T}\textbf{X})^{-1}\textbf{X}^{T}\textbf{y}= \textbf{X} \textbf{a}.

Since \hat{\textbf{y}} = \textbf{X} \textbf{a} by definition. Thus,

\textbf{a} = (\textbf{X}^{T}\textbf{X})^{-1}\textbf{X}^{T}\textbf{y}.

In this example, \textbf{a} = (0.34992679, -0.06200472, -0.01991134)^{T}.

The following python code computes the \textbf{a} vector and creates the classifier function.

# %%PCA done, find least squares line
lr = LinearRegression(), y)
print('Intercept: ' + str(lr.intercept_))
print('Coef: ' + str(lr.coef_))
yHat = lr.predict(drA)
# %%Compute classification error
corCl = 0
clsz = []
for i in range(len(yHat)):
   if (yHat[i] >= 0.5 and y[i] == 1) or (yHat[i] < 0.5 and y[i] == 0):
      corCl = corCl + 1
print('R^2: ' + str(lr.score(drA, y)))
print('Correct: ' + str(corCl*100.0/len(yHat)) + '%')


A plot of the result is as follows:


Figure 1: Transformed Data Shown with Least Squares Line

In Figure 1, the malignant class is colored red while the benign class is colored blue. Points labeled with an ‘x’ are classified incorrectly. The final results are as follows:

  • R^{2}: 0.8366
  • Percentage Classified Correctly: 95.61%

Source code for the plot is as follows:

# %% Plot the results
colDic = {0:'blue', 1:'red'}
colLst = np.array([colDic[i] for i in targ])
plt.scatter(drA[:,0], drA[:,1], color=colLst, marker='.')
plt.scatter(drA[clsz, 0], drA[clsz, 1], color=colLst[clsz], marker='x')
#Plot least squares line
a = lr.coef_
f = lambda x : (0.5 - lr.intercept_ - x * a[0]) / a[1]
xDat = np.linspace(-10, 5, 4)
yDat = f(xDat)
plt.plot(xDat, yDat)


  1. Wolberg, William H., and Olvi L. Mangasarian. “Multisurface method of pattern separation for medical diagnosis applied to breast cytology.” Proceedings of the national academy of sciences 87.23 (1990): 9193-9196.
  2. Hastie, Trevor, et al. “The elements of statistical learning: data mining, inference and prediction.” The Mathematical Intelligencer 27.2 (2005): 83-85.