Artificial neural networks have regained popularity in machine learning circles with recent advances in deep learning. Deep learning techniques trace their origins back to the concept of back-propagation in multi-layer perceptron (MLP) networks, the topic of this post. The complete code from this post is available on GitHub.
Multi-Layer Perceptron Networks for Regression
A MLP network consists of layers of artificial neurons connected by weighted edges. Neurons are denoted for the -th neuron in the -th layer of the MLP from left to right top to bottom. Inputs are fed into the leftmost layer and propagate through the network along weighted edges until reaching the final, or output, layer. An example of a MLP network can be seen below in Figure 1.
Figure 1: A Multi-Layer Perceptron Network
The input to a neuron , also known as the “net” denoted , is the weighted sum of all incoming edges plus an optional bias. Its output is computed by applying the -th layer’s activation function to . By arranging the weights of each layer into a matrices , the output of the -th layer of a MLP can be computed as:
where is vector of nets of the -th layer, is the weight matrix of the -th layer, is the bias vector of the -th layer , and is the activation function for the -th layer. The activation function is applied component-wise to the resulting vector; and the sigmoid function are commonly used. In a network with layers, the final output given an input can be defined as:
Consider a supervised learning scenario where the -th input vector is associated with the -th target vector with input-target pairs in total. Further, assume the network has layers of sizes: . The L2 error of the network for this input at the -th output neuron is then:
where the is for convenience, is the -th component of the target, and is the -th component of the predicted output. The total network error is simply:
Back-propagation uses gradient descent to minimize the network error by updating the weights and biases of the network. First, consider the case of the output layer of the MLP network. After computing the network error for the -th input sample, an update is found for the entry in the -th row and -th column of the -th weight matrix, that is . By the definition of gradient descent:
where is the learning rate. Now, by the multivariate chain rule,
where is the -th neuron’s net in the -th layer of the network, and is the -th neuron’s output in layer of the network. The final step is true since,
Only considering the -th component of this, as above, results in:
Continuing from earlier with the multivariate chain rule,
Consider the case of the output layer, that is layer . Using the definition of the component-wise error defined earlier, the RHS simplifies as follows:
If the activation function for a given layer is then and thus can be easily computed. Thus, the update rule for output neurons is:
where is the -th neuron’s delta value in the output layer ( layer of the MLP) for the -th input vector.
For hidden layers, can no longer be directed computed as the target values are only provided for the output layer. The delta terms for these layers are obtained through back-propagation. That is, the delta terms for layer are obtained using those for layer as can be seen below in Figure 2.
Figure 2: Hidden Node Update Visualization
One difference in the derivation between hidden nodes and output nodes is that, since the layers in MLP networks are fully-connected, hidden neurons contribute to the error in all later neurons. Thus, the total network error term is used instead of the component-wise error . So, similar to above,
Next, replace the total network error by its earlier definition,
The last step follows since the delta values for layer have already been computed. Thus, in summary, the weight update is computed as
for output layers and
for hidden layers. In order to update bias weights, use the appropriate formula above and replace with 1.
Implementation in Python with NumPy
The above formulas can be conveniently implemented using matrix and component-wise multiplications of matrices.
#bprop.py #Author: Nicholas Smith import numpy as np #Array of layer sizes ls = np.array([2, 4, 4, 1]) n = len(ls) #List of weight matrices (each a numpy array) W =  #Initialize weights to small random values for i in range(n - 1): W.append(np.random.randn(ls[i], ls[i + 1]) * 0.1) #List of bias vectors initialized to small random values B =  for i in range(1, n): B.append(np.random.randn(ls[i]) * 0.1) #List of output vectors O =  for i in range(n): O.append(np.zeros([ls[i]])) #List of Delta vectors D =  for i in range(1, n): D.append(np.zeros(ls[i])) #Input vectors (1 row per each) A = np.matrix([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]]) #Target Vectors (1 row per each) y = np.matrix([[-0.5], [0.5], [0.5], [-0.5]]) #Activation function (tanh) for each layer #Linear activation for final layer actF =  dF =  for i in range(n - 1): actF.append(lambda (x) : np.tanh(x)) #Derivative of activation function in terms of itself dF.append(lambda (y) : 1 - np.square(y)) #Linear activation for final layer actF.append(lambda (x): x) dF.append(lambda (x) : np.ones(x.shape)) #Learning rate a = 0.5 #Number of iterations numIter = 250 #Loop for each iteration for c in range(numIter): #loop over all input vectors for i in range(len(A)): print(str(i)) #Target vector t = y[i, :] #Feed-forward step O = A[i, :] for j in range(n - 1): O[j + 1] = actF[j](np.dot(O[j], W[j]) + B[j]) print('Out:' + str(O[-1])) #Compute output node delta values D[-1] = np.multiply((t - O[-1]), dF[-1](O[-1])) #Compute hidden node deltas for j in range(n - 2, 0, -1): D[j - 1] = np.multiply(np.dot(D[j], W[j].T), dF[j](O[j])) #Perform weight and bias updates for j in range(n - 1): W[j] = W[j] + a * np.outer(O[j], D[j]) B[j] = B[j] + a * D[j] print('\nFinal weights:') #Display final weights for i in range(n - 1): print('Layer ' + str(i + 1) + ':\n' + str(W[i]) + '\n')
In the above code, the delta values for each layer are computed simultaneously by using matrix multiplication instead of evaluating each component separately. Similarly, the weight updates are computed simultaneously using an outer product between the delta and output vectors.