Stock Market Prediction in Python Part 2

This post is part of a series on artificial neural networks (ANN) in TensorFlow and Python.

  1. Stock Market Prediction Using Multi-Layer Perceptrons With TensorFlow
  2. Stock Market Prediction in Python Part 2
  3. Visualizing Neural Network Performance on High-Dimensional Data
  4. Image Classification Using Convolutional Neural Networks in TensorFlow

This post revisits the problem of predicting stock prices based on historical stock data using TensorFlow that was explored in a previous post. In the previous post, stock price was predicted solely based on the date. First, the date was converted to a numerical value in LibreOffice, then the resulting integer value was read into a matrix using numpy. As stated in the post, this method was not meant to be indicative of how actual stock prediction is done. This post aims to slightly improve upon the previous model and explore new features in tensorflow and Anaconda python. The corresponding source code is available here.

Feature Extraction

The latest stock data for Yahoo (now Altaba Inc.) can be found at the following link. Instead of using LibreOffice to parse the date strings, the datetime library in Python can be used instead. The strptime function parses dates given a special format string. The format string in the code below specifies the dates are of the format yyyy-mm-dd, also known as ISO 8601 format.

Code to load the spreadsheet and parse the dates follows.

#Used for numpy arrays
import numpy as np
#Used to read data from CSV file
import pandas as pd
#Used to convert date string to numerical value
from datetime import datetime, timedelta
#Used to plot data
import matplotlib.pyplot as mpl

#Load data from the CSV file. Note: Some systems are unable
#to give timestamps for dates before 1970. This function may
#fail on such systems.
#
#path:      The path to the file
#return:    A data frame with the parsed timestamps
def ParseData(path):
    #Read the csv file into a dataframe
    df = pd.read_csv(path)
    #Get the date strings from the date column
    dateStr = df['Date'].values
    D = np.zeros(dateStr.shape)
    #Convert all date strings to a numeric value
    for i, j in enumerate(dateStr):
        #Date strings are of the form year-month-day
        D[i] = datetime.strptime(j, '%Y-%m-%d').timestamp()
    #Add the newly parsed column to the dataframe
    df['Timestamp'] = D
    #Remove any unused columns (axis = 1 specifies fields are columns)
    return df.drop('Date', axis = 1)

Note: A quick plot of the data reveals that there seems to be a typo in the Feb 01, 2016 data row with the “Low” value listed as 2016.02. The pyplot module of the library matplotlib provides powerful tools for visualizing data sets. Plotting a data set is useful for visualizing a data set as well as for catching outliers and typos. The erroneous data point can be removed entirely or modified to a reasonable value as desired. The following code will plot the stock data, set the x-axis labels, and add a legend.

#Given dataframe from ParseData
#plot it to the screen
#
#df:        Dataframe returned from
#p:         The position of the predicted data points
def PlotData(df, p = None):
    if(p is None):
        p = np.array([])
    #p contains the indices of predicted data; the rest are actual points
    c = np.array([i for i in range(df.shape[0]) if i not in p])
    #Timestamp data
    ts = df.Timestamp.values
    #Number of x tick marks
    nTicks= 10
    #Left most x value
    s = np.min(ts)
    #Right most x value
    e = np.max(ts)
    #Total range of x values
    r = e - s
    #Add some buffer on both sides
    s -= r / 5
    e += r / 5
    #These will be the tick locations on the x axis
    tickMarks = np.arange(s, e, (e - s) / nTicks)
    #Convert timestamps to strings
    strTs = [datetime.fromtimestamp(i).strftime('%m-%d-%y') for i in tickMarks]
    mpl.figure()
    #Plots of the high and low values for the day
    mpl.plot(ts, df.High.values, color = '#7092A8', linewidth = 1.618, label = 'Actual')
    #Predicted data was also provided
    if(len(p) > 0):
        mpl.plot(ts[p], df.High.values[p], color = '#6F6F6F', linewidth = 1.618, label = 'Predicted')
    #Set the tick marks
    mpl.xticks(tickMarks, strTs, rotation='vertical')
    #Add the label in the upper left
    mpl.legend(loc = 'upper left')
    mpl.show()

A plot of the data set produced by the above code is shown below in Figure 1.

Figure 1: Historical Yahoo Inc Stock Data

In the previous post, only the numericized date was used as input to the regression model. It is dubious that the date provides much useful information about the stock price of a company. To improve the model, more of the information from the spreadsheet is used. A sample is constructed as the current timestamp together with the past n days of the opening value, closing value, high value, low value, adjusted closing value, volume, and previous timestamp. Thus, if data for the past n=5 days is use, the data matrix contains 1 + 5 * 7 = 36 features. If the previous data is unavailable, the oldest available value will be used instead.

The corresponding target values are the stock opening value, closing value, high value, low value, adjusted closing value, and volume fields. The timestamp obviously does not need to be predicted.

A Stock Predictor Class

A python class is constructed which takes the number of days n and a regression model providing the Scikit-learn interface as arguments. The class then uses the Learn function to learn a dataframe returned from the ParseData function. Next, the stock values can be predicted for a range of dates using the PredictDate function. Source code follows.

#Gives a list of timestamps from the start date to the end date
#
#startDate:     The start date as a string xxxx-xx-xx
#endDate:       The end date as a string year-month-day
#weekends:      True if weekends should be included; false otherwise
#return:        A numpy array of timestamps
def DateRange(startDate, endDate, weekends = False):
    #The start and end date
    sd = datetime.strptime(startDate, '%Y-%m-%d')
    ed = datetime.strptime(endDate, '%Y-%m-%d')
    #Invalid start and end dates
    if(sd > ed):
        raise ValueError("The start date cannot be later than the end date.")
    #One day
    day = timedelta(1)
    #The final list of timestamp data
    dates = []
    cd = sd
    while(cd <= ed):
        #If weekdays are included or it's a weekday append the current ts
        if(weekends or (cd.date().weekday() != 5 and cd.date().weekday() != 6)):
            dates.append(cd.timestamp())
        #Onto the next day
        cd = cd + day
    return np.array(dates)

#Given a date, returns the previous day
#
#startDate:     The start date as a datetime object
#weekends:      True if weekends should counted; false otherwise
def DatePrevDay(startDate, weekends = False):
    #One day
    day = timedelta(1)
    cd = datetime.fromtimestamp(startDate)
    while(True):
        cd = cd - day
        if(weekends or (cd.date().weekday() != 5 and cd.date().weekday() != 6)):
            return cd.timestamp()
    #Should never happen
    return None

#A class that predicts stock prices based on historical stock data
class StockPredictor:

    #The (scaled) data frame
    D = None
    #Unscaled timestamp data
    DTS = None
    #The data matrix
    A = None
    #Target value matrix
    y = None
    #Corresponding columns for target values
    targCols = None
    #Number of previous days of data to use
    npd = 1
    #The regressor model
    R = None
    #Object to scale input data
    S = None

    #Constructor
    #nPrevDays:     The number of past days to include
    #               in a sample.
    #rmodel:        The regressor model to use (sklearn)
    #nPastDays:     The number of past days in each feature
    #scaler:        The scaler object used to scale the data (sklearn)
    def __init__(self, rmodel, nPastDays = 1, scaler = StandardScaler()):
        self.npd = nPastDays
        self.R = rmodel
        self.S = scaler

    #Extracts features from stock market data
    #
    #D:         A dataframe from ParseData
    #ret:       The data matrix of samples
    def _ExtractFeat(self, D):
        #One row per day of stock data
        m = D.shape[0]
        #Open, High, Low, and Close for past n days + timestamp and volume
        n = self._GetNumFeatures()
        B = np.zeros([m, n])
        #Preserve order of spreadsheet
        for i in range(m - 1, -1, -1):
            self._GetSample(B[i], i, D)
        #Return the internal numpy array
        return B

    #Extracts the target values from stock market data
    #
    #D:         A dataframe from ParseData
    #ret:       The data matrix of targets and the

    def _ExtractTarg(self, D):
        #Timestamp column is not predicted
        tmp = D.drop('Timestamp', axis = 1)
        #Return the internal numpy array
        return tmp.values, tmp.columns

    #Get the number of features in the data matrix
    #
    #n:         The number of previous days to include
    #           self.npd is  used if n is None
    #ret:       The number of features in the data matrix
    def _GetNumFeatures(self, n = None):
        if(n is None):
            n = self.npd
        return n * 7 + 1

    #Get the sample for a specific row in the dataframe.
    #A sample consists of the current timestamp and the data from
    #the past n rows of the dataframe
    #
    #r:         The array to fill with data
    #i:         The index of the row for which to build a sample
    #df:        The dataframe to use
    #return;    r
    def _GetSample(self, r, i, df):
        #First value is the timestamp
        r[0] = df['Timestamp'].values[i]
        #The number of columns in df
        n = df.shape[1]
        #The last valid index
        lim = df.shape[0]
        #Each sample contains the past n days of stock data; for non-existing data
        #repeat last available sample
        #Format of row:
        #Timestamp Volume Open[i] High[i] ... Open[i-1] High[i-1]... etc
        for j in range(0, self.npd):
            #Subsequent rows contain older data in the spreadsheet
            ind = i + j + 1
            #If there is no older data, duplicate the oldest available values
            if(ind >= lim):
                ind = lim - 1
            #Add all columns from row[ind]
            for k, c in enumerate(df.columns):
                #+ 1 is needed as timestamp is at index 0
                r[k + 1 + n * j] = df[c].values[ind]
        return r

    #Attempts to learn the stock market data
    #given a dataframe taken from ParseData
    #
    #D:         A dataframe from ParseData
    def Learn(self, D):
        #Keep track of the currently learned data
        self.D = D.copy()
        #Keep track of old timestamps for indexing
        self.DTS = np.copy(D.Timestamp.values)
        #Scale the data
        self.D[self.D.columns] = self.S.fit_transform(self.D)
        #Get features from the data frame
        self.A = self._ExtractFeat(self.D)
        #Get the target values and their corresponding column names
        self.y, self.targCols = self._ExtractTarg(self.D)
        #Create the regressor model and fit it
        self.R.fit(self.A, self.y)

    #Predict the stock price during a specified time
    #
    #startDate:     The start date as a string in yyyy-mm-dd format
    #endDate:       The end date as a string yyyy-mm-dd format
    #return:        A dataframe containing the predictions or
    def PredictDate(self, startDate, endDate):
        #Create the range of timestamps and reverse them
        ts = DateRange(startDate, endDate)[::-1]
        m = ts.shape[0]
        #Prediction is based on data prior to start date
        #Get timestamp of previous day
        prevts = DatePrevDay(ts[-1])
        #Test if there is enough data to continue
        try:
            ind = np.where(self.DTS == prevts)[0][0]
        except IndexError:
            return None
        #There is enough data to perform prediction; allocate new data frame
        P = pd.DataFrame(np.zeros([m, self.D.shape[1]]), index = range(m), columns = self.D.columns)
        #Add in the timestamp column so that it can be scaled properly
        P['Timestamp'] = ts
        #Scale the timestamp (other fields are 0)
        P[P.columns] = self.S.transform(P)
        #B is to be the data matrix of features
        B = np.zeros([1, self._GetNumFeatures()])
        #Add extra last entries for past existing data
        for i in range(self.npd):
            #If the current index does not exist, repeat the last valid data
            curInd = ind + i
            if(curInd >= self.D.shape[0]):
                curInd = curInd - 1
            #Copy over the past data (already scaled)
            P.loc[m + i] = self.D.loc[curInd]
        #Loop until end date is reached
        for i in range(m - 1, -1, -1):
            #Create one sample
            self._GetSample(B[0], i, P)
            #Predict the row of the dataframe and save it
            pred = self.R.predict(B).ravel()
            #Fill in the remaining fields into the respective columns
            for j, k in zip(self.targCols, pred):
                P.set_value(i, j, k)
        #Discard extra rows needed for prediction
        P = P[0:m]
        #Scale the dataframe back to the original range
        P[P.columns] = self.S.inverse_transform(P)
        return P

The basic idea of the above code is as follows: use the data from today and the past n-1 days (n total) to predict the stock data tomorrow. The PredictDate function, can then repeat this process indefinitely into the future by basing subsequent predictions on predicted data. It is reasonable to assume that the subsequent predictions are increasingly unreliable.

Putting it Together

With the above class in place, the ANNR class and others are used to make stock predictions for Yahoo Inc. Data is loaded from the csv file, prediction is made for a user specified range of dates, and the results are plotted. Sample main code is as follows:

#Grab the data frame
D = ParseData('yahoostock.csv')
#The number of previous days of data used
#when making a prediction
numPastDays = 16
#Number of neurons in the input layer
i = numPastDays * 7 + 1
#Number of neurons in the output layer
o = D.shape[1] - 1
#Number of neurons in the hidden layers
h = int((i + o) / 2)
#The list of layer sizes
layers = [('F', h), ('AF', 'tanh'), ('F', h), ('AF', 'tanh'), ('F', o)]
R = ANNR([i], layers, batchSize = 128, learnRate = 1e-4, maxIter = 1000, tol = 0.01, reg = 0.001, verbose = True)
#Learn the dataset and then display performance statistics
sp.Learn(D)
sp.TestPerformance()
#Perform prediction for a specified date range
P = sp.PredictDate('2016-11-02', '2016-12-31')
#Keep track of number of predicted results for plot
n = P.shape[0]
#Append the predicted results to the actual results
D = P.append(D)
#Predicted results are the first n rows
PlotData(D, range(n + 1))

In addition to the ANNR class from a previous post, the StockPredictor class works with any class that provides the basic sklearn interface: fit, predict, and score. A python program which provides a basic command line interface for the primary functionality of this class can be found here.

Results

Next, the program is employed to predict the stock data for Yahoo Inc from November 2nd to December 31st 2016. A KNeighborsRegressor class from sklearn was provided to the StockPredictor constructor to produce the prediction shown in Figure 2.

Figure 2: Yahoo Inc Stock Data with Prediction from KNN

Finally, the ANNR class from the previous post is used to perform prediction. The results can be seen below in Figure 3.

Figure 3: Yahoo Inc Stock Data with Prediction from ANNR

It appears the artificial neural network does not have much faith in Yahoo Inc.

Next Steps

The StockPredictor class above takes a slightly less naive approach to stock prediction than that from the previous post. In future posts, I hope to combine sentiment analysis techniques with textual data sets with stock data to make a more reasonable model. I hope to see you then.

N

Advertisements

One thought on “Stock Market Prediction in Python Part 2

  1. Thank you so much for your clarification, i also had another question regarding the cross validation part using KFold, i tried getting what is happening but i can’t get along, so if you could explain what is going there and what is the output it would be really great. Thank you so much in advance i really appreciate it.

    Mai EL-zayat

    Like

Comments are closed.