Stock Market Prediction in Python Part 2

This post is part of a series on artificial neural networks (ANN) in TensorFlow and Python.

  1. Stock Market Prediction Using Multi-Layer Perceptrons With TensorFlow
  2. Stock Market Prediction in Python Part 2
  3. Visualizing Neural Network Performance on High-Dimensional Data
  4. Image Classification Using Convolutional Neural Networks in TensorFlow

This post revisits the problem of predicting stock prices based on historical stock data using TensorFlow that was explored in a previous post. In the previous post, stock price was predicted solely based on the date. First, the date was converted to a numerical value in LibreOffice, then the resulting integer value was read into a matrix using numpy. As stated in the post, this method was not meant to be indicative of how actual stock prediction is done. This post aims to slightly improve upon the previous model and explore new features in tensorflow and Anaconda python. The corresponding source code is available here.

Note: See a later post Visualizing Neural Network Performance on High-Dimensional Data for code to help visualize neural network learning and performance.

Feature Extraction

The latest stock data for Yahoo can be found at the following link. Instead of using LibreOffice to parse the date strings, the datetime library in Python can be used instead. The strptime function parses dates given a special format string. The format string in the code below specifies the dates are of the format yyyy-mm-dd, also known as ISO 8601 format.

Code to load the spreadsheet and parse the dates follows.

#Used for numpy arrays
import numpy as np
#Used to read data from CSV file
import pandas as pd
#Used to convert date string to numerical value
from datetime import datetime, timedelta
#Used to plot data
import matplotlib.pyplot as mpl

#Load data from the CSV file. Note: Some systems are unable
#to give timestamps for dates before 1970. This function may
#fail on such systems.
#
#path:      The path to the file
#return:    A data frame with the parsed timestamps
def ParseData(path):
    #Read the csv file into a dataframe
    df = pd.read_csv(path)
    #Get the date strings from the date column
    dateStr = df['Date'].values
    D = np.zeros(dateStr.shape)
    #Convert all date strings to a numeric value
    for i, j in enumerate(dateStr):
        #Date strings are of the form year-month-day
        D[i] = datetime.strptime(j, '%Y-%m-%d').timestamp()
    #Add the newly parsed column to the dataframe
    df['Timestamp'] = D
    #Remove any unused columns (axis = 1 specifies fields are columns)
    return df.drop('Date', axis = 1)

Note: A quick plot of the data reveals that there seems to be a typo in the Feb 01, 2016 data row with the “Low” value listed as 2016.02. The pyplot module of the library matplotlib provides powerful tools for visualizing data sets. Plotting a data set is useful for visualizing a data set as well as for catching outliers and typos. The erroneous data point can be removed entirely or modified to a reasonable value as desired. The following code will plot the stock data, set the x-axis labels, and add a legend.

#Given dataframe from ParseData
#plot it to the screen
#
#df:        Dataframe returned from
#p:         The position of the predicted data points
def PlotData(df, p = None):
    if(p is None):
        p = np.array([])
    #p contains the indices of predicted data; the rest are actual points
    c = np.array([i for i in range(df.shape[0]) if i not in p])
    #Timestamp data
    ts = df.Timestamp.values
    #Number of x tick marks
    nTicks= 10
    #Left most x value
    s = np.min(ts)
    #Right most x value
    e = np.max(ts)
    #Total range of x values
    r = e - s
    #Add some buffer on both sides
    s -= r / 5
    e += r / 5
    #These will be the tick locations on the x axis
    tickMarks = np.arange(s, e, (e - s) / nTicks)
    #Convert timestamps to strings
    strTs = [datetime.fromtimestamp(i).strftime('%m-%d-%y') for i in tickMarks]
    mpl.figure()
    #Plots of the high and low values for the day
    mpl.plot(ts, df.High.values, color = '#7092A8', linewidth = 1.618, label = 'Actual')
    #Predicted data was also provided
    if(len(p) > 0):
        mpl.plot(ts[p], df.High.values[p], color = '#6F6F6F', linewidth = 1.618, label = 'Predicted')
    #Set the tick marks
    mpl.xticks(tickMarks, strTs, rotation='vertical')
    #Add the label in the upper left
    mpl.legend(loc = 'upper left')
    mpl.show()

A plot of the data set produced by the above code is shown below in Figure 1.

Figure 1: Historical Yahoo Inc Stock Data

In the previous post, only the numericized date was used as input to the regression model. It is dubious that the date provides much useful information about the stock price of a company. To improve the model, more of the information from the spreadsheet is used. A sample is constructed as the current timestamp together with the past n days of the opening value, closing value, high value, low value, adjusted closing value, volume, and previous timestamp. Thus, if data for the past n=5 days is use, the data matrix contains 1 + 5 * 7 = 36 features. If the previous data is unavailable, the oldest available value will be used instead.

The corresponding target values are the stock opening value, closing value, high value, low value, adjusted closing value, and volume fields. The timestamp obviously does not need to be predicted.

A Stock Predictor Class

A python class is constructed which takes the number of days n and a regression model providing the Scikit-learn interface as arguments. The class then uses the Learn function to learn a dataframe returned from the ParseData function. Next, the stock values can be predicted for a range of dates using the PredictDate function. Source code follows.

#Gives a list of timestamps from the start date to the end date
#
#startDate:     The start date as a string xxxx-xx-xx
#endDate:       The end date as a string year-month-day
#weekends:      True if weekends should be included; false otherwise
#return:        A numpy array of timestamps
def DateRange(startDate, endDate, weekends = False):
    #The start and end date
    sd = datetime.strptime(startDate, '%Y-%m-%d')
    ed = datetime.strptime(endDate, '%Y-%m-%d')
    #Invalid start and end dates
    if(sd > ed):
        raise ValueError("The start date cannot be later than the end date.")
    #One day
    day = timedelta(1)
    #The final list of timestamp data
    dates = []
    cd = sd
    while(cd <= ed):
        #If weekdays are included or it's a weekday append the current ts
        if(weekends or (cd.date().weekday() != 5 and cd.date().weekday() != 6)):
            dates.append(cd.timestamp())
        #Onto the next day
        cd = cd + day
    return np.array(dates)

#Given a date, returns the previous day
#
#startDate:     The start date as a datetime object
#weekends:      True if weekends should counted; false otherwise
def DatePrevDay(startDate, weekends = False):
    #One day
    day = timedelta(1)
    cd = datetime.fromtimestamp(startDate)
    while(True):
        cd = cd - day
        if(weekends or (cd.date().weekday() != 5 and cd.date().weekday() != 6)):
            return cd.timestamp()
    #Should never happen
    return None

#A class that predicts stock prices based on historical stock data
class StockPredictor:

    #The (scaled) data frame
    D = None
    #Unscaled timestamp data
    DTS = None
    #The data matrix
    A = None
    #Target value matrix
    y = None
    #Corresponding columns for target values
    targCols = None
    #Number of previous days of data to use
    npd = 1
    #The regressor model
    R = None
    #Object to scale input data
    S = None

    #Constructor
    #nPrevDays:     The number of past days to include
    #               in a sample.
    #rmodel:        The regressor model to use (sklearn)
    #nPastDays:     The number of past days in each feature
    #scaler:        The scaler object used to scale the data (sklearn)
    def __init__(self, rmodel, nPastDays = 1, scaler = StandardScaler()):
        self.npd = nPastDays
        self.R = rmodel
        self.S = scaler

    #Extracts features from stock market data
    #
    #D:         A dataframe from ParseData
    #ret:       The data matrix of samples
    def _ExtractFeat(self, D):
        #One row per day of stock data
        m = D.shape[0]
        #Open, High, Low, and Close for past n days + timestamp and volume
        n = self._GetNumFeatures()
        B = np.zeros([m, n])
        #Preserve order of spreadsheet
        for i in range(m - 1, -1, -1):
            self._GetSample(B[i], i, D)
        #Return the internal numpy array
        return B

    #Extracts the target values from stock market data
    #
    #D:         A dataframe from ParseData
    #ret:       The data matrix of targets and the

    def _ExtractTarg(self, D):
        #Timestamp column is not predicted
        tmp = D.drop('Timestamp', axis = 1)
        #Return the internal numpy array
        return tmp.values, tmp.columns

    #Get the number of features in the data matrix
    #
    #n:         The number of previous days to include
    #           self.npd is  used if n is None
    #ret:       The number of features in the data matrix
    def _GetNumFeatures(self, n = None):
        if(n is None):
            n = self.npd
        return n * 7 + 1

    #Get the sample for a specific row in the dataframe.
    #A sample consists of the current timestamp and the data from
    #the past n rows of the dataframe
    #
    #r:         The array to fill with data
    #i:         The index of the row for which to build a sample
    #df:        The dataframe to use
    #return;    r
    def _GetSample(self, r, i, df):
        #First value is the timestamp
        r[0] = df['Timestamp'].values[i]
        #The number of columns in df
        n = df.shape[1]
        #The last valid index
        lim = df.shape[0]
        #Each sample contains the past n days of stock data; for non-existing data
        #repeat last available sample
        #Format of row:
        #Timestamp Volume Open[i] High[i] ... Open[i-1] High[i-1]... etc
        for j in range(0, self.npd):
            #Subsequent rows contain older data in the spreadsheet
            ind = i + j + 1
            #If there is no older data, duplicate the oldest available values
            if(ind >= lim):
                ind = lim - 1
            #Add all columns from row[ind]
            for k, c in enumerate(df.columns):
                #+ 1 is needed as timestamp is at index 0
                r[k + 1 + n * j] = df[c].values[ind]
        return r

    #Attempts to learn the stock market data
    #given a dataframe taken from ParseData
    #
    #D:         A dataframe from ParseData
    def Learn(self, D):
        #Keep track of the currently learned data
        self.D = D.copy()
        #Keep track of old timestamps for indexing
        self.DTS = np.copy(D.Timestamp.values)
        #Scale the data
        self.D[self.D.columns] = self.S.fit_transform(self.D)
        #Get features from the data frame
        self.A = self._ExtractFeat(self.D)
        #Get the target values and their corresponding column names
        self.y, self.targCols = self._ExtractTarg(self.D)
        #Create the regressor model and fit it
        self.R.fit(self.A, self.y)

    #Predict the stock price during a specified time
    #
    #startDate:     The start date as a string in yyyy-mm-dd format
    #endDate:       The end date as a string yyyy-mm-dd format
    #return:        A dataframe containing the predictions or
    def PredictDate(self, startDate, endDate):
        #Create the range of timestamps and reverse them
        ts = DateRange(startDate, endDate)[::-1]
        m = ts.shape[0]
        #Prediction is based on data prior to start date
        #Get timestamp of previous day
        prevts = DatePrevDay(ts[-1])
        #Test if there is enough data to continue
        try:
            ind = np.where(self.DTS == prevts)[0][0]
        except IndexError:
            return None
        #There is enough data to perform prediction; allocate new data frame
        P = pd.DataFrame(np.zeros([m, self.D.shape[1]]), index = range(m), columns = self.D.columns)
        #Add in the timestamp column so that it can be scaled properly
        P['Timestamp'] = ts
        #Scale the timestamp (other fields are 0)
        P[P.columns] = self.S.transform(P)
        #B is to be the data matrix of features
        B = np.zeros([1, self._GetNumFeatures()])
        #Add extra last entries for past existing data
        for i in range(self.npd):
            #If the current index does not exist, repeat the last valid data
            curInd = ind + i
            if(curInd >= self.D.shape[0]):
                curInd = curInd - 1
            #Copy over the past data (already scaled)
            P.loc[m + i] = self.D.loc[curInd]
        #Loop until end date is reached
        for i in range(m - 1, -1, -1):
            #Create one sample
            self._GetSample(B[0], i, P)
            #Predict the row of the dataframe and save it
            pred = self.R.predict(B).ravel()
            #Fill in the remaining fields into the respective columns
            for j, k in zip(self.targCols, pred):
                P.set_value(i, j, k)
        #Discard extra rows needed for prediction
        P = P[0:m]
        #Scale the dataframe back to the original range
        P[P.columns] = self.S.inverse_transform(P)
        return P

The basic idea of the above code is as follows: use the data from today and the past n-1 days (n total) to predict the stock data tomorrow. The PredictDate function, can then repeat this process indefinitely into the future by basing subsequent predictions on predicted data. It is reasonable to assume that the subsequent predictions are increasingly unreliable.

Putting it Together

With the above class in place, the MLPR class and others are used to make stock predictions for Yahoo Inc. Data is loaded from the csv file, prediction is made for a user specified range of dates, and the results are plotted. Sample main code is as follows:

#Grab the data frame
D = ParseData('yahoostock.csv')
#The number of previous days of data used
#when making a prediction
numPastDays = 16
#Number of neurons in the input layer
i = numPastDays * 7 + 1
#Number of neurons in the output layer
o = D.shape[1] - 1
#Number of neurons in the hidden layers
h = int((i + o) / 2)
#The list of layer sizes
layers = [i, h, h, h, h, h, o]
R = MLPR(layers, maxItr = 1000, tol = 0.40, reg = 0.001, verbose = True)
sp = StockPredictor(R, nPastDays = numPastDays)
#Learn the dataset and then display performance statistics
sp.Learn(D)
sp.TestPerformance()
#Perform prediction for a specified date range
P = sp.PredictDate('2016-11-02', '2016-12-31')
#Keep track of number of predicted results for plot
n = P.shape[0]
#Append the predicted results to the actual results
D = P.append(D)
#Predicted results are the first n rows
PlotData(D, range(n + 1))

In addition to the MLPR class from a previous post, the StockPredictor class works with any class that provides the basic sklearn interface: fit, predict, and score. A python program which provides a basic command line interface for the primary functionality of this class can be found here.

Results

Next, the program is employed to predict the stock data for Yahoo Inc from November 2nd to December 31st 2016. A KNeighborsRegressor class from sklearn was provided to the StockPredictor constructor to produce the prediction shown in Figure 2.

Figure 2: Yahoo Inc Stock Data with Prediction from KNN

Finally, the MLPR class from the previous post is used to perform prediction. The results can be seen below in Figure 3.

Figure 3: Yahoo Inc Stock Data with Prediction from MLPR

It appears the artificial neural network does not have much faith in Yahoo Inc.

Next Steps

The StockPredictor class above takes a slightly less naive approach to stock prediction than that from the previous post. In future posts, I hope to combine sentiment analysis techniques with textual data sets with stock data to make a more reasonable model. I hope to see you then.

N

Advertisements

37 thoughts on “Stock Market Prediction in Python Part 2

  1. Got an error: sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.

    Like

    1. Hi Paul,

      I am still using scikit-learn 0.17.1 in Anaconda Python 4.2, so I didn’t see that warning. I just tested that it works with scikit-learn 0.18 in a new environment though (despite the warning). Make sure you provide command line arguments to the program. Try running it from the command line with (provide dates in yyyy-mm-dd format):

      python stocks.py csv_file start_date end_date

      Hope this helps,

      N

      Like

      1. File “stocks.py”, line 102, in
        p, n = Main([‘table2.csv’, ‘2016-11-02’, ‘2016-12-31’, ‘D’])
        TypeError: ‘NoneType’ object is not iterable

        Ops, another error here…

        Like

  2. Hello,
    getting this error. If I comment out of the print line on the last line of stockpredictor.py it runs up through the point of printing the actual graph.

    line 341, in
    print(‘C-V:\t’ + str(s1) + ‘\nTst:\t’ + str(s2) + ‘\nTrn:\t’ + str(s3))
    NameError: name ‘s1’ is not defined

    Any help would be appreciated on this. Thanks

    Like

    1. Hi Trevor,

      That’s odd. What version of python are you using? Are you using the stocks.py as your “main” program? Also, how are you running the program?

      N

      Like

    1. Hi Trevor,

      I’m glad to hear you got it working. As it is, it should work with weekly or monthly data. The only thing would be to modify the PredictDate function to allow for daily, weekly, or monthly predictions. I’ll push an update to the GitHub in a bit.

      N

      Like

      1. line 283, in PredictDate
        prevts = DatePrevDay(ts[-1])
        IndexError: index -1 is out of bounds for axis 0 with size 0

        Like

      2. File “stocks.py”, line 100, in
        Main(sys.argv[1:])
        File “stocks.py”, line 90, in Main
        n = P.shape[0]
        AttributeError: ‘NoneType’ object has no attribute ‘shape’

        This happened with the monthly

        Like

  3. Now is this update assuming that we are still working with a daily dataset or monthly or weekly. I guess what I’m trying to ask, is it unit less?

    Like

  4. fixed the issues above. Didn’t think it through. The idea would be to only pull 1 data point for that period, because it would be greater at predicting macroeconomic trends, or does this pull every data point in an attempt to predict weekly or monthly values. I’m newer to python and I’m still catching my footing. More used to Java and C.

    Like

    1. What I’m noticing is that changing the D to W or M is stretching the gain of a day over weeks or months. I guess I need to know if this is drawing data points that are one day, one week or one month apart?

      Like

      1. It’s in the DatePrevday. It only jumps back timeDelta(1) which has no continuity with timedelta(7) or timedelta(30) on the predicted side. I’m going to add some elifs to accommodate the continuity issue. Your thoughts?

        Like

  5. How does the MLP train ? Back-Propogation is the answer! So how much time it takes given an year of data and prediction on daily basis and if possible can you give steps to do execute it ? I am doing it as stocks.py table.csv 2016-11-2 2016-12-31 D.

    And expecting the output something similar to what you posted as an image. but the problem is, it is no where closer to it. What can be the reason ? (Learning has not been enough ?)

    Again Steps are required here or anywhere.

    Like

  6. This looks more of a stat based approach as it gives the same predictions everytime the program finishes, which is not at all possible in case of neural networks condition applied* That the MLP is trained enough, which I didn’t see here.

    Liked by 1 person

  7. Hi Milind,

    Thanks for your comment. The performance of neural networks is highly dependent upon the hyper-parameters chosen (number of layers, layer sizes, activation function, regularization used, etc). It can take quite a lot of tuning before acceptable results are found for a given dataset.

    As mentioned in the post, the image I posted is using KNN to perform the prediction. Perhaps if you post your settings for the neural network I can give some suggestions.

    N

    Liked by 1 person

    1. Hi,

      I don’t see anything wrong with the command; it works okay for me. Make sure you have data going back to at least 2007-12-28, though this wouldn’t cause the error you are getting. Just a guess: Do you use a non-US keyboard layout by any chance? Maybe you are using a different “-” symbol.

      N

      Liked by 1 person

  8. If i want to use 75% train data, 15% testdata and 10% prediction (validation)

    How and where should i Put the dates?
    I dont understand this.

    I got data from 2007-03-20 -> 2017-01-01 but I cant figure out how to use this with your code.

    Liked by 1 person

    1. Hi,

      Currently the code only supports a holdout period for performing prediction. If you want to have additional data reserved for other uses you will need to make adjustments to the code.

      N

      Liked by 1 person

  9. Hey Nick,

    Predicting stock price just using it’s own chart looks simple and underpowered to what neural networks can do. When doing “Deep Learning” I would love to see how deep we can actually go.

    Can you make a part 3 of stock market prediction where you show how to cross examine Yahoo stock with either Forex USD/EUR ticker or GOLD price, or maybe cross examine tech market index or maybe just Dow Jones. And ask neural network if it sees the pattern.

    Liked by 1 person

  10. Hey Nick,

    could you help me about how do i see the graph on the console? i have used your code for my understanding. i pasted the first part-ParseData(path) (wherein you have put the csv into the data frame) followed by the second part- ParsePlot(df,p=None) function(wherein you are making the first plot). however, the funtions haven’t been called anywhere. could you tell me if i have the right approach? i am alot confused.
    thankyou!

    Like

  11. Hey Nick,

    Could you please clarify on what basis you choose the number on neurons in the input, hidden and output layers. Also, how did you choose the number of hidden layers ? please help
    I would be so grateful if you clarify this as soon as possible.

    Thank you so much

    Like

    1. Hi,

      The number of input and output neurons is determined by the problem. If the samples have 11 features for example, there must be 11 input neurons. If the target values have 4 features, there must be 4 output neurons. The number of hidden neurons is not restricted, however. Best practices usually suggest some value between the number of input and number of output neurons, but there are no definite rules. Basically, it comes down to trying out different values and seeing what works best each individual problem.

      N

      Like

      1. Thank you so much for your clarification, i also had another question regarding the cross validation part using KFold, i tried getting what is happening but i can’t get along, so if you could explain what is going there and what is the output it would be really great. Thank you so much in advance i really appreciate it.

        Mai EL-zayat

        Like

  12. I am also getting different result for the same input the range of dates (input) i guess it is because the weights and biases are randomly generated, correct me please if i am wrong.

    Thank you in advance

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s