This post revisits the problem of predicting stock prices based on historical stock data using TensorFlow that was explored in a previous post. In the previous post, stock price was predicted solely based on the date. First, the date was converted to a numerical value in LibreOffice, then the resulting integer value was read into a matrix using numpy. As stated in the post, this method was not meant to be indicative of how actual stock prediction is done. This post aims to slightly improve upon the previous model and explore new features in tensorflow and Anaconda python. The corresponding source code is available here.
Note: See a later post Visualizing Neural Network Performance on High-Dimensional Data for code to help visualize neural network learning and performance.
The latest stock data for Yahoo can be found at the following link. Instead of using LibreOffice to parse the date strings, the datetime library in Python can be used instead. The strptime function parses dates given a special format string. The format string in the code below specifies the dates are of the format yyyy-mm-dd, also known as ISO 8601 format.
Code to load the spreadsheet and parse the dates follows.
#Used for numpy arrays import numpy as np #Used to read data from CSV file import pandas as pd #Used to convert date string to numerical value from datetime import datetime, timedelta #Used to plot data import matplotlib.pyplot as mpl #Load data from the CSV file. Note: Some systems are unable #to give timestamps for dates before 1970. This function may #fail on such systems. # #path: The path to the file #return: A data frame with the parsed timestamps def ParseData(path): #Read the csv file into a dataframe df = pd.read_csv(path) #Get the date strings from the date column dateStr = df['Date'].values D = np.zeros(dateStr.shape) #Convert all date strings to a numeric value for i, j in enumerate(dateStr): #Date strings are of the form year-month-day D[i] = datetime.strptime(j, '%Y-%m-%d').timestamp() #Add the newly parsed column to the dataframe df['Timestamp'] = D #Remove any unused columns (axis = 1 specifies fields are columns) return df.drop('Date', axis = 1)
Note: A quick plot of the data reveals that there seems to be a typo in the Feb 01, 2016 data row with the “Low” value listed as 2016.02. The pyplot module of the library matplotlib provides powerful tools for visualizing data sets. Plotting a data set is useful for visualizing a data set as well as for catching outliers and typos. The erroneous data point can be removed entirely or modified to a reasonable value as desired. The following code will plot the stock data, set the x-axis labels, and add a legend.
#Given dataframe from ParseData #plot it to the screen # #df: Dataframe returned from #p: The position of the predicted data points def PlotData(df, p = None): if(p is None): p = np.array() #p contains the indices of predicted data; the rest are actual points c = np.array([i for i in range(df.shape) if i not in p]) #Timestamp data ts = df.Timestamp.values #Number of x tick marks nTicks= 10 #Left most x value s = np.min(ts) #Right most x value e = np.max(ts) #Total range of x values r = e - s #Add some buffer on both sides s -= r / 5 e += r / 5 #These will be the tick locations on the x axis tickMarks = np.arange(s, e, (e - s) / nTicks) #Convert timestamps to strings strTs = [datetime.fromtimestamp(i).strftime('%m-%d-%y') for i in tickMarks] mpl.figure() #Plots of the high and low values for the day mpl.plot(ts, df.High.values, color = '#7092A8', linewidth = 1.618, label = 'Actual') #Predicted data was also provided if(len(p) > 0): mpl.plot(ts[p], df.High.values[p], color = '#6F6F6F', linewidth = 1.618, label = 'Predicted') #Set the tick marks mpl.xticks(tickMarks, strTs, rotation='vertical') #Add the label in the upper left mpl.legend(loc = 'upper left') mpl.show()
A plot of the data set produced by the above code is shown below in Figure 1.
Figure 1: Historical Yahoo Inc Stock Data
In the previous post, only the numericized date was used as input to the regression model. It is dubious that the date provides much useful information about the stock price of a company. To improve the model, more of the information from the spreadsheet is used. A sample is constructed as the current timestamp together with the past days of the opening value, closing value, high value, low value, adjusted closing value, volume, and previous timestamp. Thus, if data for the past days is use, the data matrix contains features. If the previous data is unavailable, the oldest available value will be used instead.
The corresponding target values are the stock opening value, closing value, high value, low value, adjusted closing value, and volume fields. The timestamp obviously does not need to be predicted.
A Stock Predictor Class
A python class is constructed which takes the number of days and a regression model providing the Scikit-learn interface as arguments. The class then uses the Learn function to learn a dataframe returned from the ParseData function. Next, the stock values can be predicted for a range of dates using the PredictDate function. Source code follows.
#Gives a list of timestamps from the start date to the end date # #startDate: The start date as a string xxxx-xx-xx #endDate: The end date as a string year-month-day #weekends: True if weekends should be included; false otherwise #return: A numpy array of timestamps def DateRange(startDate, endDate, weekends = False): #The start and end date sd = datetime.strptime(startDate, '%Y-%m-%d') ed = datetime.strptime(endDate, '%Y-%m-%d') #Invalid start and end dates if(sd > ed): raise ValueError("The start date cannot be later than the end date.") #One day day = timedelta(1) #The final list of timestamp data dates =  cd = sd while(cd <= ed): #If weekdays are included or it's a weekday append the current ts if(weekends or (cd.date().weekday() != 5 and cd.date().weekday() != 6)): dates.append(cd.timestamp()) #Onto the next day cd = cd + day return np.array(dates) #Given a date, returns the previous day # #startDate: The start date as a datetime object #weekends: True if weekends should counted; false otherwise def DatePrevDay(startDate, weekends = False): #One day day = timedelta(1) cd = datetime.fromtimestamp(startDate) while(True): cd = cd - day if(weekends or (cd.date().weekday() != 5 and cd.date().weekday() != 6)): return cd.timestamp() #Should never happen return None #A class that predicts stock prices based on historical stock data class StockPredictor: #The (scaled) data frame D = None #Unscaled timestamp data DTS = None #The data matrix A = None #Target value matrix y = None #Corresponding columns for target values targCols = None #Number of previous days of data to use npd = 1 #The regressor model R = None #Object to scale input data S = None #Constructor #nPrevDays: The number of past days to include # in a sample. #rmodel: The regressor model to use (sklearn) #nPastDays: The number of past days in each feature #scaler: The scaler object used to scale the data (sklearn) def __init__(self, rmodel, nPastDays = 1, scaler = StandardScaler()): self.npd = nPastDays self.R = rmodel self.S = scaler #Extracts features from stock market data # #D: A dataframe from ParseData #ret: The data matrix of samples def _ExtractFeat(self, D): #One row per day of stock data m = D.shape #Open, High, Low, and Close for past n days + timestamp and volume n = self._GetNumFeatures() B = np.zeros([m, n]) #Preserve order of spreadsheet for i in range(m - 1, -1, -1): self._GetSample(B[i], i, D) #Return the internal numpy array return B #Extracts the target values from stock market data # #D: A dataframe from ParseData #ret: The data matrix of targets and the def _ExtractTarg(self, D): #Timestamp column is not predicted tmp = D.drop('Timestamp', axis = 1) #Return the internal numpy array return tmp.values, tmp.columns #Get the number of features in the data matrix # #n: The number of previous days to include # self.npd is used if n is None #ret: The number of features in the data matrix def _GetNumFeatures(self, n = None): if(n is None): n = self.npd return n * 7 + 1 #Get the sample for a specific row in the dataframe. #A sample consists of the current timestamp and the data from #the past n rows of the dataframe # #r: The array to fill with data #i: The index of the row for which to build a sample #df: The dataframe to use #return; r def _GetSample(self, r, i, df): #First value is the timestamp r = df['Timestamp'].values[i] #The number of columns in df n = df.shape #The last valid index lim = df.shape #Each sample contains the past n days of stock data; for non-existing data #repeat last available sample #Format of row: #Timestamp Volume Open[i] High[i] ... Open[i-1] High[i-1]... etc for j in range(0, self.npd): #Subsequent rows contain older data in the spreadsheet ind = i + j + 1 #If there is no older data, duplicate the oldest available values if(ind >= lim): ind = lim - 1 #Add all columns from row[ind] for k, c in enumerate(df.columns): #+ 1 is needed as timestamp is at index 0 r[k + 1 + n * j] = df[c].values[ind] return r #Attempts to learn the stock market data #given a dataframe taken from ParseData # #D: A dataframe from ParseData def Learn(self, D): #Keep track of the currently learned data self.D = D.copy() #Keep track of old timestamps for indexing self.DTS = np.copy(D.Timestamp.values) #Scale the data self.D[self.D.columns] = self.S.fit_transform(self.D) #Get features from the data frame self.A = self._ExtractFeat(self.D) #Get the target values and their corresponding column names self.y, self.targCols = self._ExtractTarg(self.D) #Create the regressor model and fit it self.R.fit(self.A, self.y) #Predict the stock price during a specified time # #startDate: The start date as a string in yyyy-mm-dd format #endDate: The end date as a string yyyy-mm-dd format #return: A dataframe containing the predictions or def PredictDate(self, startDate, endDate): #Create the range of timestamps and reverse them ts = DateRange(startDate, endDate)[::-1] m = ts.shape #Prediction is based on data prior to start date #Get timestamp of previous day prevts = DatePrevDay(ts[-1]) #Test if there is enough data to continue try: ind = np.where(self.DTS == prevts) except IndexError: return None #There is enough data to perform prediction; allocate new data frame P = pd.DataFrame(np.zeros([m, self.D.shape]), index = range(m), columns = self.D.columns) #Add in the timestamp column so that it can be scaled properly P['Timestamp'] = ts #Scale the timestamp (other fields are 0) P[P.columns] = self.S.transform(P) #B is to be the data matrix of features B = np.zeros([1, self._GetNumFeatures()]) #Add extra last entries for past existing data for i in range(self.npd): #If the current index does not exist, repeat the last valid data curInd = ind + i if(curInd >= self.D.shape): curInd = curInd - 1 #Copy over the past data (already scaled) P.loc[m + i] = self.D.loc[curInd] #Loop until end date is reached for i in range(m - 1, -1, -1): #Create one sample self._GetSample(B, i, P) #Predict the row of the dataframe and save it pred = self.R.predict(B).ravel() #Fill in the remaining fields into the respective columns for j, k in zip(self.targCols, pred): P.set_value(i, j, k) #Discard extra rows needed for prediction P = P[0:m] #Scale the dataframe back to the original range P[P.columns] = self.S.inverse_transform(P) return P
The basic idea of the above code is as follows: use the data from today and the past days ( total) to predict the stock data tomorrow. The PredictDate function, can then repeat this process indefinitely into the future by basing subsequent predictions on predicted data. It is reasonable to assume that the subsequent predictions are increasingly unreliable.
Putting it Together
With the above class in place, the MLPR class and others are used to make stock predictions for Yahoo Inc. Data is loaded from the csv file, prediction is made for a user specified range of dates, and the results are plotted. Sample main code is as follows:
#Grab the data frame D = ParseData('yahoostock.csv') #The number of previous days of data used #when making a prediction numPastDays = 16 #Number of neurons in the input layer i = numPastDays * 7 + 1 #Number of neurons in the output layer o = D.shape - 1 #Number of neurons in the hidden layers h = int((i + o) / 2) #The list of layer sizes layers = [i, h, h, h, h, h, o] R = MLPR(layers, maxItr = 1000, tol = 0.40, reg = 0.001, verbose = True) sp = StockPredictor(R, nPastDays = numPastDays) #Learn the dataset and then display performance statistics sp.Learn(D) sp.TestPerformance() #Perform prediction for a specified date range P = sp.PredictDate('2016-11-02', '2016-12-31') #Keep track of number of predicted results for plot n = P.shape #Append the predicted results to the actual results D = P.append(D) #Predicted results are the first n rows PlotData(D, range(n + 1))
In addition to the MLPR class from a previous post, the StockPredictor class works with any class that provides the basic sklearn interface: fit, predict, and score. A python program which provides a basic command line interface for the primary functionality of this class can be found here.
Next, the program is employed to predict the stock data for Yahoo Inc from November 2nd to December 31st 2016. A KNeighborsRegressor class from sklearn was provided to the StockPredictor constructor to produce the prediction shown in Figure 2.
Figure 2: Yahoo Inc Stock Data with Prediction from KNN
Finally, the MLPR class from the previous post is used to perform prediction. The results can be seen below in Figure 3.
Figure 3: Yahoo Inc Stock Data with Prediction from MLPR
It appears the artificial neural network does not have much faith in Yahoo Inc.
The StockPredictor class above takes a slightly less naive approach to stock prediction than that from the previous post. In future posts, I hope to combine sentiment analysis techniques with textual data sets with stock data to make a more reasonable model. I hope to see you then.