# Stock Market Prediction in Python Part 2

This post revisits the problem of predicting stock prices based on historical stock data using TensorFlow that was explored in a previous post. In the previous post, stock price was predicted solely based on the date. First, the date was converted to a numerical value in LibreOffice, then the resulting integer value was read into a matrix using numpy. As stated in the post, this method was not meant to be indicative of how actual stock prediction is done. This post aims to slightly improve upon the previous model and explore new features in tensorflow and Anaconda python. The corresponding source code is available here.

Note: See a later post Visualizing Neural Network Performance on High-Dimensional Data for code to help visualize neural network learning and performance.

# Feature Extraction

The latest stock data for Yahoo can be found at the following link. Instead of using LibreOffice to parse the date strings, the datetime library in Python can be used instead. The strptime function parses dates given a special format string. The format string in the code below specifies the dates are of the format yyyy-mm-dd, also known as ISO 8601 format.

```#Used for numpy arrays
import numpy as np
#Used to read data from CSV file
import pandas as pd
#Used to convert date string to numerical value
from datetime import datetime, timedelta
#Used to plot data
import matplotlib.pyplot as mpl

#Load data from the CSV file. Note: Some systems are unable
#to give timestamps for dates before 1970. This function may
#fail on such systems.
#
#path:      The path to the file
#return:    A data frame with the parsed timestamps
def ParseData(path):
#Read the csv file into a dataframe
#Get the date strings from the date column
dateStr = df['Date'].values
D = np.zeros(dateStr.shape)
#Convert all date strings to a numeric value
for i, j in enumerate(dateStr):
#Date strings are of the form year-month-day
D[i] = datetime.strptime(j, '%Y-%m-%d').timestamp()
#Add the newly parsed column to the dataframe
df['Timestamp'] = D
#Remove any unused columns (axis = 1 specifies fields are columns)
return df.drop('Date', axis = 1)
```

Note: A quick plot of the data reveals that there seems to be a typo in the Feb 01, 2016 data row with the “Low” value listed as 2016.02. The pyplot module of the library matplotlib provides powerful tools for visualizing data sets. Plotting a data set is useful for visualizing a data set as well as for catching outliers and typos. The erroneous data point can be removed entirely or modified to a reasonable value as desired. The following code will plot the stock data, set the x-axis labels, and add a legend.

```#Given dataframe from ParseData
#plot it to the screen
#
#df:        Dataframe returned from
#p:         The position of the predicted data points
def PlotData(df, p = None):
if(p is None):
p = np.array([])
#p contains the indices of predicted data; the rest are actual points
c = np.array([i for i in range(df.shape[0]) if i not in p])
#Timestamp data
ts = df.Timestamp.values
#Number of x tick marks
nTicks= 10
#Left most x value
s = np.min(ts)
#Right most x value
e = np.max(ts)
#Total range of x values
r = e - s
#Add some buffer on both sides
s -= r / 5
e += r / 5
#These will be the tick locations on the x axis
tickMarks = np.arange(s, e, (e - s) / nTicks)
#Convert timestamps to strings
strTs = [datetime.fromtimestamp(i).strftime('%m-%d-%y') for i in tickMarks]
mpl.figure()
#Plots of the high and low values for the day
mpl.plot(ts, df.High.values, color = '#7092A8', linewidth = 1.618, label = 'Actual')
#Predicted data was also provided
if(len(p) > 0):
mpl.plot(ts[p], df.High.values[p], color = '#6F6F6F', linewidth = 1.618, label = 'Predicted')
#Set the tick marks
mpl.xticks(tickMarks, strTs, rotation='vertical')
#Add the label in the upper left
mpl.legend(loc = 'upper left')
mpl.show()
```

A plot of the data set produced by the above code is shown below in Figure 1.

Figure 1: Historical Yahoo Inc Stock Data

In the previous post, only the numericized date was used as input to the regression model. It is dubious that the date provides much useful information about the stock price of a company. To improve the model, more of the information from the spreadsheet is used. A sample is constructed as the current timestamp together with the past $n$ days of the opening value, closing value, high value, low value, adjusted closing value, volume, and previous timestamp. Thus, if data for the past $n=5$ days is use, the data matrix contains $1 + 5 * 7 = 36$ features. If the previous data is unavailable, the oldest available value will be used instead.

The corresponding target values are the stock opening value, closing value, high value, low value, adjusted closing value, and volume fields. The timestamp obviously does not need to be predicted.

# A Stock Predictor Class

A python class is constructed which takes the number of days $n$ and a regression model providing the Scikit-learn interface as arguments. The class then uses the Learn function to learn a dataframe returned from the ParseData function. Next, the stock values can be predicted for a range of dates using the PredictDate function. Source code follows.

```#Gives a list of timestamps from the start date to the end date
#
#startDate:     The start date as a string xxxx-xx-xx
#endDate:       The end date as a string year-month-day
#weekends:      True if weekends should be included; false otherwise
#return:        A numpy array of timestamps
def DateRange(startDate, endDate, weekends = False):
#The start and end date
sd = datetime.strptime(startDate, '%Y-%m-%d')
ed = datetime.strptime(endDate, '%Y-%m-%d')
#Invalid start and end dates
if(sd > ed):
raise ValueError("The start date cannot be later than the end date.")
#One day
day = timedelta(1)
#The final list of timestamp data
dates = []
cd = sd
while(cd <= ed):
#If weekdays are included or it's a weekday append the current ts
if(weekends or (cd.date().weekday() != 5 and cd.date().weekday() != 6)):
dates.append(cd.timestamp())
#Onto the next day
cd = cd + day
return np.array(dates)

#Given a date, returns the previous day
#
#startDate:     The start date as a datetime object
#weekends:      True if weekends should counted; false otherwise
def DatePrevDay(startDate, weekends = False):
#One day
day = timedelta(1)
cd = datetime.fromtimestamp(startDate)
while(True):
cd = cd - day
if(weekends or (cd.date().weekday() != 5 and cd.date().weekday() != 6)):
return cd.timestamp()
#Should never happen
return None

#A class that predicts stock prices based on historical stock data
class StockPredictor:

#The (scaled) data frame
D = None
#Unscaled timestamp data
DTS = None
#The data matrix
A = None
#Target value matrix
y = None
#Corresponding columns for target values
targCols = None
#Number of previous days of data to use
npd = 1
#The regressor model
R = None
#Object to scale input data
S = None

#Constructor
#nPrevDays:     The number of past days to include
#               in a sample.
#rmodel:        The regressor model to use (sklearn)
#nPastDays:     The number of past days in each feature
#scaler:        The scaler object used to scale the data (sklearn)
def __init__(self, rmodel, nPastDays = 1, scaler = StandardScaler()):
self.npd = nPastDays
self.R = rmodel
self.S = scaler

#Extracts features from stock market data
#
#D:         A dataframe from ParseData
#ret:       The data matrix of samples
def _ExtractFeat(self, D):
#One row per day of stock data
m = D.shape[0]
#Open, High, Low, and Close for past n days + timestamp and volume
n = self._GetNumFeatures()
B = np.zeros([m, n])
for i in range(m - 1, -1, -1):
self._GetSample(B[i], i, D)
#Return the internal numpy array
return B

#Extracts the target values from stock market data
#
#D:         A dataframe from ParseData
#ret:       The data matrix of targets and the

def _ExtractTarg(self, D):
#Timestamp column is not predicted
tmp = D.drop('Timestamp', axis = 1)
#Return the internal numpy array
return tmp.values, tmp.columns

#Get the number of features in the data matrix
#
#n:         The number of previous days to include
#           self.npd is  used if n is None
#ret:       The number of features in the data matrix
def _GetNumFeatures(self, n = None):
if(n is None):
n = self.npd
return n * 7 + 1

#Get the sample for a specific row in the dataframe.
#A sample consists of the current timestamp and the data from
#the past n rows of the dataframe
#
#r:         The array to fill with data
#i:         The index of the row for which to build a sample
#df:        The dataframe to use
#return;    r
def _GetSample(self, r, i, df):
#First value is the timestamp
r[0] = df['Timestamp'].values[i]
#The number of columns in df
n = df.shape[1]
#The last valid index
lim = df.shape[0]
#Each sample contains the past n days of stock data; for non-existing data
#repeat last available sample
#Format of row:
#Timestamp Volume Open[i] High[i] ... Open[i-1] High[i-1]... etc
for j in range(0, self.npd):
#Subsequent rows contain older data in the spreadsheet
ind = i + j + 1
#If there is no older data, duplicate the oldest available values
if(ind >= lim):
ind = lim - 1
for k, c in enumerate(df.columns):
#+ 1 is needed as timestamp is at index 0
r[k + 1 + n * j] = df[c].values[ind]
return r

#Attempts to learn the stock market data
#given a dataframe taken from ParseData
#
#D:         A dataframe from ParseData
def Learn(self, D):
#Keep track of the currently learned data
self.D = D.copy()
#Keep track of old timestamps for indexing
self.DTS = np.copy(D.Timestamp.values)
#Scale the data
self.D[self.D.columns] = self.S.fit_transform(self.D)
#Get features from the data frame
self.A = self._ExtractFeat(self.D)
#Get the target values and their corresponding column names
self.y, self.targCols = self._ExtractTarg(self.D)
#Create the regressor model and fit it
self.R.fit(self.A, self.y)

#Predict the stock price during a specified time
#
#startDate:     The start date as a string in yyyy-mm-dd format
#endDate:       The end date as a string yyyy-mm-dd format
#return:        A dataframe containing the predictions or
def PredictDate(self, startDate, endDate):
#Create the range of timestamps and reverse them
ts = DateRange(startDate, endDate)[::-1]
m = ts.shape[0]
#Prediction is based on data prior to start date
#Get timestamp of previous day
prevts = DatePrevDay(ts[-1])
#Test if there is enough data to continue
try:
ind = np.where(self.DTS == prevts)[0][0]
except IndexError:
return None
#There is enough data to perform prediction; allocate new data frame
P = pd.DataFrame(np.zeros([m, self.D.shape[1]]), index = range(m), columns = self.D.columns)
#Add in the timestamp column so that it can be scaled properly
P['Timestamp'] = ts
#Scale the timestamp (other fields are 0)
P[P.columns] = self.S.transform(P)
#B is to be the data matrix of features
B = np.zeros([1, self._GetNumFeatures()])
#Add extra last entries for past existing data
for i in range(self.npd):
#If the current index does not exist, repeat the last valid data
curInd = ind + i
if(curInd >= self.D.shape[0]):
curInd = curInd - 1
#Copy over the past data (already scaled)
P.loc[m + i] = self.D.loc[curInd]
#Loop until end date is reached
for i in range(m - 1, -1, -1):
#Create one sample
self._GetSample(B[0], i, P)
#Predict the row of the dataframe and save it
pred = self.R.predict(B).ravel()
#Fill in the remaining fields into the respective columns
for j, k in zip(self.targCols, pred):
P.set_value(i, j, k)
#Discard extra rows needed for prediction
P = P[0:m]
#Scale the dataframe back to the original range
P[P.columns] = self.S.inverse_transform(P)
return P
```

The basic idea of the above code is as follows: use the data from today and the past $n-1$ days ($n$ total) to predict the stock data tomorrow. The PredictDate function, can then repeat this process indefinitely into the future by basing subsequent predictions on predicted data. It is reasonable to assume that the subsequent predictions are increasingly unreliable.

# Putting it Together

With the above class in place, the MLPR class and others are used to make stock predictions for Yahoo Inc. Data is loaded from the csv file, prediction is made for a user specified range of dates, and the results are plotted. Sample main code is as follows:

```#Grab the data frame
D = ParseData('yahoostock.csv')
#The number of previous days of data used
#when making a prediction
numPastDays = 16
#Number of neurons in the input layer
i = numPastDays * 7 + 1
#Number of neurons in the output layer
o = D.shape[1] - 1
#Number of neurons in the hidden layers
h = int((i + o) / 2)
#The list of layer sizes
layers = [i, h, h, h, h, h, o]
R = MLPR(layers, maxItr = 1000, tol = 0.40, reg = 0.001, verbose = True)
sp = StockPredictor(R, nPastDays = numPastDays)
#Learn the dataset and then display performance statistics
sp.Learn(D)
sp.TestPerformance()
#Perform prediction for a specified date range
P = sp.PredictDate('2016-11-02', '2016-12-31')
#Keep track of number of predicted results for plot
n = P.shape[0]
#Append the predicted results to the actual results
D = P.append(D)
#Predicted results are the first n rows
PlotData(D, range(n + 1))
```

In addition to the MLPR class from a previous post, the StockPredictor class works with any class that provides the basic sklearn interface: fit, predict, and score. A python program which provides a basic command line interface for the primary functionality of this class can be found here.

# Results

Next, the program is employed to predict the stock data for Yahoo Inc from November 2nd to December 31st 2016. A KNeighborsRegressor class from sklearn was provided to the StockPredictor constructor to produce the prediction shown in Figure 2.

Figure 2: Yahoo Inc Stock Data with Prediction from KNN

Finally, the MLPR class from the previous post is used to perform prediction. The results can be seen below in Figure 3.

Figure 3: Yahoo Inc Stock Data with Prediction from MLPR

It appears the artificial neural network does not have much faith in Yahoo Inc.

# Next Steps

The StockPredictor class above takes a slightly less naive approach to stock prediction than that from the previous post. In future posts, I hope to combine sentiment analysis techniques with textual data sets with stock data to make a more reasonable model. I hope to see you then.

N

## 36 thoughts on “Stock Market Prediction in Python Part 2”

1. Paul |

Got an error: sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.

Like

• Nicholas T Smith |

Hi Paul,

I am still using scikit-learn 0.17.1 in Anaconda Python 4.2, so I didn’t see that warning. I just tested that it works with scikit-learn 0.18 in a new environment though (despite the warning). Make sure you provide command line arguments to the program. Try running it from the command line with (provide dates in yyyy-mm-dd format):

python stocks.py csv_file start_date end_date

Hope this helps,

N

Like

• Paul |

File “stocks.py”, line 102, in
p, n = Main([‘table2.csv’, ‘2016-11-02’, ‘2016-12-31’, ‘D’])
TypeError: ‘NoneType’ object is not iterable

Ops, another error here…

Like

2. Trevor |

Hello,
getting this error. If I comment out of the print line on the last line of stockpredictor.py it runs up through the point of printing the actual graph.

line 341, in
print(‘C-V:\t’ + str(s1) + ‘\nTst:\t’ + str(s2) + ‘\nTrn:\t’ + str(s3))
NameError: name ‘s1’ is not defined

Any help would be appreciated on this. Thanks

Like

• Nicholas T Smith |

Hi Trevor,

That’s odd. What version of python are you using? Are you using the stocks.py as your “main” program? Also, how are you running the program?

N

Like

• Trevor |

It ended up being an indention problem.

Like

3. Trevor |

fixed it. nevermind. What are your thoughts on making this capable of using weekly or monthly data?

Like

• Nicholas T Smith |

Hi Trevor,

I’m glad to hear you got it working. As it is, it should work with weekly or monthly data. The only thing would be to modify the PredictDate function to allow for daily, weekly, or monthly predictions. I’ll push an update to the GitHub in a bit.

N

Like

• Trevor |

line 283, in PredictDate
prevts = DatePrevDay(ts[-1])
IndexError: index -1 is out of bounds for axis 0 with size 0

Like

• Trevor |

File “stocks.py”, line 100, in
Main(sys.argv[1:])
File “stocks.py”, line 90, in Main
n = P.shape[0]
AttributeError: ‘NoneType’ object has no attribute ‘shape’

This happened with the monthly

Like

4. Trevor |

Now is this update assuming that we are still working with a daily dataset or monthly or weekly. I guess what I’m trying to ask, is it unit less?

Like

5. Trevor |

Do you ever play with neuroph? Hit me up on facebook or something. Trevor Rydberg

Like

6. Trevor |

fixed the issues above. Didn’t think it through. The idea would be to only pull 1 data point for that period, because it would be greater at predicting macroeconomic trends, or does this pull every data point in an attempt to predict weekly or monthly values. I’m newer to python and I’m still catching my footing. More used to Java and C.

Like

• Trevor |

What I’m noticing is that changing the D to W or M is stretching the gain of a day over weeks or months. I guess I need to know if this is drawing data points that are one day, one week or one month apart?

Like

• Trevor |

It’s in the DatePrevday. It only jumps back timeDelta(1) which has no continuity with timedelta(7) or timedelta(30) on the predicted side. I’m going to add some elifs to accommodate the continuity issue. Your thoughts?

Like

7. Paul |

just fixed the issues under python 2.7. Maybe the next move is to try LSTM or other RNN? What do you think?

Like

8. How does the MLP train ? Back-Propogation is the answer! So how much time it takes given an year of data and prediction on daily basis and if possible can you give steps to do execute it ? I am doing it as stocks.py table.csv 2016-11-2 2016-12-31 D.

And expecting the output something similar to what you posted as an image. but the problem is, it is no where closer to it. What can be the reason ? (Learning has not been enough ?)

Again Steps are required here or anywhere.

Like

9. This looks more of a stat based approach as it gives the same predictions everytime the program finishes, which is not at all possible in case of neural networks condition applied* That the MLP is trained enough, which I didn’t see here.

Liked by 1 person

10. Nicholas T Smith |

Hi Milind,

Thanks for your comment. The performance of neural networks is highly dependent upon the hyper-parameters chosen (number of layers, layer sizes, activation function, regularization used, etc). It can take quite a lot of tuning before acceptable results are found for a given dataset.

As mentioned in the post, the image I posted is using KNN to perform the prediction. Perhaps if you post your settings for the neural network I can give some suggestions.

N

Liked by 1 person

11. Can RandomForest be used instead of KNN for better results? Also, how can I add more parameters into this and make it more customizable using NLP as frontend ?

Like

12. Also if you can explain how exactly you are backpropagating the delta, that would be nice 🙂

Liked by 1 person

• Nicholas T Smith |

Hi,

The neural network code is using tensorflow’s implementation of the Adam method. If you are just referring to back-propagation in general though, in a previous blog post, I detailed how to perform simple back-propagation.

N

Liked by 1 person

13. i dont know why it aways error
When I command -> python stocks.py data.csv 2007-12-28 2016-11-02 D
and I aways get this Error parsing date: 2007-12-28
So 2007-12-28 is not correct ???

pls help
Thank

Liked by 1 person

• Nicholas T Smith |

Hi,

I don’t see anything wrong with the command; it works okay for me. Make sure you have data going back to at least 2007-12-28, though this wouldn’t cause the error you are getting. Just a guess: Do you use a non-US keyboard layout by any chance? Maybe you are using a different “-” symbol.

N

Liked by 1 person

• Hi
thank so much for answer me
but I founded my problem is not “-” symbol
but the problem is from python2
so when i use python3 I got no any problem

Thank

Liked by 2 people

14. cccdicks |

If i want to use 75% train data, 15% testdata and 10% prediction (validation)

How and where should i Put the dates?
I dont understand this.

I got data from 2007-03-20 -> 2017-01-01 but I cant figure out how to use this with your code.

Liked by 1 person

• Nicholas T Smith |

Hi,

Currently the code only supports a holdout period for performing prediction. If you want to have additional data reserved for other uses you will need to make adjustments to the code.

N

Liked by 1 person

15. Yuriy |

Hey Nick,

Predicting stock price just using it’s own chart looks simple and underpowered to what neural networks can do. When doing “Deep Learning” I would love to see how deep we can actually go.

Can you make a part 3 of stock market prediction where you show how to cross examine Yahoo stock with either Forex USD/EUR ticker or GOLD price, or maybe cross examine tech market index or maybe just Dow Jones. And ask neural network if it sees the pattern.

Liked by 1 person

16. troofy |

Hey Nick,

could you help me about how do i see the graph on the console? i have used your code for my understanding. i pasted the first part-ParseData(path) (wherein you have put the csv into the data frame) followed by the second part- ParsePlot(df,p=None) function(wherein you are making the first plot). however, the funtions haven’t been called anywhere. could you tell me if i have the right approach? i am alot confused.
thankyou!

Like

17. Hey Nick,

Could you please clarify on what basis you choose the number on neurons in the input, hidden and output layers. Also, how did you choose the number of hidden layers ? please help
I would be so grateful if you clarify this as soon as possible.

Thank you so much

Like

• Nicholas T Smith |

Hi,

The number of input and output neurons is determined by the problem. If the samples have 11 features for example, there must be 11 input neurons. If the target values have 4 features, there must be 4 output neurons. The number of hidden neurons is not restricted, however. Best practices usually suggest some value between the number of input and number of output neurons, but there are no definite rules. Basically, it comes down to trying out different values and seeing what works best each individual problem.

N

Like

• Thank you so much for your clarification, i also had another question regarding the cross validation part using KFold, i tried getting what is happening but i can’t get along, so if you could explain what is going there and what is the output it would be really great. Thank you so much in advance i really appreciate it.

Mai EL-zayat

Like

18. I am also getting different result for the same input the range of dates (input) i guess it is because the weights and biases are randomly generated, correct me please if i am wrong.