Jane Street Stock Prediction

Prateek Nigam
16 min readApr 6, 2021

Can AI forecast stock investments

Photo by Maxim Hopman on Unsplash

Algorithmic exchanging has upset the securities exchange and its encompassing industry. More than 70% of all exchanges happening in the US right presently are being dealt with by bots. Gone are the times of the loaded stock trade with fit individuals waving pieces of paper yelling into phones. AI has numerous applications, one of which is to estimate time arrangement. Quite possibly the most intriguing (or maybe generally productive) time arrangement to anticipate is stock costs.

Foreseeing how the securities exchange will perform is quite possibly the most troublesome activity. There are such countless elements engaged with the expectation — actual elements versus mental, judicious, and nonsensical conduct, and so forth every one of these perspectives joins to make share costs unstable and hard to anticipate with a serious level of exactness.

Would we be able to utilize AI as a distinct advantage in this area? Utilizing highlights like the most recent declarations about an association, their quarterly income results, and so on, AI methods can possibly uncover examples and bits of knowledge we didn’t see previously, and these can be utilized to make unerringly exact expectations.

In this Problem answer for the Kaggle true issue, we will work with chronicled information about the stock costs of an openly recorded organization. We will execute a blend of AI calculations to anticipate the future stock cost of this organization, beginning with straightforward calculations like averaging and direct relapse, and afterward, investigate with cutting-edge methods.

Table of Contents:

· Exploratory Data Analysis
· Feature Engineering
· Machine Learning Models

Exploratory Data Analysis

our test is to utilize the authentic information, numerical apparatuses, and mathematical instruments available to make a model that gets as near sureness as could really be expected. we are given various potential exchanging openings, which your model should pick whether to acknowledge or dismiss.

As a rule, if one can create a profoundly prescient model which chooses the correct exchanges to execute, they would likewise be assuming a significant part in conveying the market messages that push costs nearer to “reasonable” values. That is, a superior model will mean the market will be more effective going ahead. In any case, growing great models will be trying for some, reasons, including an exceptionally low sign to-clamor proportion, likely excess, solid component relationship, and trouble in thinking of a legitimate numerical detailing.

Features

This dataset contains an anonymized set of features, feature_{0…129}, representing real stock market data. Each row in the dataset represents a trading opportunity, for which you will be predicting an action value: 1 to make the trade and 0 to pass on it.

We divided the information into two segments, one as highlight segments each one of those 129 highlights and the other as non-component segments named as non-feature columns, which were essential to clarify the information

Head of data for non-feature columns

Weight

Weight Distribution

Trades with weight=0 were intentionally included in the dataset for completeness, although such trades will not contribute towards the scoring evaluation.

We observed from the distribution of weight feature is, that it is highly skewed towards 0, as in the dataset we have trades with weight = 0 which were intentionally included in the dataset for completeness, although such trades will not contribute towards the scoring evaluation, and that the percentage of zero weights is 17%, and since these weights would not contribute to towards scoring evaluation, we can drop them.

Resp

Resp distribution

The resp feature with resp_{1,2,3,4} values, addresses returns throughout various timelines. As the dispersion of resp highlights are characterized as returns throughout various times. Referred from:https://www.investopedia.com/terms/t/timehorizon.asp, Time Horizon, is the timeframe one hopes to hold a venture until they need the returns. The longer the more extended the time, the more drawn out the force of compounding needs to work, and furthermore the more forceful a financial return can be made in their portfolio and the other way around.
Each exchange has a related weight and resp, which together addresses a profit from the exchange

we likewise saw from the above appropriation that the profits of resp_4 are higher than resp_1, It can be brought out that the profits are reliant on the time skyline, thusly this resp and weight highlights can be used to make action column.

Target column

Target feature can be defined as if the weight > 0 and its product with resp feature.

ts_id

ts_id feature represents a time ordering, trades per day.

By looking at the no. of trades wrt days plot, we observe that the fluctuation is in the range 0–100, and at last around after 450 to 500, and almost all of the days having a large volume of trades are before and up to day 85.

Referred from: https://www.investopedia.com/terms/v/volatility.asp, It is extracted out that Volatility is the measure of fluctuations, Volatile assets are often considered riskier than less volatile assets because the price is expected to be less predictable, Volatility is an important variable for calculating options prices.

As Suggested in the mentioned link, a higher number of trades for a given day simply indicates higher market volatility.
Referred from: https://www.kaggle.com/c/jane-street-market-prediction/discussion/201930#1125847, more than 9k trades per day and almost all of them occur before day 85

129 Features

head of train data

Looking at the head or few elements of features, it was observed that feature_0 has only 1 and -1 as its values, which can be looked into, while feature_7,8,120,121 is NaN, which gives an intuition that there can be multiple missing values

It is confirmed that feature_0 has only two values as 1 and -1, it is a binary feature that might be a useful observation, meanwhile, we will look for explanations from other features.

A better option is we can plot each feature with its relation to the target and its box plot with the target hue.

Then we can use the plot_feature function in a loop for each feature.

Observations from the plots

  • All features are normally distributed with mostly mean around zero
  • Most Features appears to be correlated

To have a better understanding we can plot the correlation matrix.

Feature Engineering

Observing the heat map we can see a bunch from feature 16–40 appears to be highly correlated, we can investigate in-depth and if we find high correlation we can drop the features

We can utilize the concept and can drop all the features which are highly correlated. From the distribution of means of correlations, we can observe that most values lie below around 0.32, so we can set this to be a threshold, but we will take 0.5 so as to not lose some columns.

Variance Engineering

Connection is a sign about the progressions between two factors. We can plot the relationship framework to show which variable is having a high or low connection in regard to another variable. Correlation can be a significant device for highlight designing in building AI models. Indicators which are uncorrelated with the target variable are presumably acceptable possibility to manage from the model.

The output of the less correlated data is

Heatmap

All these features are although uncorrelated to each other but with the output also, these do not have a good correlation, which means, these features are not good enough to predict the action, but we can utilize them based on comparison with the features after dimensionality reduction.

Missing values

In our true information, we have many missing qualities, and taking care of the missing qualities is perhaps the best test looked at by analysts, since settling on the correct choice on the most proficient method to deal with it produces hearty information models. Allow us to take a gander at various methods of attributing the missing qualities, in light of the fact that on one side we do finish expulsion of information with missing qualities brings about hearty and exceptionally precise model while it has a con as loss of data and information.

We can compute the mean, middle, or method of the component and supplant it with the missing qualities. This is an estimation that can add the difference to the informational index. Be that as it may, the deficiency of the information can be nullified by this technique which yields better outcomes.

https://www.geeksforgeeks.org/python-visualize-missing-values-nan-values-using-missingno-library/

Referred, from the above-mentioned link, we plotted the missing values in the dataset in 5 different portions as we have a massive dataset. as:

msno.matrix(train_data[feature_columns[:26]])
msno missing values plot

On complete observation of missing plot we observe that , we have features -

  • 7,8,17,18,11,12,21,22,27,28,78,84,90,96,102,80,86,92,98,108,114,104,110,116,120,121

which have missing values, Now let us see the count of missing values in the mentioned list of columns.

we can observe from the plot that out of 2390491 values if we consider approx 350000 values missing from the data set, we can not remove the feature as the ratio of missing value is very less around only 16% of the data, as mentioned in the below calculation.
Instead, we will be filling those values with the median of the particular feature.

Dimensionality Reduction

With 129 features, we can find out how many features make enough contribution to the result, although we gave these many features, only a few can be extracted out, which explains most of the data. So Moving forward, we will take two approaches, to solve dimension problem,

  • Create features
  • extract best features

And whichever gives better performance, we will use them for forecasting

SVD

Dimensionality decrease strategy not just lessens the complexities present in data by diminishing the number of information factors, yet it likewise brings about a superior execution of the model while making forecasts on new data. The objective of projection strategies is to lessen the number of measurements in the component space while keeping up the most fundamental design or connections between the factors in the information.

Singular Value Decomposition (SVD) is quite possibly the most mainstream procedure utilized for dimensionality decrease in AI. This is an information planning strategy that comes from the field of straight polynomial math and can be utilized to set up a projection of a sparse dataset prior to fitting a model.

SVD is a strategy from straight polynomial math that can be utilized to naturally perform dimensionality reduction, When information is inadequate, Singular Value Decomposition, or SVD, might be the most widely recognized procedure for dimensionality decrease.

Dimensionality reduction with truncated SVD

We also used SVD for dimensionality reduction from 129 features to 16 features and from the plot, it gives a great understanding that these features are very less correlated, but they are even less correlated with the target, which is not a good sign.

Other Option to try is to look for an autoencoder technique for dimension reduction.

Autoencoder

Autoencoder is an unsupervised artificial neural network. It finds a lower-dimensional portrayal of the information by focusing on the main highlights and killing commotion and repetition. It utilizes an encoder-decoder design, in which the encoder changes high-dimensional information over to bring down dimensional information, and the decoder recreates the first high-dimensional information from the lower-dimensional information.

Its procedure starts compressing the original data into a shortcode ignoring the noise. Then, the algorithm uncompresses that code to generate an image as close as possible to the original input.

Attached is the model structure we designed where the bottleNeck_output layer will contain the encoded features.

Thus generating output as :

Autoencoder features a correlation heat map

XGBoost

The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use xgboost.

XGBoost is an execution of inclination-supported choice trees intended for speed and execution. The name xgboost, however, really defines to the designing objective to stretch the boundary of calculations assets for supporting tree calculations. Which is the motivation behind why numerous individuals use xgboost.

Boosting is a group strategy where new models are added to address the blunders made by existing models. Models are added consecutively until no further upgrades can be made.

we performed XGBOOST to come up with the best features out of the existing feature and not create features, below plot explains how the correlation of the generated data.

We can use both the models one by one, visually it appears to be that autoencoder will perform better as it has less mutual correlation as compared to that of xgboost, but let's implement and try out.

Performing Prediction

To perform prediction we created a report generator and one best parameter finder function as:

Report generator

Preprocessing

Preprocessing of data is to divide the data in train and test in 70:30 ratio,
Scaling the data, scale fit, and transform on train data, while only transforming on test data based on train fitting, to avoid data leakage, and only feature columns to train and validate the data, thus filtering the columns

Machine Learning Models

Accurate prediction of stock market returns is a very challenging task due to the volatile and non-linear nature of the financial stock markets. With the introduction of artificial intelligence and increased computational capabilities, programmed methods of prediction have proved to be more efficient in predicting stock prices.

Logistic Regression

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).

parameters = {'alpha':[0.01, 0.1, 1, 10, 20, 30], 'penalty':['l2','l1','elasticnet']}lgr = SGDClassifier(loss='log')lgr_params = RandomizedSearchCV(lgr,param_distributions=parameters,cv=4, verbose=1, n_jobs=-1,scoring='accuracy')lgr_params.fit(scl.fit_transform(X_train),y_train)

With random search CV we calculated tuned the model for best parameters and thereafter trained the model on those parameters, parameters include:

  • alpha as [0.01, 0.1, 1, 10, 20, 30]
  • penalty as [‘l2’,’l1',’elasticnet’]
logistic regression Metric

SVM

SVMs may make classification errors within training data in order to minimize overall errors across test data. A major advantage of SVMs is that it finds a global optimum.
Support Vector Machines are outstanding amongst other binary classifiers. They make a choice limit with the end goal that most focuses in one classification fall on one side of the limit while most focuses in the other classification fall on the opposite side of the boundary. The ideal hyperplane is to such an extent that we expand the separation from the plane to any point. This is known as the edge. The most maximum margin hyperplane (MMH) best separates the information.
since it may not be a perfect differentiation, Vector Machine (SVM) is a relatively new learning algorithm that has the desirable characteristics of the
control of the decision function.

params = {'alpha':np.random.uniform(0.01, 100,10),'penalty':['l2','l1','elasticnet'],'eta0':np.random.uniform(1,100,10)}svm = SGDClassifier(loss='hinge',n_jobs=-1,shuffle=False)svm_params = RandomizedSearchCV(svm,param_distributions=params,cv=4, verbose=1, n_jobs=-1,scoring='accuracy')svm_params.fit(scl.fit_transform(X_train),y_train)

With random search CV we calculated tuned the model for best parameters and thereafter trained the model on those parameters, parameters include:

  • alpha as np.random.uniform(0.01, 100,10)
  • penalty as [‘l2’,’l1',’elasticnet’]
  • eta0 as np.random.uniform(1,100,10)
Support Vector Machine metric

Decision Tree

In each iteration, the decision tree algorithm partitions the training dataset using the input attribute. After the selection of an appropriate split, each node further subdivides the training set into smaller subsets, until no further splitting is possible or when a stopping criterion is fulfilled. After the complete creation of the tree, it is pruned using certain pruning rules to reduce classification errors.

This is how a decision tree gets constructed which can be used for making stock price prediction in machine learning.

parameters = {'max_depth':np.arange(3,10,2),'criterion' : ["gini", "entropy"]}dtree = DecisionTreeClassifier()dtree_search = RandomizedSearchCV(dtree,param_distributions=parameters,cv=3, n_jobs=-1,verbose=1,scoring="accuracy")dtree_search.fit(scl.fit_transform(X_train),y_train)

With random search CV we calculated tuned the model for best parameters and thereafter trained the model on those parameters, parameters include:

  • max_depth as np.arange(3,10,2)
  • criterion as [“Gini”, “entropy”]
Decision Tree Metric

Random Forest

Random Forests use an ensemble of Decision Trees to improve the accuracy of classification.

Random forests use an ensemble of many decision trees to reduce the effects of over-fitting. In a random forest, each tree is grown on a subset of the feature space.

Optuna is a product structure for mechanizing the enhancement interaction of these hyperparameters. It consequently finds ideal hyperparameter values by utilizing various samplers, for example, matrix search, irregular, bayesian, and transformative calculations.

we define the search space to our requirements, once this is done, the objective function should get trial as its input and be defined as objective(trial,train_data).

Once these lines are added to the code, the optimizer will sample the defined parameter space according to the sampler.
After the optimization is done, results can be accessed as a data frame via study.trials_dataframe.

Thus, a result is calculated and plotted as a metric below.

Random Forest Metric

xgboost

RF works by finding the best threshold based on which the feature space is recursively split, whereas GBDTs approximate regressors to the training samples and find the best split of the aggregate of the regressor functions. XGBoost is a fairly new invention.

Thus generating result metric as:

XGBOOST metric

Adaboost

we use a machine learning algorithm that is among the most popular and most successful for classification tasks called Adaboost.
One of the main properties that make the application of Adaboost to financial databases, interesting is that it has shown, in many applications (albeit in non-financial databases), robustness against overcapacity and produced, in many cases, low test-set error.

params = {'algorithm' : ['SAMME', 'SAMME.R']}adb = AdaBoostClassifier()adb_search = RandomizedSearchCV(adb,param_distributions=params,cv=3, n_jobs=-1,verbose=1,scoring="accuracy")adb_search.fit(scl.fit_transform(X_train),y_train)

Adaboost can find a stable function which discriminates, better than randomly made decisions, between upward and downward movements.

AdaBoost Metrics

Comparing ML Models

Then after all the models were performed we plotted a comparison of all the models against the accuracy measures, we took into consideration.

Confusion Metric: Explaining the precision and recall of the prediction.

F1_score: This is a combination of precision and recall of the model, and it is defined as the harmonic mean of the model’s precision and recall, thus providing a numerical value, to the score.

AUC Score: Tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0s as 0s and 1s as 1s.

Accuracy: It defines the percentage of correct prediction for the test data.

Model Comparison

Thus observed that logistic regression performed better.

Future Work

As a future bearing, we might want to perform a similar investigation with profound learning classifiers including Deep learning techniques like RNN and Neural networks and extreme learning classifiers with the assistance of a feature reduction calculation dependent on the boundaries utilized for stockmarket expectation. Alongside this, exploration would likewise like to study and carry out financial development model for stockmarket expectation and the examination of how monetary growth model will influence in securities exchange forecast in correlation to the direct relapse model and with specific machine learning strategies.

References

  1. appliedaicourse.com
  2. Stock price prediction based on deep neural networks by — Pengfei
    Yu1,2·Xuesong Yan1,2
    (https://www.researchgate.net/publication/332488706_Stock_price_pr
    ediction_based_on_deep_neural_networks)
    As per the paper, the stock market is a nonlinear financial data and a time
    the dependent problem, therefore Deep Neural Network along with Deep
    learning advantages will perform more satisfactory as compared to
    conventional machine learning approach, the observers have used
    DNN prediction model based on phase space reconstruction
    technique and Long Short term Model, to predict stock prices, he has
    also compared the results for different models including ARIMA, and
    concluded that DNN performs better than SVR or any other machine
    learning techniques.
  3. https://www.analyticsvidhya.com/blog/2018/10/predicting-stock-pricemachine-
    learning-deep-learning-techniques-python/
    Since the data is time-series data or moving window data, they have
    used 4 different techniques to analyze and predict stock markets,
    however, their output is more on how correct the prediction is, our
    the requirement is to trade or pass, but we will also be classified on the
    basis of the return the opportunity provides, coming back to the
    the solution they have provided uses 6 techniques and compare them for
    which performs better, these 6 techniques are:
    Moving Average, Linear Regression,k-Nearest Neighbors, Auto
    ARIMA, Prophet, Long Short Term Memory (LSTM).
    All these techniques are god performers for sequential data, but LSTM
    being a Recurrent Neural Network and values whole range of length
    based on weights, performs better.
  4. Deep Learning for Price Movement Prediction Using Convolutional
    Neural Network and Long Short-Term Memory
    (https://www.hindawi.com/journals/mpe/2020/2746845/)
    This paper differentiates itself from other stock marketing papers with
    the involvement of CNN in it, CNN extracting useful features from
    financial variables, and then using LSTM over it.
  5. [Solution] 1DCNN for Feature Extraction
    https://www.kaggle.com/code1110/janestreet-1dcnn-for-feature-extracti
    on-infer
    In the solution, they have used a bottleneck encoder to denoise the
    data, and used bottleneck because it will preserve intra-feature
    the relation also, then they have done a purgedtimeseries split of the data
    and provided it to the second phase which is multi-layer perceptron, to
    predict classification probability.

My work

Please find me on:

--

--