Constructing an Enhanced Alpha Model using Momentum, Mean Reversion, and Sentiment Analysis Factors
Utilizing Momentum 1 Year, Mean Reversion 5 Day Sector Neutral Smoothed, and Overnight Sentiment Smoothed Factors to Improve Investment Performance
The objective of this project is to construct an improved Alpha, which is a measure of investment performance that evaluates returns beyond what could be expected from market movements. To achieve this, we will use a combination of three distinct factors: Momentum 1 Year Factor, Mean Reversion 5 Day Sector Neutral Smoothed Factor, and Overnight Sentiment Smoothed Factor.
The Momentum 1 Year Factor is a measure of a stock's recent performance over a one-year period. This factor identifies stocks that have demonstrated strong returns over the past year and involves buying them with the expectation that they will continue to perform well in the future.
The Mean Reversion 5 Day Sector Neutral Smoothed Factor focuses on identifying stocks that have experienced a recent dip in price and aims to capitalize on their tendency to revert back to their mean price. This factor is sector-neutral, which means that it is not biased towards any specific industry or sector, making it easier to use in diversified portfolios. Additionally, this factor is smoothed to reduce noise and make it easier to interpret.
The Overnight Sentiment Smoothed Factor analyzes news and social media sentiment to identify stocks that are likely to have positive price movements in the following day's trading session. This factor uses natural language processing techniques to analyze large amounts of data and smooths the results to reduce noise and increase accuracy.
Combining these three factors into an enhanced Alpha can help investors identify attractive investment opportunities that may have been overlooked by other market participants. It's worth noting that the selection and weighting of these factors will require careful consideration, and they may need to be refined and adjusted over time based on market conditions and performance.
Load Packages:
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (14, 8)
Data Pipeline
Data Bundle
We'll be using Zipline to handle our data. We've created a end of day data bundle for this project. Run the cell below to register this data bundle in zipline.
import os
from zipline.data import bundles
os.environ['ZIPLINE_ROOT'] = os.path.join(os.getcwd(), '..', '..', 'data', 'project_7_eod')
ingest_func = bundles.csvdir.csvdir_equities(['daily'], project_helper.EOD_BUNDLE_NAME)
bundles.register(project_helper.EOD_BUNDLE_NAME, ingest_func)
print('Data Registered')
Build Pipeline Engine
We'll be using Zipline's pipeline package to access our data for this project. To use it, we must build a pipeline engine. Run the cell below to build the engine.
from zipline.pipeline import Pipeline
from zipline.pipeline.factors import AverageDollarVolume
from zipline.utils.calendars import get_calendar
universe = AverageDollarVolume(window_length=120).top(500)
trading_calendar = get_calendar('NYSE')
bundle_data = bundles.load(project_helper.EOD_BUNDLE_NAME)
engine = project_helper.build_pipeline_engine(bundle_data, trading_calendar)
Alpha Factors
It's time to start working on the alpha factors. In this project, we'll use the following factors:
Momentum 1 Year Factor
Mean Reversion 5 Day Sector Neutral Smoothed Factor
Overnight Sentiment Smoothed Factor
from zipline.pipeline.factors import CustomFactor, DailyReturns, Returns, SimpleMovingAverage, AnnualizedVolatility
from zipline.pipeline.data import USEquityPricing
factor_start_date = universe_end_date - pd.DateOffset(years=3, days=2)
sector = project_helper.Sector()
def momentum_1yr(window_length, universe, sector):
return Returns(window_length=window_length, mask=universe) \
.demean(groupby=sector) \
.rank() \
.zscore()
def mean_reversion_5day_sector_neutral_smoothed(window_length, universe, sector):
unsmoothed_factor = -Returns(window_length=window_length, mask=universe) \
.demean(groupby=sector) \
.rank() \
.zscore()
return SimpleMovingAverage(inputs=[unsmoothed_factor], window_length=window_length) \
.rank() \
.zscore()
class CTO(Returns):
"""
Computes the overnight return, per hypothesis from
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2554010
"""
inputs = [USEquityPricing.open, USEquityPricing.close]
def compute(self, today, assets, out, opens, closes):
"""
The opens and closes matrix is 2 rows x N assets, with the most recent at the bottom.
As such, opens[-1] is the most recent open, and closes[0] is the earlier close
"""
out[:] = (opens[-1] - closes[0]) / closes[0]
class TrailingOvernightReturns(Returns):
"""
Sum of trailing 1m O/N returns
"""
window_safe = True
def compute(self, today, asset_ids, out, cto):
out[:] = np.nansum(cto, axis=0)
def overnight_sentiment_smoothed(cto_window_length, trail_overnight_returns_window_length, universe):
cto_out = CTO(mask=universe, window_length=cto_window_length)
unsmoothed_factor = TrailingOvernightReturns(inputs=[cto_out], window_length=trail_overnight_returns_window_length) \
.rank() \
.zscore()
return SimpleMovingAverage(inputs=[unsmoothed_factor], window_length=trail_overnight_returns_window_length) \
.rank() \
.zscore()
Features and Labels
Let's create some features that we think will help the model make predictions.
"Universal" Quant Features
To capture the universe, we'll use the following as features:
Stock Volatility 20d, 120d
Stock Dollar Volume 20d, 120d
Sector
pipeline.add(AnnualizedVolatility(window_length=20, mask=universe).rank().zscore(), 'volatility_20d')
pipeline.add(AnnualizedVolatility(window_length=120, mask=universe).rank().zscore(), 'volatility_120d')
pipeline.add(AverageDollarVolume(window_length=20, mask=universe).rank().zscore(), 'adv_20d')
pipeline.add(AverageDollarVolume(window_length=120, mask=universe).rank().zscore(), 'adv_120d')
pipeline.add(sector, 'sector_code')
Regime Features
We are going to try to capture market-wide regimes. To do that, we'll use the following features:
High and low volatility 20d, 120d
High and low dispersion 20d, 120d
class MarketDispersion(CustomFactor):
inputs = [DailyReturns()]
window_length = 1
window_safe = True
def compute(self, today, assets, out, returns):
# returns are days in rows, assets across columns
out[:] = np.sqrt(np.nanmean((returns - np.nanmean(returns))**2))
pipeline.add(SimpleMovingAverage(inputs=[MarketDispersion(mask=universe)], window_length=20), 'dispersion_20d')
pipeline.add(SimpleMovingAverage(inputs=[MarketDispersion(mask=universe)], window_length=120), 'dispersion_120d')
Date Features
Let's make columns for the trees to split on that might capture trader/investor behavior due to calendar anomalies.
all_factors = engine.run_pipeline(pipeline, factor_start_date, universe_end_date)
all_factors['is_Janaury'] = all_factors.index.get_level_values(0).month == 1
all_factors['is_December'] = all_factors.index.get_level_values(0).month == 12
all_factors['weekday'] = all_factors.index.get_level_values(0).weekday
all_factors['quarter'] = all_factors.index.get_level_values(0).quarter
all_factors['qtr_yr'] = all_factors.quarter.astype('str') + '_' + all_factors.index.get_level_values(0).year.astype('str')
all_factors['month_end'] = all_factors.index.get_level_values(0).isin(pd.date_range(start=factor_start_date, end=universe_end_date, freq='BM'))
all_factors['month_start'] = all_factors.index.get_level_values(0).isin(pd.date_range(start=factor_start_date, end=universe_end_date, freq='BMS'))
all_factors['qtr_end'] = all_factors.index.get_level_values(0).isin(pd.date_range(start=factor_start_date, end=universe_end_date, freq='BQ'))
all_factors['qtr_start'] = all_factors.index.get_level_values(0).isin(pd.date_range(start=factor_start_date, end=universe_end_date, freq='BQS'))
all_factors.head()
One Hot Encode Sectors
For the model to better understand the sector data, we'll one hot encode this data.
sector_lookup = pd.read_csv(
os.path.join(os.getcwd(), '..', '..', 'data', 'project_7_sector', 'labels.csv'),
index_col='Sector_i')['Sector'].to_dict()
sector_lookup
sector_columns = []
for sector_i, sector_name in sector_lookup.items():
secotr_column = 'sector_{}'.format(sector_name)
sector_columns.append(secotr_column)
all_factors[secotr_column] = (all_factors['sector_code'] == sector_i)
all_factors[sector_columns].head()
Shift Target
We'll use shifted 5 day returns for training the model.
all_factors['target'] = all_factors.groupby(level=1)['return_5d'].shift(-5)
all_factors[['return_5d','target']].reset_index().sort_values(['level_1', 'level_0']).head(10)
IID Check of Target
Let's see if the returns are independent and identically distributed.
from scipy.stats import spearmanr
def sp(group, col1_name, col2_name):
x = group[col1_name]
y = group[col2_name]
return spearmanr(x, y)[0]
all_factors['target_p'] = all_factors.groupby(level=1)['return_5d_p'].shift(-5)
all_factors['target_1'] = all_factors.groupby(level=1)['return_5d'].shift(-4)
all_factors['target_2'] = all_factors.groupby(level=1)['return_5d'].shift(-3)
all_factors['target_3'] = all_factors.groupby(level=1)['return_5d'].shift(-2)
all_factors['target_4'] = all_factors.groupby(level=1)['return_5d'].shift(-1)
g = all_factors.dropna().groupby(level=0)
for i in range(4):
label = 'target_'+str(i+1)
ic = g.apply(sp, 'target', label)
ic.plot(ylim=(-1, 1), label=label)
plt.legend(bbox_to_anchor=(1.04, 1), borderaxespad=0)
plt.title('Rolling Autocorrelation of Labels Shifted 1,2,3,4 Days')
plt.show()
Train/Valid/Test Splits
Now let's split the data into a train, validation, and test dataset. Implement the function train_valid_test_split
to split the input samples, all_x
, and targets values, all_y
into a train, validation, and test dataset. The proportion sizes are train_size
, valid_size
, test_size
respectively.
When splitting, make sure the data is in order from train, validation, and test respectivly. Say train_size
is 0.7, valid_size
is 0.2, and test_size
is 0.1. The first 70 percent of all_x
and all_y
would be the train set. The next 20 percent of all_x
and all_y
would be the validation set. The last 10 percent of all_x
and all_y
would be the test set. Make sure not split a day between multiple datasets. It should be contained within a single dataset.
def train_valid_test_split(all_x, all_y, train_size, valid_size, test_size):
"""
Generate the train, validation, and test dataset.
Parameters
----------
all_x : DataFrame
All the input samples
all_y : Pandas Series
All the target values
train_size : float
The proportion of the data used for the training dataset
valid_size : float
The proportion of the data used for the validation dataset
test_size : float
The proportion of the data used for the test dataset
Returns
-------
x_train : DataFrame
The train input samples
x_valid : DataFrame
The validation input samples
x_test : DataFrame
The test input samples
y_train : Pandas Series
The train target values
y_valid : Pandas Series
The validation target values
y_test : Pandas Series
The test target values
"""
assert train_size >= 0 and train_size <= 1.0
assert valid_size >= 0 and valid_size <= 1.0
assert test_size >= 0 and test_size <= 1.0
assert train_size + valid_size + test_size == 1.0
# TODO: Implement
NN = all_x.index.levels[0]
N = len(NN)
Tsx = int(N * train_size)
Vsx = int(N * (train_size + valid_size))
TRi = NN[:Tsx]
Vi = NN[Tsx:Vsx]
TEi = NN[Vsx:]
xTr, xV, xTe = all_x.loc[TRi[0]:TRi[-1]], all_x.loc[Vi[0]:Vi[-1]], all_x.loc[TEi[0]:TEi[-1]]
yTr, yV, yTe = all_y.loc[TRi[0]:TRi[-1]], all_y.loc[Vi[0]:Vi[-1]], all_y.loc[TEi[0]:TEi[-1]]
return xTr, xV, xTe, yTr, yV, yTe
Random Forests
Visualize a Simple Tree
Let's see how a single tree would look using our data.
from IPython.display import display
from sklearn.tree import DecisionTreeClassifier
# This is to get consistent results between each run.
clf_random_state = 0
simple_clf = DecisionTreeClassifier(
max_depth=3,
criterion='entropy',
random_state=clf_random_state)
simple_clf.fit(X_train, y_train)
display(project_helper.plot_tree_classifier(simple_clf, feature_names=features))
project_helper.rank_features_by_importance(simple_clf.feature_importances_, features)
Model Results
Let's look at some additional metrics to see how well a model performs. We've created the function show_sample_results
to show the following results of a model:
Sharpe Ratios
Factor Returns
Factor Rank Autocorrelation
import alphalens as al
all_assets = all_factors.index.levels[1].values.tolist()
all_pricing = get_pricing(
data_portal,
trading_calendar,
all_assets,
factor_start_date,
universe_end_date)
def show_sample_results(data, samples, classifier, factors, pricing=all_pricing):
# Calculate the Alpha Score
prob_array=[-1,1]
alpha_score = classifier.predict_proba(samples).dot(np.array(prob_array))
# Add Alpha Score to rest of the factors
alpha_score_label = 'AI_ALPHA'
factors_with_alpha = data.loc[samples.index].copy()
factors_with_alpha[alpha_score_label] = alpha_score
# Setup data for AlphaLens
print('Cleaning Data...\n')
factor_data = project_helper.build_factor_data(factors_with_alpha[factors + [alpha_score_label]], pricing)
print('\n-----------------------\n')
# Calculate Factor Returns and Sharpe Ratio
factor_returns = project_helper.get_factor_returns(factor_data)
sharpe_ratio = project_helper.sharpe_ratio(factor_returns)
# Show Results
print(' Sharpe Ratios')
print(sharpe_ratio.round(2))
project_helper.plot_factor_returns(factor_returns)
project_helper.plot_factor_rank_autocorrelation(factor_data)
Hopefully, you're impressed by this outcome. Even though there were notable variations in factor performances across the three sets, AI ALPHA managed to achieve a favorable result.