一、CFM 对金融市场的波动性预测
比赛地址: CFM对金融市场的波动性预测 www.heywhale.com/home/activi…
该赛题是经典的股票波动回归预测题目,是股票市场中经典的趋势预测问题。
1.赛题简介:
美国股票市场是地球上流动性最强的股票市场,因此提供了许多投资机会。但是,在通过组合金融资产构建股票投资组合时,我们需要估计其未来的风险或波动性。
股票过去的波动性是其未来风险的一个很好的代表,但一些盘中模式仍有待通过算法发现。本次赛题针对某个投资公司的股票市场价格的波动进行预测。提供每只股票每天的间隔五分钟的股票价格波动率以及对应的股票价格波动方向,需要预测5分钟后的2小时内的波动率。
2.数据说明
本数据集来自某个投资公司的股票市场信息,训练数据有101601条、测试集有25535条,为了预测结果灵敏性,对标签结果采用百分制形式,即扩大了100倍。本次竞赛数据包含训练集(train_new.csv)和测试集(test_new.csv)两张表供用户训练预测学习。并提供了提交样例模版。
各表的信息如下:
train_new.csv
字段 | 含义 |
ID | ID唯一取值 |
date | 日期 |
product_id | 股票的标识号 |
volatility1 ~ volatility54 | 波动率,举例:volatility2到volatility1间隔5分钟内的波动率 |
return1 ~ return54 | 波动方向,举例:return2到return1间隔5分钟内的波动方向 |
target | 预测volatility字段5分钟后的2小时内的波动率 |
备注:volatility字段时间范围与return字段时间范围一一对应
test_new.csv
字段 | 含义 |
ID | ID唯一取值 |
date | 日期 |
product_id | 股票的标识号 |
volatility1 ~ volatility54 | 波动率,举例:volatility2到volatility1间隔5分钟内的波动率 |
return1 ~ return54 | 波动方向,举例:return2到return1间隔5分钟内的波动方向 |
3.数据协议
本数据为公开数据,遵从CC0: Public Domain。目前国际上被广泛接受的做法是采用知识共享组织 (Creative Commons) 所提供的版权许可协议,其中有一项声明作品进入公有领域的协议是 CC0 。采取了 CC0 协议的作品,表示著作权人已将该其贡献至公有领域,在法律允许的范围,放弃所有他在全世界范围内基于著作权法对作品享有的权利,包括所有相关权利和邻接权利。标记有「CC0」的文字、图片、音频、视频等作品,无需著作权人同意,无需顾虑著作权风险,就可以复制、修改、发行和演绎,可用于商业性目的。简单地说就是可以随意使用,包括商用。
二、数据分析
1.数据解压缩
!unzip -qoa data/data126261/CFM对金融市场的波动性预测-数据集.zip
2.导入必须库
import pandas as pd import numpy as np import os from tqdm import tqdm
3.数据查看
- trian:((101601, 112),
- test:(25535, 111),
- submit:(25535, 2))
path = './' os.listdir(path)
ssh', 'train_new.csv', 'work', '.viminfo', '.pip', '.virtual_documents', '.conda', '.dataset.download', '.cache', 'submit_sample.csv', '.python_history', 'test_new.csv', 'data', '.condarc', '.systemlogs', '.ipython', '.local', '.node.started', '.bashrc', '.bash_logout', '.jupyter', '.config', '3441338.ipynb', '.homedata.success', '.profile', '.ipynb_checkpoints', '.bash_history']
# 读取数据 train = pd.read_csv(path + '/train_new.csv') test = pd.read_csv(path + '/test_new.csv') submit = pd.read_csv(path + '/submit_sample.csv') train.head() .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
ID | date | product_id | volatility1 | volatility2 | volatility3 | volatility4 | volatility5 | volatility6 | volatility7 | ... | return46 | return47 | return48 | return49 | return50 | return51 | return52 | return53 | return54 | target | |
0 | 0 | 1 | 1 | 0.662737 | 0.716896 | 0.698601 | 0.480172 | 0.623665 | 0.201876 | 0.327206 | ... | 1.0 | 1.0 | -1.0 | 1.0 | -1.0 | 0.0 | 1.0 | 1.0 | -1.0 | 13.416821 |
1 | 1 | 2 | 1 | 1.341973 | 0.361853 | 0.361713 | 0.774088 | 0.609955 | 1.209693 | 0.160228 | ... | 1.0 | 1.0 | 1.0 | -1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 26.537525 |
2 | 2 | 3 | 1 | 1.369799 | 0.409785 | 0.590202 | 0.322052 | 3.293654 | 0.530644 | 0.981945 | ... | 1.0 | -1.0 | -1.0 | -1.0 | -1.0 | 1.0 | 1.0 | 1.0 | -1.0 | 50.628539 |
3 | 3 | 4 | 1 | 0.460805 | 0.144150 | 0.086472 | 0.096159 | 0.220848 | 0.009604 | 0.433608 | ... | 0.0 | 1.0 | 1.0 | -1.0 | -1.0 | 1.0 | 1.0 | -1.0 | 1.0 | 14.199094 |
4 | 4 | 5 | 1 | 0.223939 | 0.168005 | 0.224208 | 0.251931 | 0.056002 | 0.069960 | 0.083977 | ... | 1.0 | 1.0 | 1.0 | -1.0 | 1.0 | -1.0 | 1.0 | -1.0 | 1.0 | 14.798843 |
5 rows × 112 columns
# 尺寸查看 train.shape, test.shape, submit.shape
((101601, 112), (25535, 111), (25535, 2))
# 按日期统计 temp = train.groupby(['product_id'])['date'].count().reset_index(drop = False) temp.columns = ['product_id', 'lengs'] temp.head() .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
product_id | lengs | |
0 | 1 | 338 |
1 | 2 | 338 |
2 | 3 | 338 |
3 | 4 | 338 |
4 | 5 | 335 |
volatility_features = [f'volatility{i}' for i in range(1, 55)] return_features = [f'return{i}' for i in range(1, 55)] for f1 in tqdm(volatility_features[: -1]): indexs = volatility_features.index(f1) f2 = volatility_features[indexs + 1] train[f'{f2}_diff_{f1}'] = train[f2] - train[f1] test[f'{f2}_diff_{f1}'] = test[f2] - test[f1] for f1 in tqdm(return_features[: -1]): indexs = return_features.index(f1) f2 = return_features[indexs + 1] train[f'{f2}_diff_{f1}'] = train[f2] - train[f1] test[f'{f2}_diff_{f1}'] = test[f2] - test[f1]
100%|██████████| 53/53 [00:00<00:00, 561.40it/s] 100%|██████████| 53/53 [00:00<00:00, 99.91it/s]
三、训练
1.评价函数
def mean_absolute_error(y_true, y_pred): """ Mean Absolute Error, . """ return np.mean(np.abs((np.asarray(y_true) - np.asarray(y_pred)))) * 100
2.LightGBM训练
from sklearn.model_selection import KFold,StratifiedKFold import lightgbm as lgb features = [f for f in train.columns if f not in ['ID', 'target', 'product_id']] print(len(features), features) preds = np.zeros(test.shape[0]) oof=np.zeros(train.shape[0]) train_X = train[features] train_Y = train['target'] val_label=[] params = {'learning_rate': 0.12, 'boosting_type': 'gbdt', 'objective': 'regression_l1', 'metric': 'mae', 'min_child_samples': 46, 'min_child_weight': 0.01, 'feature_fraction': 0.7, 'bagging_fraction': 0.7, 'bagging_freq': 2, 'num_leaves': 16, 'max_depth': 5, 'n_jobs': -1, 'seed': 2019, 'verbosity': -1, } folds = KFold(n_splits=5, shuffle=True, random_state=1996) for fold_, (train_index, test_index) in enumerate(folds.split(train_X, train_Y)): print('Fold_{}'.format(fold_)) train_x, test_x, train_y, test_y = train_X.iloc[train_index], train_X.iloc[test_index], train_Y.iloc[train_index], train_Y.iloc[test_index] trn_data = lgb.Dataset(train_x, train_y) val_data = lgb.Dataset(test_x, test_y) num_round=500 clf = lgb.train(params, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval = 50, early_stopping_rounds = 50, ) val_train=clf.predict(test_x, num_iteration=clf.best_iteration) oof[test_index]=val_train scores = mean_absolute_error(oof[test_index], test_y) print('===scores===', scores) val_pred = clf.predict(test[features], num_iteration=clf.best_iteration) preds += val_pred/5 result_scores = mean_absolute_error(oof, train_Y) print('===result_scores===', result_scores)
215 ['date', 'volatility1', 'volatility2', 'volatility3', 'volatility4', 'volatility5', 'volatility6', 'volatility7', 'volatility8', 'volatility9', 'volatility10', 'volatility11', 'volatility12', 'volatility13', 'volatility14', 'volatility15', 'volatility16', 'volatility17', 'volatility18', 'volatility19', 'volatility20', 'volatility21', 'volatility22', 'volatility23', 'volatility24', 'volatility25', 'volatility26', 'volatility27', 'volatility28', 'volatility29', 'volatility30', 'volatility31', 'volatility32', 'volatility33', 'volatility34', 'volatility35', 'volatility36', 'volatility37', 'volatility38', 'volatility39', 'volatility40', 'volatility41', 'volatility42', 'volatility43', 'volatility44', 'volatility45', 'volatility46', 'volatility47', 'volatility48', 'volatility49', 'volatility50', 'volatility51', 'volatility52', 'volatility53', 'volatility54', 'return1', 'return2', 'return3', 'return4', 'return5', 'return6', 'return7', 'return8', 'return9', 'return10', 'return11', 'return12', 'return13', 'return14', 'return15', 'return16', 'return17', 'return18', 'return19', 'return20', 'return21', 'return22', 'return23', 'return24', 'return25', 'return26', 'return27', 'return28', 'return29', 'return30', 'return31', 'return32', 'return33', 'return34', 'return35', 'return36', 'return37', 'return38', 'return39', 'return40', 'return41', 'return42', 'return43', 'return44', 'return45', 'return46', 'return47', 'return48', 'return49', 'return50', 'return51', 'return52', 'return53', 'return54', 'volatility2_diff_volatility1', 'volatility3_diff_volatility2', 'volatility4_diff_volatility3', 'volatility5_diff_volatility4', 'volatility6_diff_volatility5', 'volatility7_diff_volatility6', 'volatility8_diff_volatility7', 'volatility9_diff_volatility8', 'volatility10_diff_volatility9', 'volatility11_diff_volatility10', 'volatility12_diff_volatility11', 'volatility13_diff_volatility12', 'volatility14_diff_volatility13', 'volatility15_diff_volatility14', 'volatility16_diff_volatility15', 'volatility17_diff_volatility16', 'volatility18_diff_volatility17', 'volatility19_diff_volatility18', 'volatility20_diff_volatility19', 'volatility21_diff_volatility20', 'volatility22_diff_volatility21', 'volatility23_diff_volatility22', 'volatility24_diff_volatility23', 'volatility25_diff_volatility24', 'volatility26_diff_volatility25', 'volatility27_diff_volatility26', 'volatility28_diff_volatility27', 'volatility29_diff_volatility28', 'volatility30_diff_volatility29', 'volatility31_diff_volatility30', 'volatility32_diff_volatility31', 'volatility33_diff_volatility32', 'volatility34_diff_volatility33', 'volatility35_diff_volatility34', 'volatility36_diff_volatility35', 'volatility37_diff_volatility36', 'volatility38_diff_volatility37', 'volatility39_diff_volatility38', 'volatility40_diff_volatility39', 'volatility41_diff_volatility40', 'volatility42_diff_volatility41', 'volatility43_diff_volatility42', 'volatility44_diff_volatility43', 'volatility45_diff_volatility44', 'volatility46_diff_volatility45', 'volatility47_diff_volatility46', 'volatility48_diff_volatility47', 'volatility49_diff_volatility48', 'volatility50_diff_volatility49', 'volatility51_diff_volatility50', 'volatility52_diff_volatility51', 'volatility53_diff_volatility52', 'volatility54_diff_volatility53', 'return2_diff_return1', 'return3_diff_return2', 'return4_diff_return3', 'return5_diff_return4', 'return6_diff_return5', 'return7_diff_return6', 'return8_diff_return7', 'return9_diff_return8', 'return10_diff_return9', 'return11_diff_return10', 'return12_diff_return11', 'return13_diff_return12', 'return14_diff_return13', 'return15_diff_return14', 'return16_diff_return15', 'return17_diff_return16', 'return18_diff_return17', 'return19_diff_return18', 'return20_diff_return19', 'return21_diff_return20', 'return22_diff_return21', 'return23_diff_return22', 'return24_diff_return23', 'return25_diff_return24', 'return26_diff_return25', 'return27_diff_return26', 'return28_diff_return27', 'return29_diff_return28', 'return30_diff_return29', 'return31_diff_return30', 'return32_diff_return31', 'return33_diff_return32', 'return34_diff_return33', 'return35_diff_return34', 'return36_diff_return35', 'return37_diff_return36', 'return38_diff_return37', 'return39_diff_return38', 'return40_diff_return39', 'return41_diff_return40', 'return42_diff_return41', 'return43_diff_return42', 'return44_diff_return43', 'return45_diff_return44', 'return46_diff_return45', 'return47_diff_return46', 'return48_diff_return47', 'return49_diff_return48', 'return50_diff_return49', 'return51_diff_return50', 'return52_diff_return51', 'return53_diff_return52', 'return54_diff_return53'] Fold_0 Training until validation scores don't improve for 50 rounds [50] training's l1: 4.33785 valid_1's l1: 4.51081 [100] training's l1: 4.10948 valid_1's l1: 4.32772 [150] training's l1: 3.98695 valid_1's l1: 4.24831 [200] training's l1: 3.88911 valid_1's l1: 4.19567 [250] training's l1: 3.81043 valid_1's l1: 4.16447 [300] training's l1: 3.74021 valid_1's l1: 4.13661 [350] training's l1: 3.67801 valid_1's l1: 4.11464 [400] training's l1: 3.62277 valid_1's l1: 4.09582
Fold_4 Training until validation scores don't improve for 50 rounds [50] training's l1: 4.41598 valid_1's l1: 4.45476 [100] training's l1: 4.15049 valid_1's l1: 4.25855 [150] training's l1: 4.03215 valid_1's l1: 4.18636 [200] training's l1: 3.93277 valid_1's l1: 4.12894 [250] training's l1: 3.84618 valid_1's l1: 4.08396 [300] training's l1: 3.77718 valid_1's l1: 4.05415 [350] training's l1: 3.71948 valid_1's l1: 4.03507 [400] training's l1: 3.66632 valid_1's l1: 4.0148 [450] training's l1: 3.61854 valid_1's l1: 4.00233 [500] training's l1: 3.57847 valid_1's l1: 3.99413 Did not meet early stopping. Best iteration is: [500] training's l1: 3.57847 valid_1's l1: 3.99413 ===scores=== 399.4129462926169 ===result_scores=== 404.7503172560462
test['target'] = preds test.head() .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
ID | date | product_id | volatility1 | volatility2 | volatility3 | volatility4 | volatility5 | volatility6 | volatility7 | ... | return46_diff_return45 | return47_diff_return46 | return48_diff_return47 | return49_diff_return48 | return50_diff_return49 | return51_diff_return50 | return52_diff_return51 | return53_diff_return52 | return54_diff_return53 | target | |
0 | 0 | 339 | 1 | 0.331584 | 0.060210 | 0.120524 | 0.350625 | 0.149978 | 0.150107 | 1.273166 | ... | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | -2.0 | 2.0 | 0.0 | -1.0 | 14.269864 |
1 | 1 | 340 | 1 | 0.564908 | 1.129115 | 0.640419 | 0.242404 | 0.207550 | 0.363404 | 0.294009 | ... | 0.0 | 0.0 | 2.0 | -2.0 | 0.0 | 0.0 | 0.0 | 2.0 | -1.0 | 25.299242 |
2 | 2 | 341 | 1 | 0.556154 | 0.122724 | 0.196152 | 0.065396 | 0.334392 | 0.106101 | 0.268816 | ... | -2.0 | 0.0 | 2.0 | -2.0 | 0.0 | 2.0 | -2.0 | 0.0 | 0.0 | 10.210525 |
3 | 3 | 342 | 1 | 0.122205 | 0.485541 | 0.211874 | 0.311488 | 0.060288 | 0.080384 | 0.320640 | ... | 0.0 | -2.0 | 2.0 | -1.0 | 1.0 | -2.0 | 0.0 | 2.0 | -1.0 | 14.601898 |
4 | 4 | 343 | 1 | 0.185981 | 0.104199 | 0.268454 | 0.111784 | 0.096849 | 0.208831 | 0.149069 | ... | 0.0 | 2.0 | -2.0 | 0.0 | 2.0 | -2.0 | 2.0 | -2.0 | 0.0 | 11.692939 |
5 rows × 218 columns
test[['ID', 'target']].to_csv('./submit.csv', index = False)