CFM 对金融市场的波动性预测

简介: CFM 对金融市场的波动性预测

一、CFM 对金融市场的波动性预测


比赛地址: CFM对金融市场的波动性预测  www.heywhale.com/home/activi…

该赛题是经典的股票波动回归预测题目,是股票市场中经典的趋势预测问题。


1.赛题简介:


美国股票市场是地球上流动性最强的股票市场,因此提供了许多投资机会。但是,在通过组合金融资产构建股票投资组合时,我们需要估计其未来的风险或波动性。


股票过去的波动性是其未来风险的一个很好的代表,但一些盘中模式仍有待通过算法发现。本次赛题针对某个投资公司的股票市场价格的波动进行预测。提供每只股票每天的间隔五分钟的股票价格波动率以及对应的股票价格波动方向,需要预测5分钟后的2小时内的波动率。


2.数据说明


本数据集来自某个投资公司的股票市场信息,训练数据有101601条、测试集有25535条,为了预测结果灵敏性,对标签结果采用百分制形式,即扩大了100倍。本次竞赛数据包含训练集(train_new.csv)和测试集(test_new.csv)两张表供用户训练预测学习。并提供了提交样例模版。


各表的信息如下:


train_new.csv


字段 含义
ID ID唯一取值
date 日期
product_id 股票的标识号
volatility1 ~ volatility54 波动率,举例:volatility2到volatility1间隔5分钟内的波动率
return1 ~ return54 波动方向,举例:return2到return1间隔5分钟内的波动方向
target 预测volatility字段5分钟后的2小时内的波动率

备注:volatility字段时间范围与return字段时间范围一一对应


test_new.csv


字段 含义
ID ID唯一取值
date 日期
product_id 股票的标识号
volatility1 ~ volatility54 波动率,举例:volatility2到volatility1间隔5分钟内的波动率
return1 ~ return54 波动方向,举例:return2到return1间隔5分钟内的波动方向

 

3.数据协议


本数据为公开数据,遵CC0: Public Domain。目前国际上被广泛接受的做法是采用知识共享组织 (Creative Commons) 所提供的版权许可协议,其中有一项声明作品进入公有领域的协议是 CC0 。采取了 CC0 协议的作品,表示著作权人已将该其贡献至公有领域,在法律允许的范围,放弃所有他在全世界范围内基于著作权法对作品享有的权利,包括所有相关权利和邻接权利。标记有「CC0」的文字、图片、音频、视频等作品,无需著作权人同意,无需顾虑著作权风险,就可以复制、修改、发行和演绎,可用于商业性目的。简单地说就是可以随意使用,包括商用。

 

二、数据分析



1.数据解压缩


!unzip -qoa data/data126261/CFM对金融市场的波动性预测-数据集.zip


2.导入必须库


import pandas as pd
import numpy as np
import os
from tqdm import tqdm


3.数据查看


  • trian:((101601, 112),
  • test:(25535, 111),
  • submit:(25535, 2))
path = './'
os.listdir(path)
ssh', 'train_new.csv', 'work', '.viminfo', '.pip', '.virtual_documents', '.conda', '.dataset.download', '.cache', 'submit_sample.csv', '.python_history', 'test_new.csv', 'data', '.condarc', '.systemlogs', '.ipython', '.local', '.node.started', '.bashrc', '.bash_logout', '.jupyter', '.config', '3441338.ipynb', '.homedata.success', '.profile', '.ipynb_checkpoints', '.bash_history']
# 读取数据
train = pd.read_csv(path + '/train_new.csv')
test  = pd.read_csv(path + '/test_new.csv')
submit = pd.read_csv(path + '/submit_sample.csv')
train.head()
.dataframe tbody tr th:only-of-type {         vertical-align: middle;     } .dataframe tbody tr th {     vertical-align: top; } .dataframe thead th {     text-align: right; }

ID date product_id volatility1 volatility2 volatility3 volatility4 volatility5 volatility6 volatility7 ... return46 return47 return48 return49 return50 return51 return52 return53 return54 target
0 0 1 1 0.662737 0.716896 0.698601 0.480172 0.623665 0.201876 0.327206 ... 1.0 1.0 -1.0 1.0 -1.0 0.0 1.0 1.0 -1.0 13.416821
1 1 2 1 1.341973 0.361853 0.361713 0.774088 0.609955 1.209693 0.160228 ... 1.0 1.0 1.0 -1.0 1.0 0.0 1.0 1.0 1.0 26.537525
2 2 3 1 1.369799 0.409785 0.590202 0.322052 3.293654 0.530644 0.981945 ... 1.0 -1.0 -1.0 -1.0 -1.0 1.0 1.0 1.0 -1.0 50.628539
3 3 4 1 0.460805 0.144150 0.086472 0.096159 0.220848 0.009604 0.433608 ... 0.0 1.0 1.0 -1.0 -1.0 1.0 1.0 -1.0 1.0 14.199094
4 4 5 1 0.223939 0.168005 0.224208 0.251931 0.056002 0.069960 0.083977 ... 1.0 1.0 1.0 -1.0 1.0 -1.0 1.0 -1.0 1.0 14.798843

5 rows × 112 columns

# 尺寸查看
train.shape, test.shape, submit.shape
((101601, 112), (25535, 111), (25535, 2))
# 按日期统计
temp = train.groupby(['product_id'])['date'].count().reset_index(drop = False)
temp.columns = ['product_id', 'lengs']
temp.head()
.dataframe tbody tr th:only-of-type {         vertical-align: middle;     } .dataframe tbody tr th {     vertical-align: top; } .dataframe thead th {     text-align: right; }

product_id lengs
0 1 338
1 2 338
2 3 338
3 4 338
4 5 335
volatility_features = [f'volatility{i}' for i in range(1, 55)]
return_features = [f'return{i}' for i in range(1, 55)]
for f1 in tqdm(volatility_features[: -1]):
    indexs = volatility_features.index(f1)
    f2 = volatility_features[indexs + 1]
    train[f'{f2}_diff_{f1}'] = train[f2] - train[f1]
    test[f'{f2}_diff_{f1}']  = test[f2] - test[f1]
for f1 in tqdm(return_features[: -1]):
    indexs = return_features.index(f1)
    f2 = return_features[indexs + 1]
    train[f'{f2}_diff_{f1}'] = train[f2] - train[f1]
    test[f'{f2}_diff_{f1}']  = test[f2] - test[f1]
100%|██████████| 53/53 [00:00<00:00, 561.40it/s]
100%|██████████| 53/53 [00:00<00:00, 99.91it/s]


三、训练


1.评价函数


def mean_absolute_error(y_true, y_pred):
    """     
    Mean Absolute Error,  .
    """
    return np.mean(np.abs((np.asarray(y_true) - np.asarray(y_pred)))) * 100


2.LightGBM训练


from sklearn.model_selection import KFold,StratifiedKFold
import lightgbm as lgb
features = [f for f in train.columns if f not in ['ID', 'target', 'product_id']]
print(len(features), features)
preds = np.zeros(test.shape[0]) 
oof=np.zeros(train.shape[0]) 
train_X = train[features]
train_Y = train['target']
val_label=[]
params = {'learning_rate': 0.12, 
        'boosting_type': 'gbdt', 
        'objective': 'regression_l1',
        'metric': 'mae',
        'min_child_samples': 46, 
        'min_child_weight': 0.01,
        'feature_fraction': 0.7, 
        'bagging_fraction': 0.7, 
        'bagging_freq': 2, 
        'num_leaves': 16, 
        'max_depth': 5, 
        'n_jobs': -1, 
        'seed': 2019, 
        'verbosity': -1, 
       }
folds = KFold(n_splits=5, shuffle=True, random_state=1996)
for fold_, (train_index, test_index) in enumerate(folds.split(train_X, train_Y)):
    print('Fold_{}'.format(fold_))
    train_x, test_x, train_y, test_y = train_X.iloc[train_index], train_X.iloc[test_index], train_Y.iloc[train_index], train_Y.iloc[test_index]
    trn_data = lgb.Dataset(train_x, train_y)
    val_data = lgb.Dataset(test_x, test_y)
    num_round=500
    clf = lgb.train(params, 
                    trn_data, 
                    num_round, 
                    valid_sets = [trn_data, val_data], 
                    verbose_eval = 50,
                    early_stopping_rounds = 50,
                  )
    val_train=clf.predict(test_x, num_iteration=clf.best_iteration)
    oof[test_index]=val_train   
    scores = mean_absolute_error(oof[test_index], test_y)
    print('===scores===', scores)
    val_pred = clf.predict(test[features], num_iteration=clf.best_iteration)
    preds += val_pred/5
result_scores = mean_absolute_error(oof, train_Y)
print('===result_scores===', result_scores)
215 ['date', 'volatility1', 'volatility2', 'volatility3', 'volatility4', 'volatility5', 'volatility6', 'volatility7', 'volatility8', 'volatility9', 'volatility10', 'volatility11', 'volatility12', 'volatility13', 'volatility14', 'volatility15', 'volatility16', 'volatility17', 'volatility18', 'volatility19', 'volatility20', 'volatility21', 'volatility22', 'volatility23', 'volatility24', 'volatility25', 'volatility26', 'volatility27', 'volatility28', 'volatility29', 'volatility30', 'volatility31', 'volatility32', 'volatility33', 'volatility34', 'volatility35', 'volatility36', 'volatility37', 'volatility38', 'volatility39', 'volatility40', 'volatility41', 'volatility42', 'volatility43', 'volatility44', 'volatility45', 'volatility46', 'volatility47', 'volatility48', 'volatility49', 'volatility50', 'volatility51', 'volatility52', 'volatility53', 'volatility54', 'return1', 'return2', 'return3', 'return4', 'return5', 'return6', 'return7', 'return8', 'return9', 'return10', 'return11', 'return12', 'return13', 'return14', 'return15', 'return16', 'return17', 'return18', 'return19', 'return20', 'return21', 'return22', 'return23', 'return24', 'return25', 'return26', 'return27', 'return28', 'return29', 'return30', 'return31', 'return32', 'return33', 'return34', 'return35', 'return36', 'return37', 'return38', 'return39', 'return40', 'return41', 'return42', 'return43', 'return44', 'return45', 'return46', 'return47', 'return48', 'return49', 'return50', 'return51', 'return52', 'return53', 'return54', 'volatility2_diff_volatility1', 'volatility3_diff_volatility2', 'volatility4_diff_volatility3', 'volatility5_diff_volatility4', 'volatility6_diff_volatility5', 'volatility7_diff_volatility6', 'volatility8_diff_volatility7', 'volatility9_diff_volatility8', 'volatility10_diff_volatility9', 'volatility11_diff_volatility10', 'volatility12_diff_volatility11', 'volatility13_diff_volatility12', 'volatility14_diff_volatility13', 'volatility15_diff_volatility14', 'volatility16_diff_volatility15', 'volatility17_diff_volatility16', 'volatility18_diff_volatility17', 'volatility19_diff_volatility18', 'volatility20_diff_volatility19', 'volatility21_diff_volatility20', 'volatility22_diff_volatility21', 'volatility23_diff_volatility22', 'volatility24_diff_volatility23', 'volatility25_diff_volatility24', 'volatility26_diff_volatility25', 'volatility27_diff_volatility26', 'volatility28_diff_volatility27', 'volatility29_diff_volatility28', 'volatility30_diff_volatility29', 'volatility31_diff_volatility30', 'volatility32_diff_volatility31', 'volatility33_diff_volatility32', 'volatility34_diff_volatility33', 'volatility35_diff_volatility34', 'volatility36_diff_volatility35', 'volatility37_diff_volatility36', 'volatility38_diff_volatility37', 'volatility39_diff_volatility38', 'volatility40_diff_volatility39', 'volatility41_diff_volatility40', 'volatility42_diff_volatility41', 'volatility43_diff_volatility42', 'volatility44_diff_volatility43', 'volatility45_diff_volatility44', 'volatility46_diff_volatility45', 'volatility47_diff_volatility46', 'volatility48_diff_volatility47', 'volatility49_diff_volatility48', 'volatility50_diff_volatility49', 'volatility51_diff_volatility50', 'volatility52_diff_volatility51', 'volatility53_diff_volatility52', 'volatility54_diff_volatility53', 'return2_diff_return1', 'return3_diff_return2', 'return4_diff_return3', 'return5_diff_return4', 'return6_diff_return5', 'return7_diff_return6', 'return8_diff_return7', 'return9_diff_return8', 'return10_diff_return9', 'return11_diff_return10', 'return12_diff_return11', 'return13_diff_return12', 'return14_diff_return13', 'return15_diff_return14', 'return16_diff_return15', 'return17_diff_return16', 'return18_diff_return17', 'return19_diff_return18', 'return20_diff_return19', 'return21_diff_return20', 'return22_diff_return21', 'return23_diff_return22', 'return24_diff_return23', 'return25_diff_return24', 'return26_diff_return25', 'return27_diff_return26', 'return28_diff_return27', 'return29_diff_return28', 'return30_diff_return29', 'return31_diff_return30', 'return32_diff_return31', 'return33_diff_return32', 'return34_diff_return33', 'return35_diff_return34', 'return36_diff_return35', 'return37_diff_return36', 'return38_diff_return37', 'return39_diff_return38', 'return40_diff_return39', 'return41_diff_return40', 'return42_diff_return41', 'return43_diff_return42', 'return44_diff_return43', 'return45_diff_return44', 'return46_diff_return45', 'return47_diff_return46', 'return48_diff_return47', 'return49_diff_return48', 'return50_diff_return49', 'return51_diff_return50', 'return52_diff_return51', 'return53_diff_return52', 'return54_diff_return53']
Fold_0
Training until validation scores don't improve for 50 rounds
[50]  training's l1: 4.33785  valid_1's l1: 4.51081
[100] training's l1: 4.10948  valid_1's l1: 4.32772
[150] training's l1: 3.98695  valid_1's l1: 4.24831
[200] training's l1: 3.88911  valid_1's l1: 4.19567
[250] training's l1: 3.81043  valid_1's l1: 4.16447
[300] training's l1: 3.74021  valid_1's l1: 4.13661
[350] training's l1: 3.67801  valid_1's l1: 4.11464
[400] training's l1: 3.62277  valid_1's l1: 4.09582
Fold_4
Training until validation scores don't improve for 50 rounds
[50]  training's l1: 4.41598  valid_1's l1: 4.45476
[100] training's l1: 4.15049  valid_1's l1: 4.25855
[150] training's l1: 4.03215  valid_1's l1: 4.18636
[200] training's l1: 3.93277  valid_1's l1: 4.12894
[250] training's l1: 3.84618  valid_1's l1: 4.08396
[300] training's l1: 3.77718  valid_1's l1: 4.05415
[350] training's l1: 3.71948  valid_1's l1: 4.03507
[400] training's l1: 3.66632  valid_1's l1: 4.0148
[450] training's l1: 3.61854  valid_1's l1: 4.00233
[500] training's l1: 3.57847  valid_1's l1: 3.99413
Did not meet early stopping. Best iteration is:
[500] training's l1: 3.57847  valid_1's l1: 3.99413
===scores=== 399.4129462926169
===result_scores=== 404.7503172560462
test['target'] = preds
test.head()
.dataframe tbody tr th:only-of-type {         vertical-align: middle;     } .dataframe tbody tr th {     vertical-align: top; } .dataframe thead th {     text-align: right; }

ID date product_id volatility1 volatility2 volatility3 volatility4 volatility5 volatility6 volatility7 ... return46_diff_return45 return47_diff_return46 return48_diff_return47 return49_diff_return48 return50_diff_return49 return51_diff_return50 return52_diff_return51 return53_diff_return52 return54_diff_return53 target
0 0 339 1 0.331584 0.060210 0.120524 0.350625 0.149978 0.150107 1.273166 ... 0.0 2.0 0.0 0.0 0.0 -2.0 2.0 0.0 -1.0 14.269864
1 1 340 1 0.564908 1.129115 0.640419 0.242404 0.207550 0.363404 0.294009 ... 0.0 0.0 2.0 -2.0 0.0 0.0 0.0 2.0 -1.0 25.299242
2 2 341 1 0.556154 0.122724 0.196152 0.065396 0.334392 0.106101 0.268816 ... -2.0 0.0 2.0 -2.0 0.0 2.0 -2.0 0.0 0.0 10.210525
3 3 342 1 0.122205 0.485541 0.211874 0.311488 0.060288 0.080384 0.320640 ... 0.0 -2.0 2.0 -1.0 1.0 -2.0 0.0 2.0 -1.0 14.601898
4 4 343 1 0.185981 0.104199 0.268454 0.111784 0.096849 0.208831 0.149069 ... 0.0 2.0 -2.0 0.0 2.0 -2.0 2.0 -2.0 0.0 11.692939

5 rows × 218 columns

test[['ID', 'target']].to_csv('./submit.csv', index = False)


目录
相关文章
|
6月前
|
数据采集 机器学习/深度学习 算法
多维因素与学生辍学风险预测
多维因素与学生辍学风险预测
|
2月前
均值回归策略在A股ETF市场获利的可能性
【9月更文挑战第24天】均值回归策略是一种量化交易方法,依据资产价格与平均价格的关系预测价格变动。在A股ETF市场中,该策略可能带来收益,但需考虑市场复杂性和不确定性。历史数据显示某些ETF具有均值回归特征,但未来表现不确定,投资者应结合技术与基本面分析,合理决策并控制风险。
|
6月前
|
机器学习/深度学习 传感器 自然语言处理
时间序列预测的零样本学习是未来还是炒作:TimeGPT和TiDE的综合比较
最近时间序列预测预测领域的最新进展受到了各个领域(包括文本、图像和语音)成功开发基础模型的影响,例如文本(如ChatGPT)、文本到图像(如Midjourney)和文本到语音(如Eleven Labs)。这些模型的广泛采用导致了像TimeGPT[1]这样的模型的出现,这些模型利用了类似于它们在文本、图像和语音方面获得成功的方法和架构。
124 1
|
6月前
|
机器学习/深度学习 数据建模
数据分享|Eviews用ARIMA、指数曲线趋势模型对中国进出口总额时间序列预测分析
数据分享|Eviews用ARIMA、指数曲线趋势模型对中国进出口总额时间序列预测分析
|
6月前
|
机器学习/深度学习 数据可视化 数据挖掘
R语言用CPV模型的房地产信贷信用风险的度量和预测
R语言用CPV模型的房地产信贷信用风险的度量和预测
|
6月前
|
数据挖掘 vr&ar
金融时间序列模型ARIMA 和GARCH 在股票市场预测应用
金融时间序列模型ARIMA 和GARCH 在股票市场预测应用
|
6月前
|
机器学习/深度学习 算法
随机森林优化贝叶斯预测分析汽车燃油经济性
随机森林优化贝叶斯预测分析汽车燃油经济性
|
6月前
|
存储
R语言分布滞后非线性模型(DLNM)空气污染研究温度对死亡率影响建模应用
R语言分布滞后非线性模型(DLNM)空气污染研究温度对死亡率影响建模应用
|
6月前
|
算法 API
R语言数据的收益率和可能的波动性交易
R语言数据的收益率和可能的波动性交易
|
6月前
|
机器学习/深度学习
探索Transformer在金融行情预测领域的应用——预测银行间回购市场加权价格
文章先发在公众号上来,顺便在这里也写一下,主要思路其实就是模仿盘古天气大模型的方法,来试试能不能用来预测全国银行间市场质押式回购每日的加权平均价格,目前模型主要架构和训练粗略的跑了出来,效果不是太好,目前看了点其他paper,希望尝试利用已经开源的各种大模型做微调。欢迎大家批评指正。
探索Transformer在金融行情预测领域的应用——预测银行间回购市场加权价格