分析股票涨跌幅概率分布特征, 用PolarDB模拟逼真股票数据

+关注继续查看

背景

• 1、股票的日涨跌幅波动范围: [-10%, 10%] (这个应该是国内股市交易限制?)
• 2、日涨跌幅的幅度在[-10%, 10%]范围内符合高斯分布. 本文将介绍这个结论怎么得到的?
• 靠近0的最多, 靠近正负10%的概率逐渐回落.
• 类似这样的图形:

分析过程

1、一键部署PolarDB请参考:

《如何用 PolarDB 证明巴菲特的投资理念》

2、随便下载几只股票的数据: 茅台,ST热电,海立股份

https://zhuanlan.zhihu.com/p/65662875

curl "http://quotes.money.163.com/service/chddata.html?code=0600519&start=20010101&end=20220901&fields=TOPEN;TCLOSE" -o ./0600519.SH.csv
curl "http://quotes.money.163.com/service/chddata.html?code=0600619&start=20010101&end=20220901&fields=TOPEN;TCLOSE" -o ./0600619.SH.csv
curl "http://quotes.money.163.com/service/chddata.html?code=0600719&start=20010101&end=20220901&fields=TOPEN;TCLOSE" -o ./0600719.SH.csv

$iconv -f GBK -t UTF-8 ./0600519.SH.csv > ./1.csv$ iconv -f GBK -t UTF-8 ./0600619.SH.csv > ./2.csv
$iconv -f GBK -t UTF-8 ./0600719.SH.csv > ./3.csv [postgres@d6b4778340d1 ~]$ head -n 5 1.csv 2.csv 3.csv
==> 1.csv <==

2022-09-01,'600519,贵州茅台,1912.15,1880.89
2022-08-31,'600519,贵州茅台,1860.1,1924.0
2022-08-30,'600519,贵州茅台,1882.35,1870.0
2022-08-29,'600519,贵州茅台,1883.0,1878.82

==> 2.csv <==

2022-09-01,'600619,海立股份,6.77,6.67
2022-08-31,'600619,海立股份,7.06,6.77
2022-08-30,'600619,海立股份,7.3,7.19
2022-08-29,'600619,海立股份,7.0,7.26

==> 3.csv <==

2022-09-01,'600719,ST热电,5.01,4.9
2022-08-31,'600719,ST热电,5.34,5.05
2022-08-30,'600719,ST热电,5.38,5.32
2022-08-29,'600719,ST热电,5.33,5.38

2、将数据导入到PolarDB

create table t1 (c1 date, c2 text, c3 text, c4 numeric, c5 numeric);
copy t1 from '/home/postgres/1.csv' ( format csv, HEADER , quote '"');
delete from t1 where c4 =0 or c5=0 ;

create table t2 (c1 date, c2 text, c3 text, c4 numeric, c5 numeric);
copy t2 from '/home/postgres/2.csv' ( format csv, HEADER , quote '"');
delete from t2 where c4 =0 or c5=0 ;

create table t3 (c1 date, c2 text, c3 text, c4 numeric, c5 numeric);
copy t3 from '/home/postgres/3.csv' ( format csv, HEADER , quote '"');
delete from t3 where c4 =0 or c5=0 ;

3、分析涨跌幅的数据分布, 从结果来看, 涨跌幅度符合高斯分布.

select width_bucket(v, -0.1, 0.1, 10), count(*) from (
select (lag(c5) over w - c5)/c5 as v from t1 window w as (order by c1)
) t group by 1 order by 2 desc, 1 asc;

select width_bucket(v, -0.1, 0.1, 10), count(*) from (
select (lag(c5) over w - c5)/c5 as v from t2 window w as (order by c1)
) t group by 1 order by 2 desc, 1 asc;

select width_bucket(v, -0.1, 0.1, 10), count(*) from (
select (lag(c5) over w - c5)/c5 as v from t3 window w as (order by c1)
) t group by 1 order by 2 desc, 1 asc;

width_bucket | count
--------------+-------
6 |  1925
5 |  1813
4 |   528
7 |   459
3 |   130
8 |    91
2 |    23
9 |    21
1 |    20
11 |    11
10 |     5
|     1
(12 rows)

width_bucket | count
--------------+-------
6 |  1624
5 |  1570
4 |   658
7 |   575
8 |   201
3 |   178
1 |    80
11 |    71
9 |    67
2 |    57
10 |    37
|     1
(12 rows)

width_bucket | count
--------------+-------
5 |  1611
6 |  1576
4 |   599
7 |   576
8 |   203
3 |   177
9 |    70
1 |    63
11 |    49
2 |    47
10 |    26
|     1
(12 rows)

模拟过程

1、思路:

• 1、生成涨跌幅数据, 在[-10%, 10%]内按高斯分布.
• 2、用递归语法, 输入一个上市价格, 根据日涨跌幅得到上市后的每日价格.

2、建表, 存放pgbench生成的涨跌幅结果

create table tbl (id serial primary key, v numeric(20,3));

3、使用pgbench生成涨跌幅数据

vi test.sql
\set r random_gaussian(0, 20000, 5)
insert into tbl (v) values ((:r-10000)/100000.0);

pgbench -h 127.0.0.1 -n -r -f ./test.sql -c 1 -j 1 -t 5000

• random_gaussian, 生成 0-20000 的数据, 其中概率高密度分布在中间 10000. 5是random_gaussian的微调参数, 可以调整, 决定了中间的概率集中度.
• 这个随机数减去10000, 刚好得到正负10000 ([-10000, 10000]) 的范围, 再除以100000, 得到正负10% ([-10%, 10%])的范围.

select width_bucket(v,-0.1,0.1,10),count(*) from tbl group by 1 order by 2 desc,1;
width_bucket | count
--------------+-------
6 |  1755
5 |  1661
7 |   701
4 |   668
8 |   112
3 |    87
9 |    10
2 |     5
10 |     1
(9 rows)

4、假设上市价格为38.101, 使用如下递归SQL, 使用生成的涨跌幅数据生成每交易日价格

create table tbl1 (c1 int, c5 numeric);

with recursive a as (
(select id, (38.101 * (1 + tbl.v))::numeric(20,3) as price from tbl order by id limit 1)
union all
(select tbl.id, (a.price * (1 + tbl.v))::numeric(20,3) from tbl join a on (tbl.id > a.id) where a.* is not null order by tbl.id limit 1)
)
insert into tbl1
select * from a
where a.* is not null;

INSERT 0 5000

5、随便选一个定投起点, 模拟的数据和真实数据一样, 也符合巴菲特的投资理念.

《如何用 PolarDB 证明巴菲特的投资理念》

607 | 37.385 | 32.2845 |              11.35 |  254000 |  294128.47 |     40128.47 |       507
608 | 37.759 | 32.2953 |              12.13 |  254500 |  297556.59 |     43056.59 |       508
609 | 37.835 | 32.3061 |              12.25 |  255000 |  298640.82 |     43640.82 |       509
610 | 37.986 | 32.3172 |              12.53 |  255500 |  300317.28 |     44817.28 |       510
611 | 38.822 | 32.3299 |              14.32 |  256000 |  307406.49 |     51406.49 |       511
612 | 40.025 | 32.3449 |              16.89 |  256500 |  317404.02 |     60904.02 |       512
613 | 39.665 | 32.3592 |              16.03 |  257000 |  315023.62 |     58023.62 |       513
614 | 38.673 | 32.3714 |              13.80 |  257500 |  307626.06 |     50126.06 |       514
615 | 38.712 | 32.3837 |              13.82 |  258000 |  308417.15 |     50417.15 |       515
616 | 39.293 | 32.3971 |              15.03 |  258500 |  313523.25 |     55023.25 |       516
617 | 39.647 | 32.4111 |              15.73 |  259000 |  316822.87 |     57822.87 |       517
618 | 40.123 | 32.4259 |              16.69 |  259500 |  321098.39 |     61598.39 |       518
...
1383 | 37.699 | 31.0363 |               6.10 |  642000 |  779819.74 |    137819.74 |      1283
1384 | 37.963 | 31.0417 |               6.33 |  642500 |  785755.81 |    143255.81 |      1284
1385 | 37.507 | 31.0468 |               5.91 |  643000 |  776795.88 |    133795.88 |      1285
1386 | 37.995 | 31.0522 |               6.34 |  643500 |  787377.68 |    143877.68 |      1286
1387 | 38.869 | 31.0582 |               7.13 |  644000 |  805958.09 |    161958.09 |      1287
1388 | 38.791 | 31.0642 |               7.04 |  644500 |  804809.78 |    160309.78 |      1288
1389 | 40.226 | 31.0713 |               8.34 |  645000 |  835038.75 |    190038.75 |      1289
1390 | 39.341 | 31.0777 |               7.52 |  645500 |  817131.93 |    171631.93 |      1290
1391 | 38.554 | 31.0835 |               6.79 |  646000 |  801256.65 |    155256.65 |      1291

c1  | price  |  round  | revenue_year_ratio | invest  |  v_value   | v_make_money | keep_days
------+--------+---------+--------------------+---------+------------+--------------+-----------
4463 | 56.423 | 37.1193 |               4.35 | 2182000 | 3316734.30 |   1134734.30 |      4363
(1 row)
select
c1, -- 日期
price, -- 当前价
round(cost_avg,4), -- 成本价
round(100 * ((price-cost_avg)/cost_avg) / ((c1-start_c1+1)/365.0), 2) as revenue_year_ratio, -- 年化收益率
rn * 500 as invest,  -- 截止当前总投入. (假设每个交易日投入500)
round(rn * 500 * (1+ (price-cost_avg)/cost_avg ), 2) as v_value,  -- 当前持有股票的价值
round(rn * 500 * (1+ (price-cost_avg)/cost_avg ), 2) - rn * 500 as v_make_money,  -- 赚了多少钱
c1-start_c1 as keep_days  -- 持有天数
from
(
select
c1,
c5 as price,
avg(c5) over w as cost_avg,
min(c1) over w as start_c1,
row_number() over w as rn
from tbl1
where c1 >= 100
-- 经济越低迷的时候股价越低, 从那时开始投入是比较好的.
-- 如果你的投入周期足够长, 可以从任意时间开始投入, 总会遇到可以收割的时候.
window w as (order by c1 range between UNBOUNDED PRECEDING and CURRENT ROW)
) t
order by c1;

select
c1, -- 日期
price, -- 当前价
round(cost_avg,4), -- 成本价
round(100 * ((price-cost_avg)/cost_avg) / ((c1-start_c1+1)/365.0), 2) as revenue_year_ratio, -- 年化收益率
rn * 500 as invest,  -- 截止当前总投入. (假设每个交易日投入500)
round(rn * 500 * (1+ (price-cost_avg)/cost_avg ), 2) as v_value,  -- 当前持有股票的价值
round(rn * 500 * (1+ (price-cost_avg)/cost_avg ), 2) - rn * 500 as v_make_money,  -- 赚了多少钱
c1-start_c1 as keep_days  -- 持有天数
from
(
select
c1,
c5 as price,
avg(c5) over w as cost_avg,
min(c1) over w as start_c1,
row_number() over w as rn
from tbl1
where c1 >= 100
-- 经济越低迷的时候股价越低, 从那时开始投入是比较好的.
-- 如果你的投入周期足够长, 可以从任意时间开始投入, 总会遇到可以收割的时候.
window w as (order by c1 range between UNBOUNDED PRECEDING and CURRENT ROW)
) t
order by round(rn * 500 * (1+ (price-cost_avg)/cost_avg ), 2) - rn * 500 desc limit 1;

3368 | 47.851 | 37.8497 |               6.57 | 734500 |  928582.39 |    194082.39 |      1468
3369 | 48.664 | 37.8571 |               7.09 | 735000 |  944818.44 |    209818.44 |      1469
3370 | 48.956 | 37.8646 |               7.27 | 735500 |  950944.72 |    215444.72 |      1470
3371 | 48.662 | 37.8719 |               7.06 | 736000 |  945693.31 |    209693.31 |      1471
3372 | 49.149 | 37.8796 |               7.37 | 736500 |  955613.33 |    219113.33 |      1472
3373 | 48.608 | 37.8869 |               7.01 | 737000 |  945554.49 |    208554.49 |      1473
3374 | 49.434 | 37.8947 |               7.54 | 737500 |  962075.98 |    224575.98 |      1474
3375 | 48.248 | 37.9017 |               6.75 | 738000 |  939456.96 |    201456.96 |      1475
3376 | 48.875 | 37.9091 |               7.15 | 738500 |  952123.66 |    213623.66 |      1476
3377 | 47.995 | 37.9160 |               6.56 | 739000 |  935445.21 |    196445.21 |      1477
3378 | 47.755 | 37.9226 |               6.40 | 739500 |  931233.85 |    191733.85 |      1478
3379 | 48.567 | 37.9298 |               6.92 | 740000 |  947528.69 |    207528.69 |      1479
3380 | 48.373 | 37.9369 |               6.78 | 740500 |  944205.93 |    203705.93 |      1480
...
3807 | 53.445 | 39.5058 |               6.75 | 954000 | 1290608.01 |    336608.01 |      1907
3808 | 53.659 | 39.5132 |               6.84 | 954500 | 1296211.63 |    341711.63 |      1908
3809 | 54.518 | 39.5211 |               7.25 | 955000 | 1317389.98 |    362389.98 |      1909
3810 | 53.973 | 39.5287 |               6.98 | 955500 | 1304653.62 |    349153.62 |      1910
(169 rows)

pgbench目前支持生成泊松、高斯、指数、随机分布的数据. 有兴趣的小伙伴可以学习一下, 文末提供了参考文档.

参考

1、width_bucket

https://www.postgresql.org/docs/15/functions-math.html

2、gaussian分布, 参数越大, 随机值的概率分布越集中在中间.

https://www.postgresql.org/docs/15/pgbench.html

\set r random_gaussian(0, 20000, 2.5)
\set r random_gaussian(0, 20000, 10)
\set r random_gaussian(0, 20000, 5)
-- 10的分布

postgres=# select width_bucket(v,-0.1,0.1,10),count(*) from tbl group by 1 order by 2 desc,1;
width_bucket | count
--------------+-------
6 |  9583
5 |  9511
4 |   459
7 |   447
(4 rows)

-- 5的分布

postgres=# select width_bucket(v,-0.1,0.1,10),count(*) from tbl group by 1 order by 2 desc,1;
width_bucket | count
--------------+-------
6 |  6852
5 |  6828
4 |  2766
7 |  2668
8 |   430
3 |   397
2 |    36
9 |    21
10 |     2
(9 rows)

-- 2.5的分布

postgres=# select width_bucket(v,-0.1,0.1,10),count(*) from tbl group by 1 order by 2 desc,1;
width_bucket | count
--------------+-------
5 |  3971
6 |  3791
4 |  3035
7 |  3028
3 |  1900
8 |  1872
9 |   898
2 |   855
10 |   330
1 |   322
(10 rows)
random ( lb, ub ) → integer
Computes a uniformly-distributed random integer in [lb, ub].
random(1, 10) → an integer between 1 and 10

random_exponential ( lb, ub, parameter ) → integer
Computes an exponentially-distributed random integer in [lb, ub], see below.
random_exponential(1, 10, 3.0) → an integer between 1 and 10

random_gaussian ( lb, ub, parameter ) → integer
Computes a Gaussian-distributed random integer in [lb, ub], see below.
random_gaussian(1, 10, 2.5) → an integer between 1 and 10

random_zipfian ( lb, ub, parameter ) → integer
Computes a Zipfian-distributed random integer in [lb, ub], see below.
random_zipfian(1, 10, 1.5) → an integer between 1 and 10

create table his (c1 date, c2 text, c3 text, c4 numeric, c5 numeric);

select width_bucket(v,-0.1,0.1,10), count(*) from (
select (lag(c5) over w - c5)/c5 as v from his window w as (order by c1)
) t group by 1 order by 2 desc, 1 asc;

width_bucket | count
--------------+-------
6 |  1925
5 |  1813
4 |   528
7 |   459
3 |   130
8 |    91
2 |    23
9 |    21
1 |    20
11 |    11
10 |     5
|     1
(12 rows)
-- 5的分布
-- random_gaussian(0, 20000, 5)

width_bucket | count
--------------+-------
5 |  1711
6 |  1676
7 |   704
4 |   660
8 |   121
3 |   117
9 |     6
2 |     4
1 |     1
(9 rows)

ECS进阶训练营-DAY 4 打卡 使用PolarDB和ECS搭建门户网站

1189 0

585 0
ECS 7 天实践训练营 - day04 - 使用PolarDB和ECS搭建门户网站

381 0

185 0

ECS 7天实践训练营（进阶路线）--day04--使用ECS和PolarDB搭建门户网站
585 0
7天训练营-使用PolarDB和ECS搭建门户网站

211 0
+关注

AIOps 自适应机器学习异常检测

POLARDB 产品特性和通用业务场景

PolarDB MySQL引擎重磅功能及产品能力盛大发布