摘要:MaxCompute平台支持的日期格式通常是对齐的日期格式诸如20170725或2017/07/25这种,而本次电力AI赛提供的日期格式却是未对齐的非标准(相对MaxCompute平台来说)的日期格式2016/1/1这种,使得无法直接使用ODPS SQL中的日期函数来进行处理。同时,电力AI赛提供的气象数据并不是已经数值化的数据,也使得很多团队未能将气象数据利用起来(现已公开解决方案的团队,基本上天气数据中的气象、风速和风向信息等都未使用),而气象数据通常来说对短期负荷预测具有较大的影响。本文将详细介绍利用MaxCompute的ODPS SQL处理电力AI赛的非标准日期数据的方法和利用OPEN_MR来处理天气数据的详细方法,并给出在MaxCompute平台上使用ODPS SQL、OPEN_MR和PAI命令来完成从数据预处理、特征提取到预测结果的全过程,供大家参考,同时欢迎各位批评指正。
阿里云的MaxCompute平台具有非常强大的功能和开放式的接口,使得可以非常方便的处理各类数据并快速高效的完成数据分析和预测。本文介绍的内容,除了气象部分的数据是之前利用零散时间处理的之外(大概花了不到1天的时间),其他代码都是在电力AI赛的复赛换数据后开始的2天内临时赶出来的,组件MaxCompute的强大。实际上,除了因为比赛平台的OPEN_MR部分目前无法集成到ODPS SQL,所以运行时需要中断一次,其他的代码都可以只需点击一次“运行“按钮就可以批量运行完成,直接完成从原始数据到提交结果的全过程。需要注意的是,本文使用的平台是天池比赛平台,这是阿里云MaxCompute平台为了确保比赛数据安全而做了裁剪(限制)的比赛专用平台,阿里云对外开放的MaxCompute平台限制更少,功能更为强大。
一、 赛题说明
本次竞赛主要数据源为企业用电量表Tianchi_power,抽取了扬中市高新区的1000多家企业的用电量(数据进行了脱敏),包括企业ID(匿名化处理),日期和用电量。具体字段如下表:
tianchi_power
列名 |
类型 |
含义 |
示例 |
record_date |
string |
日期 |
20150101 |
user_id |
bigint |
企业id |
1 |
power_consumption |
bigint |
用电量 |
1031 |
... | ... | ... | ... |
气象数据表为tianchi_weather_data,起内容如下表所示:
选手提交结果表
tianchi_power_answer
列名 |
类型 |
含义 |
示例 |
predict_date |
string |
日期 |
2016/9/1 |
power_consumption |
bigint |
预测的用电量 |
1031 |
二、赛题解读
这是一个短期负荷预测(short-term load forecasting)问题,国家电网于2010年曾出台过 国家电网企业标准 Q/GDW 552-2010 《电网短期超短期负荷预测技术规范》,在规范中对相关的术语、预测内容、误差计算公式、常用的预测算法等都做了介绍。在本次比赛中,由于负荷预测的用途不一样,因此并未完全遵守国家电网的企业标准中规定的预测内容(时间粒度和待预测时长),并且预测误差评价公式也采用了自定义的公式,但问题的本质并未改变,仍然是一个短期负荷预测问题。
我们在前期做光伏电站超短期发电功率预测时,发现缺失值和数值天气预报数据对预测精度的影响最大,并且国网的企业标准中对负荷预测的影响因素也有个大致的介绍:
由于社会事件等不可知,因此本次比赛中我们侧重解决缺失值和气象数据的问题,将主要工作集中在三个地方:
1)对官方给定的气象数据进行编码、变换等,构建完善的气象数据特征;
2)构建过拟合的模型来填充缺失值;
3)用修订数据构建模型一来预测趋势,原始数据构建模型二来预测用电量水平(大致值),再对两个模型进行加权融合;
三、 数据预处理
3.1 非标准日期的处理方法
利用ODPS SQL提供的字符串正则处理函数regexp_extract,分别提取年、月、日的数据,然后转换成标准日期格式,代码如下:
-- 产生每日用电量总和
DROP TABLE IF EXISTS t_netivs_daily_sum_consumption;
CREATE TABLE IF NOT EXISTS t_netivs_daily_sum_consumption AS
SELECT
*
,(year*10000+month*100+day) as day_int -- 转化成 20160101 这种格式
,(month*100+day) as month_day
,(year*100+month) as year_month
,((year-2015)*12+month) as month_index
FROM
(
SELECT
*
,cast(regexp_extract(record_date,'(.*)/(.*)/(.*)',1) as bigint) as year -- 提取年
,cast(regexp_extract(record_date,'(.*)/(.*)/(.*)',2) as bigint) as month -- 提取月
,cast(regexp_extract(record_date,'(.*)/(.*)/(.*)',3) as bigint) as day -- 提取日
FROM
(
SELECT
record_date
,sum(power_consumption) as power_consumption
FROM
odps_tc_257100_f673506e024.tianchi_power2
GROUP BY
record_date
)t2
)t1
;
利用这个代码,可以方便的将2016/1/1这种非标准的日期数据转化为bigint类型的20160101这类数据,后续可以非常方便的用 to_data(cast(xxx as string),'yyyymmdd') 函数来将这类数据转化成日期类型,在利用ODPS SQL内置的函数来提取日期特征。
3.2 节假日的实现
由于比赛过程中原则上是不允许上传和下载数据的,因此正规的做法是通过ODPS SQL中的case when来实现节假日的处理。这里给出节假日及日期特征的处理代码:
-- 产生扩展日期
DROP TABLE IF EXISTS t_netivs_date_features;
CREATE TABLE IF NOT EXISTS t_netivs_date_features AS
SELECT
day_int
,day_index
,month_index
,year_index
,month
,day
,(month*100+day) as month_day
,(year*100+month) as year_month
,case when (weekday in (6,7) and special_workday == 0) or holiday==1 then 0 else 1 end as workday
,weekofyear
,day_to_lastday
,month_day_num
,weekday
,holiday
,special_workday
,special_holiday
,day1_before_special_holiday
,day2_before_special_holiday
,day3_before_special_holiday
,day1_before_holiday
,day2_before_holiday
,day3_before_holiday
,day1_after_special_holiday
,day2_after_special_holiday
,day3_after_special_holiday
,day1_after_holiday
,day2_after_holiday
,day3_after_holiday
FROM
(
SELECT
day_int
,dt
,datediff(dt,to_date('2015-01-01','yyyy-mm-dd'),'dd')+1 as day_index
,datediff(dt,to_date('2015-01-01','yyyy-mm-dd'),'mm')+1 as month_index
,datepart(dt,'yyyy')-2015+1 as year_index
,datepart(dt,'yyyy') as year
,datepart(dt,'mm') as month
,datepart(dt,'dd') as day
,datepart(lastday(dt),'dd') as month_day_num
,weekofyear(dt) as weekofyear
,datediff(lastday(dt),dt,'dd') as day_to_lastday
,weekday(dt) as weekday
,holiday
,special_workday
,special_holiday
,case when cast(to_char(dateadd(dt,-1,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day1_before_special_holiday
,case when cast(to_char(dateadd(dt,-2,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day2_before_special_holiday
,case when cast(to_char(dateadd(dt,-3,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day3_before_special_holiday
,case when cast(to_char(dateadd(dt,-1,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150404,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day1_before_holiday
,case when cast(to_char(dateadd(dt,-2,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150404,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day2_before_holiday
,case when cast(to_char(dateadd(dt,-3,'dd'),'yyyymmdd') as bigint) in (20150101,20150218,20150404,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day3_before_holiday
,case when cast(to_char(dateadd(dt,1,'dd'),'yyyymmdd') as bigint) in (20150101,20150219,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160208,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day1_after_special_holiday
,case when cast(to_char(dateadd(dt,2,'dd'),'yyyymmdd') as bigint) in (20150101,20150219,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160208,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day2_after_special_holiday
,case when cast(to_char(dateadd(dt,3,'dd'),'yyyymmdd') as bigint) in (20150101,20150219,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160208,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as day3_after_special_holiday
,case when cast(to_char(dateadd(dt,1,'dd'),'yyyymmdd') as bigint) in (20150103,20150224,20150406,20150503,20150622,20150905,20150927,20151007,20160101,20160213,20160404,20160502,20160611,20160917,20161007) then 1 else 0 end as day1_after_holiday
,case when cast(to_char(dateadd(dt,2,'dd'),'yyyymmdd') as bigint) in (20150103,20150224,20150406,20150503,20150622,20150905,20150927,20151007,20160101,20160213,20160404,20160502,20160611,20160917,20161007) then 1 else 0 end as day2_after_holiday
,case when cast(to_char(dateadd(dt,3,'dd'),'yyyymmdd') as bigint) in (20150103,20150224,20150406,20150503,20150622,20150905,20150927,20151007,20160101,20160213,20160404,20160502,20160611,20160917,20161007) then 1 else 0 end as day3_after_holiday
FROM
(
SELECT
day_int
,to_date(to_char(day_int),'yyyymmdd') as dt
,case when day_int in (20150101,20150102,20150103,20150218,20150219,20150220,20150221,20150222,20150223,20150224,20150404,20150405,20150406,20150501,20150502,20150503,20150620,20150621,20150622,20150903,20150904,20150905,20150927,20151001,20151002,20151003,20151004,20151005,20151006,20151007,20160101,20160207,20160208,20160209,20160210,20160211,20160212,20160213,20160404,20160501,20160502,20160609,20160610,20160611,20160915,20160916,20160917,20161001,20161002,20161003,20161004,20161005,20161006,20161007) then 1 else 0 end as holiday
,case when day_int in (20150104,20150215,20150228,20150906,20151010,20160206,20160214,20160612,20160918,20161008,20161009) then 1 else 0 end as special_workday
,case when day_int in (20150101,20150218,20150219,20150305,20150405,20150501,20150620,20150903,20150927,20151001,20160101,20160207,20160208,20160404,20160501,20160609,20160915,20161001) then 1 else 0 end as special_holiday
FROM
t_netivs_tianchi_weather_data
)t1
)t2
;
这段代码中基本包括了利用MaxCompute平台的ODPS SQL代码来整理日期数据格式和提取日期特征的全部常用操作,借助MaxCompute来进行时间序列特征分析和预测的朋友都可以考虑借鉴和完善这段代码来提取自己的日期特征。在这里,由于 通过对比赛数据的分析,可以很容易的发现节假日对每日用电总量的影响非常大,而且节假日对每日用电总量的影响有一定的延续性,比如某些节日快到的时候,用电量会突然增加或者下降,有些节日结束后,会有连续几天的用电量增加或者下降,因此这里对节假日做了比较细致的处理,增加了节前1/2/3天和节后1/2/3天的特征。
3.3 气象数据处理
从上面的气象数据可以看出来,其中的气象、风速、风向等都是字符串数据,需要转换成数值型的数据才能用于机器学习模型。由于这里用的字符串可能的类型有限,其中一种方法是将字符串排序,用序号代表该字符串的编码,直接用于机器学习模型的输入特征。这种方式的好处是处理简单,借助ODPS SQL内置的row_number函数可以很方便的进行实现。但是这种时间的缺点也很明显:没有充分的利用不同气象类型之间的关联关系,比如大雨跟大到暴雨的关系。因此,我们这里采用了OPEN_MR来对气象数据进行了详细的处理,主要的处理思路为:
1)将数据表中所有的数据类型都找出来,观察其构成情况及类别;
2)考虑到部分气象只有一种类型,比如“大雨、中雨、小雨”,而有的气象是两种气象类型,如“大到暴雨、多云转阴”等,因此,将所有气象进行统一:只有一种类型的,就用两个一样的类型来表示;
3)对于每个类型的气象,设计 气象类型(晴、雪、雨等)、气象等级(小雨、中雨、大雨、暴雨等分别从1开始编号)、气象组合(气象类型+气象等级);
按这种思路处理后的气象数据的格式可以用如下的ODPS SQL语句来创建,并且用于OPEN_MR的输出表:
-- map reduce来处理气象数据的输出表
-- 线上给的12月份的气象数据已经一起完成了,所以不需要再更改
-- DROP TABLE IF EXISTS t_netivs_encode_weather;
CREATE TABLE IF NOT EXISTS t_netivs_encode_weather (
day_int bigint
,temperature_high bigint
,temperature_low bigint
,weather1 bigint
,weather1_level bigint
,weather1_type bigint
,weather2 bigint
,weather2_level bigint
,weather2_type bigint
,wind_direction bigint
,wind_speed double
,wind_speed1 double
,wind_speed2 double
)
;
为了实现对气象数据的解析,编写了一个OPEN_MR来进行处理,其核心代码如下:
从mapper总获得原始数据,然后进行处理,再将结果输出到reducer中去,其主流程代码如下:
// 气象数据处理主流程
public void weather_encode(long day_int, long temperature_high, long temperature_low, String weather, String wind_direction, String wind_speed, Record vals){
m_output_vals = vals;
m_day_int = day_int;
m_temp_high = temperature_high;
m_temp_low = temperature_low;
reset();
weather_parser(weather);
wind_direction_parser(wind_direction);
wind_speed_parse(wind_speed);
// 输出特征
output();
}
其中,气象数据转化为编码的代码如下:
// -------------- 对气象进行重新编码 ---------------------------------------------//
private void weather_parser(String weather){
String weather1,weather2;
// 如果最后一个字母是 ~ ,应该是不数据不完整,直接去掉 ~
if(weather.endsWith("~")){
weather = weather.substring(0, weather.length()-2);
}
weather = weather.replace("转", "~");
// 解析a1的数据
if(weather.contains("~")){
weather1 = weather.split("~")[0];
weather2 = weather.split("~")[1];
}
else {
weather1= weather;
weather2 = weather;
}
// 开始解析weather1和weather2
// 小雨、小到中雨、中雨、中到大雨、大雨、大到暴雨、暴雨、阵雨、雷雨、雷阵雨、小雪、中雪、大雪、雨夹雪、晴、阴、多云
m_weather1 = get_weather_index(weather1); // 对气象进行重新编码
m_weather1_level = get_weather_level(weather1);
m_weather1_type = get_weather_type(weather1);
m_weather2 = get_weather_index(weather2);
m_weather2_level = get_weather_level(weather2);
m_weather2_type = get_weather_type(weather2);
}
3.4 过拟合模型实现缺失数据填充
通过前面两个部分的代码,可以快速的完成电力负荷数据的格式转化、日期和气象特征提取等。通过分析2016年11月的每日总用电量可以发现,1416这个大客户存在2天用电缺失的情况,从而导致那两天的用电量异常偏低。由此可以想到:
1)对用户进行分类,按不同的类别分别处理;
2)对这类大客户的缺失用电量进行填充,抵消偶然事件对用电趋势的影响,从而构建模型来预测每日用电量的趋势,再配合用真实用电量(未填充)模型的预测结果来获得最终预测结果;
由于这里构建的模型是用于填充缺失数据,有别于用来预测未来数据的模型,这应该有意的利用同一用户缺失值附近两侧的用电信息以及不同用户在同一时期的用电量等信息,构建“穿越”待预测日的过拟合模型,更好的填充缺失值。这里用于缺失值填充的过拟合模型的特征提取及预测的全过程代码如下所示:
-- 经过详细分析,拟定采用的缺失数据填充规则:
-- 1. 11月份缺失值为30,所有历史用电量改成1;
-- 2. 除了11月份缺失值为30天的,其他non_default_power_consumption_median<2500的都不处理;
-- 3. 总缺失天数大于30的不处理;
DROP TABLE IF EXISTS t_netivs_user_missing_info;
CREATE TABLE IF NOT EXISTS t_netivs_user_missing_info AS
select
case when t11.user_id is not null then t11.user_id else t2.user_id end as user_id
,case when t11.missing_day_cnt is null then 0 else t11.missing_day_cnt end as missing_day_cnt
,case when t11.first_default_day_int is null then 0 else t11.first_default_day_int end as first_default_day_int
,case when t11.last_default_day_int is null then 0 else t11.last_default_day_int end as last_default_day_int
,case when t11.last1month_default_day_cnt is null then 0 else t11.last1month_default_day_cnt end as last1month_default_day_cnt
,case when t11.last2month_default_day_cnt is null then 0 else t11.last2month_default_day_cnt end as last2month_default_day_cnt
,case when t11.last3month_default_day_cnt is null then 0 else t11.last3month_default_day_cnt end as last3month_default_day_cnt
,case when t2.power_consumption_avg is null then 0 else t2.power_consumption_avg end as power_consumption_avg
,case when t2.power_consumption_median is null then 0 else t2.power_consumption_median end as power_consumption_median
,case when t2.power_consumption_max is null then 0 else t2.power_consumption_max end as power_consumption_max
,case when t2.power_consumption_min is null then 0 else t2.power_consumption_min end as power_consumption_min
,case when t2.first_non_default_day_int is null then 0 else t2.first_non_default_day_int end as first_non_default_day_int
,case when t2.last_non_default_day_int is null then 0 else t2.last_non_default_day_int end as last_non_default_day_int
from
(
select
*
from
(
select
user_id
,count(*) as missing_day_cnt
,min(day_int) as first_default_day_int
,max(day_int) as last_default_day_int
,SUM(case when day_int>=20161101 and day_int<20161201 then 1 else 0 end) as last1month_default_day_cnt
,SUM(case when day_int>=20161001 and day_int<20161101 then 1 else 0 end) as last2month_default_day_cnt
,SUM(case when day_int>=20160901 and day_int<20161001 then 1 else 0 end) as last3month_default_day_cnt
from
t_netivs_ext_power
where
power_consumption=1
group by
user_id
)t1
where missing_day_cnt>1
)t11
FULL OUTER JOIN
(
select
user_id
,avg(power_consumption) as power_consumption_avg
,median(power_consumption) as power_consumption_median
,max(power_consumption) as power_consumption_max
,min(power_consumption) as power_consumption_min
,min(day_int) as first_non_default_day_int
,max(day_int) as last_non_default_day_int
from
t_netivs_ext_power
where
power_consumption<>1
group by
user_id
)t2
ON t11.user_id = t2.user_id
;
-- 产生要用xgboost来填充的user_id的列表
DROP TABLE IF EXISTS t_netivs_xgb_fill_user_day_list;
DROP TABLE IF EXISTS t_netivs_gbdt_fill_user_day_list;
CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_user_day_list AS
SELECT
user_id
,day_int
FROM
t_netivs_ext_power
WHERE
power_consumption =1 and user_id in
(
SELECT
user_id
FROM
t_netivs_user_missing_info
WHERE
power_consumption_median>2500 and missing_day_cnt<30 and missing_day_cnt>0
)
;
-- 产生要用来训练xgboost模型的user_id列表
DROP TABLE IF EXISTS t_netivs_xgb_fill_train_user_list;
DROP TABLE IF EXISTS t_netivs_gbdt_fill_train_user_list;
CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_train_user_list AS
SELECT
user_id
,day_int
FROM
t_netivs_ext_power
WHERE
power_consumption <> 1 and user_id in
(
SELECT
user_id
FROM
t_netivs_user_missing_info
WHERE
power_consumption_median>2500 and missing_day_cnt<30
)
;
-- 产生要把历史数据全部清0的user_id的列表
DROP TABLE IF EXISTS t_netivs_clear_historical_data_user_list;
CREATE TABLE IF NOT EXISTS t_netivs_clear_historical_data_user_list AS
SELECT
user_id
FROM
t_netivs_user_missing_info
WHERE
last1month_default_day_cnt=30
;
-- 产生GBDT训练集
DROP TABLE IF EXISTS t_netivs_gbdt_fill_consumption_features;
CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_consumption_features AS
SELECT
t1.user_id
,t1.day_int
,case when t2.weekly_power_consumption_avg is null then 0 else t2.weekly_power_consumption_avg end as weekly_power_consumption_avg
,case when t2.weekly_power_consumption_median is null then 0 else t2.weekly_power_consumption_median end as weekly_power_consumption_median
,case when t3.monthly_power_consumption_avg is null then 0 else t3.monthly_power_consumption_avg end as monthly_power_consumption_avg
,case when t3.monthly_power_consumption_median is null then 0 else t3.monthly_power_consumption_median end as monthly_power_consumption_median
,case when t4.last_weekly_power_consumption_avg is null then 0 else t4.last_weekly_power_consumption_avg end as last_weekly_power_consumption_avg
,case when t4.last_weekly_power_consumption_median is null then 0 else t4.last_weekly_power_consumption_median end as last_weekly_power_consumption_median
,case when t5.last_monthly_power_consumption_avg is null then 0 else t5.last_monthly_power_consumption_avg end as last_monthly_power_consumption_avg
,case when t5.last_monthly_power_consumption_median is null then 0 else t5.last_monthly_power_consumption_median end as last_monthly_power_consumption_median
FROM
t_netivs_ext_power t1
LEFT OUTER JOIN
(
SELECT
user_id
,weekofyear
,avg(power_consumption) as weekly_power_consumption_avg
,median(power_consumption) as weekly_power_consumption_median
FROM
t_netivs_ext_power
WHERE
power_consumption<>1
GROUP BY
user_id,weekofyear
)t2
ON t1.user_id = t2.user_id and t1.weekofyear = t2.weekofyear
LEFT OUTER JOIN
(
SELECT
user_id
,year_month
,avg(power_consumption) as monthly_power_consumption_avg
,median(power_consumption) as monthly_power_consumption_median
FROM
t_netivs_ext_power
WHERE
power_consumption<>1
GROUP BY
user_id,year_month
)t3
ON t1.user_id = t3.user_id and t1.year_month = t3.year_month
LEFT OUTER JOIN
(
SELECT
user_id
,weekofyear
,avg(power_consumption) as last_weekly_power_consumption_avg
,median(power_consumption) as last_weekly_power_consumption_median
FROM
t_netivs_ext_power
WHERE
power_consumption<>1
GROUP BY
user_id,weekofyear
)t4
ON t1.user_id = t4.user_id and t1.weekofyear = t4.weekofyear+1
LEFT OUTER JOIN
(
SELECT
user_id
,year_month
,avg(power_consumption) as last_monthly_power_consumption_avg
,median(power_consumption) as last_monthly_power_consumption_median
FROM
t_netivs_ext_power
WHERE
power_consumption<>1
GROUP BY
user_id,year_month
)t5
ON t1.user_id = t5.user_id and t1.year_month = t5.year_month+1
;
DROP TABLE IF EXISTS t_netivs_gbdt_fill_train_features;
CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_train_features AS
SELECT
t1.user_id
,t1.day_int
,t2.temperature_high
,t2.temperature_low
,t2.weather1
,t2.weather1_level
,t2.weather1_type
,t2.weather2
,t2.weather2_level
,t2.weather2_type
,t2.wind_direction
,t2.wind_speed
,t2.wind_speed1
,t2.wind_speed2
,t3.day_index
,t3.month_index
,t3.year_index
,t3.month
,t3.day
,t3.workday
,t3.weekday
,t3.holiday
,t3.special_workday
,t3.special_holiday
,t3.day1_before_special_holiday
,t3.day2_before_special_holiday
,t3.day3_before_special_holiday
,t3.day1_before_holiday
,t3.day2_before_holiday
,t3.day3_before_holiday
,t3.day1_after_special_holiday
,t3.day2_after_special_holiday
,t3.day3_after_special_holiday
,t3.day1_after_holiday
,t3.day2_after_holiday
,t3.day3_after_holiday
,t4.weekly_power_consumption_avg
,t4.weekly_power_consumption_median
,t4.monthly_power_consumption_avg
,t4.monthly_power_consumption_median
,t4.last_weekly_power_consumption_avg
,t4.last_weekly_power_consumption_median
,t4.last_monthly_power_consumption_avg
,t4.last_monthly_power_consumption_median
,t5.power_consumption
FROM
t_netivs_gbdt_fill_train_user_list t1
LEFT OUTER JOIN
t_netivs_encode_weather t2
ON t1.day_int = t2.day_int
LEFT OUTER JOIN
t_netivs_date_features t3
ON t1.day_int = t3.day_int
LEFT OUTER JOIN
t_netivs_gbdt_fill_consumption_features t4
ON t1.user_id = t4.user_id and t1.day_int = t4.day_int
LEFT OUTER JOIN
t_netivs_ext_power t5
ON t1.user_id = t5.user_id and t1.day_int = t5.day_int
;
-- 产生gbdt填充的测试集
DROP TABLE IF EXISTS t_netivs_gbdt_fill_test_features;
CREATE TABLE IF NOT EXISTS t_netivs_gbdt_fill_test_features AS
SELECT
t1.user_id
,t1.day_int
,t2.temperature_high
,t2.temperature_low
,t2.weather1
,t2.weather1_level
,t2.weather1_type
,t2.weather2
,t2.weather2_level
,t2.weather2_type
,t2.wind_direction
,t2.wind_speed
,t2.wind_speed1
,t2.wind_speed2
,t3.day_index
,t3.month_index
,t3.year_index
,t3.month
,t3.day
,t3.workday
,t3.weekday
,t3.holiday
,t3.special_workday
,t3.special_holiday
,t3.day1_before_special_holiday
,t3.day2_before_special_holiday
,t3.day3_before_special_holiday
,t3.day1_before_holiday
,t3.day2_before_holiday
,t3.day3_before_holiday
,t3.day1_after_special_holiday
,t3.day2_after_special_holiday
,t3.day3_after_special_holiday
,t3.day1_after_holiday
,t3.day2_after_holiday
,t3.day3_after_holiday
,t4.weekly_power_consumption_avg
,t4.weekly_power_consumption_median
,t4.monthly_power_consumption_avg
,t4.monthly_power_consumption_median
,t4.last_weekly_power_consumption_avg
,t4.last_weekly_power_consumption_median
,t4.last_monthly_power_consumption_avg
,t4.last_monthly_power_consumption_median
FROM
t_netivs_gbdt_fill_user_day_list t1
LEFT OUTER JOIN
t_netivs_encode_weather t2
ON t1.day_int = t2.day_int
LEFT OUTER JOIN
t_netivs_date_features t3
ON t1.day_int = t3.day_int
LEFT OUTER JOIN
t_netivs_gbdt_fill_consumption_features t4
ON t1.user_id = t4.user_id and t1.day_int = t4.day_int
;
-- 用xgb来产生填充值
DROP TABLE IF EXISTS t_netivs_xgb_fill_prediction_result;
DROP OFFLINEMODEL IF EXISTS m_xgb_fill_model;
-- train
PAI
-name xgboost
-project algo_public
-Deta="0.01"
---Dobjective="reg:linear"
-Dobjective="reg:linear"
-DitemDelimiter=","
-Dseed="0"
-Dnum_round="3500"
-DlabelColName="power_consumption"
-DinputTableName="t_netivs_gbdt_fill_train_features"
-DenableSparse="false"
-Dmax_depth="8"
-Dsubsample="0.4"
-Dcolsample_bytree="0.6"
-DmodelName="m_xgb_fill_model"
-Dgamma="0"
-Dlambda="50"
-DfeatureColNames="user_id,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,day_index,month_index,year_index,month,day,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday,wind_speed,wind_speed1,wind_speed2,weekly_power_consumption_avg,weekly_power_consumption_median,monthly_power_consumption_avg,monthly_power_consumption_median,last_weekly_power_consumption_avg,last_weekly_power_consumption_median,last_monthly_power_consumption_avg,last_monthly_power_consumption_median"
-Dbase_score="0.11"
-Dmin_child_weight="100"
-DkvDelimiter=":";
-- predict
PAI
-name prediction
-project algo_public
-DdetailColName="prediction_detail"
-DappendColNames="day_int"
-DmodelName="m_xgb_fill_model"
-DitemDelimiter=","
-DresultColName="prediction_result"
-Dlifecycle="28"
-DoutputTableName="t_netivs_xgb_fill_prediction_result"
-DscoreColName="prediction_score"
-DkvDelimiter=":"
-DfeatureColNames="user_id,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,day_index,month_index,year_index,month,day,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday,wind_speed,wind_speed1,wind_speed2,weekly_power_consumption_avg,weekly_power_consumption_median,monthly_power_consumption_avg,monthly_power_consumption_median,last_weekly_power_consumption_avg,last_weekly_power_consumption_median,last_monthly_power_consumption_avg,last_monthly_power_consumption_median"
-DinputTableName="t_netivs_gbdt_fill_test_features"
-DenableSparse="false";
SELECT * FROM t_netivs_xgb_fill_prediction_result ORDER BY prediction_result desc limit 100;
-- 产生修订后的每日用电量详单
-- t_netivs_clear_historical_data_user_list内的user_id全部清零
DROP TABLE IF EXISTS t_netivs_fixed_ext_power;
CREATE TABLE IF NOT EXISTS t_netivs_fixed_ext_power AS
SELECT
t1.user_id
,t1.day_int
--,cast(case when t2.prediction_result is not null then round(t2.prediction_result,0) when t3.user_id is not null then 1 else power_consumption end as bigint) as power_consumption
,cast(case when t2.prediction_result is not null then round(t2.prediction_result,0) else power_consumption end as bigint) as power_consumption
FROM
t_netivs_ext_power t1
LEFT OUTER JOIN
t_netivs_gbdt_fill_prediction_result t2
ON t1.user_id = t2.user_id and t1.day_int = t2.day_int
LEFT OUTER JOIN
t_netivs_clear_historical_data_user_list t3
ON t1.user_id = t3.user_id
;
-- 产生修订后的每日用电量总和
DROP TABLE IF EXISTS t_netivs_fixed_daily_sum_consumption;
CREATE TABLE IF NOT EXISTS t_netivs_fixed_daily_sum_consumption AS
SELECT
t1.day_int
,t1.power_consumption
,t2.fixed_power_consumption
,t3.day_index
,t3.month_index
,t3.year_index
,t3.month
,t3.day
,t3.month_day
,t3.year_month
,t3.workday
,t3.weekofyear
,t3.day_to_lastday
,t3.weekday
,t3.holiday
,t3.special_workday
,t3.special_holiday
,t3.day1_before_special_holiday
,t3.day2_before_special_holiday
,t3.day3_before_special_holiday
,t3.day1_before_holiday
,t3.day2_before_holiday
,t3.day3_before_holiday
,t3.day1_after_special_holiday
,t3.day2_after_special_holiday
,t3.day3_after_special_holiday
,t3.day1_after_holiday
,t3.day2_after_holiday
,t3.day3_after_holiday
FROM
t_netivs_daily_sum_consumption t1
LEFT OUTER JOIN
(
SELECT
day_int
,SUM(power_consumption) as fixed_power_consumption
FROM
t_netivs_fixed_ext_power
GROUP BY
day_int
)t2
ON t1.day_int = t2.day_int
LEFT OUTER JOIN
t_netivs_date_features t3
ON t1.day_int = t3.day_int
;
四、模型构建与融合
在做这个赛题的时候,确定解题思路是用两个模型来分别预测趋势和用电量水平,然后再进行融合,其思路如下图所示:
其中模型一的特征提取及模型构建的实现代码如下:
-- 提取每日用电总量的特征
DROP TABLE IF EXISTS t_netivs_daily_sum_features;
CREATE TABLE IF NOT EXISTS t_netivs_daily_sum_features AS
SELECT
t1.day_int
,t1.last_month_same_day_consumption
--,t1.last_year_same_day_consumption
,t2.last_month_power_consumption_avg
,t2.last_month_power_consumption_median
,t2.last_month_power_consumption_stddev
,t2.last_month_weekday1_avg
,t2.last_month_weekday1_median
,t2.last_month_weekday0_avg
,t2.last_month_weekday0_median
,t2.last_month_workday1_avg
,t2.last_month_workday1_median
,t2.last_month_workday0_avg
,t2.last_month_workday0_median
,t2.last_month_last3day_avg
,t2.last_month_last3day_median
,t2.last_month_last7day_avg
,t2.last_month_last7day_median
,t2.last_month_first3day_avg
,t2.last_month_first3day_median
,t2.last_month_first7day_avg
,t2.last_month_first7day_median
,t2.last_month_middle_avg
,t2.last_month_middle_median
FROM
(
SELECT
t11.day_int
,t21.power_consumption as last_month_same_day_consumption
--,t31.power_consumption as last_year_same_day_consumption
FROM
(
SELECT
day_int
,day
,day_to_lastday
,case when day<=15 then cast(to_char(dateadd(to_date(cast(day_int as string),'yyyymmdd'),-1,'mm'),'yyyymmdd') as bigint)
else cast(to_char(dateadd(lastday(dateadd(to_date(cast(day_int as string),'yyyymmdd'),-1,'mm')),-day_to_lastday,'dd'),'yyyymmdd') as bigint) end
as last_month_same_day
--,cast(to_char(dateadd(to_date(cast(day_int as string),'yyyymmdd'),-1,'yyyy'),'yyyymmdd') as bigint) as last_year_same_day
FROM
t_netivs_date_features
WHERE
day_int>=20150201
)t11
LEFT OUTER JOIN
t_netivs_fixed_daily_sum_consumption t21
ON t11.last_month_same_day = t21.day_int
--LEFT OUTER JOIN
-- t_netivs_fixed_daily_sum_consumption t31
--ON t11.last_year_same_day = t31.day_int
)t1
LEFT OUTER JOIN
(
SELECT
t1.day_int
,t2.last_month_power_consumption_avg
,t2.last_month_power_consumption_median
,t2.last_month_power_consumption_stddev
,t2.last_month_weekday1_avg
,t2.last_month_weekday1_median
,t2.last_month_weekday0_avg
,t2.last_month_weekday0_median
,t2.last_month_workday1_avg
,t2.last_month_workday1_median
,t2.last_month_workday0_avg
,t2.last_month_workday0_median
,t2.last_month_last3day_avg
,t2.last_month_last3day_median
,t2.last_month_last7day_avg
,t2.last_month_last7day_median
,t2.last_month_first3day_avg
,t2.last_month_first3day_median
,t2.last_month_first7day_avg
,t2.last_month_first7day_median
,t2.last_month_middle_avg
,t2.last_month_middle_median
FROM
(
SELECT * FROM t_netivs_date_features WHERE month_index>1
)t1
LEFT OUTER JOIN
(
SELECT
month_index
,avg(fixed_power_consumption) as last_month_power_consumption_avg
,median(fixed_power_consumption) as last_month_power_consumption_median
,stddev(fixed_power_consumption) as last_month_power_consumption_stddev
,avg(case when weekday=1 then fixed_power_consumption else null end) as last_month_weekday1_avg
,median(case when weekday=1 then fixed_power_consumption else null end) as last_month_weekday1_median
,avg(case when weekday=0 then fixed_power_consumption else null end) as last_month_weekday0_avg
,median(case when weekday=0 then fixed_power_consumption else null end) as last_month_weekday0_median
,avg(case when workday=1 then fixed_power_consumption else null end) as last_month_workday1_avg
,median(case when workday=1 then fixed_power_consumption else null end) as last_month_workday1_median
,avg(case when workday=0 then fixed_power_consumption else null end) as last_month_workday0_avg
,median(case when workday=0 then fixed_power_consumption else null end) as last_month_workday0_median
,avg(case when day_to_lastday<=3 then fixed_power_consumption else null end) as last_month_last3day_avg
,median(case when day_to_lastday<=3 then fixed_power_consumption else null end) as last_month_last3day_median
,avg(case when day_to_lastday<=7 then fixed_power_consumption else null end) as last_month_last7day_avg
,median(case when day_to_lastday<=7 then fixed_power_consumption else null end) as last_month_last7day_median
,avg(case when day<=3 then fixed_power_consumption else null end) as last_month_first3day_avg
,median(case when day<=3 then fixed_power_consumption else null end) as last_month_first3day_median
,avg(case when day<=7 then fixed_power_consumption else null end) as last_month_first7day_avg
,median(case when day<=7 then fixed_power_consumption else null end) as last_month_first7day_median
,avg(case when day>=14 and day_to_lastday>=14 then fixed_power_consumption else null end) as last_month_middle_avg
,median(case when day>=14 and day_to_lastday>=14 then fixed_power_consumption else null end) as last_month_middle_median
FROM
(
SELECT
t1.day_int
,t1.power_consumption
,t1.fixed_power_consumption
,t2.day_index
,t2.month_index
,t2.workday
,t2.day_to_lastday
,t2.day
,t2.weekday
,t2.holiday
FROM
t_netivs_fixed_daily_sum_consumption t1
LEFT OUTER JOIN
t_netivs_date_features t2
ON t1.day_int = t2.day_int
)t2_1
GROUP BY
month_index
)t2
ON t1.month_index = t2.month_index+1
)t2
ON t1.day_int = t2.day_int
;
-- 合并特征
DROP TABLE IF EXISTS t_netivs_all_online_features;
CREATE TABLE IF NOT EXISTS t_netivs_all_online_features AS
SELECT
t1.*
,t2.temperature_high
,t2.temperature_low
,t2.weather1
,t2.weather1_level
,t2.weather1_type
,t2.weather2
,t2.weather2_level
,t2.weather2_type
,t2.wind_direction
,t2.wind_speed
,t2.wind_speed1
,t2.wind_speed2
,t3.day_index
,t3.month_index
,t3.year_index
,t3.month
,t3.day
,t3.workday
,t3.weekday
,t3.holiday
,t3.special_workday
,t3.special_holiday
,t3.day1_before_special_holiday
,t3.day2_before_special_holiday
,t3.day3_before_special_holiday
,t3.day1_before_holiday
,t3.day2_before_holiday
,t3.day3_before_holiday
,t3.day1_after_special_holiday
,t3.day2_after_special_holiday
,t3.day3_after_special_holiday
,t3.day1_after_holiday
,t3.day2_after_holiday
,t3.day3_after_holiday
FROM
t_netivs_daily_sum_features t1
LEFT OUTER JOIN
t_netivs_encode_weather t2
ON t1.day_int = t2.day_int
LEFT OUTER JOIN
t_netivs_date_features t3
ON t1.day_int = t3.day_int
;
-- 产生训练集
DROP TABLE IF EXISTS t_netivs_online_train_features;
CREATE TABLE IF NOT EXISTS t_netivs_online_train_features AS
SELECT
t1.*
,t2.power_consumption
,t2.fixed_power_consumption
FROM
(
SELECT * FROM t_netivs_all_online_features WHERE day_int<20161201
)t1
LEFT OUTER JOIN
t_netivs_fixed_daily_sum_consumption t2
ON t1.day_int = t2.day_int
;
-- 产生测试集
DROP TABLE IF EXISTS t_netivs_online_test_features;
CREATE TABLE IF NOT EXISTS t_netivs_online_test_features AS
SELECT * FROM t_netivs_all_online_features WHERE day_int>=20161201
;
-- 用xgb来跑
DROP TABLE IF EXISTS t_netivs_online_xgb_prediction_result;
DROP OFFLINEMODEL IF EXISTS m_online_xgb_model;
-- train
PAI
-name xgboost
-project algo_public
-Deta="0.01"
---Dobjective="reg:linear"
-Dobjective="reg:linear"
-DitemDelimiter=","
-Dseed="0"
-Dnum_round="3500"
-DlabelColName="power_consumption"
-DinputTableName="t_netivs_online_train_features"
-DenableSparse="false"
-Dmax_depth="8"
-Dsubsample="0.4"
-Dcolsample_bytree="0.6"
-DmodelName="m_online_xgb_model"
-Dgamma="0"
-Dlambda="50"
-DfeatureColNames="last_month_same_day_consumption,last_month_power_consumption_avg,last_month_power_consumption_median,last_month_power_consumption_stddev,last_month_weekday1_avg,last_month_weekday1_median,last_month_weekday0_avg,last_month_weekday0_median,last_month_workday1_avg,last_month_workday1_median,last_month_workday0_avg,last_month_workday0_median,last_month_last3day_avg,last_month_last3day_median,last_month_last7day_avg,last_month_last7day_median,last_month_first3day_avg,last_month_first3day_median,last_month_first7day_avg,last_month_first7day_median,last_month_middle_avg,last_month_middle_median,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,day,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday"
-Dbase_score="0.11"
-Dmin_child_weight="100"
-DkvDelimiter=":";
-- predict
PAI
-name prediction
-project algo_public
-DdetailColName="prediction_detail"
-DappendColNames="day_int"
-DmodelName="m_online_xgb_model"
-DitemDelimiter=","
-DresultColName="prediction_result"
-Dlifecycle="28"
-DoutputTableName="t_netivs_online_xgb_prediction_result"
-DscoreColName="prediction_score"
-DkvDelimiter=":"
-DfeatureColNames="last_month_same_day_consumption,last_month_power_consumption_avg,last_month_power_consumption_median,last_month_power_consumption_stddev,last_month_weekday1_avg,last_month_weekday1_median,last_month_weekday0_avg,last_month_weekday0_median,last_month_workday1_avg,last_month_workday1_median,last_month_workday0_avg,last_month_workday0_median,last_month_last3day_avg,last_month_last3day_median,last_month_last7day_avg,last_month_last7day_median,last_month_first3day_avg,last_month_first3day_median,last_month_first7day_avg,last_month_first7day_median,last_month_middle_avg,last_month_middle_median,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,day,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday"
-DinputTableName="t_netivs_online_test_features"
-DenableSparse="false";
select * from t_netivs_online_xgb_prediction_result ORDER BY day_int limit 100;
select avg(prediction_result) from t_netivs_online_xgb_prediction_result;
模型二的特征提取、模型构建及模型融合代码如下:
-- 根据历史同期信息来构建特征工程
DROP TABLE IF EXISTS t_netivs_same_period_feature;
CREATE TABLE IF NOT EXISTS t_netivs_same_period_feature AS
SELECT
t1.day_int
,t1.day
,t1.day_to_lastday
,t2.power_consumption as last_month_same_day_consumption
,t3.power_consumption as last_year_same_day_consumption
,t4.last_month_power_consumption_median
,t4.last_month_weekday1_median
,t4.last_month_weekday0_median
,t4.last_month_workday1_median
,t4.last_month_last3day_median
,t4.last_month_last7day_median
,t4.last_month_first3day_median
,t4.last_month_first7day_median
,t4.last_month_middle_median
,t4.last_month_power_consumption_avg
,t4.last_month_weekday1_avg
,t4.last_month_weekday0_avg
,t4.last_month_workday1_avg
,t4.last_month_last3day_avg
,t4.last_month_last7day_avg
,t4.last_month_first3day_avg
,t4.last_month_first7day_avg
,t4.last_month_middle_avg
,t5.power_consumption
,t5.fixed_power_consumption
,t6.temperature_high
,t6.temperature_low
,t6.weather1
,t6.weather1_level
,t6.weather1_type
,t6.weather2
,t6.weather2_level
,t6.weather2_type
,t6.wind_direction
,t6.wind_speed
,t6.wind_speed1
,t6.wind_speed2
,t7.month
,t7.workday
,t7.weekday
,t7.holiday
,t7.special_workday
,t7.special_holiday
,t7.day1_before_special_holiday
,t7.day2_before_special_holiday
,t7.day3_before_special_holiday
,t7.day1_before_holiday
,t7.day2_before_holiday
,t7.day3_before_holiday
,t7.day1_after_special_holiday
,t7.day2_after_special_holiday
,t7.day3_after_special_holiday
,t7.day1_after_holiday
,t7.day2_after_holiday
,t7.day3_after_holiday
FROM
(
SELECT
day_int
,day
,day_to_lastday
,case when day<=15 then cast(to_char(dateadd(to_date(cast(day_int as string),'yyyymmdd'),-1,'mm'),'yyyymmdd') as bigint)
else cast(to_char(dateadd(lastday(dateadd(to_date(cast(day_int as string),'yyyymmdd'),-1,'mm')),-day_to_lastday,'dd'),'yyyymmdd') as bigint) end
as last_month_same_day
,cast(to_char(dateadd(to_date(cast(day_int as string),'yyyymmdd'),-1,'yyyy'),'yyyymmdd') as bigint) as last_year_same_day
FROM
t_netivs_date_features
WHERE
day_int>=20160101
)t1
LEFT OUTER JOIN
t_netivs_fixed_daily_sum_consumption t2
ON t1.last_month_same_day = t2.day_int
LEFT OUTER JOIN
t_netivs_fixed_daily_sum_consumption t3
ON t1.last_year_same_day = t3.day_int
LEFT OUTER JOIN
t_netivs_dail_sum_features t4
ON t1.day_int = t4.day_int
LEFT OUTER JOIN
t_netivs_fixed_daily_sum_consumption t5
ON t1.day_int = t5.day_int
LEFT OUTER JOIN
t_netivs_encode_weather t6
ON t1.day_int = t6.day_int
LEFT OUTER JOIN
t_netivs_date_features t7
ON t1.day_int = t7.day_int
;
-- 产生训练集
DROP TABLE IF EXISTS t_netivs_online_historical_train_features;
CREATE TABLE IF NOT EXISTS t_netivs_online_historical_train_features AS
SELECT * FROM t_netivs_same_period_feature WHERE day_int<20161201
;
-- 产生测试集
DROP TABLE IF EXISTS t_netivs_online_historical_test_features;
CREATE TABLE IF NOT EXISTS t_netivs_online_historical_test_features AS
SELECT * FROM t_netivs_same_period_feature WHERE day_int>=20161201
;
-- 用xgb来跑
DROP TABLE IF EXISTS t_netivs_online_historical_xgb_prediction_result;
DROP OFFLINEMODEL IF EXISTS m_online_historical_xgb_model;
-- train
PAI
-name xgboost
-project algo_public
-Deta="0.01"
---Dobjective="reg:linear"
-Dobjective="reg:linear"
-DitemDelimiter=","
-Dseed="0"
-Dnum_round="4000"
-DlabelColName="power_consumption"
-DinputTableName="t_netivs_online_historical_train_features"
-DenableSparse="false"
-Dmax_depth="8"
-Dsubsample="0.8"
-Dcolsample_bytree="0.8"
-DmodelName="m_online_historical_xgb_model"
-Dgamma="0"
-Dlambda="50"
-DfeatureColNames="day,day_to_lastday,last_month_same_day_consumption,last_year_same_day_consumption,last_month_power_consumption_median,last_month_weekday1_median,last_month_weekday0_median,last_month_workday1_median,last_month_last3day_median,last_month_last7day_median,last_month_first3day_median,last_month_first7day_median,last_month_middle_median,last_month_power_consumption_avg,last_month_weekday1_avg,last_month_weekday0_avg,last_month_workday1_avg,last_month_last3day_avg,last_month_last7day_avg,last_month_first3day_avg,last_month_first7day_avg,last_month_middle_avg,power_consumption,fixed_power_consumption,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday"
---Dbase_score="0.11"
-Dmin_child_weight="50"
-DkvDelimiter=":";
-- predict
PAI
-name prediction
-project algo_public
-DdetailColName="prediction_detail"
-DappendColNames="day_int"
-DmodelName="m_online_historical_xgb_model"
-DitemDelimiter=","
-DresultColName="prediction_result"
-Dlifecycle="28"
-DoutputTableName="t_netivs_online_historical_xgb_prediction_result"
-DscoreColName="prediction_score"
-DkvDelimiter=":"
-DfeatureColNames="day,day_to_lastday,last_month_same_day_consumption,last_year_same_day_consumption,last_month_power_consumption_median,last_month_weekday1_median,last_month_weekday0_median,last_month_workday1_median,last_month_last3day_median,last_month_last7day_median,last_month_first3day_median,last_month_first7day_median,last_month_middle_median,last_month_power_consumption_avg,last_month_weekday1_avg,last_month_weekday0_avg,last_month_workday1_avg,last_month_last3day_avg,last_month_last7day_avg,last_month_first3day_avg,last_month_first7day_avg,last_month_middle_avg,power_consumption,fixed_power_consumption,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday"
-DinputTableName="t_netivs_online_historical_test_features"
-DenableSparse="false";
select * from t_netivs_online_historical_xgb_prediction_result ORDER BY day_int limit 100;
-- 使用没有调整的power_consumption
DROP TABLE IF EXISTS t_netivs_online_historical_fixed_xgb_prediction_result;
DROP OFFLINEMODEL IF EXISTS m_online_historical_fixed_xgb_model;
-- train
PAI
-name xgboost
-project algo_public
-Deta="0.01"
---Dobjective="reg:linear"
-Dobjective="reg:linear"
-DitemDelimiter=","
-Dseed="0"
-Dnum_round="4000"
-DlabelColName="fixed_power_consumption"
-DinputTableName="t_netivs_online_historical_train_features"
-DenableSparse="false"
-Dmax_depth="8"
-Dsubsample="0.8"
-Dcolsample_bytree="0.8"
-DmodelName="m_online_historical_fixed_xgb_model"
-Dgamma="0"
-Dlambda="50"
-DfeatureColNames="day,day_to_lastday,last_month_same_day_consumption,last_year_same_day_consumption,last_month_power_consumption_median,last_month_weekday1_median,last_month_weekday0_median,last_month_workday1_median,last_month_last3day_median,last_month_last7day_median,last_month_first3day_median,last_month_first7day_median,last_month_middle_median,last_month_power_consumption_avg,last_month_weekday1_avg,last_month_weekday0_avg,last_month_workday1_avg,last_month_last3day_avg,last_month_last7day_avg,last_month_first3day_avg,last_month_first7day_avg,last_month_middle_avg,power_consumption,fixed_power_consumption,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday"
---Dbase_score="0.11"
-Dmin_child_weight="50"
-DkvDelimiter=":";
-- predict
PAI
-name prediction
-project algo_public
-DdetailColName="prediction_detail"
-DappendColNames="day_int"
-DmodelName="m_online_historical_fixed_xgb_model"
-DitemDelimiter=","
-DresultColName="prediction_result"
-Dlifecycle="28"
-DoutputTableName="t_netivs_online_historical_fixed_xgb_prediction_result"
-DscoreColName="prediction_score"
-DkvDelimiter=":"
-DfeatureColNames="day,day_to_lastday,last_month_same_day_consumption,last_year_same_day_consumption,last_month_power_consumption_median,last_month_weekday1_median,last_month_weekday0_median,last_month_workday1_median,last_month_last3day_median,last_month_last7day_median,last_month_first3day_median,last_month_first7day_median,last_month_middle_median,last_month_power_consumption_avg,last_month_weekday1_avg,last_month_weekday0_avg,last_month_workday1_avg,last_month_last3day_avg,last_month_last7day_avg,last_month_first3day_avg,last_month_first7day_avg,last_month_middle_avg,power_consumption,fixed_power_consumption,temperature_high,temperature_low,weather1,weather1_level,weather1_type,weather2,weather2_level,weather2_type,wind_direction,wind_speed,wind_speed1,wind_speed2,month,workday,weekday,holiday,special_workday,special_holiday,day1_before_special_holiday,day2_before_special_holiday,day3_before_special_holiday,day1_before_holiday,day2_before_holiday,day3_before_holiday,day1_after_special_holiday,day2_after_special_holiday,day3_after_special_holiday,day1_after_holiday,day2_after_holiday,day3_after_holiday"
-DinputTableName="t_netivs_online_historical_test_features"
-DenableSparse="false";
select * from t_netivs_online_historical_fixed_xgb_prediction_result ORDER BY day_int limit 100;
DROP TABLE IF EXISTS t_netivs_xgb_ensemble_result;
CREATE TABLE IF NOT EXISTS t_netivs_xgb_ensemble_result AS
SELECT
t1.day_int
,t1.prediction_result + t2.prediction_result*0.05 as prediction_result
FROM
t_netivs_online_xgb_prediction_result t1
LEFT OUTER JOIN
t_netivs_online_historical_fixed_xgb_prediction_result t2
ON t1.day_int = t2.day_int
ORDER BY day_int limit 61
;
SELECT avg(prediction_result) FROM t_netivs_xgb_ensemble_result;
SELECT * FROM t_netivs_xgb_ensemble_result ORDER BY day_int limit 100;
INSERT OVERWRITE TABLE tianchi_power_answer
SELECT
concat(to_char(datepart(ds,'yyyy')),'/',to_char(datepart(ds,'mm')),'/',to_char(datepart(ds,'dd'))) as predict_date
,cast(round(power_consumption,0) as bigint) as power_consumption
FROM
(
SELECT
to_date(cast(day_int as string),'yyyymmdd') as ds
,prediction_result as power_consumption
FROM
t_netivs_xgb_ensemble_result
)t1
;
五、总结与展望
本文以阿里云天池大数据平台上举办的电力AI赛(https://tianchi.aliyun.com/competition/introduction.htm?spm=5176.100066.0.0.3f6e7d83UaNT4W&raceId=231602)为例,介绍了借助阿里云MaxCompute平台实现电力系统负荷预测的整个流程,并给出了全部核心代码。代码面前了无秘密可言,通过对这些代码的分析,可以很容易看出来阿里云的MaxCompute强大的功能和灵活的开放接口。实际上,由于比赛平台的限制,阿里云 MaxCompute平台上还有很多可以辅助开发的功能尚未展示到,比如可视化用的DataV、商业智能引擎Quik BI等等,通过将负荷预测跟这些产品的结合,可以很方便的实现出界面美观功能强大的电力系统应用。