一、 基于lightgbm的Web攻击检测与分类识别
比赛地址:www.datafountain.cn/competition…
1.赛题背景
某业务平台平均每月捕获到Web攻击数量超过2亿,涉及常见注入攻击,代码执行等类型。传统威胁检测手段通过分析已知攻击特征进行规则匹配,无法检测未知漏洞或攻击手法。如何快速准确地识别未知威胁攻击并且将不同攻击正确分类,对提升Web攻击检测能力至关重要。利用机器学习和深度学习技术对攻击报文进行识别和分类已经成为解决该问题的创新思路,有利于推动AI技术在威胁检测分析场景的研究与应用。
2.赛题任务
参赛团队需要对前期提供的训练集进行分析,通过特征工程、机器学习和深度学习等方法构建AI模型,实现对每一条样本正确且快速分类,不断提高模型精确率和召回率。待模型优化稳定后,通过无标签测试集评估各参赛团队模型分类效果,以正确率评估各参赛团队模型质量。
3.数据简介
赛题训练集分为6种不同标签,共计约3.5万条数据。训练数据集字段内容主要包括:
● ID:样本编号
● label:攻击类型编号
● 其他:HTTP协议内容
按照当前测试集格式,增加一个字段predict,predict字段表示模型预测结果,predict值与训练集中的label字段值的含义一致。提交的文件名为predict.csv。
测试集格式:包含id,method,user_agent,url,refer和body
如图:增加了predict字段,对id=117的样本预测类别为0。注意提交的id编号不允许改变。
4.引用
本项目修改自 致Great 大佬 discussion.datafountain.cn/articles/de… ,属于再学习,非常感谢。
二、数据处理
1.解压缩
特别注意的是指定中文编码,否则会乱码
- -c:将解压缩的结果显示到屏幕上,并对字符做适当的转换;
- -f:更新现有的文件;
- -l:显示压缩文件内所包含的文件;
- -p:与-c参数类似,会将解压缩的结果显示到屏幕上,但不会执行任何的转换;
- -t:检查压缩文件是否正确;
- -u:与-f参数类似,但是除了更新现有的文件外,也会将压缩文件中的其他文件解压缩到目录中;
- -v:执行时显示详细的信息;
- -z:仅显示压缩文件的备注文字;
- -a:对文本文件进行必要的字符转换;
- -b:不要对文本文件进行字符转换;
- -C:压缩文件中的文件名称区分大小写;
- -j:不处理压缩文件中原有的目录路径;
- -L:将压缩文件中的全部文件名改为小写;
- -M:将输出结果送到more程序处理;
- -n:解压缩时不要覆盖原有的文件;
- -o:不必先询问用户,unzip执行后覆盖原有的文件;
- -P<密码>:使用zip的密码选项;
- -q:执行时不显示任何信息;
- -s:将文件名中的空白字符转换为底线字符;
- -V:保留VMS的文件版本信息;
- -X:解压缩时同时回存文件原来的UID/GID;
- -d<目录>:指定文件解压缩后所要存储的目录;
- -x<文件>:指定不要处理.zip压缩文件中的哪些文件;
- -Z:unzip-Z等于执行zipinfo指令。
- -O 为在DOS、Windows和OS/2上压缩的存档指定字符编码
- -I 为UNIX和其他系统上压缩的存档指定字符编码
!unzip -qoa -O GBK data/data168450/train.zip !unzip -qoa -O GBK data/data168450/test-726699.zip
import matplotlib.pyplot as plt import numpy as np import pandas as pd
2.数据加载
读取并横向合并数据集
train=pd.concat([ pd.read_csv('SQL注入.csv'), pd.read_csv('XSS跨站脚本.csv'), pd.read_csv('命令执行.csv'), pd.read_csv('白.csv'), pd.read_csv('目录遍历.csv'), pd.read_csv('远程代码执行.csv'), ],axis=0).reset_index(drop=True) test=pd.read_csv('test.csv')
train.head() .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } id method user_agent url refer body lable 0 17902 GET Dalvik/2.1.0 (Linux; U; Android 11; SM-G9860 B... /livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=ed... NaN GET /livemsg?ad_type=WL_WK&ty=web&pu=0&openudi... 1 1 190 GET Dalvik/2.1.0 (Linux; U; Android 11; Mi 10 Buil... /livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d6... NaN GET /livemsg?ad_type=WL_WK&ty=web&pu=0&openudi... 1 2 8799 GET Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl... /(select%20extractvalue(xmltype('%3c%3fxml%20v... NaN GET /(select%20extractvalue(xmltype('%3c%3fxml... 1 3 8788 GET Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl... /ftp/quarantine/?(select%20load_file('%5c%5c%5... NaN GET /ftp/quarantine/?(select%20load_file('%5c%... 1 4 16030 GET Dalvik/2.1.0 (Linux; U; Android 9; MI 9 SE MIU... /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap... NaN GET /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&... 1
test.head() .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id | method | user_agent | url | refer | body | |
0 | 0 | GET | Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl... | /demo/aisec/upload.php?act='%7C%7C(select+1+fr... | http://demo.aisec.cn/demo/aisec/upload.php?t=0... | GET /demo/aisec/upload.php?act='%7C%7C(select+... |
1 | 1 | GET | Dalvik/2.1.0 (Linux; U; Android 11; M2102J2SC ... | /livemsg?ad_type=WL_WK&ty=web&pu=1&openudid=5f... | NaN | GET /livemsg?ad_type=WL_WK&ty=web&pu=1&openudi... |
2 | 2 | GET | Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/2... | /create_user/?username=%3Cscript%3Ealert(docum... | NaN | NaN |
3 | 3 | GET | NaN | /mmsns/WeDwicXmkOl4kjKsBycicI0H3q41r6syFFvu46h... | NaN | NaN |
4 | 4 | PUT | Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/2... | /naizau.jsp/ | NaN | GET /login HTTP/1.1 Host: 111.160.211.18:8088 ... |
train.to_csv("train_all.csv") test.to_csv("test_all.csv")
3.数据分布
print(train['lable'].value_counts()) train['lable'].value_counts().plot(kind='bar')
1 14038 2 9939 0 6489 3 1397 4 697 5 659 Name: lable, dtype: int64 <matplotlib.axes._subplots.AxesSubplot at 0x7fb3645bf890>
3.数据合并
reset_index用来重置索引,因为有时候对dataframe做处理后索引可能是乱的。这样如果数据量比较大不注意的话取出一行会取出多个索引相同的行。
- drop=True就是把原来的索引index列去掉,重置index。
- drop=False就是保留原来的索引,添加重置的index。
两者的区别就是有没有把原来的index去掉。
# 合并训练集、测试集,并重置index data=pd.concat([train,test],axis=0).reset_index(drop=True)
4.空值处理
建议对所有空置进行替换,而不是某列处理。
data=data.fillna('NAN')
4.特征抽取
4.1 lightgbm 简介
注意lightgbm 需要更新为3.3.2,老的3.1版本没有log_evaluation。 文档: lightgbm.readthedocs.io/en/latest/
4.2 user_agents 简介
user_agents 是一个Python库,通过解析(浏览器/HTTP)用户代理字符串,提供了一种简单的方法来识别/检测设备。user_agents依赖于优秀的ua-parser对原始用户代理字符串进行实际解析。
!pip install -U -q lightgbm !pip install -q user_agents
!pip list |grep lightgbm
lightgbm 3.3.2
import lightgbm as lgb import matplotlib.pyplot as plt import numpy as np import pandas as pd from lightgbm import early_stopping from lightgbm import log_evaluation from sklearn.decomposition import TruncatedSVD from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics import accuracy_score from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder from tqdm import tqdm from user_agents import parse
4.3 agent解析
使用例子如下:
from user_agents import parse user_string = "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36" user_agent = parse(user_string) # 访问用户代理的浏览器属性 print(user_agent.browser) print(user_agent.browser.family) print(user_agent.browser.version) print(user_agent.browser.version_string) # 访问用户代理的操作系统属性 print(user_agent.os) print(user_agent.os.family) print(user_agent.os.version) print(user_agent.os.version_string) # 访问用户代理的设备属性 print(user_agent.device) print(user_agent.device.family) print(user_agent.device.brand) print(user_agent.device.model) # 查看一个漂亮的字符串版本 # 查看一个漂亮的字符串版本 print(str(user_agent))
ua_cols=['browser_family', 'os_family', 'device_family','device_brand','device_model'] # iPhone的UserAgent def get_ua(row): user_agent = parse(row['user_agent']) browser_family=str(user_agent.browser.family) os_family=str(user_agent.os.family) device_family=str(user_agent.device.family) device_brand=str(user_agent.device.brand) device_model=str(user_agent.device.model) return browser_family,os_family,device_family,device_brand,device_model data[ua_cols] = data.apply(get_ua, axis=1, result_type="expand")
4.4 TfidfVectorizer 文本向量化
TfidfVectorizer可以把原始文本转化为tf-idf的特征矩阵,从而为后续的文本相似度计算,主题模型(如LSI),文本搜索排序等一系列应用奠定基础。
常见用法如下:
from sklearn.feature_extraction.text import TfidfVectorizer document = ["I have a pen.", "I have an apple."] tfidf_model = TfidfVectorizer().fit(document) # 得到tf-idf矩阵,稀疏矩阵表示法 sparse_result = tfidf_model.transform(document) print(sparse_result) # 第0个字符串,对应词典序号为3的词的TFIDF为0.8148 # (0, 3) 0.814802474667 # (0, 2) 0.579738671538 # (1, 2) 0.449436416524 # (1, 1) 0.631667201738 # (1, 0) 0.631667201738 print(sparse_result.todense()) # 转化为更直观的一般矩阵 # [[ 0. 0. 0.57973867 0.81480247] # [ 0.6316672 0.6316672 0.44943642 0. ]] print(tfidf_model.vocabulary_) # 词语与列的对应关系 # {'have': 2, 'pen': 3, 'an': 0, 'apple': 1}
texts=data['user_agent'].values.tolist() n_components = 16 tf = TfidfVectorizer(min_df= 3, max_df=0.5,analyzer = 'char_wb', ngram_range = (2,5)) X = tf.fit_transform(texts) svd = TruncatedSVD(n_components=n_components, random_state=42) X_svd = svd.fit_transform(X) df_tfidf = pd.DataFrame(X_svd) df_tfidf.columns = [f'user_agent_name_tfidf_{i}' for i in range(n_components)]
data=pd.concat([data,df_tfidf],axis=1) cate_cols=['method','user_agent','url','refer','body']+ua_cols
data.head() .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id | method | user_agent | url | refer | body | lable | browser_family | os_family | device_family | ... | user_agent_name_tfidf_6 | user_agent_name_tfidf_7 | user_agent_name_tfidf_8 | user_agent_name_tfidf_9 | user_agent_name_tfidf_10 | user_agent_name_tfidf_11 | user_agent_name_tfidf_12 | user_agent_name_tfidf_13 | user_agent_name_tfidf_14 | user_agent_name_tfidf_15 | |
0 | 17902 | GET | Dalvik/2.1.0 (Linux; U; Android 11; SM-G9860 B... | /livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=ed... | NAN | GET /livemsg?ad_type=WL_WK&ty=web&pu=0&openudi... | 1 | Android | Android | Samsung SM-G9860 | ... | -0.001757 | 0.006566 | -0.016728 | 0.071496 | 0.105247 | -0.373983 | 0.515993 | -0.172456 | -0.015012 | 0.014517 |
1 | 190 | GET | Dalvik/2.1.0 (Linux; U; Android 11; Mi 10 Buil... | /livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d6... | NAN | GET /livemsg?ad_type=WL_WK&ty=web&pu=0&openudi... | 1 | Android | Android | XiaoMi Mi 10 | ... | 0.002254 | -0.028675 | -0.014554 | -0.330868 | -0.045324 | -0.079707 | -0.197802 | 0.035036 | 0.018985 | -0.014604 |
2 | 8799 | GET | Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl... | /(select%20extractvalue(xmltype('%3c%3fxml%20v... | NAN | GET /(select%20extractvalue(xmltype('%3c%3fxml... | 1 | Chrome | Windows | Other | ... | -0.329514 | -0.249199 | -0.024391 | 0.006098 | -0.049402 | 0.008570 | -0.005371 | -0.042493 | 0.000034 | 0.002854 |
3 | 8788 | GET | Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl... | /ftp/quarantine/?(select%20load_file('%5c%5c%5... | NAN | GET /ftp/quarantine/?(select%20load_file('%5c%... | 1 | Chrome | Windows | Other | ... | -0.329514 | -0.249199 | -0.024391 | 0.006098 | -0.049402 | 0.008570 | -0.005371 | -0.042493 | 0.000034 | 0.002854 |
4 | 16030 | GET | Dalvik/2.1.0 (Linux; U; Android 9; MI 9 SE MIU... | /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap... | NAN | GET /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&... | 1 | Android | Android | XiaoMi MI 9 SE | ... | -0.005911 | -0.002689 | 0.015273 | 0.085995 | -0.011761 | -0.006246 | -0.009553 | -0.012806 | -0.001621 | 0.002927 |
5 rows × 28 columns
三、模型训练和预测
1.LabelEncoder 处理
将离散型的数据转换成 0 到 n − 1 之间的数,这里 n 是一个列表的不同取值的个数,可以认为是某个特征的所有不同取值的个数。
- 提示:LabelEncoder().fit_transform函数 argument must be a string or number:
- LabelEncoder()运行前要对缺失值进行处理
主要是两个步骤
- 先构造encoder,通过fit函数传入需要编码的数据,在内部生成对应的key-value
- 然后encoder 用于需要转化的数据,用transform函数
for col in tqdm(cate_cols): lbl = LabelEncoder() lbl.fit(data[col]) data[col] = lbl.transform(data[col])
100%|██████████| 10/10 [00:00<00:00, 29.83it/s]
train, test = data[:len(train)], data[len(train):]
2.模型训练
def lgb_model(train, target, test, k): feats = [f for f in train.columns if f not in ['lable', 'url', 'url_count']] print('Current num of features:', len(feats)) oof_probs = np.zeros((train.shape[0],6)) output_preds = 0 offline_score = [] feature_importance_df = pd.DataFrame() parameters = { 'learning_rate': 0.03, 'boosting_type': 'gbdt', 'objective': 'multiclass', 'metric': 'multi_error', 'num_class': 6, 'num_leaves': 31, 'feature_fraction': 0.6, 'bagging_fraction': 0.8, 'min_data_in_leaf': 15, 'verbose': -1, 'nthread': 4, 'max_depth': 7 } seeds = [2020] for seed in seeds: folds = StratifiedKFold(n_splits=k, shuffle=True, random_state=seed) for i, (train_index, test_index) in enumerate(folds.split(train, target)): train_y, test_y = target.iloc[train_index], target.iloc[test_index] train_X, test_X = train[feats].iloc[train_index, :], train[feats].iloc[test_index, :] dtrain = lgb.Dataset(train_X, label=train_y) dval = lgb.Dataset(test_X, label=test_y) lgb_model = lgb.train( parameters, dtrain, num_boost_round=20000, valid_sets=[dval], callbacks=[early_stopping(100), log_evaluation(100)], ) oof_probs[test_index] = lgb_model.predict(test_X[feats], num_iteration=lgb_model.best_iteration) / len( seeds) offline_score.append(lgb_model.best_score['valid_0']['multi_error']) output_preds += lgb_model.predict(test[feats], num_iteration=lgb_model.best_iteration) / folds.n_splits / len(seeds) print(offline_score) # feature importance fold_importance_df = pd.DataFrame() fold_importance_df["feature"] = feats fold_importance_df["importance"] = lgb_model.feature_importance(importance_type='gain') fold_importance_df["fold"] = i + 1 feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0) print('OOF-MEAN-AUC:%.6f, OOF-STD-AUC:%.6f' % (np.mean(offline_score), np.std(offline_score))) print('feature importance:') print(feature_importance_df.groupby(['feature'])['importance'].mean().sort_values(ascending=False).head(50)) return output_preds, oof_probs, np.mean(offline_score), feature_importance_df
# 排除特征 # id等肯定是要排除的 feature_names = list( filter( lambda x: x not in ['id','lable','url', 'url_count'], train.columns))
feature_names
['method', 'user_agent', 'refer', 'body', 'browser_family', 'os_family', 'device_family', 'device_brand', 'device_model', 'user_agent_name_tfidf_0', 'user_agent_name_tfidf_1', 'user_agent_name_tfidf_2', 'user_agent_name_tfidf_3', 'user_agent_name_tfidf_4', 'user_agent_name_tfidf_5', 'user_agent_name_tfidf_6', 'user_agent_name_tfidf_7', 'user_agent_name_tfidf_8', 'user_agent_name_tfidf_9', 'user_agent_name_tfidf_10', 'user_agent_name_tfidf_11', 'user_agent_name_tfidf_12', 'user_agent_name_tfidf_13', 'user_agent_name_tfidf_14', 'user_agent_name_tfidf_15']
train['lable'].unique()
array([1, 5, 4, 0, 2, 3])
# label转为int类型 train['lable']=train['lable'].apply(lambda i:int(i)) train['lable'].unique()
array([1, 5, 4, 0, 2, 3])
print('开始模型训练train') lgb_preds, lgb_oof, lgb_score, feature_importance_df = lgb_model(train=train[feature_names], target=train['lable'],
开始模型训练train Current num of features: 25 Training until validation scores don't improve for 100 rounds [100] valid_0's multi_error: 0.106713 [200] valid_0's multi_error: 0.102197 [300] valid_0's multi_error: 0.0996388 [400] valid_0's multi_error: 0.0979831 [500] valid_0's multi_error: 0.0963275 [600] valid_0's multi_error: 0.0945214 [700] valid_0's multi_error: 0.0943709 Early stopping, best iteration is: [665] valid_0's multi_error: 0.0942203 [0.09422034918723661] Training until validation scores don't improve for 100 rounds [100] valid_0's multi_error: 0.107164 [200] valid_0's multi_error: 0.104756 [300] valid_0's multi_error: 0.102498 [400] valid_0's multi_error: 0.0990367 [500] valid_0's multi_error: 0.10009 Early stopping, best iteration is: [410] valid_0's multi_error: 0.0987357 [0.09422034918723661, 0.09873570138470801] Training until validation scores don't improve for 100 rounds [100] valid_0's multi_error: 0.105057 [200] valid_0's multi_error: 0.100993 [300] valid_0's multi_error: 0.0975316 [400] valid_0's multi_error: 0.0957255 Early stopping, best iteration is: [388] valid_0's multi_error: 0.0954244 [0.09422034918723661, 0.09873570138470801, 0.09542444310656231] Training until validation scores don't improve for 100 rounds [100] valid_0's multi_error: 0.110777 [200] valid_0's multi_error: 0.104907 [300] valid_0's multi_error: 0.1028 [400] valid_0's multi_error: 0.101445 [500] valid_0's multi_error: 0.100542 [600] valid_0's multi_error: 0.101746 Early stopping, best iteration is: [518] valid_0's multi_error: 0.0997893 [0.09422034918723661, 0.09873570138470801, 0.09542444310656231, 0.099789283564118] Training until validation scores don't improve for 100 rounds [100] valid_0's multi_error: 0.0999548 [200] valid_0's multi_error: 0.0948367 [300] valid_0's multi_error: 0.0928797 [400] valid_0's multi_error: 0.0924281 [500] valid_0's multi_error: 0.0913744 Early stopping, best iteration is: [496] valid_0's multi_error: 0.0912238 [0.09422034918723661, 0.09873570138470801, 0.09542444310656231, 0.099789283564118, 0.09122384464850218] OOF-MEAN-AUC:0.095879, OOF-STD-AUC:0.003101 feature importance: feature body 241348.874669 browser_family 136926.749047 refer 91150.440689 user_agent_name_tfidf_2 57131.292926 user_agent_name_tfidf_6 54907.824831 user_agent_name_tfidf_4 54479.337116 user_agent 41806.222855 user_agent_name_tfidf_1 37807.530563 user_agent_name_tfidf_5 37042.289503 user_agent_name_tfidf_0 35681.990519 user_agent_name_tfidf_7 31312.516073 method 22488.125961 user_agent_name_tfidf_12 20017.676556 user_agent_name_tfidf_15 19832.515956 user_agent_name_tfidf_14 17805.419354 user_agent_name_tfidf_10 17644.252264 user_agent_name_tfidf_3 17013.031725 user_agent_name_tfidf_9 16673.376316 user_agent_name_tfidf_13 16357.774028 user_agent_name_tfidf_8 15505.476155 user_agent_name_tfidf_11 11311.683459 os_family 3066.650211 device_model 2139.934527 device_family 1887.317088 device_brand 1406.674794 Name: importance, dtype: float64
四、预测并提交
1.预测
# 读取提交格式 sub=pd.read_csv('data/data168450/submit_example.csv')
# 获取最大概率标签 sub['predict']=np.argmax(lgb_preds,axis=1)
# 统计 sub['predict'].value_counts()
1 1281 2 923 0 840 3 588 5 298 4 70 Name: predict, dtype: int64
# 保存 sub.to_csv('sub.csv',index=None)
accuracy_score(train['lable'],np.argmax(lgb_oof,axis=1))
0.9041211354947469
2.提交
3.体会
- 对于字符串离散化处理,以前一直手动处理,很慢,这次学到了 LabelEncoder 办法;
- 对于文本内容向量化处理,这次学到了 TfidfVectorizer 办法。