机器学习样本标记 示意代码

本文涉及的产品
公共DNS(含HTTPDNS解析),每月1000万次HTTP解析
全局流量管理 GTM,标准版 1个月
云解析 DNS,旗舰版 1个月
简介:

目标:根据各个字段数据的分布(例如srcIP和dstIP的top 10)以及其他特征来进行样本标注,最终将几类样本分别标注在black/white/ddos/mddos/cdn/unknown几类。

效果示意:

-------------choose one--------------
sub domain: DNSQueryName(N)
ip: srcip(S) or dstip(D)
length: DNSRequestLength(R1) or DNSReplyLength(R2)
length too: DNSRequestErrLength(R3) or DNSReplyErrLength(R4)
port: sourcePort(P1) or destPort(P2) or DNSReplyTTL(T)
code: DNSReplyCode(C2) or DNSRequestRRType(C1)
other: DNSRRClass(RR) or DNSReplyIPv4(V)
-------------label or quit------------
black(B) or white(W) or cdn(CDN) or ddos(DDOS) or mddos(M) or unknown(U) or white-like(L)
next(Q) or exit(E)?
***************************************
domain: workgroup. flow count: 206
***************************************
------------srcip-----------------
count                 206
unique                  9
top       162.105.129.122
freq                  150
Name: sourceIP, dtype: object
--------------destip---------------
count             206
unique             12
top       199.7.83.42
freq               82
Name: destIP, dtype: object

 

代码:

复制代码
import sys
import json
import os
import pandas as pd
import tldextract
# import numpy as np


medata_field = '''
3 = sourceIP
4 = destIP
5 = sourcePort
6 = destPort
7 = protocol
12 = flowStartSeconds
13 = flowEndSecond
54 = DNSReplyCode
55 = DNSQueryName
56 = DNSRequestRRType
57 = DNSRRClass
58 = DNSDelay
59 = DNSReplyTTL
60 = DNSReplyIPv4
61 = DNSReplyIPv6
62 = DNSReplyRRType
77 = DNSReplyName
81 = payload
88 = DNSRequestLength
89 = DNSRequestErrLength
90 = DNSReplyLength
91 = DNSReplyErrLength
'''

medata_field_num = []
medata_field_info = []
for l in medata_field.split("\n"):
    if len(l) == 0: continue
    num, info = l.split(" = ")
    medata_field_num.append(int(num)-1)
    medata_field_info.append(info)
print medata_field_num
print medata_field_info


def extract_domain(domain):
    try:
        ext = tldextract.extract(domain)
        subdomain = ext.subdomain
        if ext.domain == "":
            mdomain = ext.suffix
        else:
            mdomain = ".".join(ext[1:])
        return mdomain
    except Exception,e:
        print "extract_domain error:", e
        return "unknown"


def parse_metadata(path):
    df = pd.read_csv(path, sep="^", header=None)
    dns_df = df.iloc[:, medata_field_num].copy()
    dns_df.columns = medata_field_info
    # print dns_df.tail()

    dns_df["mdomain"] = dns_df["DNSQueryName"].apply(extract_domain)
    # print dns_df.groupby('mdomain').describe()
    # print dns_df.groupby('mdomain').groups
    return dns_df.groupby('mdomain')


def get_data_dist(df, col="sourceIP"):
    # group count by ip dist
    grouped = df.groupby(col)
    # print grouped.head(10)[col]
    print type(grouped.size())
    size = grouped.size()
    print size
    print "-----------top 10-------------"
    print size.nlargest(10)


def get_ipv4_dist(df, col="DNSReplyLength"):
    # group count by ip dist
    df2 = df[df[col] > 0]
    print "filter before length:", len(df), "filter after length:", len(df2)
    grouped = df2.groupby(by="DNSReplyIPv4")
    # print grouped.head(10)[col]
    size = grouped.size()
    print size
    print "-----------top 10-------------"
    print size.nlargest(10)



def move_to(srcpath, domain, dst_path):
    with open(dst_path, "w") as w:
        with open(srcpath) as r:
            for line in r:
                if extract_domain(line.split("^")[55-1]) == domain:
                    w.write(line)


def main():
    history_op = {}
    if os.path.exists("history_op.json"):
        with open("history_op.json") as h:
            history_op = json.load(h)
            print history_op
    for day in range(24, 27):
        for hour in range(0, 24):
            path = "/home/bonelee/latest_metadata_sample/sampled/unknown_sample/debugdogcom-medata_wanted-2017-09-%d-%d.txt" % (day, hour)
            if not os.path.exists(path) or os.path.getsize(path) == 0:
                print path, "passed, file not exists or empty file."
                continue
            print path, "running..."
            try:
                domains_info = parse_metadata(path)
            except IOError, e:
                print e
                continue
            for domain, group in domains_info:
                print "***************************************"
                print "domain:", domain, "flow count:", len(group)
                print "***************************************"
                # print type(group) #<class 'pandas.core.frame.DataFrame'>
                print "------------srcip-----------------"
                print group["sourceIP"].describe()
                print "--------------destip---------------"
                print group["destIP"].describe()
                print "----------------------------------------"
                print "ipv4 address return dist:"
                get_ipv4_dist(group)
                print "----------------------------------------"

                has_judged = False
                need_break = False
                while True:
                    print "-------------choose one--------------"
                    print "sub domain: DNSQueryName(N)"
                    print "ip: srcip(S) or dstip(D)"
                    print "length: DNSRequestLength(R1) or DNSReplyLength(R2)"
                    print "length too: DNSRequestErrLength(R3) or DNSReplyErrLength(R4)"
                    print "port: sourcePort(P1) or destPort(P2) or DNSReplyTTL(T)"
                    print "code: DNSReplyCode(C2) or DNSRequestRRType(C1)"
                    print "other: DNSRRClass(RR) or DNSReplyIPv4(V)"
                    dist_dict = {"R1": "DNSRequestLength",
                     "R2": "DNSReplyLength",
                     "R3": "DNSRequestErrLength",
                     "R4": "DNSReplyErrLength",
                     "P1": "sourcePort",
                     "P2": "destPort",
                     "T": "DNSReplyTTL",
                     "C2": "DNSReplyCode",
                     "C1": "DNSRequestRRType",
                     "RR": "DNSRRClass",
                     "V": "DNSReplyIPv4",
                     "S": "sourceIP",
                     "D": "destIP",
                     "N": "DNSQueryName"
                     }

                    print "-------------label or quit------------"
                    print "black(B) or white(W) or cdn(CDN) or ddos(DDOS) or mddos(M) or unknown(U) or white-like(L)"
                    print "next(Q) or exit(E)?"
                    domain = domain.lower()
                    if "win" == domain[-len("win"):] or "site" == domain[-len("site"):] or "vip" == domain[-len("vip"):]:
                        check = "U"
                        need_break = True
                    elif "lan" in domain or "local" in domain or "dhcp" in domain or "workgroup" in domain or "home" in domain:
                        check = "DDOS"
                        need_break = True
                    elif "cdn" in domain:
                        check = "CDN"
                        need_break = True
                    else:
                        if domain in history_op and not has_judged:
                            print "found history op:", history_op[domain]
                            if not raw_input("OK(Enter for Y)?"):
                                check = history_op[domain]
                                need_break = True
                            else:
                                check = raw_input("Input:")
                        else:
                            check = raw_input("Input:")
                    has_judged = True
                    if check == "Q":
                        print path, "next OK!"
                        break
                    elif check == "E":
                        print path, "Exit!"
                        with open("history_op.json", "w") as f:
                            json.dump(history_op, f)
                            print "saved history_op.json"
                        sys.exit()
                    elif check == "B":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_black/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "B"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "W":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_white/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "W"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "L":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_white_like/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "L"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "CDN":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_cdn/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "CDN"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "DDOS":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_ddos/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "DDOS"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "M":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_mddos/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "M"
                        print "Saved OK!"
                        if need_break: break
                    elif check == "U":
                        move_to(path, domain, "/home/bonelee/latest_metadata_sample/labeled_unknown/2017-8-%d-%d-%s.txt" % (day, hour, domain))
                        history_op[domain] = "U"
                        print "Saved OK!"
                        if need_break: break
                    else:
                        if check in dist_dict:
                            get_data_dist(group, dist_dict[check])
                        else:
                            print "unknown input!Choose the following one:"
            print "*******************************"
            print path, "check over..."
            print "*******************************"


if __name__ == "__main__":
    main()
复制代码

 













本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/7608165.html,如需转载请自行联系原作者



相关实践学习
Serverless极速搭建Hexo博客
本场景介绍如何使用阿里云函数计算服务命令行工具快速搭建一个Hexo博客。
相关文章
|
14天前
|
机器学习/深度学习 数据采集 人工智能
探索机器学习:从理论到Python代码实践
【10月更文挑战第36天】本文将深入浅出地介绍机器学习的基本概念、主要算法及其在Python中的实现。我们将通过实际案例,展示如何使用scikit-learn库进行数据预处理、模型选择和参数调优。无论你是初学者还是有一定基础的开发者,都能从中获得启发和实践指导。
31 2
|
17天前
|
机器学习/深度学习 数据采集 人工智能
揭秘AI:机器学习的魔法与代码
【10月更文挑战第33天】本文将带你走进AI的世界,了解机器学习的原理和应用。我们将通过Python代码示例,展示如何实现一个简单的线性回归模型。无论你是AI新手还是有经验的开发者,这篇文章都会给你带来新的启示。让我们一起探索AI的奥秘吧!
|
1月前
|
数据采集 移动开发 数据可视化
模型预测笔记(一):数据清洗分析及可视化、模型搭建、模型训练和预测代码一体化和对应结果展示(可作为baseline)
这篇文章介绍了数据清洗、分析、可视化、模型搭建、训练和预测的全过程,包括缺失值处理、异常值处理、特征选择、数据归一化等关键步骤,并展示了模型融合技术。
67 1
模型预测笔记(一):数据清洗分析及可视化、模型搭建、模型训练和预测代码一体化和对应结果展示(可作为baseline)
|
1月前
|
机器学习/深度学习 人工智能 算法
揭开深度学习与传统机器学习的神秘面纱:从理论差异到实战代码详解两者间的选择与应用策略全面解析
【10月更文挑战第10天】本文探讨了深度学习与传统机器学习的区别,通过图像识别和语音处理等领域的应用案例,展示了深度学习在自动特征学习和处理大规模数据方面的优势。文中还提供了一个Python代码示例,使用TensorFlow构建多层感知器(MLP)并与Scikit-learn中的逻辑回归模型进行对比,进一步说明了两者的不同特点。
69 2
|
1月前
|
JSON 测试技术 API
阿里云PAI-Stable Diffusion开源代码浅析之(二)我的png info怎么有乱码
阿里云PAI-Stable Diffusion开源代码浅析之(二)我的png info怎么有乱码
|
1月前
|
机器学习/深度学习 算法 API
【机器学习】正则化,欠拟合与过拟合(详细代码与图片演示!助你迅速拿下!!!)
【机器学习】正则化,欠拟合与过拟合(详细代码与图片演示!助你迅速拿下!!!)
|
3月前
|
机器学习/深度学习 数据采集 算法
机器学习到底是什么?附sklearn代码
机器学习到底是什么?附sklearn代码
|
2月前
|
机器学习/深度学习 人工智能 算法
探索人工智能:机器学习的基本原理与Python代码实践
【9月更文挑战第6天】本文深入探讨了人工智能领域中的机器学习技术,旨在通过简明的语言和实际的编码示例,为初学者提供一条清晰的学习路径。文章不仅阐述了机器学习的基本概念、主要算法及其应用场景,还通过Python语言展示了如何实现一个简单的线性回归模型。此外,本文还讨论了机器学习面临的挑战和未来发展趋势,以期激发读者对这一前沿技术的兴趣和思考。
|
3月前
|
机器学习/深度学习 运维 算法
深入探索机器学习中的支持向量机(SVM)算法:原理、应用与Python代码示例全面解析
【8月更文挑战第6天】在机器学习领域,支持向量机(SVM)犹如璀璨明珠。它是一种强大的监督学习算法,在分类、回归及异常检测中表现出色。SVM通过在高维空间寻找最大间隔超平面来分隔不同类别的数据,提升模型泛化能力。为处理非线性问题,引入了核函数将数据映射到高维空间。SVM在文本分类、图像识别等多个领域有广泛应用,展现出高度灵活性和适应性。
160 2
|
3月前
|
持续交付 测试技术 jenkins
JSF 邂逅持续集成,紧跟技术热点潮流,开启高效开发之旅,引发开发者强烈情感共鸣
【8月更文挑战第31天】在快速发展的软件开发领域,JavaServer Faces(JSF)这一强大的Java Web应用框架与持续集成(CI)结合,可显著提升开发效率及软件质量。持续集成通过频繁的代码集成及自动化构建测试,实现快速反馈、高质量代码、加强团队协作及简化部署流程。以Jenkins为例,配合Maven或Gradle,可轻松搭建JSF项目的CI环境,通过JUnit和Selenium编写自动化测试,确保每次构建的稳定性和正确性。
62 0

热门文章

最新文章

下一篇
无影云桌面