一、评测结果
评测结论1:4台64核256g阿里云服务器组成的4节点集群,hadoop3.2.2 作为分布式存储,Cloudwave4.0在 SSB1000g 国际标准测试集下,整体性能优于Starrocks3.0近0.36倍。
评测结论2:在多表联合join场景下,Cloudwave4.0版本,耗时几乎等于零
[附]13条标准测试SQL测试结果表:
数据库 | 数据集 | 响应时间(s) | CPU 最大占用率 | 存储压缩比 | 数据导入时间 |
Cloudwave4.0 | ssb1000 | 7.602 | 90%(5763%/6400%) | 59%(360g/606g) | 58分钟 |
Starrocks3.0 | ssb1000 | 10.397 | 66.6%(4266%/6400%) | 169%(1024g/606g) | 112分钟 |
[附]2条拓展测试SQL测试结果表
数据库 | 数据集 | 拓展SQL1响应时间(s) | 拓展SQL1 CPU 最大占用率 | 拓展SQL2响应时间(s) | 拓展SQL2 CPU 最大占用率 |
Cloudwave4.0 | ssb1000 | 0.012 | 0.0935%(6%/6400%) | 0.014 | 0.118%(7.6%/6400%) |
Starrocks3.0 | ssb1000 | 2.79 | 78.7%(5037%/6400%) | 4.8 | 90.5%(5797%/6400%) |
二、评测环境
- 硬件环境:4台 64核256g 云服务器(组成2节点的集群),essd pl1 高效云盘
- 软件环境:jdk19(Cloudwave4.0官方推荐版本,官方基于jdk19版本里头的的vector api,实现全面向量化引擎)、jdk8(starrocks安装推荐jdk版本,主要用于fe,亦可少踩坑)、mysql8(作为starrocks的客户端)、hadoop 3.2.2(作为cloudwave 和 starrocks 共同的分布式存储,副本数=2)
- 软件版本:Cloudwave 4.0(最新版在2023年5月份发版),Starrocks 3.0(最新版在2023年4月份发版)
- 评测数据集:ssb1000
表名 | 行数 | 说明 |
lineorder | 60 亿 | SSB 商品订单表 |
customer | 3000 万 | SSB 客户表 |
part | 200 万 | SSB 零部件表 |
supplier | 200 万 | SSB 供应商表 |
dates | 2556 | 日期表 |
硬件环境
jdk版本
mysql版本
hadoop版本
Starrocks版本
Starrocks版本
三、评测方法
- 执行19轮测试脚本,每轮执行13条标准测试sql,去除第1轮的测试数据(由于IO原因,第1次查询两边的性能均受IO影响,本测试主要测数据库引擎的算法在同等计算资源的条件下的优劣,因此去除第一轮测试数据),将余下的18轮测试数据做平均,获得每条sql的平均耗时;
- 观察最大CPU占用
- 观察存储压缩比
- 观察数据加载时间
多表联合join测试
- 测试方法:执行19轮SQL测试脚本,每轮执行1条多表联合join拓展测试sql,去除第1轮的测试数据(由于IO原因,第1次查询两边的性能均受IO影响,本测试主要测数据库引擎的算法在同等计算资源的条件下的优劣,因此去除第一轮测试数据),将余下的18轮测试数据做平均,获得sql的平均耗时
- 观察最大CPU占用
- 统计耗时
- 多表联合join拓展测试SQL1:select count(*) from lineorder,customer where lo_custkey = c_custkey;
- 多表联合join拓展测试SQL2:select count(*) from lineorder,customer,supplier where lo_custkey = c_custkey and lo_suppkey = s_suppkey;
四、开始测试[cloudwave]
- 查看为hadoop准备的存储空间
./sync_scripts.sh 'df -h' | grep home
- 格式化hadoop
hdfs namenode -format
- 启动hdfs,并查看服务状态
start-dfs.sh ./sync_scripts.sh 'jps'
- 创建ssb1000数据上传目录
hdfs dfs -mkdir /cloudwave hdfs dfs -mkdir /cloudwave/uploads hdfs dfs -put ssb1000 /cloudwave/uploads/
- 检查数据上传结果
- 可以看到ssb1000的数据,占用606GB的存储空间
hdfs dfs -du -h / du -sh /home/cloudwave/ssb-poc-0.9.3/ssb-poc/output/data_dir/ssb1000
启动cloudwave数据库,并导入ssb1000数据
- 启动数据库
./start-all-server.sh
- 导入数据
./cplus_go.bin -s 'loaddata ssb1000'
- 可以看到3493秒导入完成,也就是58分钟。
- 上图通过hdfs命令,可以看到cloudwave做了数据压缩,ssb1000数据的原始大小是606G,导入cloudwave数据库之后,压缩到了360g(图中的720G 表示hdfs两个数据副本的总大小),压缩比为59%。
- [cloudwave]开始测试13条标准测试SQL
- 执行测试脚本
./test_ssb.sh
,七镜观察到cloudwave 的4节点集群测ssb1000 CPU最大占用是90%(5763%/6400%)
- 执行分析脚本
./analysis.sh cloudwave "$(ls n*txt)" +
,可以看到13条标准测试SQL的合计时间平均是7.6s
- [cloudwave] 开始测试2条多表联合joinSQL
- 执行测试脚本
./test_ex.sh
,七镜观察到cloudwave的4节点集群测ssb1000 的拓展SQL1的CPU最大占用是0.0935%(6%/6400%)
- 执行分析脚本
./analysis.sh cloudwave "$(ls n*txt)" +
,可以看到拓展SQL1耗时是12ms。
- 执行分析脚本
./analysis.sh cloudwave "$(ls n*txt)" +
,可以看到拓展SQL1耗时是12ms。
- 将sql_ex.sql里的sql换成拓展SQL2,执行测试脚本
./test_ex.sh
,七镜观察到cloudwave的4节点集群测ssb1000 的拓展SQL2的CPU最大占用是0.118%(7.6%/6400%)
- 执行分析脚本
./analysis.sh cloudwave "$(ls n*txt)" +
,可以看到拓展SQL2耗时是14ms。
五、对比测试
- 清空hdfs
[starrocks] 启动 starrocks3.0 fe
./fe/bin/start_fe.sh --daemon
- [starrocks] 添加starrocks3.0 be
mysql -uroot -h127.0.0.1 -P9030 ALTER SYSTEM ADD BACKEND "172.17.161.33:9050"; ALTER SYSTEM ADD BACKEND "172.17.161.32:9050"; ALTER SYSTEM ADD BACKEND "172.17.161.31:9050"; ALTER SYSTEM ADD BACKEND "172.17.161.30:9050";
[starrocks] 启动 starrocks 3.0 be
./sync_scripts.sh "cd $(pwd)/be/bin && ./start_be.sh --daemon &&ps -ef | grep starrocks_be"
- [starrocks] 验证集群状态,4个节点的 Alive=true 即可。
- [starrocks] 创建表
- [starrocks] 开始导入数据,ssb1000导入时间是
- 如上图所示,8点58分开始执行的导入命令。
date && ./bin/stream_load.sh data_dir/ssb30 && date
- 如上图所示,导入过程中,发现在我设置的hdfs副本数默认=2的配置下,starrocks自己把自己建的文件副本数改成了3。
- 如上图所示,10点50分导入结束,总计耗时112分钟。
- [starrocks] 查看ssb1000 压缩比,ssb1000数据的原始大小是606G,导入starrocks数据库之后,神奇的发现,占用了1T的分布式存储(压缩呢???)。
- [starrocks] 开始测试
- 执行测试脚本./test_ssb.sh,七镜观察到 starrocks 的4节点集群测ssb1000 CPU最大占用是4266%/6400%
- [starrocks]分析测试结果
- 执行分析脚本./analysis.sh starrocks "$(ls n*txt)" +,去掉第一轮查询(42.57s)的平均时间是10.39秒
[starrocks] 开始测试2条多表联合joinSQL
- 执行测试脚本
./test_ex.sh
,七镜观察到starrocks的4节点集群测ssb1000 的拓展SQL1的CPU最大占用是78.7%(5037%/6400%)
- 执行分析脚本
./analysis.sh starrocks "$(ls n*txt)" +
,可以看到拓展SQL1耗时是2.79s。
- 将sql_ex.sql里的sql换成拓展SQL2,执行测试脚本
./test_ex.sh
,七镜观察到starrocks的4节点集群测ssb1000 的拓展SQL2的CPU最大占用是90.5%(5797%/6400%)
- 执行分析脚本
./analysis.sh starrocks "$(ls n*txt)" +
,可以看到拓展SQL2耗时是4.8s。
五、附加
- Cloudwave 测试脚本
#!/bin/bash # Program: # test ssb # History: # 2023/03/17 junfenghe.cloud@qq.com version:0.0.1 rm -rf ./n*txt for ((i=1; i<20; i++)) do cat sql_ssb.sql |./cplus.sh > n${i}.txt done
Starrocks 测试脚本
#!/bin/bash # Program: # test ssb # History: # 2023/03/17 junfenghe.cloud@qq.com version:0.0.1 rm -rf ./n*txt for ((i=1; i<20; i++)) do cat sql_ssb.sql | mysql -uroot -P 9030 -h 127.0.0.1 -v -vv -vvv >n${i}.txt done
分析脚本
#!/bin/bash #Program: # analysis cloudwave/starrocks logs of base compute #History: #2023/02/20 junfenghe.cloud@qq.com version:0.0.1 path=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/sbin:/usr/local/bin:~/bin export path suff="(s)#####" if [ -z "${1}" ] then echo "Please input database'name" exit -1 fi if [ -z "$2" ] then echo "Please input times of scanner" exit -f fi if [ -n "${3}" ] then suff=${3} fi for current in ${2} do result_time="" if [ "${1}" == "starrocks" ] then for time in $( cat ${current} | grep sec | awk -F '(' '{print $2}' | awk -F ' ' '{print $1}' ) do result_time="${result_time}${time}${suff}" done elif [ "${1}" == "cloudwave" ] then for time in $( cat ${current} | grep Elapsed | awk '{print $2}'| sed 's/:/*60+/g'| sed 's/+00\*60//g ; s/+0\*60//g ; s/^0\*60+//g' ) do result_time="${result_time}${time}${suff}" done fi echo ${result_time%${suff}*} done exit 0
sql_ssb.sql
use ssb1000; select sum(lo_revenue) as revenue from lineorder,dates where lo_orderdate = d_datekey and d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25; select sum(lo_revenue) as revenue from lineorder,dates where lo_orderdate = d_datekey and d_yearmonthnum = 199401 and lo_discount between 4 and 6 and lo_quantity between 26 and 35; select sum(lo_revenue) as revenue from lineorder,dates where lo_orderdate = d_datekey and d_weeknuminyear = 6 and d_year = 1994 and lo_discount between 5 and 7 and lo_quantity between 26 and 35; select sum(lo_revenue) as lo_revenue, d_year, p_brand from lineorder ,dates,part,supplier where lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and p_category = 'MFGR#12' and s_region = 'AMERICA' group by d_year, p_brand order by d_year, p_brand; select sum(lo_revenue) as lo_revenue, d_year, p_brand from lineorder,dates,part,supplier where lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and p_brand between 'MFGR#2221' and 'MFGR#2228' and s_region = 'ASIA' group by d_year, p_brand order by d_year, p_brand; select sum(lo_revenue) as lo_revenue, d_year, p_brand from lineorder,dates,part,supplier where lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and p_brand = 'MFGR#2239' and s_region = 'EUROPE' group by d_year, p_brand order by d_year, p_brand; select c_nation, s_nation, d_year, sum(lo_revenue) as lo_revenue from lineorder,dates,customer,supplier where lo_orderdate = d_datekey and lo_custkey = c_custkey and lo_suppkey = s_suppkey and c_region = 'ASIA' and s_region = 'ASIA'and d_year >= 1992 and d_year <= 1997 group by c_nation, s_nation, d_year order by d_year asc, lo_revenue desc; select c_city, s_city, d_year, sum(lo_revenue) as lo_revenue from lineorder,dates,customer,supplier where lo_orderdate = d_datekey and lo_custkey = c_custkey and lo_suppkey = s_suppkey and c_nation = 'UNITED STATES' and s_nation = 'UNITED STATES' and d_year >= 1992 and d_year <= 1997 group by c_city, s_city, d_year order by d_year asc, lo_revenue desc; select c_city, s_city, d_year, sum(lo_revenue) as lo_revenue from lineorder,dates,customer,supplier where lo_orderdate = d_datekey and lo_custkey = c_custkey and lo_suppkey = s_suppkey and (c_city='UNITED KI1' or c_city='UNITED KI5') and (s_city='UNITED KI1' or s_city='UNITED KI5') and d_year >= 1992 and d_year <= 1997 group by c_city, s_city, d_year order by d_year asc, lo_revenue desc; select c_city, s_city, d_year, sum(lo_revenue) as lo_revenue from lineorder,dates,customer,supplier where lo_orderdate = d_datekey and lo_custkey = c_custkey and lo_suppkey = s_suppkey and (c_city='UNITED KI1' or c_city='UNITED KI5') and (s_city='UNITED KI1' or s_city='UNITED KI5') and d_yearmonth = 'Dec1997' group by c_city, s_city, d_year order by d_year asc, lo_revenue desc; select d_year, c_nation, sum(lo_revenue) - sum(lo_supplycost) as profit from lineorder,dates,customer,supplier,part where lo_orderdate = d_datekey and lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_partkey = p_partkey and c_region = 'AMERICA' and s_region = 'AMERICA' and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2') group by d_year, c_nation order by d_year, c_nation; select d_year, s_nation, p_category, sum(lo_revenue) - sum(lo_supplycost) as profit from lineorder,dates,customer,supplier,part where lo_orderdate = d_datekey and lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_partkey = p_partkey and c_region = 'AMERICA'and s_region = 'AMERICA' and (d_year = 1997 or d_year = 1998) and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2') group by d_year, s_nation, p_category order by d_year, s_nation, p_category; select d_year, s_city, p_brand, sum(lo_revenue) - sum(lo_supplycost) as profit from lineorder,dates,customer,supplier,part where lo_orderdate = d_datekey and lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_partkey = p_partkey and c_region = 'AMERICA'and s_nation = 'UNITED STATES' and (d_year = 1997 or d_year = 1998) and p_category = 'MFGR#14' group by d_year, s_city, p_brand order by d_year, s_city, p_brand;