1. 背景介绍
EMR集群中作业写数据到OSS时,需要先将数据缓存在本地,然后再一次性上传到OSS中。EMR支持两种缓存策略:
- disk
- off-heap
两种缓存测试使用场景略有区别:
- 本地磁盘缓存策略适用到任何场景,且能满足较大文件上传需求。
- 堆外内存缓存策略在性能上较磁盘缓存有优势,但是受限于内存资源。在实现上,堆外内存的申请会限制在一定范围内,当数据产生速率超过数据上传速率时,输出流会block住,需要等待进行中的上传任务完成。
潜在问题:
- 作业提交到Yarn:当使用堆外内存策略时,存在内存超用被Yarn杀掉的风险。所以在内存参数设置上需要格外小心,不然会影响到作业的稳定性。
2. 如何使用
作业参数中配置"fs.oss.upload.bufferType",可选值为"disk"或者"off-heap"。以下举例:
1. hadoop fs -Dfs.oss.upload.bufferType=disk -put a.txt oss://xxx/xxx/
2. Hadoop作业:
Configuration conf = new Configuration()
conf.set("fs.oss.upload.bufferType", "off-heap")
...
3. Spark作业:
val conf = new SparkConf()
conf.set("spark.hadoop.fs.oss.upload.bufferType", "off-heap")
...
3. Benchmark
VPC网络,SSD云盘/高效云盘,MN4,4核16G机型,测试纯写数据时间。
文件大小 | 块大小 | 并发度 | Disk buffer (SSD云盘) | Disk buffer (高效云盘) | Off-heap buffer | vs. SSD云盘 性能提升(%) | vs.高效云盘 性能提升(%) |
---|---|---|---|---|---|---|---|
1024MB | 256KB | 5 | 23009ms | 20773ms | 18661ms | +18.8% | 10.2% |
1024MB | 1MB | 5 | 11310ms | 18524ms | 10233ms | +9.5% | +44.8% |
1024MB | 4MB | 5 | 10318ms | 18001ms | 10191ms | +1.5% | +43.4% |
1024MB | 16MB | 5 | 10212ms | 17796ms | 10184ms | +0.3% | +42.8% |
1024MB | 64MB | 5 | 10945ms | 18612ms | 10216ms | +6.7% | +45.1% |
1024MB | 128MB | 5 | 13240ms | 20181ms | OOM: Direct buffer memory | N/A | N/A |
256MB | 256KB | 5 | 4511ms | 4968ms | 4636ms | -2.7% | +6.7% |
256MB | 1MB | 5 | 2417ms | 4474ms | 2381ms | +1.5% | +46.8% |
256MB | 4MB | 5 | 2417ms | 4386ms | 2433ms | -0.7% | +44.3% |
256MB | 16MB | 5 | 2433ms | 4337ms | 2465ms | -1.3% | +43.2% |
256MB | 64MB | 5 | 3232ms | 5273ms | 2411ms | +33.7% | +54.3% |
256MB | 128MB | 5 | 4392ms | 6197ms | 3118ms | +29.0% | +49.7% |
64MB | 256KB | 5 | 1252ms | 1337ms | 1252ms | +0% | +6.4% |
64MB | 1MB | 5 | 611ms | 1117ms | 577ms | +5.6% | +48.3% |
64MB | 4MB | 5 | 567ms | 1084ms | 559ms | +1.4% | +48.4% |
64MB | 16MB | 5 | 597ms | 1108ms | 624ms | -4.5% | +43.7% |
64MB | 64MB | 5 | 1569ms | 1491ms | 1499ms | +4.5% | -0.5% |
64MB | 128MB | 5 | 1459ms | 1730ms | 1412ms | +3.2% | +18.4% |
16MB | 256KB | 5 | 459ms | 417ms | 383ms | +16.6% | +8.2% |
16MB | 1MB | 5 | 221ms | 307ms | 220ms | +0% | +28.3% |
16MB | 4MB | 5 | 254ms | 327ms | 198ms | +22.0% | +39.4% |
16MB | 16MB | 5 | 431ms | 398ms | 418ms | +3.0% | -5% |
16MB | 64MB | 5 | 412ms | 425ms | 400ms | +2.9% | +5.9% |
16MB | 128MB | 5 | 418ms | 405ms | 443ms | -5.9% | -9.3% |