开发者学堂课程【大数据实时计算框架 Spark 快速入门:Spark 算子操作及总结_3】学习笔记,与课程紧密联系,让用户快速学习知识。
课程地址:https://developer.aliyun.com/learning/course/100/detail/1693
Spark 算子操作及总结_3
内容简介:
一、JoinOperator 相关代码
二、选择存储级别
一、JoinOperator 相关代码
20 //模拟集合
21 List> nameList = Arrays . asList(
22 new Tuple2(1, "xuruyun"),
23 new Tuple2(2, "liangyongqi"),
24 new Tuple2(3, "wangfei"),
25 new Tuple2(3, " annie"));
26
27 List scoreList = Arrays.asList(
28 new Tuple2(1, 150),
29 new Tuple2(2, 100),
30 new Tuple2(3, 80),
31 new Tuple2(3, 90));
32
33 JavaPairRDD nameRDD = sc .parallelizePairs(namelist);
34 JavaPairRDD scoreRDD = sc. parallelizePairs(scorelist);
35
二、选择存储级别
Which Storage Level to Choose?
Sparks storage levels are meant to provide difrere trade_ offs between memory usage and CPU effciency. We recommend going through tne following process to select one:
If your RDDS fit comfortably with the default storage level (MEMORY_ ONLY),leave them that way. This is the most CPU_eficient oplion, allwing operations on the RDDS to run as fast as possible.
If not, try usingMEMORY _ONLY_ SER and selecting a fast serialization library to make the objects much more space_ eficient, but still reasonble fast to access.
Don't spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise,recomputing a partition may be as fast as reading it from disk.
Use the replicated storage levels f you want fast fault recovery (e.g if using Spark to serve requests from a web pplication) All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waitingto recompute a lost partition.
in environments with high amounts of memory or multiple applications. the experimentaloFF HEAP mode has several advantages:
it allows multiple executors to share the same pool of memory in Tachyon.
it significantly reduces garbage collection costs.
Cached data is not lost if individual executors crash.
译文:选择哪个存储级别?
Sparks 存储级别旨在提供内存使用量和 CPU 效率之间的差异权衡。我们建议通过以下过程选择一个:如果您的 RDDS 与默认存储级别(仅内存)相适应、请离开他们在那边。
这是 rdds 上最常用的 cpu_eficient oplion 操作。跑得越快越好如果没有,尝试使用 ME MORY_ ONLY_ SER 并选择个快速的序列化库 ,以使对象更节省空间, 但仍然可以快速访问。
不要溢出到磁盘,除非计算数据集的函数非常昂贵,或者它们过滤了大量数据。否则,重新计算分区可能与从磁盘读取分区一样快。
如果您想要快速故障恢复,请使用复制的存储级别(例如:如果使用 Spark 来服务来自网络应用程序的请求)所有存储级别都提供完整的故障通过重新计算丢失的数据来容忍。
但复制的数据允许您继续在 RDD 而不必等待重新计算丢失的分区。
在高内存或多个应用程序的环境中。
OFF HEAP 模式有几个优点
它允许多个执行器共享同一个内存池中的超光速粒子。
它显著降低了垃圾收集成本。
如果个别执行程序崩溃,缓存的数据不会丢失。