开发者学堂课程【大数据实时计算框架Spark快速入门:运行时_程序调度_1】学习笔记,与课程紧密联系,让用户快速学习知识。
课程地址:https://developer.aliyun.com/learning/course/100/detail/1650
运行时_程序调度_1
Internally,each RDD is characterized by five main properties:
-A list of partitions
- Afunction for computing each split
- A list of dependencies on other RDDs
- optionally,a Partitioner for key-value RDDs (e.g. to say that the RDD is hash.
- optionally,a list of preferred locations to compute each split on (e.g. block
an HDES file)
- optionally,a Partitioner for key-value RDDs (e.g. to say that the RDD is hash.
- optionally,a list of preferred locations to compute each split on (e.g. block
an HDES file)
Spark运行时
流程示意
分布式文件系统(File system ) --加载数据集
transformations 延迟执行--针对 RDD 的操作
Action 触发执行
代码示例
lines = se.textFile("hdfs://...”)
加载进来成为RDD
errors = lines.filter(_.startsWith(“ERROR”))
Transformation转换
errors.persist()
缓存RDD
Mysql_errors=errors.filter(_.contain( "MySQL”)).count
Action执行
http_errors = errors.filter(_.contain( "Http")).count. Action执行