SparkSQL 初识_1

简介: 快速学习 SparkSQL 初识_1

开发者学堂课程【大数据实时计算框架 Spark 快速入门SparkSQL 初识_1】学习笔记,与课程紧密联系,让用户快速学习知识。

课程地址https://developer.aliyun.com/learning/course/100/detail/1698


SparkSQL 初识_1


内容简介:


一、共享变量

二、广播变量

三、累加器


一、Shared Variables(共享变量)


Normally,when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node,it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficlent. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast varables and accumulators.

译文:共享变量

通常,当在远程集群节点上执行传道进 Spark 操作例 map 或者 Reduce) 的函数时,它工作在函数中使用的所有变量的单独副本上。这些变量被复制到每台机器上,并且不会将对远程机器上的变量的更新传播回驱动程序,支持跨任务的通用读写共享变量将是非常有效的,然而,星火确实为两种常见的使用模式提供了两种有限类型的共享变量:广播变量和累加器。


二、Broadcast Variables(广播变量)


Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks .They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Spark actions are executed through a set of stages, separated by distributed "shuffle" operations. Spark automantically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important

Broadcast vaniables are created from a variable v by calling SparkContext. broadcast(v).The broadcast variable is a wrapper around v,and its value can be accessed by calling the value method.The code below shows this.

After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. in addnon,the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).

译文:广播变量

广播变量允许程序员在每台机器上缓存一个只读变量 而不是将它的副本与任务一起发送, 例如,可以使用它们以有效的方式为每个节点提供一个大型输入数据集的副本, Spark 还尝试使用高效的broad cast算法来分配广播变量,以减少通信成本。

Spark 操作通过一组阶段执行,由分布式的 "shu ffl"  操作分隔。Spark 自动 aly 广播中每个阶段任务所需的公共数据。以这种方式广播的数据以序列化形式缓存,并运行每个任务之前进行饭反序列化,这意味着显式创建广播变量是打开的)当跨多个阶段的任务需要相同的数据时,或者当以荒漠化形式缓存数据很重要时,非常有用。

广播变量通过调用  Spark Comtext  从变量v创建。广播变量是V的包装器,它的值可以通过调用  Value  方法来访问。创建广播变量后,应该在集群上运行的任何函数中使用该变量,而不是使用值 v,以便 v 不会被多次发送到节点,此外,对象 v 在广播后不应被修改,以确保所有节点获得相同的广播变量值(例如,如果变量稍后被传送到新节点)。


三、Accumulators(累加器)


Accumulators are variables that are only "added" to through an associative operation and can therefore be efficienty supponed in parallel. They can be used to implement counters (as in MapReduce) or sums Soark natively supports accumulators of numeric types,and progammers can add support for new types.if accumulators are created with a name.they will be dispayed in Spark's Ul. This can be useful for understanding the progress of running stages (NOTE: this is not yet supported in Python).

An accumulator is created from an initial value v by calling SparkContext.accumulator (v).Tasks running on the cluster can then add to it using the add method or the += operator (in Scala and Python).However, they cannot read its value. Only the driver program can read the accumulator's value, using its value method.

While this code used the built-in support for accumulators of type integer ,programmers can also create their own types by subclassing AccumulatorParam.The AccumulatorParam interface has two methods, zero for providing a "zero value" for your data type,and addInplace for adding two values together. For example,supposing we had a vector class representing mathematicatl vectors,we could write:

译文:累加器

累加器是只能通过关联运算添加到其中的变量,因此可以有效地被并行地支持。它们可以用来实现计数器(如  MapReduce  )或求和。Spark 本身支持数值类型的累加器,程序员可以添加对新类型的支持。它的累加器是用一个名称创建的,它们将显示在  Spark  的  UI  中。这对于理解运行阶段的进展非常有用(注:这在  Python  中还不支持)。

通过调用  Spark Context  从初始值v创建累加器。蓄电池(五)。然后,集群上运行的任务可以使用  add  方法或+ =运算符(在  Scala  和  Python  中添加到集群中。然而,他们无法解读其价值。只有驱动程序可以读取累加器的值,使用它的 value 方法。

代码使用了对  Integer  类型累加器的  bast-n  支持,但程序员也可以通过子类 AccumlatorParam  创建自己的类型。AccmulatorParam  接口有两种方法:for 为数据类型提供"零值”,以  addInPlace  将两个值相加在一起。例如,假设我们有一个表示数学向量的向量类,我们可以编写:

类  VectorAccumlatorPar  实现  AccumulatorPa r(公共向量零( Vectorintiavalue )返回向量。==引用=外部随接=

相关文章
|
7月前
|
SQL 存储 缓存
Paimon与Spark
Paimon与Spark
275 1
|
SQL 机器学习/深度学习 分布式计算
Spark5:SparkSQL
Spark5:SparkSQL
112 0
|
4月前
|
SQL 机器学习/深度学习 分布式计算
|
6月前
|
SQL 分布式计算 大数据
PySpark
【6月更文挑战第15天】PySpark
84 6
|
SQL 分布式计算 HIVE
spark2.2 SparkSession思考与总结1
spark2.2 SparkSession思考与总结1
113 0
spark2.2 SparkSession思考与总结1
|
SQL 存储 缓存
让你真正理解什么是SparkContext, SQLContext 和HiveContext
让你真正理解什么是SparkContext, SQLContext 和HiveContext
315 0
让你真正理解什么是SparkContext, SQLContext 和HiveContext
|
SQL 分布式计算
SparkSQL实践
SparkSQL实战:统计用户及商品数据指标,包含以下三张表
176 0
SparkSQL实践
|
SQL 缓存 分布式计算
spark2的SparkSession思考与总结2:SparkSession有哪些函数及作用是什么
spark2的SparkSession思考与总结2:SparkSession有哪些函数及作用是什么
296 0
|
SQL JSON 分布式计算
SparkSQL 是什么_适用场景 | 学习笔记
快速学习 SparkSQL 是什么_适用场景
327 0
|
SQL 缓存 分布式计算
SparkSQL 初识_2
快速学习 SparkSQL 初识_2
135 0
SparkSQL 初识_2