SparkSQL 初识_1-阿里云开发者社区

SparkSQL 初识_1

2021-12-19 155

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 快速学习 SparkSQL 初识_1

开发者学堂课程【大数据实时计算框架 Spark 快速入门：SparkSQL 初识_1】学习笔记，与课程紧密联系，让用户快速学习知识。

课程地址：https://developer.aliyun.com/learning/course/100/detail/1698

SparkSQL 初识_1

内容简介:

一、共享变量

二、广播变量

三、累加器

一、Shared Variables（共享变量）

Normally，when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node,it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficlent. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast varables and accumulators.

译文：共享变量

通常，当在远程集群节点上执行传道进 Spark 操作例 map 或者 Reduce) 的函数时，它工作在函数中使用的所有变量的单独副本上。这些变量被复制到每台机器上，并且不会将对远程机器上的变量的更新传播回驱动程序，支持跨任务的通用读写共享变量将是非常有效的，然而，星火确实为两种常见的使用模式提供了两种有限类型的共享变量:广播变量和累加器。

二、Broadcast Variables（广播变量）

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks .They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Spark actions are executed through a set of stages, separated by distributed "shuffle" operations. Spark automantically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important

Broadcast vaniables are created from a variable v by calling SparkContext. broadcast(v).The broadcast variable is a wrapper around v,and its value can be accessed by calling the value method.The code below shows this.

After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. in addnon,the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).

译文：广播变量

广播变量允许程序员在每台机器上缓存一个只读变量而不是将它的副本与任务一起发送，例如，可以使用它们以有效的方式为每个节点提供一个大型输入数据集的副本， Spark 还尝试使用高效的broad cast算法来分配广播变量，以减少通信成本。

Spark 操作通过一组阶段执行，由分布式的 "shu ffl" 操作分隔。Spark 自动 aly 广播中每个阶段任务所需的公共数据。以这种方式广播的数据以序列化形式缓存，并运行每个任务之前进行饭反序列化，这意味着显式创建广播变量是打开的)当跨多个阶段的任务需要相同的数据时，或者当以荒漠化形式缓存数据很重要时，非常有用。

广播变量通过调用 Spark Comtext 从变量v创建。广播变量是V的包装器，它的值可以通过调用 Value 方法来访问。创建广播变量后，应该在集群上运行的任何函数中使用该变量，而不是使用值 v，以便 v 不会被多次发送到节点，此外，对象 v 在广播后不应被修改，以确保所有节点获得相同的广播变量值(例如，如果变量稍后被传送到新节点)。

三、Accumulators（累加器）

Accumulators are variables that are only "added" to through an associative operation and can therefore be efficienty supponed in parallel. They can be used to implement counters (as in MapReduce) or sums Soark natively supports accumulators of numeric types，and progammers can add support for new types.if accumulators are created with a name.they will be dispayed in Spark's Ul. This can be useful for understanding the progress of running stages (NOTE: this is not yet supported in Python).

An accumulator is created from an initial value v by calling SparkContext.accumulator (v).Tasks running on the cluster can then add to it using the add method or the += operator (in Scala and Python).However, they cannot read its value. Only the driver program can read the accumulator's value, using its value method.

While this code used the built-in support for accumulators of type integer ,programmers can also create their own types by subclassing AccumulatorParam.The AccumulatorParam interface has two methods, zero for providing a "zero value" for your data type,and addInplace for adding two values together. For example,supposing we had a vector class representing mathematicatl vectors,we could write:

译文：累加器

累加器是只能通过关联运算添加到其中的变量，因此可以有效地被并行地支持。它们可以用来实现计数器(如 MapReduce )或求和。Spark 本身支持数值类型的累加器，程序员可以添加对新类型的支持。它的累加器是用一个名称创建的，它们将显示在 Spark 的 UI 中。这对于理解运行阶段的进展非常有用(注:这在 Python 中还不支持)。

通过调用 Spark Context 从初始值v创建累加器。蓄电池(五)。然后，集群上运行的任务可以使用 add 方法或+ =运算符(在 Scala 和 Python 中添加到集群中。然而，他们无法解读其价值。只有驱动程序可以读取累加器的值，使用它的 value 方法。

代码使用了对 Integer 类型累加器的 bast-n 支持，但程序员也可以通过子类 AccumlatorParam 创建自己的类型。AccmulatorParam 接口有两种方法:for 为数据类型提供"零值”，以 addInPlace 将两个值相加在一起。例如，假设我们有一个表示数学向量的向量类，我们可以编写:

类 VectorAccumlatorPar 实现 AccumulatorPa r(公共向量零( Vectorintiavalue )返回向量。==引用=外部随接=

SparkSQL 初识_1

SparkSQL 初识_1

一、Shared Variables（共享变量）

二、Broadcast Variables（广播变量）

三、Accumulators（累加器）

阿里云开发者学堂

热门文章

最新文章

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

SparkSQL 初识_1

SparkSQL 初识_1

一、Shared Variables（共享变量）

二、Broadcast Variables（广播变量）

三、Accumulators（累加器）

阿里云开发者学堂

热门文章

最新文章

相关电子书