Spark-zeppelin-大数据可视化分析

本文涉及的产品
云原生大数据计算服务MaxCompute,500CU*H 100GB 3个月
简介: 官网介绍 Multi-purpose Notebook The Notebook is the place for all your needs Data IngestionData DiscoveryData Analytics...
+关注继续查看

官网介绍

Multi-purpose Notebook

The Notebook is the place for all your needs

  • Data Ingestion
  • Data Discovery
  • Data Analytics
  • Data Visualization & Collaboration


Multiple language backend

Zeppelin interpreter concept allows any language/data-processing-backend to be plugged into Zeppelin.Currently Zeppelin supports many interpreters such as Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell.

Adding new language-backend is really simple. Learn how to write a zeppelin interpreter.


Apache Spark integration

Zeppelin provides built-in Apache Spark integration. You don't need to build a separate module, plugin or library for it.

Zeppelin's Spark integration provides

  • Automatic SparkContext and SQLContext injection
  • Runtime jar dependency loading from local filesystem or maven repository. Learn more aboutdependency loader.
  • Canceling job and displaying its progress


Data visualization

Some basic charts are already included in Zeppelin. Visualizations are not limited to SparkSQL's query, any output from any language backend can be recognized and visualized.

Pivot chart

With simple drag and drop Zeppelin aggeregates the values and display them in pivot chart. You can easily create chart with multiple aggregated values including sum, count, average, min, max.

Learn more about Zeppelin's Display system. ( text, html, table, angular )


Dynamic forms

Zeppelin can dynamically create some input forms into your notebook.

Learn more about Dynamic Forms.


Collaboration

Notebook URL can be shared among collaborators. Zeppelin can then broadcast any changes in realtime, just like the collaboration in Google docs.


Publish

Zeppelin provides an URL to display the result only, that page does not include Zeppelin's menu and buttons.This way, you can easily embed it as an iframe inside of your website.


100% Opensource

Apache Zeppelin (incubating) is Apache2 Licensed software. Please check out thesource repository andHow to contribute

Zeppelin has a very active development community.Join the Mailing list and report issues on our Issue tracker.


Undergoing Incubation

Apache Zeppelin is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

安装

From binary package

Download latest binary package from Download.

Build from source

Check instructions in README to build from source.

Configure

Configuration can be done by both environment variable(conf/zeppelin-env.sh) and java properties(conf/zeppelin-site.xml). If both defined, environment vaiable is used.

zepplin-env.sh zepplin-site.xml Default value Description
ZEPPELIN_PORT zeppelin.server.port 8080 Zeppelin server port.
ZEPPELIN_MEM N/A -Xmx1024m -XX:MaxPermSize=512m JVM mem options
ZEPPELIN_INTP_MEM N/A ZEPPELIN_MEM JVM mem options for interpreter process
ZEPPELIN_JAVA_OPTS N/A   JVM Options
ZEPPELIN_ALLOWED_ORIGINS zeppelin.server.allowed.origins * Allows a way to specify a ',' separated list of allowed origins for rest and websockets. i.e. http://localhost:8080
ZEPPELIN_SERVER_CONTEXT_PATH zeppelin.server.context.path / Context Path of the Web Application
ZEPPELIN_SSL zeppelin.ssl false  
ZEPPELIN_SSL_CLIENT_AUTH zeppelin.ssl.client.auth false  
ZEPPELIN_SSL_KEYSTORE_PATH zeppelin.ssl.keystore.path keystore  
ZEPPELIN_SSL_KEYSTORE_TYPE zeppelin.ssl.keystore.type JKS  
ZEPPELIN_SSL_KEYSTORE_PASSWORD zeppelin.ssl.keystore.password    
ZEPPELIN_SSL_KEY_MANAGER_PASSWORD zeppelin.ssl.key.manager.password    
ZEPPELIN_SSL_TRUSTSTORE_PATH zeppelin.ssl.truststore.path    
ZEPPELIN_SSL_TRUSTSTORE_TYPE zeppelin.ssl.truststore.type    
ZEPPELIN_SSL_TRUSTSTORE_PASSWORD zeppelin.ssl.truststore.password    
ZEPPELIN_NOTEBOOK_HOMESCREEN zeppelin.notebook.homescreen   Id of notebook to be displayed in homescreen ex) 2A94M5J1Z
ZEPPELIN_NOTEBOOK_HOMESCREEN_HIDE zeppelin.notebook.homescreen.hide false hide homescreen notebook from list when this value set to "true"
ZEPPELIN_WAR_TEMPDIR zeppelin.war.tempdir webapps The location of jetty temporary directory.
ZEPPELIN_NOTEBOOK_DIR zeppelin.notebook.dir notebook Where notebook file is saved
ZEPPELIN_NOTEBOOK_S3_BUCKET zeppelin.notebook.s3.bucket zeppelin Bucket where notebook saved
ZEPPELIN_NOTEBOOK_S3_USER zeppelin.notebook.s3.user user User in bucket where notebook saved. For example bucket/user/notebook/2A94M5J1Z/note.json
ZEPPELIN_NOTEBOOK_STORAGE zeppelin.notebook.storage org.apache.zeppelin.notebook.repo.VFSNotebookRepo Comma separated list of notebook storage
ZEPPELIN_INTERPRETERS zeppelin.interpreters org.apache.zeppelin.spark.SparkInterpreter,
org.apache.zeppelin.spark.PySparkInterpreter,
org.apache.zeppelin.spark.SparkSqlInterpreter,
org.apache.zeppelin.spark.DepInterpreter,
org.apache.zeppelin.markdown.Markdown,
org.apache.zeppelin.shell.ShellInterpreter,
org.apache.zeppelin.hive.HiveInterpreter
...
Comma separated interpreter configurations [Class]. First interpreter become a default
ZEPPELIN_INTERPRETER_DIR zeppelin.interpreter.dir interpreter Zeppelin interpreter directory


You'll also need to configure individual interpreter. Information can be found in 'Interpreter' section in this documentation.

For example Spark.


Start/Stop

Start Zeppelin

bin/zeppelin-daemon.sh start

After successful start, visit http://localhost:8080 with your web browser.

Stop Zeppelin

bin/zeppelin-daemon.sh stop
实践例子:

Zeppelin Tutorial

We will assume you have Zeppelin installed already. If that's not the case, seeInstall.

Zeppelin's current main backend processing engine is Apache Spark. If you're new to the system, you might want to start by getting an idea of how it processes data to get the most out of Zeppelin.


Tutorial with Local File

Data Refine

Before you start Zeppelin tutorial, you will need to download bank.zip.

First, to transform data from csv format into RDD of Bank objects, run following script. This will also remove header usingfilter function.

val bankText = sc.textFile("yourPath/bank/bank-full.csv")

case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer)

// split each line, filter out header (starts with "age"), and map it into Bank case class  
val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map(
    s=>Bank(s(0).toInt, 
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
)

// convert to DataFrame and create temporal table
bank.toDF().registerTempTable("bank")


Data Retrieval

Suppose we want to see age distribution from bank. To do this, run:

%sql select age, count(1) from bank where age < 30 group by age order by age

You can make input box for setting age condition by replacing 30 with${maxAge=30}.

%sql select age, count(1) from bank where age < ${maxAge=30} group by age order by age

Now we want to see age distribution with certain marital status and add combo box to select marital status. Run:

%sql select age, count(1) from bank where marital="${marital=single,single|divorced|married}" group by age order by age


Tutorial with Streaming Data

Data Refine

Since this tutorial is based on Twitter's sample tweet stream, you must configure authentication with a Twitter account. To do this, take a look atTwitter Credential Setup. After you get API keys, you should fill out credential related values(apiKey,apiSecret, accessToken, accessTokenSecret) with your API keys on following script.

This will create a RDD of Tweet objects and register these stream data as a table:

import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.storage.StorageLevel
import scala.io.Source
import scala.collection.mutable.HashMap
import java.io.File
import org.apache.log4j.Logger
import org.apache.log4j.Level
import sys.process.stringSeqToProcess

/** Configures the Oauth Credentials for accessing Twitter */
def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) {
  val configs = new HashMap[String, String] ++= Seq(
    "apiKey" -> apiKey, "apiSecret" -> apiSecret, "accessToken" -> accessToken, "accessTokenSecret" -> accessTokenSecret)
  println("Configuring Twitter OAuth")
  configs.foreach{ case(key, value) =>
    if (value.trim.isEmpty) {
      throw new Exception("Error setting authentication - value for " + key + " not set")
    }
    val fullKey = "twitter4j.oauth." + key.replace("api", "consumer")
    System.setProperty(fullKey, value.trim)
    println("\tProperty " + fullKey + " set as [" + value.trim + "]")
  }
  println()
}

// Configure Twitter credentials
val apiKey = "xxxxxxxxxxxxxxxxxxxxxxxxx"
val apiSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
val accessToken = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
val accessTokenSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
configureTwitterCredentials(apiKey, apiSecret, accessToken, accessTokenSecret)

import org.apache.spark.streaming.twitter._
val ssc = new StreamingContext(sc, Seconds(2))
val tweets = TwitterUtils.createStream(ssc, None)
val twt = tweets.window(Seconds(60))

case class Tweet(createdAt:Long, text:String)
twt.map(status=>
  Tweet(status.getCreatedAt().getTime()/1000, status.getText())
).foreachRDD(rdd=>
  // Below line works only in spark 1.3.0.
  // For spark 1.1.x and spark 1.2.x,
  // use rdd.registerTempTable("tweets") instead.
  rdd.toDF().registerAsTable("tweets")
)

twt.print

ssc.start()


Data Retrieval

For each following script, every time you click run button you will see different result since it is based on real-time data.

Let's begin by extracting maximum 10 tweets which contain the word "girl".

%sql select * from tweets where text like '%girl%' limit 10

This time suppose we want to see how many tweets have been created per sec during last 60 sec. To do this, run:

%sql select createdAt, count(1) from tweets group by createdAt order by createdAt

You can make user-defined function and use it in Spark SQL. Let's try it by making function namedsentiment. This function will return one of the three attitudes(positive, negative, neutral) towards the parameter.

def sentiment(s:String) : String = {
    val positive = Array("like", "love", "good", "great", "happy", "cool", "the", "one", "that")
    val negative = Array("hate", "bad", "stupid", "is")

    var st = 0;

    val words = s.split(" ")    
    positive.foreach(p =>
        words.foreach(w =>
            if(p==w) st = st+1
        )
    )

    negative.foreach(p=>
        words.foreach(w=>
            if(p==w) st = st-1
        )
    )
    if(st>0)
        "positivie"
    else if(st<0)
        "negative"
    else
        "neutral"
}

// Below line works only in spark 1.3.0.
// For spark 1.1.x and spark 1.2.x,
// use sqlc.registerFunction("sentiment", sentiment _) instead.
sqlc.udf.register("sentiment", sentiment _)

To check how people think about girls using sentiment function we've made above, run this:

%sql select sentiment(text), count(1) from tweets where text like '%girl%' group by sentiment(text)

相关实践学习
简单用户画像分析
本场景主要介绍基于海量日志数据进行简单用户画像分析为背景,如何通过使用DataWorks完成数据采集 、加工数据、配置数据质量监控和数据可视化展现等任务。
SaaS 模式云数据仓库必修课
本课程由阿里云开发者社区和阿里云大数据团队共同出品,是SaaS模式云原生数据仓库领导者MaxCompute核心课程。本课程由阿里云资深产品和技术专家们从概念到方法,从场景到实践,体系化的将阿里巴巴飞天大数据平台10多年的经过验证的方法与实践深入浅出的讲给开发者们。帮助大数据开发者快速了解并掌握SaaS模式的云原生的数据仓库,助力开发者学习了解先进的技术栈,并能在实际业务中敏捷的进行大数据分析,赋能企业业务。 通过本课程可以了解SaaS模式云原生数据仓库领导者MaxCompute核心功能及典型适用场景,可应用MaxCompute实现数仓搭建,快速进行大数据分析。适合大数据工程师、大数据分析师 大量数据需要处理、存储和管理,需要搭建数据仓库?学它! 没有足够人员和经验来运维大数据平台,不想自建IDC买机器,需要免运维的大数据平台?会SQL就等于会大数据?学它! 想知道大数据用得对不对,想用更少的钱得到持续演进的数仓能力?获得极致弹性的计算资源和更好的性能,以及持续保护数据安全的生产环境?学它! 想要获得灵活的分析能力,快速洞察数据规律特征?想要兼得数据湖的灵活性与数据仓库的成长性?学它! 出品人:阿里云大数据产品及研发团队专家 产品 MaxCompute 官网 https://www.aliyun.com/product/odps&nbsp;
目录
相关文章
|
1月前
|
分布式计算 Hadoop 大数据
大数据Hadoop之——Apache Hudi 数据湖实战操作(Spark,Flink与Hudi整合)
大数据Hadoop之——Apache Hudi 数据湖实战操作(Spark,Flink与Hudi整合)
107 2
|
3月前
|
分布式计算 搜索推荐 Hadoop
阿里巴巴资深架构师熬几个通宵肛出来的Spark+Hadoop+中台实战pdf
Spark大数据分析实战 1、Spark简介 初识Spark Sp ark生态系统BDAS Sp ark架构与运行逻辑 弹性分布式数据集
|
分布式计算 数据可视化 IDE
5月21日 Spark 社区直播【Spark on Zeppelin】
Apache Zeppelin 是一个交互式的大数据开发Notebook,从一开始就是为Spark定制的。Zeppelin Notebook的开发环境与传统IDE开发环境相比有几大优势:不需要编译Jar,环境配置简单,交互式开发,数据结果可视化等等。 本次直播将会介绍Spark on Zeppelin的一些基本使用方式以及应用场景。
5月21日 Spark 社区直播【Spark on Zeppelin】
|
分布式计算 Shell MaxCompute
【公共云支持】MaxCompute Spark支持交互式Zeppelin
【公共云支持】MaxCompute Spark支持交互式Zeppelin
|
分布式计算 Hadoop Apache
【Hadoop Summit Tokyo 2016】使Apache Zeppelin与Spark赋能企业数据科学
本讲义出自Bikas Saha在Hadoop Summit Tokyo 2016上的演讲,主要分享了如何使得数据科学在企业中变得容易实现以及目前企业中实现数据科学所面临的的挑战,并分享了在企业中如何使用Apache Zeppelin以及企业中数据科学的未来的发展规划。
1748 0
|
SQL 数据可视化 大数据
Spark-zeppelin大数据可视化导入Mysql
Zeppelin是基于spark的数据可视化方案。支持scala语言,任何在spark上运行的job都可以在此平台上运行,此外支持对表数据的可视化。对数据源的可视化可以通过interpreter进行扩展,比如github中就有支持MySQL的interpreter。
1391 0
|
分布式计算 Java Apache
Zeppelin对Spark进行交互式数据查询和分析
 Zeppelin是一个Web笔记形式的交互式数据查询分析工具,可以在线用scala和SQL对数据进行查询分析并生成报表。Zeppelin的后台数据引擎可以是Spark(目前只有Spark),开发者可以通过实现更多的解释器来为Zeppelin添加数据引擎。
1944 0
|
Web App开发 分布式计算 数据可视化
Spark-数据分析可视化Zeppelin
官网介绍 Apache Zeppelin提供了web版的类似ipython的notebook,用于做数据分析和可视化。背后可以接入不同的数据处理引擎,包括Spark, hive, tajo等,原生支持scala, Java, shell, markdown等。
2078 0
推荐文章
更多