Denodo For Big Data

本文涉及的产品
可视分析地图(DataV-Atlas),3 个项目,100M 存储空间
简介: In order to build Virtual Data Services, the user follows three simple steps: 1. Connect & Virtualize Any Source.

In order to build Virtual Data Services, the user follows three simple steps:

  1. Connect & Virtualize Any Source. Quickly access disparate structured and unstructured data sources using included connectors. Introspect their metadata and expose as normalized source views in the data virtualization layer
  2. Combine & Integrate into Business Data Views. Combine, integrate, transform, cleanse source views into canonical model-driven business views of data - do it entirely in a GUI or through documented scripting.
  3. Publish & Secure Data Services. Any of the virtual data views can be secured and published as SQL views or dozen other data services formats.(/HTML/JSON/XML)
    image

image
Discover and Consume Integrated Information
• Global Metadata / Data discovery. Global information search capability allows any user or application to discover, search, browse and eventually query both metadata and data through virtual data services to retrieve information.
• Hybrid Query Optimization. The best DV platforms utilize a combination of real-time query optimization and rewriting, intelligent caching, and selective data movement to achieve superior response and performance against both on-demand pull and scheduled batch push data requests.
• Integrated Business Information. Data virtualization delivers integrated information while hiding complexities of accessing disparate data. Users and applications get what they want, in the format they want, with real-time high performance.
Perform Data Governance and DV Management
• Data Governance. DV layer serves as a flexible and unified layer to expose business metadata to users. At the same time it helps to understand the underlying data layers through data profiling, data lineage, change impact analysis and other tools and expose needs for data normalization / quality in underlying sources. Thus DV can be the "single point of reference" to govern information.
• Security and Service Level Policy. All data views from source level to canonical business views to data services can be secured and authenticated to users, groups and roles at highly granular view-row-column level. Further custom security and access policies can throttle or manage service levels to protect source systems from overuse.
• Monitoring and Management. Leading DV platforms will include several monitors, dashboards, audit logs, and management consoles to ensure smooth operation of the DV solution. It also provides tools for managing clusters, high availability, users/roles and to migrate virtual data between development, test and production.

In this Sandbox, you will get to see how the Denodo Platform can integrate and combine data from different analytical data sources and present them to users as a single, integrated data set. Users can then access the data using their preferred tools, such as Zeppelin workbench (the tool that we’ll use in this sandbox).

Getting Started
To follow these steps, you will use the Denodo administration tool. This is the client that Denodo administrators and developers use to define the data models and configure the system. It’s similar to tools like SQL Developer, Toad or SQLServer Management Studio. If you are familiar with relational databases, it won’t take you long to learn how to use it. If you accidentally close this window, there is "Admin Tool" icon on the desktop that you can use to open it again.
image
You can log in using the following credentials:
Login: admin
Password: admin
Once connected, you will see a panel on the left hand side, the Server Explorer, with the existing Denodo virtual databases, data sources and views that are already available in the server. If you double-click on any of the objects, it will be open in a new tab in the workspace area.

In particular, we are going to use two databases in this sandbox:
sample_completed: this database contains fully configured data sources and example views and reports. Feel free to browse through this database and review and execute the existing views, and use them as examples.
my_test_drive: this is the database where you will create your own models. It only contains the pre-configured connections to the data sources. You will have to build the rest.
Analytical Queries with Data Virtualization
Working on a real example is the best way to learn how something works, isn’t it? This section will show you how to work with Denodo to run typical analytical queries in a distributed multi-source scenario.

Before you get started, under the sample_completed database, let's focus on the Application Layer folder. In particular, look at the Historical Reports folder, and open the view named same_store_sales_by_year.

This view represents the sales trends from year to year for each store. This is a key metric for retail companies as it shows how stores are faring - growing or declining - year over year.

SELECT bv_store.s_store_sk AS store_id,

    bv_store.s_city AS city,
    bv_store.s_state AS state,
    bv_date_dim.d_year AS year,
    SUM(store_sales.ss_net_paid_inc_tax)AS total_sales 
FROM 
    store_sales AS store_sales JOIN bv_date_dim AS bv_date_dim 
    ON store_sales.ss_sold_date_sk = bv_date_dim.d_date_sk 
    JOIN bv_store AS bv_store ON store_sales.ss_store_sk     = bv_store.s_store_sk 
GROUP BY bv_store.s_store_sk,bv_store.s_city,bv_store.s_state,     bv_date_dim.d_year 
ORDER BY store_id ASC,年份ASC

image
• Three different data sources: Redshift, Aurora and Impala
• Large data volumes: more than 288 millions of rows to perform this calculation.
Run the query
To run this query, simply click on the Execution wizardimage
, on the top right side of the view panel. Once the panel opens in the lower half of the screen, click on the Execute button ( image
) to launch the query. It will run a SELECT * FROM same_store_sales_by_year.

How is that possible?
How can a query that brings hundreds of millions of rows through the network, from three different remote sources, take less than 15 seconds?
What you see here is Denodo’s optimizer in action. In particular, you are seeing the result of three different techniques, carefully orchestrated by the optimizer:

  1. Cost based optimization (CBO): Denodo uses table statistics and other meta-information (indexes, primary keys, etc.) to estimate data volumes and the associated cost for different possible execution plans. It will choose the plan with the lowest estimated cost.
  2. Query push-down: Instead of bringing entire tables, the engine generates SQL queries for each data source to retrieve only the data required for this calculation. This way, it takes advantage of the execution engine of the data source.
  3. Query rewriting: Denodo’s engine rewrites the execution plan to an equivalent one that is much more efficient in a federated scenario. In this case, Denodo applies a technique called Partial aggregation pushdown, that splits the final aggregation (by store, city and year) in two steps, so that part of the aggregation can be push down to the sources. This technique dramatically reduces the data transferred through the network.
    image

Just try to disable the optimizer in the Denodo server settings, and run the query again. To do so, click on Administration > Server Configuration > Queries Optimization. Try to disable the different optimizer settings and run the query again after the charge. Without the static optimizer (which is in change of the query rewriting) the execution will take around 13 minutes to complete. This is what other federation engines do!

image
Building a Model Step-by-Step: Connecting to the Sources
Now that you have seen a pre-built query in action, let’s build one from scratch. For this section, you will use the database my_test_drive instead. Remember that you can still go back to samples_completed and use those views as a reference if you get stuck.

In my_test_drive you have the same folder structure that we saw in the previous section. Try to keep things organized during development, it will make your life easier. You can create new folders, rename your views and drag & drop them into a different folder when needed.

image
image
对于此沙箱,您将需要以下基本视图:
ds_aurora_application

数据库Customer_MDM
    顾客
    customer_address
数据库PIM
    商店

ds_impala_historical_data

架构tpcds
    tpcds_store_sales
    catalog_returns

ds_redshift_edw

数据库tpcds
    架构:tpcds_schema
        store_sales
        DATE_DIM

image
To open the shell, go to Tools > VQL Shell in the menu bar. Here you can run any query you want. VQL stands for Virtual Query Language, and it’s just Denodo’s flavor of standard SQL. Just make sure you select the right database in the Database drop down on the top-left of this panel.
select count(*) from my_test_drive.historical_store_sales ==> 219474321
Denodo的引擎生成一个SQL查询,其中包含目标数据库方言中的相应操作

您需要收集统计信息以提供基于成本的优化程序。Denodo提供统计管理器 来帮助您完成此操作。要打开它,请转到 顶部菜单中的工具>管理统计。在下拉列表中选择正确的数据库(my_test_drive )并检查您拥有的所有基本视图。然后单击“ 收集统计信息”

某些数据库没有统计信息,或者它们不完整。在这些情况下,Denodo可以通过向基础表发送一些查询来计算这些统计数据。
来自Impala的historical_store_sales就是这种情况。要执行此操作,请单击视图,转到选项>统计信息,然后启用选项“完成缺少统计信息执行SELECT查询”,如下图所示。结果表明,表格在几秒钟后就会填满数据。单击保存按钮以保存这些有价值的统计数据!
image
image
image
您可以通过两种不同的方式运行这样的查询:
Define a composite data model (derived view in Denodo’s lingo) in Denodo using the graphical wizards for data modeling.
Write your own SQL. This is the preferred method of some power users with a lot of SQL experience.

Create the UNION of all sales
Let’s now go back to the my_test_drive database to create your own view. You can access the modeling wizards with the right button in the server explorer tree.
Right-click on the folder name, and choose New > Union to open the modeling wizard for UNION views. The wizard is empty, you will have to drag and drop in the workspace the two base views involved in this model:
historical_store_sales (Impala)
ttm_store_sales (Redshift)
As you drag the second view, you will see that Denodo automatically links the fields of the two views. Denodo does this when the name and data type match on both sides. Since these two tables have the exact same structure, there is no extra manual modeling.
image
image
image
In the Output tab you can change the name of the view. You can also change the name of the columns to make them more user friendly, or add a description to each one of them. In the Metadata tab, you can add a description for the view.
Once you are done, click on the Save button on top of this panel. You have just created your first derived view!

JOIN the dimensions
Right click on the folder name and choose New > Join. Like before, you will have to drag and drop in the workspace the three base views involved in this model:

store_sales, created in the previous step
date_dim from redshift
store from Aurora

You will have to drag and drop the JOIN conditions, using the arrow that appears on the workspace when you drag a field name. Use the following conditions:
store_sales.ss_sold_date_sk = date_dim.d_date_sk
store_sales.ss_store_sk = store.s_store_sk
image
在JOIN之上,我们必须定义聚合。您不需要新的向导,您可以在“ 分组依据” 选项卡中定义聚合设置。启用Use Group By 复选框,并将以下字段添加到聚合中:
store.s_store_sk
store.s_city
store.s_state
date_dim.d_year

You just need some final polishing. In the Output tab, name your view same_store_sales_by_year. You can change the names of the fields and add descriptions. For example, you may want to use store_id instead of s_store_sk, since it’s a more understandable name for non-technical users.
您还可以添加新字段和计算指标。在这种情况下,单击New Aggr。字段 并使用以下表达式:
字段名称:total_sales
字段表达式:SUM (store_sales.ss_net_paid_inc_tax)
您还可以使用此面板右上角的选项添加ORDER BY设置。这是一个自动填充字段,因此在您开始输入时会显示匹配的字段。使用绿色加号添加条件。在本练习中,按store_id(s_store_sk)和年份(d_year)排序。
image
image
You should start by making sure that you built the right thing. To verify it, open the execution panel and get some sample data. Like in the initial example, execution should take less than 15 seconds, and return 1,206 rows. Results should look like this:
select * from same_store_sales_by_year CONTEXT('i18n'='us_pst','cache_wait_for_load'='true') TRACE
Tools => VQL Shell
image
image

Using external clients
You can also connect external clients to Denodo, like Tableau or Power BI, in order to create dashboards and charts. As part of this sandbox, we have included an instance of Apache Zeppelin, a very popular web-based notebook that allows you to run interactive queries to Denodo (via JDBC) and draw charts. A tab with Zeppelin should already be open in this browser, otherwise just head to http://localhost:7777/

To link a paragraph to the datasource, you have to specify an interpreter at the beginning. In the case of Denodo, you use %vdp. VDP stands for Virtual DataPort, which is the name of the Denodo server component. From the Zeppelin home screen you can access an example notebook we have created for this exercise with some templates.
image
image
image
未完待续,谢谢!

相关实践学习
DataV Board用户界面概览
本实验带领用户熟悉DataV Board这款可视化产品的用户界面
阿里云实时数仓实战 - 项目介绍及架构设计
课程简介 1)学习搭建一个数据仓库的过程,理解数据在整个数仓架构的从采集、存储、计算、输出、展示的整个业务流程。 2)整个数仓体系完全搭建在阿里云架构上,理解并学会运用各个服务组件,了解各个组件之间如何配合联动。 3 )前置知识要求   课程大纲 第一章 了解数据仓库概念 初步了解数据仓库是干什么的 第二章 按照企业开发的标准去搭建一个数据仓库 数据仓库的需求是什么 架构 怎么选型怎么购买服务器 第三章 数据生成模块 用户形成数据的一个准备 按照企业的标准,准备了十一张用户行为表 方便使用 第四章 采集模块的搭建 购买阿里云服务器 安装 JDK 安装 Flume 第五章 用户行为数据仓库 严格按照企业的标准开发 第六章 搭建业务数仓理论基础和对表的分类同步 第七章 业务数仓的搭建  业务行为数仓效果图  
相关文章
|
Java
Leetcode 295. Find Median from Data Stream
在一个有序数组中找中位数,但需要支持再数组中添加新的元素。本来是有序里的,可以很轻易就查到中位数,但如果添加新数字后,不一定有序。如果先对数组排序,那代价就比较大了,每次排序时间复杂度O(n*log(n)),看discuss发现了一种很巧妙的解法,可以把添加数据的时间复杂度降低到O(log(n)) ,查询中位数O(1)。
55 0
UVa 374 Big Mod
UVa 374 Big Mod
41 0
|
算法 Python
LeetCode 295. Find Median from Data Stream
中位数是有序列表中间的数。如果列表长度是偶数,中位数则是中间两个数的平均值。
107 0
LeetCode 295. Find Median from Data Stream
Data Structures and Algorithms (English) - 6-14 Count Connected Components(20 分)
Data Structures and Algorithms (English) - 6-14 Count Connected Components(20 分)
145 0
|
人工智能 索引
Leetcode-Easy 852. Peak Index in a Mountain Array
Leetcode-Easy 852. Peak Index in a Mountain Array
102 0
|
前端开发
HDOJ 1212 Big Number
HDOJ 1212 Big Number
113 0
HDOJ1018Big Number
HDOJ1018Big Number
109 0
|
存储 NoSQL 分布式数据库
带你玩转 Big Data
Big Data(大数据)技术简析 Big Data是近来的一个技术热点,但从名字就能判断它并不是什么新词。毕竟,大是一个相对概念。历史上,数据库、数据仓库、数据集市等信息管理领域的技术,很大程度上也是为了解决大规模数据的问题。
1779 0
|
分布式计算 Hadoop Spark
Drowning in Big Data? How to Start Getting Real Value Now, before It’s Too Late
Data is everywhere. Phenomena such as the Internet of Things (IoT) and widespread digitization have unleashed a tsunami of information on the world and enterprises are struggling to keep up.
2173 0
|
存储 NoSQL 关系型数据库