Kylin - 分析数据

简介: I. Create a Project Go to Query page in top menu bar, then click Manage Projects.


I. Create a Project

  1. Go to Query page in top menu bar, then click Manage Projects.

  2. Click the + Project button to add a new project.

  3. Enter a project name, e.g, “Tutorial”, with a description (optional), then click submit button to send the request.

  4. After success, the project will show in the table.

II. Sync up Hive Table

  1. Click Model in top bar and then click Data Source tab in the left part, it lists all the tables loaded into Kylin; click Load Hive Table button.

  2. Enter the hive table names, separated with commad, and then click Sync to send the request.

  3. [Optional] If you want to browser the hive database to pick tables, click the Load Hive Table From Tree button.

  4. [Optional] Expand the database node, click to select the table to load, and then click Sync.

  5. A success message will pop up. In the left Tables section, the newly loaded table is added. Click the table name will expand the columns.

  6. In the background, Kylin will run a MapReduce job to calculate the approximate cardinality for the newly synced table. After the job be finished, refresh web page and then click the table name, the cardinality will be shown in the table info.

III. Create Data Model

Before create a cube, need define a data model. The data model defines the star schema. One data model can be reused in multiple cubes.

  1. Click Model in top bar, and then click Models tab. Click +New button, in the drop-down list select New Model.

  2. Enter a name for the model, with an optional description.

  3. In the Fact Table box, select the fact table of this data model.

  4. [Optional] Click Add Lookup Table button to add a lookup table. Select the table name and join type (inner or left).

  5. [Optional] Click New Join Condition button, select the FK column of fact table in the left, and select the PK column of lookup table in the right side. Repeat this if have more than one join columns.

  6. Click “OK”, repeat step 4 and 5 to add more lookup tables if any. After finished, click “Next”.

  7. The “Dimensions” page allows to select the columns that will be used as dimension in the child cubes. Click the Columns cell of a table, in the drop-down list select the column to the list.

  8. Click “Next” go to the “Measures” page, select the columns that will be used in measure/metrics. The measure column can only from fact table.

  9. Click “Next” to the “Settings” page. If the data in fact table increases by day, select the corresponding date column in the Partition Date Column, and select the date format, otherwise leave it as blank.

  10. [Optional] Select Cube Size, which is an indicator on the scale of the cube, by default it is MEDIUM.

  11. [Optional] If some records want to excluded from the cube, like dirty data, you can input the condition in Filter.

  12. Click Save and then select Yes to save the data model. After created, the data model will be shown in the left Models list.

IV. Create Cube

After the data model be created, you can start to create cube.

Click Model in top bar, and then click Models tab. Click +New button, in the drop-down list select New Cube.

Step 1. Cube Info

Select the data model, enter the cube name; Click Next to enter the next step.

You can use letters, numbers and ‘_’ to name your cube (blank space in name is not allowed). Notification List is a list of email addresses which be notified on cube job success/failure.

Step 2. Dimensions

  1. Click Add Dimension, it popups two option: “Normal” and “Derived”: “Normal” is to add a normal independent dimension column, “Derived” is to add a derived dimension column. Read more in How to optimize cubes.

  2. Click “Normal” and then select a dimension column, give it a meaningful name.

  3. [Optional] Click “Derived” and then pickup 1 more multiple columns on lookup table, give them a meaningful name.

  4. Repeate 2 and 3 to add all dimension columns; you can do this in batch for “Normal” dimension with the button Auto Generator.

  5. Click “Next” after select all dimensions.

Step 3. Measures

  1. Click the +Measure to add a new measure.

  2. There are 6 types of measure according to its expression: SUM, MAX, MIN, COUNT, COUNT_DISTINCT and TOP_N. Properly select the return type for COUNT_DISTINCT and TOP_N, as it will impact on the cube size.

    • SUM

    • MIN

    • MAX

    • COUNT

    • DISTINCT_COUNT
      This measure has two implementations:
      a) approximate implementation with HyperLogLog, select an acceptable error rate, lower error rate will take more storage.
      b) precise implementation with bitmap (see limitation in https://issues.apache.org/jira/browse/KYLIN-1186).

    Pleaste note: distinct count is a very heavy data type, it is slower to build and query comparing to other measures.

    • TOP_N
      Approximate TopN measure pre-calculates the top records in each dimension combination, it will provide higher performance in query time than no pre-calculation; Need specify two parameters here: the first is the column will be used as metrics for Top records (aggregated with SUM and then sorted in descending order); the second is the literal ID, represents the record like seller_id;

    Properly select the return type, depends on how many top records to inspect: top 10, top 100 or top 1000.

Step 4. Refresh Setting

This step is designed for incremental cube build.

Auto Merge Time Ranges (days): merge the small segments into medium and large segment automatically. If you don’t want to auto merge, remove the default two ranges.

Retention Range (days): only keep the segment whose data is in past given days in cube, the old segment will be automatically dropped from head; 0 means not enable this feature.

Partition Start Date: the start date of this cube.

Step 5. Advanced Setting

Aggregation Groups: by default Kylin put all dimensions into one aggregation group; you can create multiple aggregation groups by knowing well about your query patterns. For the concepts of “Mandatory Dimensions”, “Hierarchy Dimensions” and “Joint Dimensions”, read this blog: New Aggregation Group

Rowkeys: the rowkeys are composed by the dimension encoded values. “Dictionary” is the default encoding method; If a dimension is not fit with dictionary (e.g., cardinality > 10 million), select “false” and then enter the fixed length for that dimension, usually that is the max. length of that column; if a value is longer than that size it will be truncated. Please note, without dictionary encoding, the cube size might be much bigger.

You can drag & drop a dimension column to adjust its position in rowkey; Put the mandantory dimension at the begining, then followed the dimensions that heavily involved in filters (where condition). Put high cardinality dimensions ahead of low cardinality dimensions.

Step 6. Overview & Save

You can overview your cube and go back to previous step to modify it. Click the Save button to complete the cube creation.

目录
相关文章
|
7月前
|
消息中间件 存储 Kafka
Grab 基于 Apache Hudi 实现近乎实时的数据分析
Grab 基于 Apache Hudi 实现近乎实时的数据分析
108 0
|
7月前
|
SQL 存储 分布式计算
【大数据技术Hadoop+Spark】Hive数据仓库架构、优缺点、数据模型介绍(图文解释 超详细)
【大数据技术Hadoop+Spark】Hive数据仓库架构、优缺点、数据模型介绍(图文解释 超详细)
1115 0
|
2月前
|
消息中间件 分布式计算 大数据
大数据-166 Apache Kylin Cube 流式构建 整体流程详细记录
大数据-166 Apache Kylin Cube 流式构建 整体流程详细记录
70 5
|
7月前
|
SQL 分布式计算 Hadoop
hadoop数据查询和分析
【5月更文挑战第9天】
185 1
|
7月前
|
存储 SQL 算法
图加速数据湖分析-GeaFlow和Apache Hudi集成
图加速数据湖分析-GeaFlow和Apache Hudi集成
60 3
|
SQL 数据挖掘 HIVE
Hive数据仓库维度分析
Hive数据仓库维度分析
167 0
|
机器学习/深度学习 数据挖掘 数据处理
海量数据实时分析引擎 Apache Flink
当系统出现大量或者重大的错误却不被人感知,将会对业务产生影响,从而导致资产损失。当竞争对手实施了新战术,却无法及时感知,跟不上竞争对手的节奏,总是追着对方尾巴走。当要做决策的时候,海量的业务数据增长却无法实时看到聚合结果,决策总是凭借过往经验或者过时的数据分析之上。
海量数据实时分析引擎 Apache Flink
|
SQL 分布式计算 监控
《Apache Flink 案例集(2022版)》——2.数据分析——BIGO-BIGO使用Flink做OLAP分析及实时数仓的实践和优化(上)
《Apache Flink 案例集(2022版)》——2.数据分析——BIGO-BIGO使用Flink做OLAP分析及实时数仓的实践和优化(上)
502 0
|
消息中间件 SQL 大数据
《Apache Flink 案例集(2022版)》——2.数据分析——BIGO-BIGO使用Flink做OLAP分析及实时数仓的实践和优化(下)
《Apache Flink 案例集(2022版)》——2.数据分析——BIGO-BIGO使用Flink做OLAP分析及实时数仓的实践和优化(下)
532 0
|
SQL 物联网 关系型数据库
实时即未来,车联网项目之phoenix on hbase 即席查询【四】
实时即未来,车联网项目之phoenix on hbase 即席查询【四】
216 0