Kylin-基本知识

简介: CUBE Table - This is definition of hive tables as source of cubes, which must be synced before building cubes.

CUBE

  • Table - This is definition of hive tables as source of cubes, which must be synced before building cubes.

  • Data Model - This describes a STAR SCHEMA data model, which defines fact/lookup tables and filter condition.

  • Cube Descriptor - This describes definition and settings for a cube instance, defining which data model to use, what dimensions and measures to have, how to partition to segments and how to handle auto-merge etc.

  • Cube Instance - This is instance of cube, built from one cube descriptor, and consist of one or more cube segments according partition settings.

  • Partition - User can define a DATE/STRING column as partition column on cube descriptor, to separate one cube into several segments with different date periods.

  • Cube Segment - This is actual carrier of cube data, and maps to a HTable in HBase. One building job creates one new segment for the cube instance. Once data change on specified data period, we can refresh related segments to avoid rebuilding whole cube.

  • Aggregation Group - Each aggregation group is subset of dimensions, and build cuboid with combinations inside. It aims at pruning for optimization.

DIMENSION & MEASURE

  • Mandotary - This dimension type is used for cuboid pruning, if a dimension is specified as “mandatory”, then those combinations without such dimension are pruned.
  • Hierarchy - This dimension type is used for cuboid pruning, if dimension A,B,C forms a “hierarchy” relation, then only combinations with A, AB or ABC shall be remained.
  • Derived - On lookup tables, some dimensions could be generated from its PK, so there’s specific mapping between them and FK from fact table. So those dimensions are DERIVED and don’t participate in cuboid generation.

  • Count Distinct(HyperLogLog) - Immediate COUNT DISTINCT is hard to calculate, a approximate algorithm - HyperLogLog is introduced, and keep error rate in a lower level.
  • Count Distinct(Precise) - Precise COUNT DISTINCT will be pre-calculated basing on RoaringBitmap, currently only int or bigint are supported.
  • Top N - For example, with this measure type, user can easily get specified numbers of top sellers/buyers etc.

CUBE ACTIONS

  • BUILD - Given an interval of partition column, this action is to build a new cube segment.
  • REFRESH - This action will rebuilt cube segment in some partition period, which is used in case of source table increasing.
  • MERGE - This action will merge multiple continuous cube segments into single one. This can be automated with auto-merge settings in cube descriptor.
  • PURGE - Clear segments under a cube instance. This will only update metadata, and won’t delete cube data from HBase.

JOB STATUS

  • NEW - This denotes one job has been just created.
  • PENDING - This denotes one job is paused by job scheduler and waiting for resources.
  • RUNNING - This denotes one job is running in progress.
  • FINISHED - This denotes one job is successfully finished.
  • ERROR - This denotes one job is aborted with errors.
  • DISCARDED - This denotes one job is cancelled by end users.

JOB ACTION

  • RESUME - Once a job in ERROR status, this action will try to restore it from latest successful point.
  • DISCARD - No matter status of a job is, user can end it and release resources with DISCARD action.
目录
相关文章
|
4月前
|
存储 SQL 分布式数据库
Kylin学习总结
Kylin学习总结
69 1
|
2月前
|
存储 大数据 关系型数据库
HBase系列学习:基础知识
HBase系列学习:基础知识
HBase系列学习:基础知识
|
SQL 存储 分布式计算
Hive概论、架构和基本操作
Apache Spark是一个快速的,多用途的集群计算系统,相对于Hadoop MapReduce将中间结果保存在磁盘中,Spark使用了内存保存中间结果,能在数据尚未写入硬盘时在内存中进行计算,同时Spark提供SQL支持。Hive是建立在Hadoop上的数据仓库基础架构,它提供了一系列的工具,可以存储、查询、分析存储在分布式存储系统中的大规模数据集。Hive定义了简单的类SQL查询语言,通过底层的计算引擎,将SQL转为具体的计算任务进行执行。它将计算分为两个阶段,分别为Map和Reduce。
234 0
|
SQL 分布式计算 Hadoop
Kylin的入门实战
Kylin 是一个 Hadoop 生态圈下的 MOLAP 系统,是 ebay 大数据部门从2014 年开始研发的支持 TB 到 PB 级别数据量的分布式 Olap 分析引擎。
374 0
Kylin的入门实战
|
存储 分布式计算 Hadoop
Hadoop快速入门——第二章、分布式集群(第四节、搭建开发环境)(2)
Hadoop快速入门——第二章、分布式集群(第四节、搭建开发环境)
127 0
Hadoop快速入门——第二章、分布式集群(第四节、搭建开发环境)(2)
|
分布式计算 Hadoop
Hadoop快速入门——第二章、分布式集群(第四节、搭建开发环境)(1)
Hadoop快速入门——第二章、分布式集群(第四节、搭建开发环境)
97 0
Hadoop快速入门——第二章、分布式集群(第四节、搭建开发环境)(1)
|
存储 自然语言处理 分布式计算
全文检索工具solr:第一章:理论知识
全文检索工具solr:第一章:理论知识
132 0
|
SQL 大数据 数据建模
大数据框架原理简介(3)
大数据框架原理简介(3)
184 0
大数据框架原理简介(3)
|
SQL 大数据 OLAP
大数据框架原理简介(4)
大数据框架原理简介(4)
136 0
|
流计算 NoSQL
Cassandra sstableloader工具使用及原理解析
sstableloader是cassandra提供的bulkload工具,可以将sstable文件导入到集群中。本文详细介绍其用法和实现原理。 用法 sstableloader工具在cassandra的bin目录下面,用法如下: bin/sstableloader
12115 0