五、页面功能介绍

Atlas的页面功能非常的丰富，可以进行元数据的管理及数据血缘的展示。

主界面

Search

基本搜索

基本搜索允许您使用实体的类型名称，关联的分类/标记进行查询，并且支持对实体属性以及分类/标记属性进行过滤。

可以使用 AND/OR 条件对多个属性进行基于属性的过滤。

支持的过滤运算符

LT（符号：<, lt）适用于数字、日期属性
GT（符号：>、gt）适用于数字、日期属性
LTE（符号：<=, lte）适用于数字、日期属性
GTE（符号：>=，gte）适用于数字、日期属性
EQ（符号：eq、=）适用于数字、日期、字符串属性
NEQ（符号：neq、!=）适用于数字、日期、字符串属性
LIKE（符号：like、LIKE）与字符串属性一起使用
STARTS_WITH（符号：startsWith、STARTSWITH）与字符串属性一起使用
ENDS_WITH（符号：endsWith、ENDSWITH）与字符串属性一起使用
CONTAINS (symbols: contains, CONTAINS) 使用 String 属性

高级搜索

Atlas 中的高级搜索也称为基于 DSL 的搜索。

领域特定搜索 (DSL) 是一种结构简单的语言，该语法模拟了关系数据库流行的结构化查询语言 (SQL)。

具体语法请参考Github上的Atlas DSL Grammer （Antlr G4格式）。

例：要检索名称可以是 time_dim 或 customer_dim 的 Table 类型的实体：

from Table where name = 'time_dim' or name = 'customer_dim'

Classification

分类传播使与实体相关联的分类能够自动与该实体的其他相关实体相关联。这在处理数据集从其他数据集派生数据的场景时非常有用。

为实体添加分类

将分类“PII”添加到“hdfs_path”实体后，该分类将传播到沿袭路径中的所有受影响实体，包括“员工”表、视图“us_employees”和“uk_employees” - 如下所示。

更新与实体关联的分类

与实体关联的分类的任何更新也将在分类传播到的所有实体中看到。

简单的说，此功能可以监控数据到底流向了哪里。

glossary

词汇表，也称术语表为业务用户提供适当的词汇表，它允许术语（词）相互关联并分类，以便在不同的上下文中理解它们。然后可以将这些术语映射到数据库、表、列等资产。这有助于抽象与存储库相关的技术术语，并允许用户发现/使用他们更熟悉的词汇表中的数据。

通过单击词汇表 UI 中的术语名称，可以查看术语的各种详细信息。详细信息页面下的每个选项卡提供该术语的不同详细信息。

当切换开关在类别上时，面板将列出所有词汇表以及类别层次结构。这是此视图下可能的交互的列表。

如果一个术语具有分类，则该实体已被分配继承相同的分类。

通过术语表的功能，让数据资产与业务系统建立了联系。

六、Hive数据血缘

Hive2兼容性问题

Atlas与Hive存在兼容性问题，本文基于Atlas2.1.0兼容CDH6.3.2部署。Hive版本为2.1.1.其他版本的问题不在此文档讨论。

为兼容Hive2.1.1，需要修改源码重新编译。

所需修改的项目位置：apache-atlas-sources-2.1.0\addons\hive-bridge

①.org/apache/atlas/hive/bridge/HiveMetaStoreBridge.java 577行

String catalogName = hiveDB.getCatalogName() != null ? hiveDB.getCatalogName().toLowerCase() : null;

改为：

String catalogName = null;

②.org/apache/atlas/hive/hook/AtlasHiveHookContext.java 81行

this.metastoreHandler = (listenerEvent != null) ? metastoreEvent.getIHMSHandler() : null;

改为：C:\Users\Heaton\Desktop\apache-atlas-2.1.0-sources\apache-atlas-sources-2.1.0\addons

this.metastoreHandler = null;

集成Hive

将 atlas-application.properties 配置文件，压缩加入到 atlas-plugin-classloader-2.0.0.jar 中

#必须在此路径打包，才能打到第一级目录下
cd /usr/local/src/atlas/apache-atlas-2.1.0/conf
zip -u /usr/local/src/atlas/apache-atlas-2.1.0/hook/hive/atlas-plugin-classloader-2.1.0.jar atlas-application.properties

修改 hive-site.xml

<property>
    <name>hive.exec.post.hooks</name>
    <value>org.apache.atlas.hive.hook.HiveHook</value>
</property>

修改 hive-env.sh 的 Gateway 客户端环境高级配置代码段（安全阀）

HIVE_AUX_JARS_PATH=/usr/local/src/atlas/apache-atlas-2.1.0/hook/hive

修改 HIVE_AUX_JARS_PATH

修改 hive-site.xml 的 HiveServer2 高级配置代码段（安全阀）

<property>
    <name>hive.exec.post.hooks</name>
    <value>org.apache.atlas.hive.hook.HiveHook</value>
</property>
<property>
    <name>hive.reloadable.aux.jars.path</name>
    <value>/usr/local/src/atlas/apache-atlas-2.1.0/hook/hive</value>
</property>

修改 HiveServer2 环境高级配置代码段

HIVE_AUX_JARS_PATH=/usr/local/src/atlas/apache-atlas-2.1.0/hook/hive

需要将配置好的Atlas包发往各个hive节点后重启集群。

导入Hive元数据

执行atlas脚本

./bin/import-hive.sh
#输入用户名：admin；输入密码：admin

登录Atlas查看元数据信息。

测试实时hive hook

简单执行一个hera资源统计的hive脚本

use sucx_test
;
-- 昨日升级设备数
create table if not exists qs_tzl_ProductTag_result(
pid string
,category string
,category_code string
,tag_name string
,tag_value string
,other string
,update_time string
)
partitioned by (tag_id string)
STORED AS PARQUET TBLPROPERTIES('parquet.compression'='SNAPPY')
;
insert overwrite table qs_tzl_ProductTag_result partition(tag_id='3014')
select
       T1.product_id as pid
       ,T2.category
       ,cast(from_unixtime(unix_timestamp()) as string) as update_time
from (select
             product_id
      from COM_PRODUCT_UPGRADE_STAT_D where p_day='20200901' 
     ) T1
left join (select category
        from bi_ods.ods_smart_product where dt='20200901'
        ) T2
on T1.product_id=T2.id
;

执行后查看 qs_tzl_ProductTag_result 的表级血缘为

pid 的字段血缘为

七、Spark数据血缘

打包spark-atlas-connector

atlas 官方文档中并不支持 spark sql 的解析，需要使用第三方的包。

地址：https://github.com/hortonworks-spark/spark-atlas-connector

一、git clone 后本地进行打包

mvn package -DskipTests

二、打包后在 spark-atlas-connector/spark-atlas-connector-assembly/target 目录有一个 spark-atlas-connector-assembly-${version}.jar 的 jar，将该 jar 上传到服务器。需要注意的是不要上传 spark-atlas-connector/spark-atlas-connector/target 这个目录内的 jar ，缺少相关依赖包

三、将 spark-atlas-connector-assembly-${version}.jar 放到一个固定目录比如/opt/resource

测试spark hook

首先进入spark-sql client

spark-sql --master yarn  \
--jars /opt/resource/spark-atlas-connector_2.11-0.1.0-SNAPSHOT.jar  \
--files /opt/resource/atlas-application.properties \
--conf spark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker \
--conf spark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker \
--conf spark.sql.streaming.streamingQueryListeners=com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker

执行 hera 的一个资源统计任务
CREATE EXTERNAL TABLE  IF NOT EXISTS sucx_hera.ads_hera_task_mem_top_10(
    `job_id` BIGINT COMMENT '任务ID',
    `user` STRING COMMENT '关注人',
    `applicationId` STRING COMMENT 'yarn执行的app id',
    `memorySeconds` BIGINT COMMENT '内存使用时间和',
    `startedTime` BIGINT COMMENT '开始时间',
    `finishedTime` BIGINT COMMENT '结束时间',
    `elapsedTime` BIGINT COMMENT '运行时间',
    `vcoreSeconds` BIGINT COMMENT 'vcore使用时间和')
ROW FORMAT SERDE
    'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
    'field.delim'='\t',
    'serialization.format'='\t')
STORED AS INPUTFORMAT
    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
    'cosn://sucx-big-data/bi//sucx_hera/ads_hera_task_mem_top_10';
insert overwrite table sucx_hera.ads_hera_task_mem_top_10
select
 job_id,user,applicationId,memorySeconds,startedTime,finishedTime,elapsedTime,vcoreSeconds
from
(SELECT
    top.job_id,
    row_number() over(distribute by top.applicationId ORDER BY sso.id) as num,
    case when sso.name is null then operator
    else sso.name end as user,
    top.applicationId,
    top.memorySeconds,
    top.startedTime,
    top.finishedTime,
    top.elapsedTime,
    top.vcoreSeconds
FROM (
select * from sucx_hera.dws_hera_task_mem_top_10 where dt = '20200901' ) top
left join bi_ods.ods_hera_job_monitor monitor
on  monitor.dt='20200901' and top.job_id=monitor.job_id
left join bi_ods.ods_hera_sso sso
on sso.dt='20200901' and find_in_set(sso.id,monitor.user_ids) >0 order by job_id ) temp
where temp.num = 1

执行后，查看 ads_hera_task_mem_top_10 表级血缘

注意此包不支持 spark 字段的支持。

如果需要spark字段的支持，一种是spark代码转成hive跑一遍，一种就是需要自研了。

八、Atlas二次开发

atlas虽好，但是很多场景依然无法满足我们的需要。这时候就不得不做一些改动了。二次开发有两种方式一种是基于Atlas Api的开发，一种是修改源码。

REST API

http://atlas.apache.org/api/v2/index.html

DiscoveryREST

http://hostname:21000/api/atlas/v2/search/basic?classification=class1
参数支持：query、typeName、classification、excludeDeletedEntities、limit、offset
code：https://github.com/apache/atlas/blob/6bacbe946bbc5ca72118304770d5ad920695bd52/webapp/src/main/java/org/apache/atlas/web/rest/DiscoveryREST.java
# 查询所有 Table
http://hostname:21000/api/atlas/v2/search/dsl?typeName=Table
# 查询 owner 前缀为 John 的 Table
http://hostname:21000/api/atlas/v2/search/attribute?typeName=Table&attrName=owner&attrValuePrefix=John
# 查询 Table 的一些属性，如：guid, ownerName, searchParameters 等
http://hostname:21000/api/atlas/v2/search/saved/Table
# 不是很懂（返回的数据和上面那个 API 一模一样）
http://hostname:21000/api/atlas/v2/search/saved
# 查询 EntityType - Table 下有哪些 entity.
http://hostname:21000/api/atlas/v2/search/saved/execute/Table
# 查询 guid 为 e283d8c1-ae19-4f4b-80c0-38031788383b 的 EntityType 下有哪些 entity.
http://hostname:21000/api/atlas/v2/search/saved/execute/guid/e283d8c1-ae19-4f4b-80c0-38031788383b

LineageREST

# 查询 guid 为 a95cb57f-4643-4edf-b5a5-0f1de2e0f076 的实体的血缘
http://hostname:21000/api/atlas/v2/lineage/a95cb57f-4643-4edf-b5a5-0f1de2e0f076

EntityREST

# 查询 guid 为 48f29229-47a9-4b05-b053-91e6484f42a1 的实体
http://hostname:21000/api/atlas/v2/entity/guid/48f29229-47a9-4b05-b053-91e6484f42a1
# 查询 guid 为 48f29229-47a9-4b05-b053-91e6484f42a1 的实体的审计数据
http://hostname:21000/api/atlas/v2/entity/48f29229-47a9-4b05-b053-91e6484f42a1/audit
# 查询 guid 为 48f29229-47a9-4b05-b053-91e6484f42a1 的实体所属的 classifications
http://hostname:21000/api/atlas/v2/entity/guid/48f29229-47a9-4b05-b053-91e6484f42a1/classifications
# 根据 EntityType 的唯一属性查找实体
# 如下：name 是 DB 的一个唯一属性. 查找 name=Logging 的 DB.
http://hostname:21000/api/atlas/v2/entity/uniqueAttribute/type/DB?attr:name=Logging
# 查询 entity 与 classification 是否有关？
http://hostname:21000/api/atlas/v2/entity/guid/48f29229-47a9-4b05-b053-91e6484f42a1/classification/Dimension
# 批量查询实体
http://hostname:21000/api/atlas/v2/entity/bulk?guid=e667f337-8dcc-468b-a5d0-96473f8ede26&guid=a95cb57f-4643-4edf-b5a5-0f1de2e0f076

RelationshipREST

# 查询 guid 为 726c0120-19d2-4978-b38d-b03124033f41 的 relationship
# 注：relationship 可以看做是血缘的一条边
http://hostname:21000/api/atlas/v2/relationship/guid/726c0120-19d2-4978-b38d-b03124033f41

TypesREST

http://hostname:21000/api/atlas/v2/types/typedef/guid/e0ca4c40-6165-4cec-b489-2b8e5fc7112b
http://hostname:21000/api/atlas/v2/types/typedef/name/Table
http://hostname:21000/api/atlas/v2/types/typedefs/headers
http://hostname:21000/api/atlas/v2/types/typedefs
http://hostname:21000/api/atlas/v2/types/enumdef/name/hive_principal_type
http://hostname:21000/api/atlas/v2/types/enumdef/guid/ee30446a-92e1-4bbc-aa0a-66ac21970d88
http://hostname:21000/api/atlas/v2/types/structdef/name/hive_order
http://hostname:21000/api/atlas/v2/types/structdef/guid/0b602605-8c88-4b60-a177-c1c671265294
http://hostname:21000/api/atlas/v2/types/classificationdef/name/PII
http://hostname:21000/api/atlas/v2/types/classificationdef/guid/3992eef8-fd74-4ae7-8b35-fead363b2122
http://hostname:21000/api/atlas/v2/types/entitydef/name/DataSet
http://hostname:21000/api/atlas/v2/types/entitydef/guid/5dca20ce-7d57-4bc3-a3f1-93fa622c32b1
http://hostname:21000/api/atlas/v2/types/relationshipdef/name/process_dataset_outputs
http://hostname:21000/api/atlas/v2/types/relationshipdef/guid/5d76dae0-6bad-4d19-b9b2-cb0cdc3b53d5

GlossaryREST

http://hostname:21000/api/atlas/v2/glossary
http://hostname:21000/api/atlas/v2/glossary/302dc8f8-3bc8-4878-9ba2-0c585ca6de3d

本地开发环境

启动HBase和solr

由于在Atlas需要使用到HBase以及slor。方便调试可以解压之前编译好内置HBase和sole的tar。

文件补充

创建文件夹将，atlas所需要的文件拷贝过去。

将编译内置HBase以及solr源码目录distro/target/conf下的文件拷贝到上述的上图的conf目录。将源码中addons/models下所有的文件拷贝过去。将源码中的atlas/webapp/target中的内容全部拷贝到webapp目录中。

源码启动

将clone下来的源码导入到IDEA中。配置启动参数：

-Datlas.home=/opt/deploy
-Datlas.conf=/opt/deploy/conf
-Datlas.data=/opt/deploy/data

数据治理之元数据管理的利器——Atlas入门宝典（二）