iceberg 元数据文件以及如何查看

简介: 了解iceberg 的文件结构

官网文档:https://iceberg.apache.org/spec/#manifests

一:了解iceberg 的文件结构

image.png

二:首先建立一个iceberg 表

CREATE TABLE sample7 (

 `id` BIGINT,

 `data` STRING,

 `name` STRING,

 `category` STRING)

USING iceberg

PARTITIONED BY (category, name)

TBLPROPERTIES( "format-version"="2")


插入3条数据(2个分区[xc1,pt2] [xc3,pt3])

spark-sql> select * from sample7;

22/08/09 20:17:18 WARN conf.HiveConf: HiveConf of name hive.merge.mapfile does not exist

22/08/09 20:17:18 WARN conf.HiveConf: HiveConf hive.mapjoin.smalltable.filesize expects LONG type value

1       1       xc1     pt2

2       2       xc1     pt2

2       2       xc3     pt3


三:如何查看iceberg 文件信息

 1: show create table sample8;

获得信息如下:

CREATE TABLE spark_catalog.zdk.sample8 (
  `id` BIGINT,
  `data` STRING,
  `name` STRING,
  `category` STRING)
USING iceberg
PARTITIONED BY (category)
LOCATION 'hdfs://ns1/dtInsight/hive/warehouse/zdk.db/sample8'
TBLPROPERTIES(
  'current-snapshot-id' = '3329158953122539010',
  'format' = 'iceberg/parquet')

2.根据属性信息中的 metadata_location 获取 metadata.json 的位置;

  * metadata.json 包含所有的snapshot信息;

  * 获取文件中的current-snapshot-id 的值:eg: 3329158953122539010

  * 根据获取到的snapshot-id 在当前的文件中查找对应的snapshot信息:

{
    "sequence-number" : 6,
    "snapshot-id" : 3329158953122539010,
    "parent-snapshot-id" : 6402031869467028731,
    "timestamp-ms" : 1660186257707,
    "summary" : {
      "operation" : "append",
      "spark.app.id" : "local-1660183319013",
      "added-data-files" : "1",
      "added-records" : "1",
      "added-files-size" : "1120",
      "changed-partition-count" : "1",
      "total-records" : "8",
      "total-files-size" : "7997",
      "total-data-files" : "7",
      "total-delete-files" : "0",
      "total-position-deletes" : "0",
      "total-equality-deletes" : "0"
    },
    "manifest-list" : "hdfs://ns1/dtInsight/hive/warehouse/zdk.db/sample8/metadata/snap-3329158953122539010-1-f4d2b1bd-22d3-4957-a5fe-f84118663407.avro",
    "schema-id" : 0
  } 

 * 获取manifest-list 的位置:  hdfs://ns1/dtInsight/hive/warehouse/zdk.db/sample8/metadata/snap-3329158953122539010-1-f4d2b1bd-22d3-4957-a5fe-f84118663407.avro,

 * 下载 avro 解析工具:avro-tools-1.9.2.jar

 * 执行命令:java -jar avro-tools-1.9.2.jar tojson snap-3329158953122539010-1-f4d2b1bd-22d3-4957-a5fe-f84118663407.avro

{"manifest_path":"hdfs://ns1/dtInsight/hive/warehouse/zdk.db/sample8/metadata/9cfc5807-02d1-4cb4-a705-863982a58b05-m1.avro","manifest_length":7173,"partition_spec_id":1,"content":0,"sequence_number":5,"min_sequence_number":3,"added_snapshot_id":6402031869467028731,"added_data_files_count":0,"existing_data_files_count":2,"deleted_data_files_count":0,"added_rows_count":0,"existing_rows_count":2,"deleted_rows_count":0,"partitions":{"array":[{"contains_null":false,"contains_nan":{"boolean":false},"lower_bound":{"bytes":"pt3"},"upper_bound":{"bytes":"pt3"}},{"contains_null":false,"contains_nan":{"boolean":false},"lower_bound":{"bytes":"xc3"},"upper_bound":{"bytes":"xc3"}}]}}
{"manifest_path":"hdfs://ns1/dtInsight/hive/warehouse/zdk.db/sample8/metadata/f4d2b1bd-22d3-4957-a5fe-f84118663407-m0.avro","manifest_length":6888,"partition_spec_id":0,"content":0,"sequence_number":6,"min_sequence_number":6,"added_snapshot_id":3329158953122539010,"added_data_files_count":1,"existing_data_files_count":0,"deleted_data_files_count":0,"added_rows_count":1,"existing_rows_count":0,"deleted_rows_count":0,"partitions":{"array":[{"contains_null":false,"contains_nan":{"boolean":false},"lower_bound":{"bytes":"pt3"},"upper_bound":{"bytes":"pt3"}}]}}
{"manifest_path":"hdfs://ns1/dtInsight/hive/warehouse/zdk.db/sample8/metadata/9cfc5807-02d1-4cb4-a705-863982a58b05-m0.avro","manifest_length":6890,"partition_spec_id":0,"content":0,"sequence_number":5,"min_sequence_number":5,"added_snapshot_id":6402031869467028731,"added_data_files_count":1,"existing_data_files_count":0,"deleted_data_files_count":0,"added_rows_count":1,"existing_rows_count":0,"deleted_rows_count":0,"partitions":{"array":[{"contains_null":false,"contains_nan":{"boolean":false},"lower_bound":{"bytes":"pt3"},"upper_bound":{"bytes":"pt3"}}]}}
{"manifest_path":"hdfs://ns1/dtInsight/hive/warehouse/zdk.db/sample8/metadata/a56dfef2-4592-4d1d-a642-e16f4467dac2-m1.avro","manifest_length":7020,"partition_spec_id":0,"content":0,"sequence_number":3,"min_sequence_number":1,"added_snapshot_id":7048362733565878395,"added_data_files_count":0,"existing_data_files_count":3,"deleted_data_files_count":0,"added_rows_count":0,"existing_rows_count":4,"deleted_rows_count":0,"partitions":{"array":[{"contains_null":false,"contains_nan":{"boolean":false},"lower_bound":{"bytes":"pt2"},"upper_bound":{"bytes":"pt3"}}]}}

当前的manifest_list包含4个manifest 文件,获取其中的一个manifest_path路径:hdfs://ns1/dtInsight/hive/warehouse/zdk.db/sample8/metadata/9cfc5807-02d1-4cb4-a705-863982a58b05-m1.avro

进行查看信息:

{"status":0,"snapshot_id":{"long":32144809447347390},"sequence_number":{"long":4},"data_file":{"content":0,"file_path":"hdfs://ns1/dtInsight/hive/warehouse/zdk.db/sample8/data/category=pt3/name=xc3/00000-0-9f003a4c-d5c0-4b56-9b8f-aab4f1d3a42d-00001.parquet","file_format":"PARQUET","partition":{"category":{"string":"pt3"},"name":{"string":"xc3"}},"record_count":1,"file_size_in_bytes":1120,"column_sizes":{"array":[{"key":1,"value":46},{"key":2,"value":48},{"key":3,"value":50},{"key":4,"value":50}]},"value_counts":{"array":[{"key":1,"value":1},{"key":2,"value":1},{"key":3,"value":1},{"key":4,"value":1}]},"null_value_counts":{"array":[{"key":1,"value":0},{"key":2,"value":0},{"key":3,"value":0},{"key":4,"value":0}]},"nan_value_counts":{"array":[]},"lower_bounds":{"array":[{"key":1,"value":"\u0005\u0000\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":2,"value":"6"},{"key":3,"value":"xc3"},{"key":4,"value":"pt3"}]},"upper_bounds":{"array":[{"key":1,"value":"\u0005\u0000\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":2,"value":"6"},{"key":3,"value":"xc3"},{"key":4,"value":"pt3"}]},"key_metadata":null,"split_offsets":{"array":[4]},"equality_ids":null,"sort_order_id":{"int":0}}}
{"status":0,"snapshot_id":{"long":7048362733565878395},"sequence_number":{"long":3},"data_file":{"content":0,"file_path":"hdfs://ns1/dtInsight/hive/warehouse/zdk.db/sample8/data/category=pt3/name=xc3/00000-3-817aac21-1f4a-49ae-97aa-337b3c77a0f8-00001.parquet","file_format":"PARQUET","partition":{"category":{"string":"pt3"},"name":{"string":"xc3"}},"record_count":1,"file_size_in_bytes":1119,"column_sizes":{"array":[{"key":1,"value":45},{"key":2,"value":48},{"key":3,"value":50},{"key":4,"value":50}]},"value_counts":{"array":[{"key":1,"value":1},{"key":2,"value":1},{"key":3,"value":1},{"key":4,"value":1}]},"null_value_counts":{"array":[{"key":1,"value":0},{"key":2,"value":0},{"key":3,"value":0},{"key":4,"value":0}]},"nan_value_counts":{"array":[]},"lower_bounds":{"array":[{"key":1,"value":"\u0004\u0000\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":2,"value":"5"},{"key":3,"value":"xc3"},{"key":4,"value":"pt3"}]},"upper_bounds":{"array":[{"key":1,"value":"\u0004\u0000\u0000\u0000\u0000\u0000\u0000\u0000"},{"key":2,"value":"5"},{"key":3,"value":"xc3"},{"key":4,"value":"pt3"}]},"key_metadata":null,"split_offsets":{"array":[4]},"equality_ids":null,"sort_order_id":{"int":0}}}

当前manifest文件包含了两个数据文件,数据文件的路径可以通过data_file获取;

hdfs://ns1/dtInsight/hive/warehouse/zdk.db/sample8/data/category=pt3/name=xc3/00000-0-9f003a4c-d5c0-4b56-9b8f-aab4f1d3a42d-00001.parquet

同时manifest 文件中还记录了对parquet 的列的统计信息

image.png

上面的列的统计信息lower_bound,upbounds 在iceberg 里面具体的作用在之后会进行分析;

相关文章
|
SQL 消息中间件 关系型数据库
iceberg实践
iceberg实践
|
JSON Java 调度
DolphinScheduler教程(03)- 源码分析(三)
DolphinScheduler教程(03)- 源码分析(三)
889 0
|
SQL 流计算
(5)Flink CEP SQL四种匹配模式效果演示
Flink CEP SQL中提供了四种匹配策略: (1)skip to next row 从匹配成功的事件序列中的第一个事件的下一个事件开始进行下一次匹配 (2)skip past last row 从匹配成功的事件序列中的最后一个事件的下一个事件开始进行下一次匹配 (3)skip to first pattern Item 从匹配成功的事件序列中第一个对应于patternItem的事件开始进行下一次匹配 (4)skip to last pattern Item 从匹配成功的事件序列中最后一个对应于patternItem的事件开始进行下一次匹配
(5)Flink CEP SQL四种匹配模式效果演示
|
2月前
|
SQL 关系型数据库 Apache
从 Flink 到 Doris 的实时数据写入实践 —— 基于 Flink CDC 构建更实时高效的数据集成链路
本文将深入解析 Flink-Doris-Connector 三大典型场景中的设计与实现,并结合 Flink CDC 详细介绍了整库同步的解决方案,助力构建更加高效、稳定的实时数据处理体系。
1254 0
从 Flink 到 Doris 的实时数据写入实践 —— 基于 Flink CDC 构建更实时高效的数据集成链路
|
分布式计算 Kubernetes 流计算
Iceberg+Amoro+CloudEon体验云原生数据湖
通过使用CloudEon、Iceberg、Flink、Spark和Amoro,您可以轻松地在Kubernetes上构建一个强大的大数据基础环境。
741 0
|
存储 关系型数据库 MySQL
怎么使用 Flink 向 Apache Doris 表中写 Bitmap 类型的数据
怎么使用 Flink 向 Apache Doris 表中写 Bitmap 类型的数据
581 0
|
SQL 分布式计算 Hadoop
Iceberg实战踩坑指南(一)
Iceberg实战踩坑指南
2258 0
|
SQL 分布式计算 HIVE
|
存储 算法 前端开发
一文带你学会国产加密算法SM4的java实现方案
今天给大家带来一个国产SM4加密解密算法的java后端解决方案,代码完整,可以直接使用,希望给大家带来帮助,尤其是做政府系统的开发人员,可以直接应用到项目中进行加密解密。
3852 1
|
存储 消息中间件 API
数据湖paimon入门指南
数据湖paimon入门指南
数据湖paimon入门指南