开发者社区> 优惠券活动> 正文

Hive语言手册-ORC

简介: Hive语言手册-ORC LanguageManual ORC ORC File Format Version Introduced in Hive version 0.11.0. Optimized Row Columnar(ORC)文件格式提供了存储Hive数据的高效方法。
+关注继续查看

Hive语言手册-ORC

LanguageManual ORC

ORC File Format

Version

Introduced in Hive version 0.11.0.

Optimized Row Columnar(ORC)文件格式提供了存储Hive数据的高效方法。它的设计是为了克服其他Hive文件格式的限制。使用ORC文件可以提高Hive在读取、写入和处理数据时的性能。

与RCFile格式相比,ORC文件格式有很多优点:

  • 单个文件作为每个任务的输出,它减少了NameNode的负载。
  • Hive类型支持,包括datetime、decimal和复杂类型(struct、list、map和union)
  • 在文件中存储轻量级索引
    • 跳过不通过谓词过滤的行组
    • 寻找一个给定的行
  • 基于数据类型的块模式压缩
    • 整数列的运行长度编码
    • 对字符串列的字典编码
  • 使用单独的记录阅读器并发读取相同的文件
  • 能够不扫描标记分割文件
  • 限制读写所需的内存
  • 使用协议缓冲区存储的元数据,允许添加和删除字段

文件结构

ORC文件包含称为stripes的行数据组,以及文件页脚中的辅助信息。在文件的末尾,一个postscript 保存了压缩参数和压缩页脚的大小。

默认的脚本大小为250MB。大的stripes可以大块的高效的从HDFS上读取。

文件页脚包含文件中的stripes 列表、每个stripes 的行数和每列的数据类型。它还包含列级的聚合计数、最小值、最大值和总和。

该图说明了ORC文件结构:

Stripe Structure

As shown in the diagram, each stripe in an ORC file holds index data, row data, and a stripe footer.

The stripe footer contains a directory of stream locations. Row data is used in table scans.

Index data includes min and max values for each column and the row positions within each column. (A bit field or bloom filter could also be included.) Row index entries provide offsets that enable seeking to the right compression block and byte within a decompressed block.  Note that ORC indexes are used only for the selection of stripes and row groups and not for answering queries.

Having relatively frequent row index entries enables row-skipping within a stripe for rapid reads, despite large stripe sizes. By default every 10,000 rows can be skipped.

With the ability to skip large sets of rows based on filter predicates, you can sort a table on its secondary keys to achieve a big reduction in execution time. For example, if the primary partition is transaction date, the table can be sorted on state, zip code, and last name. Then looking for records in one state will skip the records of all other states.

A complete specification of the format is given in the ORC specification.

HiveQL Syntax

File formats are specified at the table (or partition) level. You can specify the ORC file format with HiveQL statements such as these:

  • CREATE TABLE ... STORED AS ORC
  • ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT ORC
  • SET hive.default.fileformat=Orc

The parameters are all placed in the TBLPROPERTIES (see Create Table). They are:

Key

Default

Notes

orc.compress ZLIB high level compression (one of NONE, ZLIB, SNAPPY)
orc.compress.size 262,144 number of bytes in each compression chunk
orc.stripe.size 67,108,864 number of bytes in each stripe
orc.row.index.stride 10,000 number of rows between index entries (must be >= 1000)
orc.create.index true whether to create row indexes
orc.bloom.filter.columns "" comma separated list of column names for which bloom filter should be created
orc.bloom.filter.fpp 0.05 false positive probability for bloom filter (must >0.0 and <1.0)

For example, creating an ORC stored table without compression:

create table Addresses (
  name string,
  street string,
  city string,
  state string,
  zip int
) stored as orc tblproperties ("orc.compress"="NONE");

Version 0.14.0+: CONCATENATE

ALTER TABLE table_name [PARTITION partition_spec] CONCATENATE can be used to merge small ORC files into a larger file, starting in Hive 0.14.0. The merge happens at the stripe level, which avoids decompressing and decoding the data.

Serialization and Compression

The serialization of column data in an ORC file depends on whether the data type is integer or string.

Integer Column Serialization

Integer columns are serialized in two streams.

  1. present bit stream: is the value non-null?
  2. data stream: a stream of integers

Integer data is serialized in a way that takes advantage of the common distribution of numbers:

  • Integers are encoded using a variable-width encoding that has fewer bytes for small integers.
  • Repeated values are run-length encoded.
  • Values that differ by a constant in the range (-128 to 127) are run-length encoded.

The variable-width encoding is based on Google's protocol buffers and uses the high bit to represent whether this byte is not the last and the lower 7 bits to encode data. To encode negative numbers, a zigzag encoding is used where 0, -1, 1, -2, and 2 map into 0, 1, 2, 3, 4, and 5 respectively.

Each set of numbers is encoded this way:

  • If the first byte (b0) is negative:
    • -b0 variable-length integers follow.
  • If the first byte (b0) is positive:
    • it represents b0 + 3 repeated integers
    • the second byte (-128 to +127) is added between each repetition
    • 1 variable-length integer.

In run-length encoding, the first byte specifies run length and whether the values are literals or duplicates. Duplicates can step by -128 to +128. Run-length encoding uses protobuf style variable-length integers.

String Column Serialization

Serialization of string columns uses a dictionary to form unique column values. The dictionary is sorted to speed up predicate filtering and improve compression ratios.

String columns are serialized in four streams.

  1. present bit stream: is the value non-null?
  2. dictionary data: the bytes for the strings
  3. dictionary length: the length of each entry
  4. row data: the row values

Both the dictionary length and the row values are run-length encoded streams of integers.

Compression

Streams are compressed using a codec, which is specified as a table property for all streams in that table. To optimize memory use, compression is done incrementally as each block is produced. Compressed blocks can be jumped over without first having to be decompressed for scanning. Positions in the stream are represented by a block start location and an offset into the block.

The codec can be Snappy, Zlib, or none.

ORC File Dump Utility

The ORC file dump utility analyzes ORC files.  To invoke it, use this command:

// Hive version 0.11 through 0.14:
hive --orcfiledump <location-of-orc-file>
 
// Hive version 1.1.0 and later:
hive --orcfiledump [-d] [--rowindex <col_ids>] <location-of-orc-file>
 
// Hive version 1.2.0 and later:
hive --orcfiledump [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file>
 
// Hive version 1.3.0 and later:
hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] [--recover] [--skip-dump] 
    [--backup-path <new-path>] <location-of-orc-file-or-directory>

Specifying -d in the command will cause it to dump the ORC file data rather than the metadata (Hive 1.1.0 and later).

Specifying --rowindex with a comma separated list of column ids will cause it to print row indexes for the specified columns, where 0 is the top level struct containing all of the columns and 1 is the first column id (Hive 1.1.0 and later).

Specifying -t in the command will print the timezone id of the writer.

Specifying -j in the command will print the ORC file metadata in JSON format. To pretty print the JSON metadata, add -p to the command.

Specifying --recover in the command will recover a corrupted ORC file generated by Hive streaming.

Specifying --skip-dump along with --recover will perform recovery without dumping metadata.

Specifying --backup-path with a new-path will let the recovery tool move corrupted files to the specified backup path (default: /tmp).

<location-of-orc-file> is the URI of the ORC file.

<location-of-orc-file-or-directory> is the URI of the ORC file or directory. From Hive 1.3.0 onward, this URI can be a directory containing ORC files.

ORC Configuration Parameters

The ORC configuration parameters are described in Hive Configuration Properties – ORC File Format.

ORC Format Specification

The ORC specification has moved to ORC project

原文地址http://www.bieryun.com/2447.html

版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。

相关文章
阿里云服务器怎么设置密码?怎么停机?怎么重启服务器?
如果在创建实例时没有设置密码,或者密码丢失,您可以在控制台上重新设置实例的登录密码。本文仅描述如何在 ECS 管理控制台上修改实例登录密码。
23538 0
阿里云服务器ECS远程登录用户名密码查询方法
阿里云服务器ECS远程连接登录输入用户名和密码,阿里云没有默认密码,如果购买时没设置需要先重置实例密码,Windows用户名是administrator,Linux账号是root,阿小云来详细说下阿里云服务器远程登录连接用户名和密码查询方法
22248 0
阿里云服务器安全组设置内网互通的方法
虽然0.0.0.0/0使用非常方便,但是发现很多同学使用它来做内网互通,这是有安全风险的,实例有可能会在经典网络被内网IP访问到。下面介绍一下四种安全的内网互联设置方法。 购买前请先:领取阿里云幸运券,有很多优惠,可到下文中领取。
22096 0
如何设置阿里云服务器安全组?阿里云安全组规则详细解说
阿里云安全组设置详细图文教程(收藏起来) 阿里云服务器安全组设置规则分享,阿里云服务器安全组如何放行端口设置教程。阿里云会要求客户设置安全组,如果不设置,阿里云会指定默认的安全组。那么,这个安全组是什么呢?顾名思义,就是为了服务器安全设置的。安全组其实就是一个虚拟的防火墙,可以让用户从端口、IP的维度来筛选对应服务器的访问者,从而形成一个云上的安全域。
18900 0
windows server 2008阿里云ECS服务器安全设置
最近我们Sinesafe安全公司在为客户使用阿里云ecs服务器做安全的过程中,发现服务器基础安全性都没有做。为了为站长们提供更加有效的安全基础解决方案,我们Sinesafe将对阿里云服务器win2008 系统进行基础安全部署实战过程! 比较重要的几部分 1.
11978 0
阿里云服务器如何登录?阿里云服务器的三种登录方法
购买阿里云ECS云服务器后如何登录?场景不同,云吞铺子总结大概有三种登录方式: 登录到ECS云服务器控制台 在ECS云服务器控制台用户可以更改密码、更换系统盘、创建快照、配置安全组等操作如何登录ECS云服务器控制台? 1、先登录到阿里云ECS服务器控制台 2、点击顶部的“控制台” 3、通过左侧栏,切换到“云服务器ECS”即可,如下图所示 通过ECS控制台的远程连接来登录到云服务器 阿里云ECS云服务器自带远程连接功能,使用该功能可以登录到云服务器,简单且方便,如下图:点击“远程连接”,第一次连接会自动生成6位数字密码,输入密码即可登录到云服务器上。
36367 0
+关注
优惠券活动
阿里云优惠码阿里云推荐券bieryun.com
393
文章
9
问答
文章排行榜
最热
最新
相关电子书
更多
JS零基础入门教程(上册)
立即下载
性能优化方法论
立即下载
手把手学习日志服务SLS,云启实验室实战指南
立即下载