详解Apache Hudi如何配置各种类型分区( 二 )

Hudi同步到Hive创建的表如下
CREATE EXTERNAL TABLE `dateformatsinglepartitiondemo`(`_hoodie_commit_time` string,`_hoodie_commit_seqno` string,`_hoodie_record_key` string,`_hoodie_partition_path` string,`_hoodie_file_name` string,`age` bigint,`location` string,`name` string,`sex` string,`ts` bigint)PARTITIONED BY (`date` string)ROW FORMAT SERDE'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'STORED AS INPUTFORMAT'org.apache.hudi.hadoop.HoodieParquetInputFormat'OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION'file:/tmp/hudi-partitions/dateFormatSinglePartitionDemo'TBLPROPERTIES ('last_commit_time_sync'='20200816155107','transient_lastDdlTime'='1597564276')查询表dateformatsinglepartitiondemo

详解Apache Hudi如何配置各种类型分区

文章插图
 
2.2 多分区【详解Apache Hudi如何配置各种类型分区】多分区表示使用多个字段表示作为分区字段的场景,如上述使用location字段和sex字段,核心配置项如下
  • DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY()配置为location,sex;
  • hoodie.datasource.hive_sync.partition_fields配置为location,sex,与写入Hudi的分区字段相同;
  • DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY()配置为org.apache.hudi.keygen.ComplexKeyGenerator;
  • hoodie.datasource.hive_sync.partition_extractor_class配置为org.apache.hudi.hive.MultiPartKeysValueExtractor;
Hudi同步到Hive创建的表如下
CREATE EXTERNAL TABLE `multipartitiondemo`(`_hoodie_commit_time` string,`_hoodie_commit_seqno` string,`_hoodie_record_key` string,`_hoodie_partition_path` string,`_hoodie_file_name` string,`age` bigint,`date` string,`name` string,`ts` bigint)PARTITIONED BY (`location` string,`sex` string)ROW FORMAT SERDE'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'STORED AS INPUTFORMAT'org.apache.hudi.hadoop.HoodieParquetInputFormat'OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION'file:/tmp/hudi-partitions/multiPartitionDemo'TBLPROPERTIES ('last_commit_time_sync'='20200816160557','transient_lastDdlTime'='1597565166')查询表multipartitiondemo
详解Apache Hudi如何配置各种类型分区

文章插图
 
2.3 无分区无分区场景是指无分区字段,写入Hudi的数据集无分区 。核心配置如下
  • DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY()配置为空字符串;
  • hoodie.datasource.hive_sync.partition_fields配置为空字符串,与写入Hudi的分区字段相同;
  • DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY()配置为org.apache.hudi.keygen.NonpartitionedKeyGenerator;
  • hoodie.datasource.hive_sync.partition_extractor_class配置为org.apache.hudi.hive.NonPartitionedExtractor;
Hudi同步到Hive创建的表如下
CREATE EXTERNAL TABLE `nonpartitiondemo`(`_hoodie_commit_time` string,`_hoodie_commit_seqno` string,`_hoodie_record_key` string,`_hoodie_partition_path` string,`_hoodie_file_name` string,`age` bigint,`date` string,`location` string,`name` string,`sex` string,`ts` bigint)ROW FORMAT SERDE'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'STORED AS INPUTFORMAT'org.apache.hudi.hadoop.HoodieParquetInputFormat'OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION'file:/tmp/hudi-partitions/nonPartitionDemo'TBLPROPERTIES ('last_commit_time_sync'='20200816161558','transient_lastDdlTime'='1597565767')查询表nonpartitiondemo
详解Apache Hudi如何配置各种类型分区

文章插图
 
2.4 Hive风格分区除了上述几种常见的分区方式,还有一种Hive风格分区格式,如location=beijing/sex=male格式,以location,sex作为分区字段,核心配置如下
  • DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY()配置为location,sex;
  • hoodie.datasource.hive_sync.partition_fields配置为location,sex,与写入Hudi的分区字段相同;
  • DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY()配置为org.apache.hudi.keygen.ComplexKeyGenerator;
  • hoodie.datasource.hive_sync.partition_extractor_class配置为org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor;
  • DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY()配置为true;
生成的Hudi数据集目录结构会为如下格式
/location=beijing/sex=maleHudi同步到Hive创建的表如下
CREATE EXTERNAL TABLE `hivestylepartitiondemo`(`_hoodie_commit_time` string,`_hoodie_commit_seqno` string,`_hoodie_record_key` string,`_hoodie_partition_path` string,`_hoodie_file_name` string,`age` bigint,`date` string,`name` string,`ts` bigint)PARTITIONED BY (`location` string,`sex` string)ROW FORMAT SERDE'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'STORED AS INPUTFORMAT'org.apache.hudi.hadoop.HoodieParquetInputFormat'OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION'file:/tmp/hudi-partitions/hiveStylePartitionDemo'TBLPROPERTIES ('last_commit_time_sync'='20200816172710','transient_lastDdlTime'='1597570039')


推荐阅读