hive external table使用非常方便,结合flume sink 将 Kafka的数据持久化到hdfs是比较成熟的方法。 注:对于外表用msck可能会失败。 一般出现“No LZO codec found, cannot run.”的原因是hdfs的core-site.xml的配置项“io.compression.codecs”里没有配置lzo的解码器,当然前提还要先检查hadoop集群是否添加了hadoop-lzo-*.jar,该jar包是否已配置到HADOOP_CLASSPATH里。 配置部分(添加“com.hadoop.compression.lzo.LzoCodec”和“com.hadoop.compression.lzo.LzopCodec”): 重启hadoop服务后,查询正常。数据准备
hdfs dfs -du -h /data/test/yxz_order_info/
51.8 M 155.5 M /data/test/yxz_order_info/20221226
50.0 M 149.9 M /data/test/yxz_order_info/20221227
50.6 M 151.9 M /data/test/yxz_order_info/20221228
50.8 M 152.5 M /data/test/yxz_order_info/20221229
53.0 M 158.9 M /data/test/yxz_order_info/20221230
58.0 M 174.0 M /data/test/yxz_order_info/20221231
61.5 M 184.6 M /data/test/yxz_order_info/20230101
59.0 M 177.0 M /data/test/yxz_order_info/20230102
建表
CREATE EXTERNAL TABLE `newlog.yxz_order_info`(
`task_id` bigint,
`order_id` bigint,
`transporter_id` int,
`accept_request_time` bigint,
`is_success` int,
`fail_reason_type` int,
`fail_reason` string,
`suc_type` int)
PARTITIONED BY (
`ddate` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='\t',
'serialization.format'='\t')
STORED AS INPUTFORMAT
'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://nn3/data/test/yxz_order_info';
添加分区
alter table newlog.yxz_order_info add partition(ddate='2022-12-26') location '/data/test/yxz_order_info/20221226';
# 依次添加剩余分区(略)
查询
hive> select * from newlog.yxz_order_info limit 10;
OK
Failed with exception java.io.IOException:java.io.IOException: No LZO codec found, cannot run.
Time taken: 0.078 seconds
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec,org.apache.hadoop.io.compress.Lz4Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>