HBase数据块NotServingRegionException问题排查与解决
0 问题ERROR: org.apache.hadoop.hbase.NotServingRegionException: Region phm_default_lightunit,,1606205408615.397792fb6a31a2a183c3031d173c61d2. is not online on bd--4.jx.com,16020,1620637191420at org.apac
0 问题
ERROR: org.apache.hadoop.hbase.NotServingRegionException: Region phm_default_lightunit,,1606205408615.397792fb6a31a2a183c3031d173c61d2. is not online on bd--4.jx.com,16020,1620637191420
at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3077)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1015)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2347)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32385)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167)
Failed with exception java.io.IOException:org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions:
Tue May 11 13:34:26 CST 2021, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68229: org.apache.hadoop.hbase.NotServingRegionException: Region phm_default_lightunit,,1606205408615.397792fb6a31a2a183c3031d173c61d2. is not online on bd--4.jx.com,16020,1620637191420
at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3077)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1015)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2347)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32385)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167)
row '' on table 'phm_default_lightunit' at region=phm_default_lightunit,,1606205408615.397792fb6a31a2a183c3031d173c61d2., hostname=bd--4.jx.com,16020,1620629554687, seqNum=759396
Region问题排查
表状态检查:
hbase hbck -summary phm_default_lightunit
结果如下:
Summary:
Table phm_default_lightunit is okay.
Number of regions: 0
Deployed on:
Table hbase:meta is okay.
Number of regions: 1
Deployed on: bd--3.jx.com,16020,1620629563497
2 inconsistencies detected.
Status: INCONSISTENT
检测到2个不一致信息。
此外,在hbase web ui中,该表的region块存在 所有数值都为0的异常情况:
可以确认是该表的region数据发生异常,且日志中异常的region信息如下:
- ns:phm_default_lightunit,1b1e73ed71329daa72ffb28f42947ee7d40c9cb7872063682cb511fd16dbbfd1201810277ccbfc2577361956656a23f805afb859,1549491783878.1513a155f24ac2df20b4797077ed6ae0.
- ns:phm_default_lightunit,1b37fe2afa75581bf5f7574336f1cdfbf0f902966db54bf38aa505665828929020181207edad0e69052f705b07331e32c00b5aae,1549491783878.d4a172571b229e74109df6c03a959588.
此信息为 hbase:meta 表中的rowkey,根据meta表rowkey的组成规则,取rowkey中最后一段的encode编码信息,到hdfs上验证该region数据目录是否存在:
hadoop fs -ls /user/hbase/data/phm_default_lightunit/1513a155f24ac2df20b4797077ed6ae0
ls: `/user/hbase/data/phm_default_lightunit/1513a155f24ac2df20b4797077ed6ae0`: No such file or directory
可以看到,该region在hdfs上对应的数据目录消失了。
进入 hbase shell 中查询meta表异常rowkey对应的值:
get 'hbase:meta','phm_default_lightunit,1b1e73ed71329daa72ffb28f42947ee7d40c9cb7872063682cb511fd16dbbfd1201810277ccbfc2577361956656a23f805afb859,1549491783878.1513a155f24ac2df20b4797077ed6ae0.'
# 结果如下:
info:regioninfo timestamp=1566891946444, value={ENCODED => 96dd94004c9dd9fce3f4eb80c885ad85, NAME => 'phm_default_lightunit,0778ba8d2889fe7343ffc120c4ae83da0778ba8d2889fe7343ffc120c4ae83da,1559425462970.96dd94004c9dd9fce3f4eb80c885ad85.', STARTKEY => '0778ba8d2889fe7343ffc120c4ae83da0778ba8d2889fe7343ffc120c4ae83da', ENDKEY => '0857
87e9323caa99fc45325a351797fdd4849891167b36bfc6241197b5a58cb1201802142fd6fdc3bf0b5b9cb8faf80d04b71d1a'}
info:seqnumDuringOpen timestamp=1566891946444, value=\x00\x00\x00\x00\x00\x00\x02'
info:server timestamp=1566891946444, value=node85-104:16020
info:serverstartcode timestamp=1566891946444, value=1563956428836
info:sn timestamp=1566891945886, value=node85-104,16020,1563956428836
info:state timestamp=1566891946444, value=OPEN
元数据信息正常,至此,可以确认问题原因:
元数据显示该region在正常提供服务中,客户端到具体节点上检索数据时发现该region的数据目录不存在,抛出异常。
附:meta表结构
hbase:meta表结构如下:
- rowkey:${表名},${起始键},${region时间戳}.${encode编码}.
- info:state:Region状态,正常情况下为 OPEN
- info:serverstartcode:RegionServer启动的13位时间戳
- info:server:所在RegionServer 地址和端口,如node85-47:16020
- info:snserver:和serverstartcode组成,如node85-47:16020,1549491783878
- info:seqnumDuringOpen:Region在线时长的二进制串
- info:regioninfo:region的详细信息,如:ENCODED、NAME、STARTKEY、ENDKEY等
其中,regioninfo是重要信息:
- ENCODED:基于${表名},${起始键},${region时间戳}生成的32位md5字符串,region数据存储在hdfs上时使用的唯一编号,可以从meta表中根据该值定位到hdfs中的具体路径。 rowkey中最后的${encode编码}就是 ENCODED 的值,其是rowkey组成的一部分。
- NAME:与ROWKEY值相同
- STARTKEY:该region的起始键
- ENDKEY:该region的结束键
修复过程
使用元数据修复工具
尝试直接使用命令修复:
hbase hbck -repair phm_default_lightunit
hbase hbck -fixMeta
上面如果解决不了,走下面流程:
使用hdfs工具检查是否有文件块异常:
hdfs fsck /hbase
# 结果正常
手动修复元数据
备份原有的region信息:
get 'hbase:meta','phm_default_lightunit,1b37fe2afa75581bf5f7574336f1cdfbf0f902966db54bf38aa505665828929020181207edad0e69052f705b07331e32c00b5aae,1549491783878.d4a172571b229e74109df6c03a959588.'
# 备份数据
# 删除该region信息
delete 'hbase:meta','phm_default_lightunit,1b37fe2afa75581bf5f7574336f1cdfbf0f902966db54bf38aa505665828929020181207edad0e69052f705b07331e32c00b5aae,1549491783878.d4a172571b229e74109df6c03a959588.','info:regioninfo'
delete 'hbase:meta','phm_default_lightunit,1b37fe2afa75581bf5f7574336f1cdfbf0f902966db54bf38aa505665828929020181207edad0e69052f705b07331e32c00b5aae,1549491783878.d4a172571b229e74109df6c03a959588.','info:server'
delete 'hbase:meta','phm_default_lightunit,1b37fe2afa75581bf5f7574336f1cdfbf0f902966db54bf38aa505665828929020181207edad0e69052f705b07331e32c00b5aae,1549491783878.d4a172571b229e74109df6c03a959588.','info:serverstartcode'
delete 'hbase:meta','phm_default_lightunit,1b37fe2afa75581bf5f7574336f1cdfbf0f902966db54bf38aa505665828929020181207edad0e69052f705b07331e32c00b5aae,1549491783878.d4a172571b229e74109df6c03a959588.','info:sn'
delete 'hbase:meta','phm_default_lightunit,1b37fe2afa75581bf5f7574336f1cdfbf0f902966db54bf38aa505665828929020181207edad0e69052f705b07331e32c00b5aae,1549491783878.d4a172571b229e74109df6c03a959588.','info:state'
delete 'hbase:meta','phm_default_lightunit,1b37fe2afa75581bf5f7574336f1cdfbf0f902966db54bf38aa505665828929020181207edad0e69052f705b07331e32c00b5aae,1549491783878.d4a172571b229e74109df6c03a959588.','info:seqnumDuringOpen'
刷新hbase web ui页面,发现 d4a172571b229e74109df6c03a959588 region 已经消失。
执行hbck -summary 发现 不一致的部分由2变成了1。
尝试:
- hbase shell中手动scan
- 通过接口重新查询该值
可以成功获得结果。
继续删除 1513a155f24ac2df20b4797077ed6ae0 region块对应的meta数据后,重新测试数据样本,可得到正确的结果。
问题分析
- d4a172571b229e74109df6c03a959588
- 1513a155f24ac2df20b4797077ed6ae0
以上两个reigon在split时,子region的数据块已经并在meta表中更新上线提供正常服务,父region的数据块已删除,但是 meta表中没有更新对应的元数据信息(原因仍待排查)。
导致对应的数据查询时,仍然通过父region检索数据,但是父region的数据已被删除,故无法成功检索。
可以通过以下命令列出该表在meta表中所有的region信息,分析排查是否有相关的region范围 覆盖了有问题的region数据:
echo "scan 'hbase:meta',{FILTER => org.apache.hadoop.hbase.filter.PrefixFilter.new(org.apache.hadoop.hbase.util.Bytes.toBytes('phm_default_lightunit'))}" | hbase shell | awk -F ' ' '{print $1}' | grep phm_default_lightunit| grep -v bak | sort | uniq
来判断消失的region块是否已经由其他region托管服务。
NotServingRegionException 大概率是Region转换过程中出现了问题,通过 hbase:meta 表 和其中记录的region信息可以帮助我们定位问题所在,所以掌握meta表结构和相关存储规则是一个很有效的工具。
修复后结果如下:
hbase(main):005:0> scan "phm_default_lightunit",{LIMIT => 3}
ROW COLUMN+CELL
1207924082503057408{}1207939433961881600{}1 column=d:bulb_focpla_resis, timestamp=1606374487519, value=0.334972
606374551432
1207924082503057408{}1207939433961881600{}1 column=d:gw_id, timestamp=1606374487519, value=1207924082503057408
606374551432
1207924082503057408{}1207939433961881600{}1 column=d:msg_label, timestamp=1606374487519, value=0
606374551432
1207924082503057408{}1207939433961881600{}1 column=d:msg_time, timestamp=1606374487519, value=1606374551432
606374551432
1207924082503057408{}1207939433961881600{}1 column=d:row_key, timestamp=1606374487519, value=1207924082503057408{}1207939433961881600{}1606374551432
606374551432
1207924082503057408{}1207939433961881600{}1 column=d:sensor_id, timestamp=1606374487519, value=1207939433961881600
606374551432
1207924082503057408{}1207939433961881600{}1 column=d:sky_light_state, timestamp=1606374487519, value=0
606374551432
1207924082503057408{}1207939433961881600{}1 column=d:slight_d_out_vol, timestamp=1606374487519, value=0.836828
606374551432
1207924082503057408{}1207939433961881600{}1 column=d:slight_l_out_vol, timestamp=1606374487519, value=0.818128
606374551432
1207924082503057408{}1207939433961881600{}1 column=d:slight_out_ic, timestamp=1606374487519, value=0.496074
606374551432
1207924082503057408{}1207939433961881600{}1 column=d:slight_out_lm, timestamp=1606374487519, value=0
606374551432
1207924082503057408{}1207939433961881600{}1 column=d:slight_state, timestamp=1606374487519, value=0
606374551432
1207924082503057408{}1207939433961881600{}1 column=d:slight_type, timestamp=1606374487519, value=0
606374551432
1207924082503057408{}1207939433961881600{}1 column=d:bulb_focpla_resis, timestamp=1606374531009, value=0.190709
606374594921
1207924082503057408{}1207939433961881600{}1 column=d:gw_id, timestamp=1606374531009, value=1207924082503057408
606374594921
1207924082503057408{}1207939433961881600{}1 column=d:msg_label, timestamp=1606374531009, value=0
606374594921
1207924082503057408{}1207939433961881600{}1 column=d:msg_time, timestamp=1606374531009, value=1606374594921
606374594921
1207924082503057408{}1207939433961881600{}1 column=d:row_key, timestamp=1606374531009, value=1207924082503057408{}1207939433961881600{}1606374594921
606374594921
1207924082503057408{}1207939433961881600{}1 column=d:sensor_id, timestamp=1606374531009, value=1207939433961881600
606374594921
1207924082503057408{}1207939433961881600{}1 column=d:sky_light_state, timestamp=1606374531009, value=0
606374594921
1207924082503057408{}1207939433961881600{}1 column=d:slight_d_out_vol, timestamp=1606374531009, value=0.244069
606374594921
1207924082503057408{}1207939433961881600{}1 column=d:slight_l_out_vol, timestamp=1606374531009, value=0.912391
606374594921
1207924082503057408{}1207939433961881600{}1 column=d:slight_out_ic, timestamp=1606374531009, value=0.566164
606374594921
1207924082503057408{}1207939433961881600{}1 column=d:slight_out_lm, timestamp=1606374531009, value=0
606374594921
1207924082503057408{}1207939433961881600{}1 column=d:slight_state, timestamp=1606374531009, value=0
606374594921
1207924082503057408{}1207939433961881600{}1 column=d:slight_type, timestamp=1606374531009, value=0
606374594921
1207924082503057408{}1207939433961881600{}1 column=d:bulb_focpla_resis, timestamp=1606374574792, value=0.284280
606374638704
1207924082503057408{}1207939433961881600{}1 column=d:gw_id, timestamp=1606374574792, value=1207924082503057408
606374638704
1207924082503057408{}1207939433961881600{}1 column=d:msg_label, timestamp=1606374574792, value=0
606374638704
1207924082503057408{}1207939433961881600{}1 column=d:msg_time, timestamp=1606374574792, value=1606374638704
606374638704
1207924082503057408{}1207939433961881600{}1 column=d:row_key, timestamp=1606374574792, value=1207924082503057408{}1207939433961881600{}1606374638704
606374638704
1207924082503057408{}1207939433961881600{}1 column=d:sensor_id, timestamp=1606374574792, value=1207939433961881600
606374638704
1207924082503057408{}1207939433961881600{}1 column=d:sky_light_state, timestamp=1606374574792, value=0
606374638704
1207924082503057408{}1207939433961881600{}1 column=d:slight_d_out_vol, timestamp=1606374574792, value=0.879013
606374638704
1207924082503057408{}1207939433961881600{}1 column=d:slight_l_out_vol, timestamp=1606374574792, value=0.986644
606374638704
1207924082503057408{}1207939433961881600{}1 column=d:slight_out_ic, timestamp=1606374574792, value=0.521261
606374638704
1207924082503057408{}1207939433961881600{}1 column=d:slight_out_lm, timestamp=1606374574792, value=0
606374638704
1207924082503057408{}1207939433961881600{}1 column=d:slight_state, timestamp=1606374574792, value=0
606374638704
1207924082503057408{}1207939433961881600{}1 column=d:slight_type, timestamp=1606374574792, value=0
606374638704
3 row(s) in 0.2720 seconds
更多推荐
所有评论(0)