hbase从集群中有8台regionserver服务器,已稳定运行了5个多月,8月15号,发现集群中4个datanode进程死了,经查原因是内存 outofMemory了(因为这几台机器上部署了spark,给spark开的-Xmx是32g),然后对从集群进行了恢复并进行了补数据,写负载比较 重,又运行了几天,发现从集群写不进去数据了
①、regionserver端
regionserver端现象一、
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region table_version,hour_search_860010-1118000000_2014010418,1403685954922.640fc829f767a4e33e296fc4f4cca4a4. after a delay of 13125
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_hotstatic,860010-0507010000_2014071711_0_entry_00000008749,1406860400351.bcb13556daad6bda72b3c84df5ec912e. after a delay of 10066
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_screen,860010-2288050100_2014030419_0_00000000920,1402321410433.da4ff8fe84325e7da075b0fba1f3c3c9. after a delay of 11767
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_hotstatic,860010-1119060300_2014040422_0_bounce_ratio_00000000867,1402022490696.4fcfd303cff4211de61ff55f77d46317. after a delay of 10256
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_url,860010-0204020100_2014010607_0_8c54e33efae9da957548659c5b96f04e,1403329534827.b1c3733f5a8deade785bd71ee8660268. after a delay of 16628
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_hotstatic,860010-0335010000_2014041011_0_exit_00000000000,1399606854480.b1f83e693e0fdb18e168943d282cb6b0. after a delay of 18889
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_main,860010-2014041100_2014060513,1402472695828.c3cd5c3a1fcc01e0493a8043e376e948. after a delay of 21727
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_screen,,1396924866983.e3f0096984896efa77348dc4f89a9f3c. after a delay of 17782
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_area,860010-2316230100_2014031222_0_pv_00000000005,1395829898129.c426c025521dd8facd291f1a8ba15f13. after a delay of 6147
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_stay,860010-0604100000_2014031918_0_00000000006,1395349588239.e592ebe99f412b565381f6649bbf857f. after a delay of 16294
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_hotstatic,860010-0307010000_2014070100_0_entry_00000001023,1405881888126.055c3c19009c6822e00def0b7431d0d8. after a delay of 20105
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_hotstatic,860010-0506000000_2014072817_0_bounce_ratio_00000047803,1407729791396.22b0d3234c1173859992d231d2f2d427. after a delay of 7105
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_stay,860010-2328010100_2014010616_0_00000000011,1401896532036.547015d92a9021e31bac69909979f4ac. after a delay of 5485
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_flash,860010-0521010000_2014030620_0_00000000007,1407471178069.aa4f5e7e7f8e3dd150666ae1205ebbcf. after a delay of 11484
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_hotstatic,860010-0507010000_2014071711_0_entry_00000008749,1406860400351.bcb13556daad6bda72b3c84df5ec912e. after a delay of 10066
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_screen,860010-2288050100_2014030419_0_00000000920,1402321410433.da4ff8fe84325e7da075b0fba1f3c3c9. after a delay of 11767
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_hotstatic,860010-1119060300_2014040422_0_bounce_ratio_00000000867,1402022490696.4fcfd303cff4211de61ff55f77d46317. after a delay of 10256
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_url,860010-0204020100_2014010607_0_8c54e33efae9da957548659c5b96f04e,1403329534827.b1c3733f5a8deade785bd71ee8660268. after a delay of 16628
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_hotstatic,860010-0335010000_2014041011_0_exit_00000000000,1399606854480.b1f83e693e0fdb18e168943d282cb6b0. after a delay of 18889
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_main,860010-2014041100_2014060513,1402472695828.c3cd5c3a1fcc01e0493a8043e376e948. after a delay of 21727
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_screen,,1396924866983.e3f0096984896efa77348dc4f89a9f3c. after a delay of 17782
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_area,860010-2316230100_2014031222_0_pv_00000000005,1395829898129.c426c025521dd8facd291f1a8ba15f13. after a delay of 6147
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_stay,860010-0604100000_2014031918_0_00000000006,1395349588239.e592ebe99f412b565381f6649bbf857f. after a delay of 16294
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_hotstatic,860010-0307010000_2014070100_0_entry_00000001023,1405881888126.055c3c19009c6822e00def0b7431d0d8. after a delay of 20105
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_hotstatic,860010-0506000000_2014072817_0_bounce_ratio_00000047803,1407729791396.22b0d3234c1173859992d231d2f2d427. after a delay of 7105
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_stay,860010-2328010100_2014010616_0_00000000011,1401896532036.547015d92a9021e31bac69909979f4ac. after a delay of 5485
2014-08-21 15:03:31,011 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region hour_flash,860010-0521010000_2014030620_0_00000000007,1407471178069.aa4f5e7e7f8e3dd150666ae1205ebbcf. after a delay of 11484
regionserver端现象二、
2014-08-21 10:30:43,384 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=79, maxlogs=32; forcing flush of 1 regions(s): 12663e173854886463edfe8c6495dca0
2014-08-21 10:31:53,456 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=65, maxlogs=32; forcing flush of 9 regions(s): 192e3fcd5afce28ea2abc8bbb895163d, 2149c6216b259083a6743c61ec7f62b1, 214aac4a7f31cfc346889aabdbdbadd3, 2248c5c76b0fd55fe11d428a77330e6b, 2f5d56a3c17fd8e4f6f6f62d0fbcda69, 2ff390bdbb79cb8dc8ba05b4e56c26ea, 398376b87a43d83d84e96169dadb7865, b5431ef4a70fb2a244d83ae3316506f9, f34c16e000e648988bc00692bc6c7cea
2014-08-21 10:33:25,657 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=66, maxlogs=32; forcing flush of 4 regions(s): 192e3fcd5afce28ea2abc8bbb895163d, 2f5d56a3c17fd8e4f6f6f62d0fbcda69, b5431ef4a70fb2a244d83ae3316506f9, f34c16e000e648988bc00692bc6c7cea
2014-08-21 10:33:55,418 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=60, maxlogs=32; forcing flush of 4 regions(s): 352e2b4a2a42438d5ecb735de1c9e9f4, 5d08d2713d809334514be9ec7e2512cb, 981285a02ae3af797b10e621e76eccf8, f9a55c4661a1ee2f16e3c1e6ec978595
2014-08-21 10:35:02,013 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=51, maxlogs=32; forcing flush of 3 regions(s): a6064be87ca7005a4e4ab607501d9f5a, cc84289443f2478105bd8078df2bccd3, f533780eb2913bf8819cecea52bbeb43
2014-08-21 10:39:05,129 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=35, maxlogs=32; forcing flush of 1 regions(s): 5b0d0af8b9b684237373e941238bdfa2
2014-08-21 11:34:41,619 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 1 regions(s): 2149c6216b259083a6743c61ec7f62b1
2014-08-21 11:36:53,437 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 1 regions(s): eec50ffaa2639f7c0fbd7ac727c16f16
2014-08-21 11:37:46,667 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=34, maxlogs=32; forcing flush of 1 regions(s): eec50ffaa2639f7c0fbd7ac727c16f16
2014-08-21 11:38:09,366 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=35, maxlogs=32; forcing flush of 1 regions(s): eec50ffaa2639f7c0fbd7ac727c16f16
2014-08-21 11:38:57,140 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=35, maxlogs=32; forcing flush of 15 regions(s): 0c223074833c6a3e2835feb5f9640298, 0f461ff6911b932c013e8d5f57d110d9, 2846b752106aa8079f49e784666c17a8, 53e7a57b2028e32e90040071014b13be, 5f2053770878cfc4ae4e1849f3e128b8, 66fd00187ab38d3253fd2b440ea1a082, 6e3c2282edaebdb1bda15d49fe22df6f, 7e45f8f49ff6b697dc36d988f15a1643, a625182cd59e5ae87ead3113b3a89aaa, b77403d41440cda21e92e4d20d1dc4bc, ba2bdc3cdc3a748c5fbc4d19cdda1bbf, bab28f8f990d3aed73a982964f5731f9, e8c5bd8150ee49d0ba13ee77633d1936, f5064874556aca3c45a67463b2ad37d5, f9961ca861361ab0913f6e05571d45b5
2014-08-21 11:40:02,163 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=36, maxlogs=32; forcing flush of 15 regions(s): 0c223074833c6a3e2835feb5f9640298, 0f461ff6911b932c013e8d5f57d110d9, 2846b752106aa8079f49e784666c17a8, 53e7a57b2028e32e90040071014b13be, 5f2053770878cfc4ae4e1849f3e128b8, 66fd00187ab38d3253fd2b440ea1a082, 6e3c2282edaebdb1bda15d49fe22df6f, 7e45f8f49ff6b697dc36d988f15a1643, a625182cd59e5ae87ead3113b3a89aaa, b77403d41440cda21e92e4d20d1dc4bc, ba2bdc3cdc3a748c5fbc4d19cdda1bbf, bab28f8f990d3aed73a982964f5731f9, e8c5bd8150ee49d0ba13ee77633d1936, f5064874556aca3c45a67463b2ad37d5, f9961ca861361ab0913f6e05571d45b5
2014-08-21 11:40:47,301 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=37, maxlogs=32; forcing flush of 14 regions(s): 0c223074833c6a3e2835feb5f9640298, 0f461ff6911b932c013e8d5f57d110d9, 2846b752106aa8079f49e784666c17a8, 53e7a57b2028e32e90040071014b13be, 5f2053770878cfc4ae4e1849f3e128b8, 66fd00187ab38d3253fd2b440ea1a082, 6e3c2282edaebdb1bda15d49fe22df6f, a625182cd59e5ae87ead3113b3a89aaa, b77403d41440cda21e92e4d20d1dc4bc, ba2bdc3cdc3a748c5fbc4d19cdda1bbf, bab28f8f990d3aed73a982964f5731f9, e8c5bd8150ee49d0ba13ee77633d1936, f5064874556aca3c45a67463b2ad37d5, f9961ca861361ab0913f6e05571d45b5
2014-08-21 11:41:23,446 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=37, maxlogs=32; forcing flush of 17 regions(s): 12663e173854886463edfe8c6495dca0, 25bc0f41f28710d047c7e3775f388e39, 2f5d56a3c17fd8e4f6f6f62d0fbcda69, 3619ffc85d19102863eafe36e6d3acf8, 3b4f4f57abec73084a22bd7392247d86, 42e4757fce922723831d29326540b177, 6c53f4fb301af91f54f0d1590a7c856f, a2e173875e2287bd9ac74b9cdd289fde, c02ca04051d2684b3138662803892dd3, cd6158fa98bf85d39118e450c454e93a, d75e31ed4e06b867652a70160cd90c71, e024920c26c08afe5004f5ae51f63d35, f34c16e000e648988bc00692bc6c7cea, f378e07ac843beb2becc57e79af0362a, f49dba00bbb0c359935146ffa52bdc70, f9a55c4661a1ee2f16e3c1e6ec978595, ff82c095987dc2f6becc66cd777c7970
2014-08-21 10:31:53,456 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=65, maxlogs=32; forcing flush of 9 regions(s): 192e3fcd5afce28ea2abc8bbb895163d, 2149c6216b259083a6743c61ec7f62b1, 214aac4a7f31cfc346889aabdbdbadd3, 2248c5c76b0fd55fe11d428a77330e6b, 2f5d56a3c17fd8e4f6f6f62d0fbcda69, 2ff390bdbb79cb8dc8ba05b4e56c26ea, 398376b87a43d83d84e96169dadb7865, b5431ef4a70fb2a244d83ae3316506f9, f34c16e000e648988bc00692bc6c7cea
2014-08-21 10:33:25,657 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=66, maxlogs=32; forcing flush of 4 regions(s): 192e3fcd5afce28ea2abc8bbb895163d, 2f5d56a3c17fd8e4f6f6f62d0fbcda69, b5431ef4a70fb2a244d83ae3316506f9, f34c16e000e648988bc00692bc6c7cea
2014-08-21 10:33:55,418 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=60, maxlogs=32; forcing flush of 4 regions(s): 352e2b4a2a42438d5ecb735de1c9e9f4, 5d08d2713d809334514be9ec7e2512cb, 981285a02ae3af797b10e621e76eccf8, f9a55c4661a1ee2f16e3c1e6ec978595
2014-08-21 10:35:02,013 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=51, maxlogs=32; forcing flush of 3 regions(s): a6064be87ca7005a4e4ab607501d9f5a, cc84289443f2478105bd8078df2bccd3, f533780eb2913bf8819cecea52bbeb43
2014-08-21 10:39:05,129 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=35, maxlogs=32; forcing flush of 1 regions(s): 5b0d0af8b9b684237373e941238bdfa2
2014-08-21 11:34:41,619 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 1 regions(s): 2149c6216b259083a6743c61ec7f62b1
2014-08-21 11:36:53,437 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=33, maxlogs=32; forcing flush of 1 regions(s): eec50ffaa2639f7c0fbd7ac727c16f16
2014-08-21 11:37:46,667 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=34, maxlogs=32; forcing flush of 1 regions(s): eec50ffaa2639f7c0fbd7ac727c16f16
2014-08-21 11:38:09,366 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=35, maxlogs=32; forcing flush of 1 regions(s): eec50ffaa2639f7c0fbd7ac727c16f16
2014-08-21 11:38:57,140 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=35, maxlogs=32; forcing flush of 15 regions(s): 0c223074833c6a3e2835feb5f9640298, 0f461ff6911b932c013e8d5f57d110d9, 2846b752106aa8079f49e784666c17a8, 53e7a57b2028e32e90040071014b13be, 5f2053770878cfc4ae4e1849f3e128b8, 66fd00187ab38d3253fd2b440ea1a082, 6e3c2282edaebdb1bda15d49fe22df6f, 7e45f8f49ff6b697dc36d988f15a1643, a625182cd59e5ae87ead3113b3a89aaa, b77403d41440cda21e92e4d20d1dc4bc, ba2bdc3cdc3a748c5fbc4d19cdda1bbf, bab28f8f990d3aed73a982964f5731f9, e8c5bd8150ee49d0ba13ee77633d1936, f5064874556aca3c45a67463b2ad37d5, f9961ca861361ab0913f6e05571d45b5
2014-08-21 11:40:02,163 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=36, maxlogs=32; forcing flush of 15 regions(s): 0c223074833c6a3e2835feb5f9640298, 0f461ff6911b932c013e8d5f57d110d9, 2846b752106aa8079f49e784666c17a8, 53e7a57b2028e32e90040071014b13be, 5f2053770878cfc4ae4e1849f3e128b8, 66fd00187ab38d3253fd2b440ea1a082, 6e3c2282edaebdb1bda15d49fe22df6f, 7e45f8f49ff6b697dc36d988f15a1643, a625182cd59e5ae87ead3113b3a89aaa, b77403d41440cda21e92e4d20d1dc4bc, ba2bdc3cdc3a748c5fbc4d19cdda1bbf, bab28f8f990d3aed73a982964f5731f9, e8c5bd8150ee49d0ba13ee77633d1936, f5064874556aca3c45a67463b2ad37d5, f9961ca861361ab0913f6e05571d45b5
2014-08-21 11:40:47,301 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=37, maxlogs=32; forcing flush of 14 regions(s): 0c223074833c6a3e2835feb5f9640298, 0f461ff6911b932c013e8d5f57d110d9, 2846b752106aa8079f49e784666c17a8, 53e7a57b2028e32e90040071014b13be, 5f2053770878cfc4ae4e1849f3e128b8, 66fd00187ab38d3253fd2b440ea1a082, 6e3c2282edaebdb1bda15d49fe22df6f, a625182cd59e5ae87ead3113b3a89aaa, b77403d41440cda21e92e4d20d1dc4bc, ba2bdc3cdc3a748c5fbc4d19cdda1bbf, bab28f8f990d3aed73a982964f5731f9, e8c5bd8150ee49d0ba13ee77633d1936, f5064874556aca3c45a67463b2ad37d5, f9961ca861361ab0913f6e05571d45b5
2014-08-21 11:41:23,446 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=37, maxlogs=32; forcing flush of 17 regions(s): 12663e173854886463edfe8c6495dca0, 25bc0f41f28710d047c7e3775f388e39, 2f5d56a3c17fd8e4f6f6f62d0fbcda69, 3619ffc85d19102863eafe36e6d3acf8, 3b4f4f57abec73084a22bd7392247d86, 42e4757fce922723831d29326540b177, 6c53f4fb301af91f54f0d1590a7c856f, a2e173875e2287bd9ac74b9cdd289fde, c02ca04051d2684b3138662803892dd3, cd6158fa98bf85d39118e450c454e93a, d75e31ed4e06b867652a70160cd90c71, e024920c26c08afe5004f5ae51f63d35, f34c16e000e648988bc00692bc6c7cea, f378e07ac843beb2becc57e79af0362a, f49dba00bbb0c359935146ffa52bdc70, f9a55c4661a1ee2f16e3c1e6ec978595, ff82c095987dc2f6becc66cd777c7970
2014-08-21 11:42:02,502 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=38, maxlogs=32; forcing flush of 17 regions(s): 12663e173854886463edfe8c6495dca0, 25bc0f41f28710d047c7e3775f388e39, 2f5d56a3c17fd8e4f6f6f62d0fbcda69, 3619ffc85d19102863eafe36e6d3acf8, 3b4f4f57abec73084a22bd7392247d86, 42e4757fce922723831d29326540b177, 6c53f4fb301af91f54f0d1590a7c856f, a2e173875e2287bd9ac74b9cdd289fde, c02ca04051d2684b3138662803892dd3, cd6158fa98bf85d39118e450c454e93a, d75e31ed4e06b867652a70160cd90c71, e024920c26c08afe5004f5ae51f63d35, f34c16e000e648988bc00692bc6c7cea, f378e07ac843beb2becc57e79af0362a, f49dba00bbb0c359935146ffa52bdc70, f9a55c4661a1ee2f16e3c1e6ec978595, ff82c095987dc2f6becc66cd777c7970
regionserver端现象三(这个已经通过hdfs端和hbase端,配置同样的dfs.socket.timeout=900000修复):
2014-08-23 11:19:17,598 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_-6884116396095947381_111959717java.net.SocketTimeoutException: 66000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.130.136.114:53194 remote=/10.130.136.114:50010]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readLong(DataInputStream.java:416)
at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:124)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3127)
2014-08-23 11:19:17,599 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-4289533060700867612_111959745 bad datanode[0] 10.130.136.114:50010
2014-08-23 11:19:17,599 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-6884116396095947381_111959717 bad datanode[0] 10.130.136.114:50010
2014-08-23 11:19:17,599 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-4289533060700867612_111959745 in pipeline 10.130.136.114:50010, 10.130.136.115:50010: bad datanode 10.130.136.114:50010
2014-08-23 11:19:17,599 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-6884116396095947381_111959717 in pipeline 10.130.136.114:50010, 10.130.136.115:50010: bad datanode 10.130.136.114:50010
2014-08-23 11:22:27,624 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=681.33 MB, free=3.32 GB, max=3.99 GB, blocks=10035, accesses=44791415, hits=40264747, hitRatio=89.89%, , cachingAccesses=40274782, cachingHits=40264747, cachingHitsRatio=99.97%, , evictions=0, evicted=0, evictedPerRun=NaN
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readLong(DataInputStream.java:416)
at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:124)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3127)
2014-08-23 11:19:17,599 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-4289533060700867612_111959745 bad datanode[0] 10.130.136.114:50010
2014-08-23 11:19:17,599 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-6884116396095947381_111959717 bad datanode[0] 10.130.136.114:50010
2014-08-23 11:19:17,599 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-4289533060700867612_111959745 in pipeline 10.130.136.114:50010, 10.130.136.115:50010: bad datanode 10.130.136.114:50010
2014-08-23 11:19:17,599 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-6884116396095947381_111959717 in pipeline 10.130.136.114:50010, 10.130.136.115:50010: bad datanode 10.130.136.114:50010
2014-08-23 11:22:27,624 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=681.33 MB, free=3.32 GB, max=3.99 GB, blocks=10035, accesses=44791415, hits=40264747, hitRatio=89.89%, , cachingAccesses=40274782, cachingHits=40264747, cachingHitsRatio=99.97%, , evictions=0, evicted=0, evictedPerRun=NaN
②.datanode端
同时发现hdfs datanode里出现很多异常:
datanode异常1:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.130.136.114:50010 remote=/10.130.136.114:59516]
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.130.136.114:50010 remote=/10.130.136.114:59524]
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.130.136.114:50010 remote=/10.130.136.114:59520]
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.130.136.114:50010 remote=/10.130.136.114:59524]
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.130.136.114:50010 remote=/10.130.136.114:59520]
2014-08-23 21:26:25,292 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-3011273698174656346_113017023 received exception org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_-3011273698174656346_113017023 is valid, and cannot be written to.
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_-3011273698174656346_113017023 is valid, and cannot be written to.
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.130.136.114:50010 remote=/10.130.136.114:59520]
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.130.136.114:50010 remote=/10.130.136.114:59524]
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.130.136.114:50010 remote=/10.130.136.114:59520]
2014-08-23 21:26:25,292 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-3011273698174656346_113017023 received exception org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_-3011273698174656346_113017023 is valid, and cannot be written to.
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_-3011273698174656346_113017023 is valid, and cannot be written to.
datanode异常2:
2014-08-23 23:06:56,413 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 10.130.136.114:50010 java.io.IOException: Bad connect ack with firstBadLink as 10.130.136.119:50010
2014-08-23 23:06:56,895 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 10.130.136.114:50010 java.io.IOException: Bad connect ack with firstBadLink as 10.130.136.119:50010
2014-08-23 23:06:57,399 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 10.130.136.114:50010 java.io.IOException: Bad connect ack with firstBadLink as 10.130.136.119:50010
2014-08-23 23:06:57,548 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 10.130.136.114:50010 java.io.IOException: Bad connect ack with firstBadLink as 10.130.136.119:50010
2014-08-23 23:06:57,935 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 10.130.136.114:50010 java.io.IOException: Bad connect ack with firstBadLink as 10.130.136.119:50010
2014-08-23 23:06:56,895 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 10.130.136.114:50010 java.io.IOException: Bad connect ack with firstBadLink as 10.130.136.119:50010
2014-08-23 23:06:57,399 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 10.130.136.114:50010 java.io.IOException: Bad connect ack with firstBadLink as 10.130.136.119:50010
2014-08-23 23:06:57,548 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 10.130.136.114:50010 java.io.IOException: Bad connect ack with firstBadLink as 10.130.136.119:50010
2014-08-23 23:06:57,935 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 10.130.136.114:50010 java.io.IOException: Bad connect ack with firstBadLink as 10.130.136.119:50010
datanode异常3:
2014-08-24 22:15:21,714 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode Command
java.io.IOException: Error in deleting blocks.
at org.apache.hadoop.hdfs.server.datanode.FSDataset.invalidate(FSDataset.java:1967)
at org.apache.hadoop.hdfs.server.datanode.DataNode.processCommand(DataNode.java:1181)
at org.apache.hadoop.hdfs.server.datanode.DataNode.processCommand(DataNode.java:1143)
at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:980)
at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1527)
at java.lang.Thread.run(Thread.java:724)
java.io.IOException: Error in deleting blocks.
at org.apache.hadoop.hdfs.server.datanode.FSDataset.invalidate(FSDataset.java:1967)
at org.apache.hadoop.hdfs.server.datanode.DataNode.processCommand(DataNode.java:1181)
at org.apache.hadoop.hdfs.server.datanode.DataNode.processCommand(DataNode.java:1143)
at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:980)
at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1527)
at java.lang.Thread.run(Thread.java:724)
datanode异常4:
2014-08-24 16:45:35,855 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_2324951138767077684_113876340 received exception org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_2324951138767077684_113876340 is valid, and cannot be written to.
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_2324951138767077684_113876340 is valid, and cannot be written to.
2014-08-24 16:45:42,861 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_2305069720503912789_113876452 received exception org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_2305069720503912789_113876452 is valid, and cannot be written to.
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_2305069720503912789_113876452 is valid, and cannot be written to.
2014-08-24 16:45:43,713 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-318311590422520941_113876153 received exception org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_-318311590422520941_113876153 is valid, and cannot be written to.
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.130.136.118:50010 remote=/10.130.136.116:34363] (注:把dfs.datanode.socket.write.timeout=1800000,然后抛1800000 millis timeout while waiting for channel to be ready for write)org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_2324951138767077684_113876340 is valid, and cannot be written to.
2014-08-24 16:45:42,861 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_2305069720503912789_113876452 received exception org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_2305069720503912789_113876452 is valid, and cannot be written to.
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_2305069720503912789_113876452 is valid, and cannot be written to.
2014-08-24 16:45:43,713 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-318311590422520941_113876153 received exception org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_-318311590422520941_113876153 is valid, and cannot be written to.
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.130.136.118:50010 remote=/10.130.136.118:55147]
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.130.136.118:50010 remote=/10.130.136.118:55147]
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.130.136.118:50010 remote=/10.130.136.118:55147]
③.namenode端
namenode里出现大量如下日志,(现在每天的INFO级别以上的日志达到400多G,以前日志量很少):
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-707612696772368160 to 10.130.136.116:50010
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_8944996150588918994_62583982 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_8944996150588918994 to 10.130.136.116:50010
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_962585261283706817_105572114 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_962585261283706817 to 10.130.136.116:50010
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-1886285939257877420_33867512 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-1886285939257877420 to 10.130.136.116:50010
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-405662021725661377_23563134 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-405662021725661377 to 10.130.136.116:50010
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-6831374360596453862_49890202 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-6831374360596453862 to 10.130.136.116:50010
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-1458260851950313618_92180801 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-1458260851950313618 to 10.130.136.116:50010
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_2754038012732967699_52183933 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_2754038012732967699 to 10.130.136.116:50010
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-1651824977329564981_102396163 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-1651824977329564981 to 10.130.136.116:50010^C
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-8075220412997159517_101639855 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-8075220412997159517 to 10.130.136.116:50010
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_2245696672665686485_98393215 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_8944996150588918994_62583982 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_8944996150588918994 to 10.130.136.116:50010
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_962585261283706817_105572114 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_962585261283706817 to 10.130.136.116:50010
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-1886285939257877420_33867512 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-1886285939257877420 to 10.130.136.116:50010
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-405662021725661377_23563134 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-405662021725661377 to 10.130.136.116:50010
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-6831374360596453862_49890202 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-6831374360596453862 to 10.130.136.116:50010
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-1458260851950313618_92180801 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-1458260851950313618 to 10.130.136.116:50010
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_2754038012732967699_52183933 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_2754038012732967699 to 10.130.136.116:50010
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-1651824977329564981_102396163 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-1651824977329564981 to 10.130.136.116:50010^C
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_-8075220412997159517_101639855 on 10.130.136.116:50010 size 496 does not belong to any file.
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-8075220412997159517 to 10.130.136.116:50010
2014-08-25 11:30:01,418 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.processReport: block blk_2245696672665686485_98393215 on 10.130.136.116:50010 size 496 does not belong to any file.