• 记一次ceph集群的严重故障 (转)


    问题:集群状态,坏了一个盘,pg状态好像有点问题
    [root@ceph-1 ~]# ceph -s
        cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
         health HEALTH_WARN
                64 pgs degraded
                64 pgs stuck degraded
                64 pgs stuck unclean
                64 pgs stuck undersized
                64 pgs undersized
                recovery 269/819 objects degraded (32.845%)
         monmap e1: 1 mons at {ceph-1=192.168.101.11:6789/0}
                election epoch 6, quorum 0 ceph-1
         osdmap e38: 3 osds: 2 up, 2 in; 64 remapped pgs
                flags sortbitwise,require_jewel_osds
          pgmap v14328: 72 pgs, 2 pools, 420 bytes data, 275 objects
                217 MB used, 40720 MB / 40937 MB avail
                269/819 objects degraded (32.845%)
                      64 active+undersized+degraded
                       8 active+clean

    [root@ceph-1 ~]# ceph osd tree
    ID WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY
    -1 0.05846 root default                                      
    -2 0.01949     host ceph-1                                   
     0 0.01949         osd.0        up  1.00000          1.00000
    -3 0.01949     host ceph-2                                   
     1 0.01949         osd.1        up  1.00000          1.00000
    -4 0.01949     host ceph-3                                   
     2 0.01949         osd.2      down        0          1.00000

    将osd.2的状态设置为out
    root@ceph-1:~# ceph osd out osd.2
    osd.2 is already out.

    从集群中删除
    root@ceph-1:~# ceph osd rm osd.2
    removed osd.2

    从CRUSH中删除
    root@ceph-1:~# ceph osd crush rm osd.2
    removed item id 2 name 'osd.2' from crush map

    删除osd.2的认证信息
    root@ceph02:~# ceph auth del osd.2
    updated

    umount报错
    [root@ceph-3 ~]# umount /dev/vdb1
    umount: /var/lib/ceph/osd/ceph-2: target is busy.
            (In some cases useful info about processes that use
             the device is found by lsof(8) or fuser(1))

    kill掉ceph用户的占用
    [root@ceph-3 ~]# fuser -mv /var/lib/ceph/osd/ceph-2
                         USER        PID ACCESS COMMAND
    /var/lib/ceph/osd/ceph-2:
                         root     kernel mount /var/lib/ceph/osd/ceph-2
                         ceph       1517 F.... ceph-osd
    [root@ceph-3 ~]# kill -9 1517
    [root@ceph-3 ~]# fuser -mv /var/lib/ceph/osd/ceph-2
                         USER        PID ACCESS COMMAND
    /var/lib/ceph/osd/ceph-2:
                         root     kernel mount /var/lib/ceph/osd/ceph-2
    [root@ceph-3 ~]# umount /var/lib/ceph/osd/ceph-2

    重新准备磁盘
    [root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf osd prepare ceph-3:/dev/vdb1

    激活所有节点上的osd磁盘或者分区
    [root@ceph-deploy my-cluster]# ceph-deploy osd activate ceph-1:/dev/vdb1 ceph-2:/dev/vdb1 ceph-3:/dev/vdb1

    报错...
    [ceph-3][ERROR ] RuntimeError: command returned non-zero exit status: 1
    [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: /usr/sbin/ceph-disk -v activate --mark-init systemd --mount /dev/vdb1

    一怒之下关机重启
    [root@ceph-3 ~]# init 0
    Connection to 192.168.101.13 closed by remote host.
    Connection to 192.168.101.13 closed.

    重启之后,osd好了,但是pg的问题好像还没解决
    [root@ceph-1 ~]# ceph -s
        cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
         health HEALTH_WARN
                64 pgs degraded
                64 pgs stuck degraded
                64 pgs stuck unclean
                64 pgs stuck undersized
                64 pgs undersized
                recovery 269/819 objects degraded (32.845%)
         monmap e1: 1 mons at {ceph-1=192.168.101.11:6789/0}
                election epoch 6, quorum 0 ceph-1
         osdmap e53: 3 osds: 3 up, 3 in
                flags sortbitwise,require_jewel_osds
          pgmap v14368: 72 pgs, 2 pools, 420 bytes data, 275 objects
                5446 MB used, 55960 MB / 61406 MB avail
                269/819 objects degraded (32.845%)
                      64 active+undersized+degraded
                       8 active+clean
    [root@ceph-1 ~]# ceph osd tree
    ID WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY
    -1 0.03897 root default                                      
    -2 0.01949     host ceph-1                                   
     0 0.01949         osd.0        up  1.00000          1.00000
    -3 0.01949     host ceph-2                                   
     1 0.01949         osd.1        up  1.00000          1.00000
    -4       0     host ceph-3                                   
     2       0 osd.2                up  1.00000          1.00000

    在ceph-1和ceph-2中加了一块硬盘,然后创建osd
    [root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf osd create ceph-1:/dev/vdd ceph-2:/dev/vdd

    查看集群状态,发现pg数好像小了
    [root@ceph-1 ~]# ceph -s
        cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
         health HEALTH_WARN
                14 pgs degraded
                14 pgs stuck degraded
                64 pgs stuck unclean
                14 pgs stuck undersized
                14 pgs undersized
                recovery 188/819 objects degraded (22.955%)
                recovery 200/819 objects misplaced (24.420%)
                too few PGs per OSD (28 < min 30)
         monmap e1: 1 mons at {ceph-1=192.168.101.11:6789/0}
                election epoch 6, quorum 0 ceph-1
         osdmap e63: 5 osds: 5 up, 5 in; 50 remapped pgs
                flags sortbitwise,require_jewel_osds
          pgmap v14408: 72 pgs, 2 pools, 420 bytes data, 275 objects
                5663 MB used, 104 GB / 109 GB avail
                188/819 objects degraded (22.955%)
                200/819 objects misplaced (24.420%)
                      26 active+remapped
                      24 active
                      14 active+undersized+degraded
                       8 active+clean
    增加pg和pgp
    [root@ceph-1 ~]# ceph osd pool set rbd pg_num 128
    [root@ceph-1 ~]# ceph osd pool set rbd pgp_num 128

    状态就成error了......
    [root@ceph-1 ~]# ceph -s
        cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
         health HEALTH_ERR
                118 pgs are stuck inactive for more than 300 seconds
                118 pgs peering
                118 pgs stuck inactive
                128 pgs stuck unclean
                recovery 16/657 objects misplaced (2.435%)
         monmap e2: 2 mons at {ceph-1=192.168.101.11:6789/0,ceph-3=192.168.101.13:6789/0}
                election epoch 8, quorum 0,1 ceph-1,ceph-3
         osdmap e74: 5 osds: 5 up, 5 in; 55 remapped pgs
                flags sortbitwise,require_jewel_osds
          pgmap v14459: 136 pgs, 2 pools, 356 bytes data, 221 objects
                5665 MB used, 104 GB / 109 GB avail
                16/657 objects misplaced (2.435%)
                      73 peering
                      45 remapped+peering
                      10 active+remapped
                       8 active+clean
    [root@ceph-1 ~]# less /etc/ceph/ceph.co

    于是我又重启了三台osd机器,重启发现又有osd down了
    [root@ceph-1 ~]# ceph -s
    2018-07-25 15:18:17.207665 7fb4ec2ee700  0 -- :/1038496581 >> 192.168.101.12:6789/0 pipe(0x7fb4e8063fa0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fb4e805c610).fault
        cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
         health HEALTH_WARN
                16 pgs degraded
                59 pgs stuck unclean
                16 pgs undersized
                recovery 134/819 objects degraded (16.361%)
                recovery 88/819 objects misplaced (10.745%)
                1/5 in osds are down
         monmap e2: 2 mons at {ceph-1=192.168.101.11:6789/0,ceph-3=192.168.101.13:6789/0}
                election epoch 12, quorum 0,1 ceph-1,ceph-3
         osdmap e95: 5 osds: 4 up, 5 in; 43 remapped pgs
                flags sortbitwise,require_jewel_osds
          pgmap v14529: 136 pgs, 2 pools, 420 bytes data, 275 objects
                5668 MB used, 104 GB / 109 GB avail
                134/819 objects degraded (16.361%)
                88/819 objects misplaced (10.745%)
                      77 active+clean
                      39 active+remapped
                      16 active+undersized+degraded
                       4 active

    [root@ceph-1 ~]# ceph osd tree
    2018-07-25 15:22:25.573039 7fe5ff87c700  0 -- :/3787750993 >> 192.168.101.12:6789/0 pipe(0x7fe604063fd0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fe60405c640).fault
    ID WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY
    -1 0.10725 root default                                      
    -2 0.04388     host ceph-1                                   
     0 0.01949         osd.0        up  1.00000          1.00000
     3 0.02440         osd.3        up  1.00000          1.00000
    -3 0.04388     host ceph-2                                   
     1 0.01949         osd.1      down        0          1.00000
     4 0.02440         osd.4        up  1.00000          1.00000
    -4 0.01949     host ceph-3                                   
     2 0.01949         osd.2        up  1.00000          1.00000

    把坏盘out、rm、crush rm、auth del后,集群健康了
    [root@ceph-1 ~]# ceph -s
        cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
         health HEALTH_OK
         monmap e2: 2 mons at {ceph-1=192.168.101.11:6789/0,ceph-3=192.168.101.13:6789/0}
                election epoch 12, quorum 0,1 ceph-1,ceph-3
         osdmap e102: 4 osds: 4 up, 4 in
                flags sortbitwise,require_jewel_osds
          pgmap v14597: 136 pgs, 2 pools, 356 bytes data, 270 objects
                5559 MB used, 86551 MB / 92110 MB avail
                     136 active+clean

    换掉了坏盘,把新的盘重新加入ceph集群(扩容也是这样操作)
    [root@ceph-deploy my-cluster]# ceph-deploy disk list ceph-2
    [root@ceph-deploy my-cluster]# ceph-deploy disk zap ceph-2:vdb
    [root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf osd create ceph-2:vdb:/dev/vdc1

    现在看是error
    [root@ceph-1 ~]# ceph -s
        cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
         health HEALTH_ERR
                13 pgs are stuck inactive for more than 300 seconds
                50 pgs degraded
                2 pgs peering
                1 pgs recovering
                17 pgs recovery_wait
                13 pgs stuck inactive
                23 pgs stuck unclean
                recovery 67/798 objects degraded (8.396%)
         monmap e2: 2 mons at {ceph-1=192.168.101.11:6789/0,ceph-3=192.168.101.13:6789/0}
                election epoch 12, quorum 0,1 ceph-1,ceph-3
         osdmap e110: 5 osds: 5 up, 5 in
                flags sortbitwise,require_jewel_osds
          pgmap v14633: 136 pgs, 2 pools, 356 bytes data, 268 objects
                5669 MB used, 104 GB / 109 GB avail
                67/798 objects degraded (8.396%)
                      79 active+clean
                      32 activating+degraded
                      17 active+recovery_wait+degraded
                       5 activating
                       2 peering
                       1 active+recovering+degraded
      client io 0 B/s wr, 0 op/s rd, 5 op/s wr

    过了一会看就完全正常了
    [root@ceph-1 ~]# ceph -s
        cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
         health HEALTH_OK
         monmap e2: 2 mons at {ceph-1=192.168.101.11:6789/0,ceph-3=192.168.101.13:6789/0}
                election epoch 12, quorum 0,1 ceph-1,ceph-3
         osdmap e110: 5 osds: 5 up, 5 in
                flags sortbitwise,require_jewel_osds
          pgmap v14666: 136 pgs, 2 pools, 356 bytes data, 267 objects
                5669 MB used, 104 GB / 109 GB avail
                     136 active+clean


    问题:增加mon报错
    [root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf mon create ceph-2
    [ceph-2][ERROR ] admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
    [ceph-2][WARNIN] neither `public_addr` nor `public_network` keys are defined for monitors

    [root@ceph-2 ~]# less /var/log/ceph/ceph-mon.ceph-2.log
    2018-07-25 15:52:02.566212 7efeec7d9780 -1 no public_addr or public_network specified, and mon.ceph-2 not present in monmap or ceph.conf

    原因:ceph.conf里面没有配置public_network
    [global]
    fsid = 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
    mon_initial_members = ceph-1,ceph-2,ceph-3
    mon_host = 192.168.101.11,192.168.101.12,192.168.101.13
    auth_cluster_required = cephx
    auth_service_required = cephx
    auth_client_required = cephx
    osd pool default size = 2

    修改ceph.conf文件
    [root@ceph-deploy my-cluster]# vi ceph.conf
    [global]
    fsid = 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
    mon_initial_members = ceph-1,ceph-2,ceph-3
    mon_host = 192.168.101.11,192.168.101.12,192.168.101.13
    auth_cluster_required = cephx
    auth_service_required = cephx
    auth_client_required = cephx
    osd pool default size = 2
    public_network = 192.168.122.0/24
    cluster_network = 192.168.101.0/24

    推送新的配置文件至各个节点
    [root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf config push ceph-1 ceph-2 ceph-3

    增加ceph-2为mon
    [root@ceph-deploy my-cluster]# ceph-deploy mon add ceph-2

    添加成功后发现,mon集群中ceph-2的ip跟其他的不一样,按照配置文件,应该跟该ceph-1、ceph-3的网段为122
    [root@ceph-1 ~]# ceph -s
        cluster 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
         health HEALTH_OK
         monmap e3: 3 mons at {ceph-1=192.168.101.11:6789/0,ceph-2=192.168.122.12:6789/0,ceph-3=192.168.101.13:6789/0}
                election epoch 14, quorum 0,1,2 ceph-1,ceph-3,ceph-2
         osdmap e110: 5 osds: 5 up, 5 in
                flags sortbitwise,require_jewel_osds
          pgmap v14666: 136 pgs, 2 pools, 356 bytes data, 267 objects
                5669 MB used, 104 GB / 109 GB avail
                     136 active+clean

    所以,我修改ceph.conf中mon节点的ip段为122
    [root@ceph-deploy my-cluster]# vi ceph.conf
    [global]
    fsid = 72f44b06-b8d3-44cc-bb8b-2048f5b4acfe
    mon_initial_members = ceph-1,ceph-2,ceph-3
    mon_host = 192.168.122.11,192.168.122.12,192.168.122.13
    auth_cluster_required = cephx
    auth_service_required = cephx
    auth_client_required = cephx
    osd pool default size = 2
    public_network = 192.168.122.0/24
    cluster_network = 192.168.101.0/24

    再来一波推送
    [root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf config push ceph-1 ceph-2 ceph-3

    删除两个mon
    [root@ceph-deploy my-cluster]# ceph-deploy mon destroy ceph-1 ceph-3

    然后整个集群都不好了
    [root@ceph-1 ~]# ceph -s
    2018-07-25 16:35:21.723736 7f47dedfb700  0 -- 192.168.122.11:0/4277586904 >> 192.168.122.13:6789/0 pipe(0x7f47c8000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c8001f90).fault with nothing to send, going to standby
    2018-07-25 16:35:27.723930 7f47dedfb700  0 -- 192.168.122.11:0/4277586904 >> 192.168.122.11:6789/0 pipe(0x7f47c8005330 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c8002410).fault with nothing to send, going to standby
    2018-07-25 16:35:33.725130 7f47deffd700  0 -- 192.168.122.11:0/4277586904 >> 192.168.122.13:6789/0 pipe(0x7f47c8005330 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c80046e0).fault with nothing to send, going to standby

    [root@ceph-1 ~]# ceph osd tree
    2018-07-25 16:35:21.723736 7f47dedfb700  0 -- 192.168.122.11:0/4277586904 >> 192.168.122.13:6789/0 pipe(0x7f47c8000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c8001f90).fault with nothing to send, going to standby
    2018-07-25 16:35:27.723930 7f47dedfb700  0 -- 192.168.122.11:0/4277586904 >> 192.168.122.11:6789/0 pipe(0x7f47c8005330 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c8002410).fault with nothing to send, going to standby
    2018-07-25 16:35:33.725130 7f47deffd700  0 -- 192.168.122.11:0/4277586904 >> 192.168.122.13:6789/0 pipe(0x7f47c8005330 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f47c80046e0).fault with nothing to send, going to standby

    好像也加不回去
    [root@ceph-deploy my-cluster]# ceph-deploy mon add ceph-1 ceph-3
    [ceph-1][WARNIN] 2018-07-25 16:37:52.760218 7f06739b9700  0 -- 192.168.122.11:0/2929495808 >> 192.168.122.11:6789/0 pipe(0x7f0668000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f0668005c20).fault with nothing to send, going to standby
    [ceph-1][WARNIN] 2018-07-25 16:37:55.760830 7f06738b8700  0 -- 192.168.122.11:0/2929495808 >> 192.168.122.13:6789/0 pipe(0x7f066800d5e0 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f066800e8a0).fault with nothing to send, going to standby
    [ceph-1][WARNIN] 2018-07-25 16:37:58.760748 7f06739b9700  0 -- 192.168.122.11:0/2929495808 >> 192.168.122.11:6789/0 pipe(0x7f0668000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f066800be40).fault with nothing to send, going to standby

    不嫌事大,把最后一个mon也删掉
    [root@ceph-deploy my-cluster]# ceph-deploy mon destroy ceph-2

    [root@ceph-deploy my-cluster]# ceph-deploy new ceph-1 ceph-2 ceph-3

    [root@ceph-deploy my-cluster]# ceph-deploy --overwrite-conf mon create-initial
    [ceph-1][ERROR ] "ceph auth get-or-create for keytype admin returned 22
    [ceph-1][DEBUG ] Error EINVAL: unknown cap type 'mgr'
    [ceph-1][ERROR ] Failed to return 'admin' key from host ceph-1
    [ceph-2][ERROR ] "ceph auth get-or-create for keytype admin returned 22
    [ceph-2][DEBUG ] Error EINVAL: unknown cap type 'mgr'
    [ceph-2][ERROR ] Failed to return 'admin' key from host ceph-2
    [ceph-3][ERROR ] "ceph auth get-or-create for keytype admin returned 22
    [ceph-3][DEBUG ] Error EINVAL: unknown cap type 'mgr'
    [ceph-3][ERROR ] Failed to return 'admin' key from host ceph-3
    [ceph_deploy.gatherkeys][ERROR ] Failed to connect to host:ceph-1, ceph-2, ceph-3
    [ceph_deploy.gatherkeys][INFO  ] Destroy temp directory /tmp/tmpnPWk4d
    [ceph_deploy][ERROR ] RuntimeError: Failed to connect any mon

    [root@ceph-deploy my-cluster]# ceph-deploy mon add ceph-1
    [ceph-1][INFO  ] monitor: mon.ceph-1 is running

    [root@ceph-deploy my-cluster]# ceph-deploy mon add ceph-2
    [ceph-2][INFO  ] monitor: mon.ceph-2 is running

    [root@ceph-deploy my-cluster]# ceph-deploy mon add ceph-3
    [ceph-3][INFO  ] monitor: mon.ceph-3 is running

    [root@ceph-1 ceph-ceph-1]# ceph -s
    2018-07-25 20:42:07.965513 7f1482a91700  0 librados: client.admin authentication error (1) Operation not permitted
    Error connecting to cluster: PermissionError

    通常我们执行ceph -s 时,就相当于开启了一个客户端,连接到 Ceph 集群,而这个客户端默认是使用 client.admin 的账户密码登陆连接集群的,所以平时执行的ceph -s 相当于执行了 ceph -s --name client.admin --keyring /etc/ceph/ceph.client.admin.keyring。需要注意的是,每次我们在命令行执行 Ceph 的指令,都相当于开启一个客户端,和集群交互,再关闭客户端。 现在举一个很常见的报错,这在刚接触 Ceph 时,很容易遇到:

    [root@blog ~]# ceph -s
    2017-08-03 02:22:27.352516 7fbd157b7700  0 librados: client.admin authentication error (1) Operation not permitted
    Error connecting to cluster: PermissionError

    报错信息很好理解,操作不被允许,也就是认证未通过,由于这里我们使用的是默认的client.admin 用户和它的秘钥,说明秘钥内容和 Ceph 集群记录的不一致,也就是说 /etc/ceph/ceph.client.admin.keyring 内容很可能是之前集群留下的,或者是记录了错误的秘钥,这时,只需要使用 mon.用户来执行 ceph auth list就可以查看到正确的秘钥内容:

    [root@ceph-1 ceph]# ceph auth get client.admin --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring
    Error ENOENT: failed to find client.admin in keyring
    [root@ceph-1 ceph]#


    用mon.用户瞄一眼集群
    [root@ceph-1 ceph]# ceph -s --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring
        cluster 053670e9-9b12-4297-aa04-41c430091f90
         health HEALTH_ERR
                64 pgs are stuck inactive for more than 300 seconds
                64 pgs stuck inactive
                64 pgs stuck unclean
                no osds
         monmap e1: 3 mons at {ceph-1=192.168.101.11:6789/0,ceph-2=192.168.101.12:6789/0,ceph-3=192.168.101.13:6789/0}
                election epoch 8, quorum 0,1,2 ceph-1,ceph-2,ceph-3
         osdmap e1: 0 osds: 0 up, 0 in
                flags sortbitwise,require_jewel_osds
          pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
                0 kB used, 0 kB / 0 kB avail
                      64 creating

    获取client.admin的秘钥
    [root@ceph-1 ceph]# ceph auth get client.admin --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring
    Error ENOENT: failed to find client.admin in keyring

    添加client.admin用户
    [root@ceph-1 ceph]# ceph auth add client.admin --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring

    再次获取client.admin的秘钥
    [root@ceph-1 ceph]# ceph auth get client.admin  --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring
    exported keyring for client.admin
    [client.admin]
        key = AQAIf1hbmuPXBxAA5Q3g/Jz8gerf+S6znEHLBQ==

    修改本地client.admin的秘钥
    [root@ceph-1 ceph]# vi ceph.client.admin.keyring
    [client.admin]
    #       key = AQAnPVBbJJWsMhAAKEqaHkWdwEWndOvqDjtjXA==
            key = AQAIf1hbmuPXBxAA5Q3g/Jz8gerf+S6znEHLBQ==
            caps mds = "allow *"
            caps mon = "allow *"
            caps osd = "allow *"

    查看集群状态
    [root@ceph-1 ceph]# ceph -s
    2018-07-25 21:50:40.512039 7f0ca92d0700  0 librados: client.admin authentication error (13) Permission denied

    给client.admin用户添加权限
    [root@ceph-1 ceph]# ceph auth add client.admin mon 'allow r' osd 'allow rw'
    2018-07-25 21:57:45.263271 7f68398ea700  0 librados: client.admin authentication error (13) Permission denied

    之前mon create-initial时新生成的ceph.client.admin.keyring忘了加读权限
    [root@ceph-1 ceph]# chmod +r /etc/ceph/ceph.client.admin.keyring


    [root@ceph-1 ceph]# ceph -s
    2018-07-25 22:06:17.167512 7f449b116700  0 librados: client.admin authentication error (13) Permission denied

    再次给client.admin用户添加权限
    [root@ceph-1 ceph]# ceph auth add client.admin mon 'allow r' osd 'allow rw' --name mon. --keyring /var/lib/ceph/mon/ceph-ceph-1/keyring
    Error EINVAL: entity client.admin exists but caps do not match

    历经千辛万苦,终于在谷歌找到一个方法,client.admin权限恢复后,查看到集群osd全没了
    [root@ceph-1 ~]# cd /var/lib/ceph/mon
    [root@ceph-1 mon]# ls
    ceph-ceph-1
    [root@ceph-1 mon]# cd ceph-ceph-1/
    [root@ceph-1 ceph-ceph-1]# ls
    done  keyring  store.db  systemd
    [root@ceph-1 ceph-ceph-1]# ceph -n mon. --keyring keyring  auth caps client.admin mds 'allow *' osd 'allow *' mon 'allow *'
    updated caps for client.admin
    [root@ceph-1 ceph-ceph-1]# ceph -s
        cluster 053670e9-9b12-4297-aa04-41c430091f90
         health HEALTH_ERR
                64 pgs are stuck inactive for more than 300 seconds
                64 pgs stuck inactive
                64 pgs stuck unclean
                no osds
         monmap e1: 3 mons at {ceph-1=192.168.101.11:6789/0,ceph-2=192.168.101.12:6789/0,ceph-3=192.168.101.13:6789/0}
                election epoch 16, quorum 0,1,2 ceph-1,ceph-2,ceph-3
         osdmap e1: 0 osds: 0 up, 0 in
                flags sortbitwise,require_jewel_osds
          pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
                0 kB used, 0 kB / 0 kB avail
                      64 creating

    [root@ceph-1 ceph-ceph-1]# ceph osd tree
    ID WEIGHT TYPE NAME    UP/DOWN REWEIGHT PRIMARY-AFFINITY
    -1      0 root default          

    在每个节点lsblk查看,所有挂载点均以自动卸载了,趁此,我也调整一下磁盘规格,把它们都统一该为20G
    [root@ceph-1 ceph-ceph-1]# lsblk
    NAME            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    sr0              11:0    1 1024M  0 rom  
    vda             252:0    0  100G  0 disk
    ├─vda1          252:1    0    1G  0 part /boot
    └─vda2          252:2    0   99G  0 part
      ├─centos-root 253:0    0   50G  0 lvm  /
      ├─centos-swap 253:1    0    2G  0 lvm  [SWAP]
      └─centos-home 253:2    0   47G  0 lvm  /home
    vdb             252:16   0   20G  0 disk
    └─vdb1          252:17   0   20G  0 part
    vdc             252:32   0   20G  0 disk
    └─vdc1          252:33   0    5G  0 part
    vdd             252:48   0   30G  0 disk
    ├─vdd1          252:49   0   25G  0 part
    └─vdd2          252:50   0    5G  0 part

    重新格式化磁盘
    [root@ceph-deploy my-cluster]# ceph-deploy disk zap ceph-1:vdb ceph-2:vdb ceph-3:vdb
    [root@ceph-deploy my-cluster]# ceph-deploy osd prepare ceph-1:vdb:vdc ceph-2:vdb:vdc ceph-3:vdb:vdc

    激活osd,看似好像是osd认证失败导致的
    [root@ceph-deploy my-cluster]# ceph-deploy osd activate ceph-1:vdb1:vdc
    [ceph-1][WARNIN] ceph_disk.main.Error: Error: ceph osd create failed: Command '/usr/bin/ceph' returned non-zero exit status 1: 2018-07-26 10:34:36.851527 7f678c625700  0 librados: client.bootstrap-osd authentication error (1) Operation not permitted
    [ceph-1][WARNIN] Error connecting to cluster: PermissionError
    [ceph-1][WARNIN]
    [ceph-1][ERROR ] RuntimeError: command returned non-zero exit status: 1
    [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: /usr/sbin/ceph-disk -v activate --mark-init systemd --mount /dev/vdb1

    暂时研究到这里吧,这个集群先放着,等以后证明白cephx再来搞

    重装请看这里
    ceph-deploy purgedata {ceph-node} [{ceph-node}]  ##清空数据
    ceph-deploy forgetkeys                ##删除之前生成的密钥
    ceph-deploy purge {ceph-node} [{ceph-node}]     ##卸载ceph软件   
    If you execute purge, you must re-install Ceph.


    ceph-deploy new {initial-monitor-node(s)}   
    ceph-deploy install {ceph-node}[{ceph-node}
    ceph-deploy mon create-initial
    ceph-deploy disk list {node-name [node-name]...}
    ceph-deploy disk zap osdserver1:sda
    ceph-deploy osd prepare ceph-osd1:/dev/sda ceph-osd1:/dev/sdb
    ceph-deploy osd activate ceph-osd1:/dev/sda1 ceph-osd1:/dev/sdb1
    ceph-deploy admin {admin-node} {ceph-node}
    chmod +r /etc/ceph/ceph.client.admin.keyring

  • 相关阅读:
    kibana We couldn't activate monitoring
    学Redis这篇就够了!
    elasticsearch 官方监控文档 老版但很有用
    java dump 内存分析 elasticsearch Bulk异常引发的Elasticsearch内存泄漏
    Apache Beam实战指南 | 大数据管道(pipeline)设计及实践
    InnoDB一棵B+树可以存放多少行数据?
    函数编程真不好
    面向对象编程灾难
    可能是全网最好的MySQL重要知识点 | 面试必备
    终于有人把elasticsearch原理讲通了
  • 原文地址:https://www.cnblogs.com/wangbin/p/11661726.html
Copyright © 2020-2023  润新知