• mysql架构~ Orchestrator检测+切换


    零 检测

       1 Orchestrator在集群的主库和所有从库各启动三个线程,每隔InstancePollSeconds重建连接,监控主库和所有从库复制是否正常;

         orch会每隔InstancePollSeconds(默认5s)时间去被监控的实例上拉取实例状态,并将这些状态信息存入orch的元数据库的orchestrator.database_instance表中,然后orch会每隔InstancePollSeconds秒从元数据库中获取每个instance的状态,展示在web界面上。

       2  只有当本身无法连接主库和集群其他成员都无法连接主库两者条件都成立时才进行切换、

           其他从库成员无法连接主库的判断依据是io_thread状态,当两次尝试连接主库依然无法连接时会被判断为发生故障

       3 调用用OnFailureDetectionProcesses钩子函数

       4 配置的MySQLConnectionLifetimeSeconds不小于InstancePollSeconds,避免间隔时间内数据库没有线程监控;

    一 切换场景
      1 有延时进行强制切换-大数据
      2 有延时进行等待应用relay-log才进行切换-重要场景
      3 有延时不进行切换立刻退出,人工修复数据
        切换判断
       0 FailMasterPromotionOnLagMinutes:默认值0(0代表禁用)。如果要升级的从机延迟过大,则使故障转移失败。例如:从机故障了5个小时,然后主机故障。此时可能需要停止故障转移,需要恢复中间丢失的5个小时的同步数据丢失。要使用此配置,必须设置ReplicationLagQuery 和使用心跳检测机制,如pt-heartbeat 。复制中断时 ,MySQL 内置的show slave status 的 Seconds_behind_master ( 8.0之前的版本)  并不会报告复制延迟 推荐设置为0
       1 FailMasterPromotionIfSQLThreadNotUpToDate:如果所有的从机都有复制延迟。即使最接近的从机也有没有应用的中继日志。此时故障切换执行reset slave all 会清除从机上的所有中继日志。默认false 两个参数不能兼得
       2 DelayMasterPromotionIfSQLThreadNotUpToDate:如果所有的从机都有复制延迟。即使最接近的从机也有没有应用的中继日志。设置为 true 时,orchestrator 会等待SQL线程赶上,然后再进行故障转移,将次从机提升为主 默认false 两个参数不能兼得 推荐设置为true
    切换后续处理
         ApplyMySQLPromotionAfterMasterFailover:默认值true。当此参数为 true ,  将在提升的新主节点上执行reset slave all (清除所有同步信息) 和 set global read_only = 0 (关闭只读模式)。当参数为 true 时, 会覆盖MasterFailoverDetachSlaveMasterHost。
    DetachLostReplicasAfterMasterFailover:某些从机可能在恢复过程中发生故障。当此参数为 true 时,orchestrator 将通过deatch-replica 命令强行中断其复制
        MasterFailoverLostInstancesDowntimeMinutes:主故障转移后丢失的任何服务器停机的分钟数(包括失败的主和丢失的从) 推荐设置10
    二 内部语句更新
       1 周期获取元数据插入
    INSERT INTO database_instance (hostname, port, last_checked, last_attempted_check, last_check_partial_success, uptime, server_id, server_uuid, version, major_version, version_comment, binlog_server, read_only, binlog_format, binlog_row_image, log_bin, log_slave_updates, binary_log_file, binary_log_pos, master_host, master_port, slave_sql_running, slave_io_running, replication_sql_thread_state, replication_io_thread_state, has_replication_filters, supports_oracle_gtid, oracle_gtid, master_uuid, ancestry_uuid, executed_gtid_set, gtid_mode, gtid_purged, gtid_errant, mariadb_gtid, pseudo_gtid, master_log_file, read_master_log_pos, relay_master_log_file, exec_master_log_pos, relay_log_file, relay_log_pos, last_sql_error, last_io_error, seconds_behind_master, slave_lag_seconds, sql_delay, num_slave_hosts, slave_hosts, cluster_name, suggested_cluster_alias, data_center, region, physical_environment, replication_depth, is_co_master, replication_credentials_available, has_replication_credentials, allow_tls, semi_sync_enforced, semi_sync_master_enabled, semi_sync_replica_enabled, instance_alias, last_discovery_latency, last_seen)
        当获取元数据失败无法插入语句时 
      2  更新 update database_instance set last_checked = NOW(), last_check_partial_success = 0 where hostname = '10.10.30.170' and port = 3306 更新时间戳和失败状态
                 update database_instance set last_attempted_check = NOW() where hostname = '10.10.30.170' and port = 3306 更新失败时间戳+1S  一种保护机制
      3  核心表 database_instance
     三 判断是否切换查询语句
    // instance每隔InstancePollSeconds时间通过如下方式判断某个instance是否正常
    1 select ifnull(last_checked <= last_seen, 0) as is_last_check_valid from database_instance where hostname = '10.10.30.170' and port = 3306 order by hostname, port;
    首先,orch会每秒通过类似于如下方式判断master instance是否正常
    2 SELECT MIN(master_instance.last_checked <= master_instance.last_seen and master_instance.last_attempted_check <= master_instance.last_seen + interval 6 second) = 1 AS is_last_check_valid from database_instance master_instance GROUP BY master_instance.hostname, master_instance.port;
    3 第二步 从库进行检测
    SELECT master_instance.hostname, master_instance.port, master_instance.read_only AS read_only, MIN(master_instance.data_center) AS data_center, MIN(master_instance.region) AS region, MIN(master_instance.physical_environment) AS physical_environment, MIN(master_instance.master_host) AS master_host, MIN(master_instance.master_port) AS master_port, MIN(master_instance.cluster_name) AS cluster_name, MIN(IFNULL(cluster_alias.alias, master_instance.cluster_name)) AS cluster_alias, MIN( master_instance.last_checked <= master_instance.last_seen and master_instance.last_attempted_check <= master_instance.last_seen + interval 6 second ) = 1 AS is_last_check_valid, MIN(master_instance.last_check_partial_success) as last_check_partial_success, MIN(master_instance.master_host IN ('' , '_') OR master_instance.master_port = 0 OR substr(master_instance.master_host, 1, 2) = '//') AS is_master, MIN(master_instance.is_co_master) AS is_co_master, MIN(CONCAT(master_instance.hostname, ':', master_instance.port) = master_instance.cluster_name) AS is_cluster_master, MIN(master_instance.gtid_mode) AS gtid_mode, COUNT(replica_instance.server_id) AS count_replicas, IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen), 0) AS count_valid_slaves, IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen AND replica_instance.slave_io_running != 0 AND replica_instance.slave_sql_running != 0), 0) AS count_valid_replicating_slaves, IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen AND replica_instance.slave_io_running = 0 AND replica_instance.last_io_error like '%error %connecting to master%' AND replica_instance.slave_sql_running = 1), 0) AS count_replicas_failing_to_connect_to_master, MIN(master_instance.replication_depth) AS replication_depth, GROUP_CONCAT(concat(replica_instance.Hostname, ':', replica_instance.Port)) as slave_hosts, MIN( master_instance.slave_sql_running = 1 AND master_instance.slave_io_running = 0 AND master_instance.last_io_error like '%error %connecting to master%' ) AS is_failing_to_connect_to_master, MIN( master_downtime.downtime_active is not null and ifnull(master_downtime.end_timestamp, now()) > now() ) AS is_downtimed, MIN( IFNULL(master_downtime.end_timestamp, '') ) AS downtime_end_timestamp, MIN( IFNULL(unix_timestamp() - unix_timestamp(master_downtime.end_timestamp), 0) ) AS downtime_remaining_seconds, MIN( master_instance.binlog_server ) AS is_binlog_server, MIN( master_instance.pseudo_gtid ) AS is_pseudo_gtid, MIN( master_instance.supports_oracle_gtid ) AS supports_oracle_gtid, SUM( replica_instance.oracle_gtid ) AS count_oracle_gtid_slaves, IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen AND replica_instance.oracle_gtid != 0), 0) AS count_valid_oracle_gtid_slaves, SUM( replica_instance.binlog_server ) AS count_binlog_server_slaves, IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen AND replica_instance.binlog_server != 0), 0) AS count_valid_binlog_server_slaves, MIN( master_instance.mariadb_gtid ) AS is_mariadb_gtid, SUM( replica_instance.mariadb_gtid ) AS count_mariadb_gtid_slaves, IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen AND replica_instance.mariadb_gtid != 0), 0) AS count_valid_mariadb_gtid_slaves, IFNULL(SUM(replica_instance.log_bin AND replica_instance.log_slave_updates), 0) AS count_logging_replicas, IFNULL(SUM(replica_instance.log_bin AND replica_instance.log_slave_updates AND replica_instance.binlog_format = 'STATEMENT'), 0) AS count_statement_based_loggin_slaves, IFNULL(SUM(replica_instance.log_bin AND replica_instance.log_slave_updates AND replica_instance.binlog_format = 'MIXED'), 0) AS count_mixed_based_loggin_slaves, IFNULL(SUM(replica_instance.log_bin AND replica_instance.log_slave_updates AND replica_instance.binlog_format = 'ROW'), 0) AS count_row_based_loggin_slaves, IFNULL(SUM(replica_instance.sql_delay > 0), 0) AS count_delayed_replicas, IFNULL(SUM(replica_instance.slave_lag_seconds > 10), 0) AS count_lagging_replicas, IFNULL(MIN(replica_instance.gtid_mode), '') AS min_replica_gtid_mode, IFNULL(MAX(replica_instance.gtid_mode), '') AS max_replica_gtid_mode, IFNULL(MAX( case when replica_downtime.downtime_active is not null and ifnull(replica_downtime.end_timestamp, now()) > now() then '' else replica_instance.gtid_errant end ), '') AS max_replica_gtid_errant, IFNULL(SUM( replica_downtime.downtime_active is not null and ifnull(replica_downtime.end_timestamp, now()) > now()), 0) AS count_downtimed_replicas, COUNT(DISTINCT case when replica_instance.log_bin AND replica_instance.log_slave_updates then replica_instance.major_version else NULL end ) AS count_distinct_logging_major_versions FROM database_instance master_instance LEFT JOIN hostname_resolve ON (master_instance.hostname = hostname_resolve.hostname) LEFT JOIN database_instance replica_instance ON (COALESCE(hostname_resolve.resolved_hostname, master_instance.hostname) = replica_instance.master_host AND master_instance.port = replica_instance.master_port) LEFT JOIN database_instance_maintenance ON (master_instance.hostname = database_instance_maintenance.hostname AND master_instance.port = database_instance_maintenance.port AND database_instance_maintenance.maintenance_active = 1) LEFT JOIN database_instance_downtime as master_downtime ON (master_instance.hostname = master_downtime.hostname AND master_instance.port = master_downtime.port AND master_downtime.downtime_active = 1) LEFT JOIN database_instance_downtime as replica_downtime ON (replica_instance.hostname = replica_downtime.hostname AND replica_instance.port = replica_downtime.port AND replica_downtime.downtime_active = 1) LEFT JOIN cluster_alias ON (cluster_alias.cluster_name = master_instance.cluster_name) WHERE database_instance_maintenance.database_instance_maintenance_id IS NULL AND '' IN ('', master_instance.cluster_name) GROUP BY master_instance.hostname, master_instance.port HAVING (MIN( master_instance.last_checked <= master_instance.last_seen and master_instance.last_attempted_check <= master_instance.last_seen + interval 6 second ) = 1 ) = 0 OR (IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen AND replica_instance.slave_io_running = 0 AND replica_instance.last_io_error like '%error %connecting to master%' AND replica_instance.slave_sql_running = 1), 0) > 0) OR (IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen), 0) < COUNT(replica_instance.server_id) ) OR (IFNULL(SUM(replica_instance.last_checked <= replica_instance.last_seen AND replica_instance.slave_io_running != 0 AND replica_instance.slave_sql_running != 0), 0) < COUNT(replica_instance.server_id) ) OR (MIN( master_instance.slave_sql_running = 1 AND master_instance.slave_io_running = 0 AND master_instance.last_io_error like '%error %connecting to master%' ) ) OR (COUNT(replica_instance.server_id) > 0) ORDER BY is_master DESC , is_cluster_master DESC, count_replicas DESCG

     

     

  • 相关阅读:
    sql
    vs 2010创建Windows服务定时timer程序
    C#(.net)实现用apache activemq传递SQLite的数据
    ASP.net与SQLite数据库通过js和ashx交互(连接和操作)
    Object
    Thread
    多线程知识
    HTTPS详解
    TCP协议
    [BJDCTF 2nd]假猪套天下第一 && [BJDCTF2020]Easy MD5
  • 原文地址:https://www.cnblogs.com/danhuangpai/p/13973412.html
Copyright © 2020-2023  润新知