Openstack nova-scheduler 源码分析 — Filters/Weighting

前言

本篇记录了 Openstack 在创建 Instances 时，nova-scheduler 作为调度器的工作原理和代码实现。
Openstack 中会由多个的 Instance 共享同一个 Host，而不是独占。所以就需要使用调度器这种管理规则来协调和管理 Instance 之间的资源分配。

调度器

调度器：调度 Instance 在哪一个 Host 上运行的方式。
目前 Nova 中实现的调度器方式由下列几种：

ChanceScheduler(随机调度器)：从所有正常运行 nova-compute 服务的 Host Node 中随机选取来创建 Instance
FilterScheduler(过滤调度器)：根据指定的过滤条件以及权重来挑选最佳创建 Instance 的 Host Node 。
Caching(缓存调度器)：是 FilterScheduler 中的一种，在其基础上将 Host 资源信息缓存到本地的内存中，然后通过后台的定时任务从数据库中获取最新的 Host 资源信息。

为了便于扩展，Nova 将一个调度器必须要实现的接口提取出来成为 nova.scheduler.driver.Scheduler，只要继承了该类并实现其中的接口，我们就可以自定义调度器。

注意：不同的调度器并不能共存，需要在 /etc/nova/nova.conf 中的选项指定使用哪一个调度器。默认为 FilterScheduler 。

vim /etc/nova/nova.conf

scheduler_driver = nova.scheduler.filter_scheduler.FilterScheduler

FilterScheduler调度器的工作流程

这里写图片描述

FilterScheduler 首先使用指定的 Filters(过滤器) 过滤符合条件的 Host，EG. 内存使用率小于 2% 。然后对得到的 Host 列表计算 Weighting 权重并排序，获得最佳的 Host 。

Filters 过滤器

Filtering 就是首先根据各个 Host 当前可用的资源情况来过滤掉那些不能满足 Instance 要求的 Host，然后再使用配置文件指定的各种 Filters 去过滤掉不符合过滤条件的 Host。经过 Filters 过滤后，会得到一个 Host 列表。

这样的话 nova-scheduler 就需要从数据库中取得当前各个 Host 最新的资源使用情况，这些资源数据的收集和存储都由 nova-compute 中定义的数据库同步机制来完成。但是 nova-compute 对数据库的更新是周期性的， nova-scheduler 在选择最佳 Host 时需要最新的资源数据。所以在 nova-scheduler 中使用了 nova.scheduler.host_manager:HostState 来维护一份数据。这份数据仅保存在当前进程的内存中，里面包含了从上次数据库更新到现在 Host 资源的变化情况，也就是最新的 Host 资源数据。nova-scheduler 为了保持自己所维护的资源数据是最新的，每创建一个 Instance ，nova-scheduler 都要将这份资源数据更新，并从 Host 可用资源中去掉虚拟机使用的部分。
注意：nova-scheduler 所维护的数据不会同步到数据库，它只会从数据库同步数据到自身，所以 nova-scheduler 并没有写数据库的功能。

Filters 类型

ALLHostsFilter：不进行任何过滤
RamFilter：根据内存的可用情况来进行过滤
ComputeFilter：选取所有处于 Active 的 Host
TrustedFilter：选取所有可信的 Host
PciPassthroughFilter：选取提供 PCI SR-IOV 支持的 Host

所有的 Filters 实现都位于nova/scheduler/filters 目录，每个 Filter 都要继承自 nova.scheduler.filters.BaseHostFilter 。如果需要自定义一个 Filter，只需通过继承此类并实现一个函数 host_passes()，返回的结果只有 True or False 。

在配置文件中指定 Filters：

scheduler_available_filters=
scheduler_default_filters=

Weighting 权重

Weighting 表示对所有符合过滤条件(通过 Filters)的 Host 计算权重并以此排序从而得到最佳的一个 Host。计算 Host 权重的过程需要调用指定的各种 Weigher Module，得到每个 Host 的权重值。

所有的 Weigher 的实现都位于 nova/scheduler/weights 目录下。

源码实现

关键文件及其意义

/nova/scheduler/driver.py: 文件中最重要的就是 Scheduler 类，是所有调度器实现都要继承的基类，包含了调度器必须要实现的所有接口。
/nova/scheduler/manager.py: 主要实现了 SchedulerManager 类，定义了 Host 的管理操作函数，如：删除 Host 中的 Instance — delete_instance_info
/nova/scheduler/host_manager.py: 有两个类的实现，都是描述了跟调度器相关的 Host 的操作实现，类 HostState 维护了一份最新的 Host 资源数据。类 HostManager 描述了调度器相关的操作函数， EG._choose_host_filters/get_filtered_hosts/get_weighed_hosts
/nova/scheduler/chance.py: 只有 ChanceScheduler 类(随机调度器)，继承自 Scheduler 类，实现随机选取 Host Node 的调度器
/nova/scheduler/client: 客户端调用程序的入口
/nova/scheduler/filter_scheduler.py: 只有 FilterScheduler 类(过滤调度器)，继承自 Scheduler 类，实现了根据指定的过滤条件来选取 Host Node 的调度器
/nova/scheduler/filters 和 /nova/scheduler/weights: 这两个目录下的内容分别对应 过滤器 和权重的实现。

阶段一：nova-scheduler 接收 build_instances RPC 远程调用

这里写图片描述

nova-conductor ==> RPC scheduler_client.select_destinations() ==> nova-sechduler

#nova.conductor.manager.ComputeTaskManager:build_instances()

    def build_instances(self, context, instances, image, filter_properties,
            admin_password, injected_files, requested_networks,
            security_groups, block_device_mapping=None, legacy_bdm=True):
        # TODO(ndipanov): Remove block_device_mapping and legacy_bdm in version
        #                 2.0 of the RPC API.

        # 获取需要创建的 Instance 的参数信息
        request_spec = scheduler_utils.build_request_spec(context, image,
                                                          instances)

        # TODO(danms): Remove this in version 2.0 of the RPC API
        if (requested_networks and
                not isinstance(requested_networks,
                               objects.NetworkRequestList)):
            # 请求 network 信息
            requested_networks = objects.NetworkRequestList(
                objects=[objects.NetworkRequest.from_tuple(t)
                         for t in requested_networks])
        # TODO(melwitt): Remove this in version 2.0 of the RPC API

        # 获取 flavor 信息
        flavor = filter_properties.get('instance_type')
        if flavor and not isinstance(flavor, objects.Flavor):
            # Code downstream may expect extra_specs to be populated since it
            # is receiving an object, so lookup the flavor to ensure this.
            flavor = objects.Flavor.get_by_id(context, flavor['id'])
            filter_properties = dict(filter_properties, instance_type=flavor)

        try:
            scheduler_utils.setup_instance_group(context, request_spec,
                                                 filter_properties)
            # check retry policy. Rather ugly use of instances[0]...
            # but if we've exceeded max retries... then we really only
            # have a single instance.
            scheduler_utils.populate_retry(filter_properties,
                instances[0].uuid)

            # 获取 Hosts 列表
            hosts = self.scheduler_client.select_destinations(context,
                    request_spec, filter_properties)

        except Exception as exc:
            updates = {'vm_state': vm_states.ERROR, 'task_state': None}
            for instance in instances:
                self._set_vm_state_and_notify(
                    context, instance.uuid, 'build_instances', updates,
                    exc, request_spec)
            return

        for (instance, host) in itertools.izip(instances, hosts):
            try:
                instance.refresh()
            except (exception.InstanceNotFound,
                    exception.InstanceInfoCacheNotFound):
                LOG.debug('Instance deleted during build', instance=instance)
                continue
            local_filter_props = copy.deepcopy(filter_properties)
            scheduler_utils.populate_filter_properties(local_filter_props,
                host)
            # The block_device_mapping passed from the api doesn't contain
            # instance specific information
            bdms = objects.BlockDeviceMappingList.get_by_instance_uuid(
                    context, instance.uuid)


            self.compute_rpcapi.build_and_run_instance(context,
                    instance=instance, host=host['host'], image=image,
                    request_spec=request_spec,
                    filter_properties=local_filter_props,
                    admin_password=admin_password,
                    injected_files=injected_files,
                    requested_networks=requested_networks,
                    security_groups=security_groups,
                    block_device_mapping=bdms, node=host['nodename'],
                    limits=host['limits'])

nova-conductor 在调用 nova-scheduler 来获取能够创建 Instance 的 Host 的同时也获取了：requested_networks/flavor 等信息。

其中获取 Hosts 列表的代码块：

            # 获取 Hosts 列表
            hosts = self.scheduler_client.select_destinations(context,
                    request_spec, filter_properties)

下面列出了一系列为了获取 Hosts 列表的函数调用跳转：

# nova.scheduler.client.query.SchedulerQueryClient:select_destinations()

from nova.scheduler import rpcapi as scheduler_rpcapi

class SchedulerQueryClient(object):
    """Client class for querying to the scheduler."""

    def __init__(self):
        self.scheduler_rpcapi = scheduler_rpcapi.SchedulerAPI()

    def select_destinations(self, context, request_spec, filter_properties):
        """Returns destinations(s) best suited for this request_spec and
        filter_properties.

        The result should be a list of dicts with 'host', 'nodename' and
        'limits' as keys.
        """
        # 
        return self.scheduler_rpcapi.select_destinations(
            context, request_spec, filter_properties)


# nova.scheduler.rpcapi.SchedulerAPI:select_destinations()

    def select_destinations(self, ctxt, request_spec, filter_properties):
        cctxt = self.client.prepare(version='4.0')
        return cctxt.call(ctxt, 'select_destinations',
            request_spec=request_spec, filter_properties=filter_properties)

阶段二：从 scheduler.rpcapi.SchedulerAPI 到 scheduler.manager.SchedulerManager

rpcapi.py 中的接口函数会在 manager.py 中实现实际操作函数。
所以跳转到 nova.scheduler.manager.SchedulerManager:select_destinations()

# nova.scheduler.manager.SchedulerManager:select_destinations()
class SchedulerManager(manager.Manager):
    """Chooses a host to run instances on."""

    target = messaging.Target(version='4.2')

    def __init__(self, scheduler_driver=None, *args, **kwargs):
        if not scheduler_driver:
            scheduler_driver = CONF.scheduler_driver
        # 可以看出这里的 driver 是通过配置文件中的选项值指定的类来返回的对象 EG.nova.scheduler.filter_scheduler.FilterScheduler
        self.driver = importutils.import_object(scheduler_driver)
        super(SchedulerManager, self).__init__(service_name='scheduler',
                                               *args, **kwargs)


    def select_destinations(self, context, request_spec, filter_properties):
        """Returns destinations(s) best suited for this request_spec and
        filter_properties.

        The result should be a list of dicts with 'host', 'nodename' and
        'limits' as keys.
        """
        dests = self.driver.select_destinations(context, request_spec,
            filter_properties)
        return jsonutils.to_primitive(dests)

阶段三：从 scheduler.manager.SchedulerManager 到调度器 FilterScheduler

vim /etc/nova/nova.conf

scheduler_driver = nova.scheduler.filter_scheduler.FilterScheduler

从配置文件选项 scheduler_driver 的值可以知道，nova.scheduler.manager.SchedulerManager:driver
是 nova.scheduler.filter_scheduler.FilterScheduler 的实例化对象。
所以跳转到 nova.scheduler.filter_scheduler.FilterScheduler:select_destinations() 。

# nova.scheduler.filter_scheduler.FilterScheduler:select_destinations()

class FilterScheduler(driver.Scheduler):
    """Scheduler that can be used for filtering and weighing."""
    def __init__(self, *args, **kwargs):
        super(FilterScheduler, self).__init__(*args, **kwargs)
        self.options = scheduler_options.SchedulerOptions()
        self.notifier = rpc.get_notifier('scheduler')

    def select_destinations(self, context, request_spec, filter_properties):
        """Selects a filtered set of hosts and nodes."""
        self.notifier.info(context, 'scheduler.select_destinations.start',
                           dict(request_spec=request_spec))

        # 需要创建的 Instances 的数量
        num_instances = request_spec['num_instances']

        # 获取满足笫一次过滤条件的主机列表 List (详见上述的调度器过滤原理)
        # nova.scheduler.filter_scheduler.FilterScheduler:_schedule() ==> return selected_hosts
        selected_hosts = self._schedule(context, request_spec,
                                        filter_properties)

        # Couldn't fulfill the request_spec
        # 当请求的 Instance 数量大于合适的主机数量时，不会创建 Instance 且输出 'There are not enough hosts available.'
        if len(selected_hosts) < num_instances:
            # NOTE(Rui Chen): If multiple creates failed, set the updated time
            # of selected HostState to None so that these HostStates are
            # refreshed according to database in next schedule, and release
            # the resource consumed by instance in the process of selecting
            # host.
            for host in selected_hosts:
                host.obj.updated = None

            # Log the details but don't put those into the reason since
            # we don't want to give away too much information about our
            # actual environment.
            LOG.debug('There are %(hosts)d hosts available but '
                      '%(num_instances)d instances requested to build.',
                      {'hosts': len(selected_hosts),
                       'num_instances': num_instances})

            reason = _('There are not enough hosts available.')
            raise exception.NoValidHost(reason=reason)

        dests = [dict(host=host.obj.host, nodename=host.obj.nodename,
                      limits=host.obj.limits) for host in selected_hosts]

        self.notifier.info(context, 'scheduler.select_destinations.end',
                           dict(request_spec=request_spec))
        return dests


 def _schedule(self, context, request_spec, filter_properties):
        # 获取所有 Hosts 的状态
        hosts = self._get_all_host_states(elevated)

        selected_hosts = []

        # 获取需要创建的 Instances 数目
        num_instances = request_spec.get('num_instances', 1)

        # 遍历 num_instances，为每个 Instance 选取合适的主机
        for num in range(num_instances):
            # Filter local hosts based on requirements ...

            # 在 for 循环里，_schedule 的两个关键操作，get_filtered_hosts() 和 get_weighed_hosts()
            hosts = self.host_manager.get_filtered_hosts(hosts,
                    filter_properties, index=num)
            if not hosts:
                # Can't get any more locally.
                break

            LOG.debug("Filtered %(hosts)s", {'hosts': hosts})

            weighed_hosts = self.host_manager.get_weighed_hosts(hosts,
                    filter_properties)

            LOG.debug("Weighed %(hosts)s", {'hosts': weighed_hosts})

            scheduler_host_subset_size = CONF.scheduler_host_subset_size

            # 下面两个 if，主要为了防止 random.choice 调用越界
            if scheduler_host_subset_size > len(weighed_hosts):
                scheduler_host_subset_size = len(weighed_hosts)
            if scheduler_host_subset_size < 1:
                scheduler_host_subset_size = 1

            # 在符合要求的weigh过的host里进行随机选取
            chosen_host = random.choice(
                weighed_hosts[0:scheduler_host_subset_size])
            LOG.debug("Selected host: %(host)s", {'host': chosen_host})
            selected_hosts.append(chosen_host)

            # Now consume the resources so the filter/weights
            # will change for the next instance.
            chosen_host.obj.consume_from_instance(instance_properties)
            if update_group_hosts is True:
                if isinstance(filter_properties['group_hosts'], list):
                    filter_properties['group_hosts'] = set(
                        filter_properties['group_hosts'])
                filter_properties['group_hosts'].add(chosen_host.obj.host)
        # 循环为每一个实例获取合适的主机后，返回选择的主机列表
        return selected_hosts

上述的函数有三个非常关键的操作函数：

_get_all_host_states: 获取所有的 Host 状态，并且将初步满足条件的 Hosts 过滤出来。
get_filtered_hosts：使用 Filters 过滤器将第一个函数返回的 hosts 进行再一次过滤。
get_weighed_hosts：通过 Weighed 选取最优 Host。

这三个关键函数在后面会继续介绍。

首先看看host_manager.get_filtered_hosts() 中，host_manager 是 nova.scheduler.driver.Scheduler 的成员变量。如下：

# nova.scheduler.driver.Scheduler:__init__()

# nova.scheduler.filter_scheduler.FilterScheduler 继承了 nova.scheduler.driver.Scheduler
 class Scheduler(object):
     """The base class that all Scheduler classes should inherit from."""

     def __init__(self):
         # 从这里知道 host_manager 会根据配置文件动态导入
         self.host_manager = importutils.import_object(
                 CONF.scheduler_host_manager)
         self.servicegroup_api = servicegroup.API()

还需要注意：scheduler.filter_scheduler.FilterScheduler:_schedule() 中获取 Hosts 状态的函数 _get_all_host_states() 实现如下：

# nova.scheduler.host_manager.HostManager:get_all_host_states()

 def get_all_host_states(self, context):

        service_refs = {service.host: service
                        for service in objects.ServiceList.get_by_binary(
                            context, 'nova-compute')}

        # 获取 Compute Node 资源
        compute_nodes = objects.ComputeNodeList.get_all(context)
        # nova.object.__init__()
        #     ==> nova.object.compute_node.ComputeNodeList:get_all
        seen_nodes = set()
        for compute in compute_nodes:
            service = service_refs.get(compute.host)

            if not service:
                LOG.warning(_LW(
                    "No compute service record found for host %(host)s"),
                    {'host': compute.host})
                continue
            host = compute.host
            node = compute.hypervisor_hostname
            state_key = (host, node)
            host_state = self.host_state_map.get(state_key)

            # 更新主机信息
            if host_state:
                host_state.update_from_compute_node(compute)
            else:
                host_state = self.host_state_cls(host, node, compute=compute)
                self.host_state_map[state_key] = host_state
            # We force to update the aggregates info each time a new request
            # comes in, because some changes on the aggregates could have been
            # happening after setting this field for the first time
            host_state.aggregates = [self.aggs_by_id[agg_id] for agg_id in
                                     self.host_aggregates_map[
                                         host_state.host]]
            host_state.update_service(dict(service))
            self._add_instance_info(context, compute, host_state)
            seen_nodes.add(state_key)

        # remove compute nodes from host_state_map if they are not active
        # * 移除 not active 的节点
        dead_nodes = set(self.host_state_map.keys()) - seen_nodes


for state_key in dead_nodes:
            host, node = state_key
            LOG.info(_LI("Removing dead compute node %(host)s:%(node)s "
                         "from scheduler"), {'host': host, 'node': node})
            del self.host_state_map[state_key]

        return six.itervalues(self.host_state_map)
# get_all_host_states主要用来去除不活跃的节点

继续往下看获取 Compute Node 资源信息函数 objects.ComputeNodeList.get_all(context) 的实现。

# nova.object.compute_node:get_all()

    @base.remotable_classmethod
    def get_all(cls, context):
        # 调到了 nova.db.api.compute_node_get_all()
        db_computes = db.compute_node_get_all(context)


        return base.obj_make_list(context, cls(context), objects.ComputeNode,
                                  db_computes)



#nova.db.api:compute_node_get_all()

def compute_node_get_all(context):
    """Get all computeNodes.

    :param context: The security context

    :returns: List of dictionaries each containing compute node properties
    """
    return IMPL.compute_node_get_all(context)

至此，说明 liberty 版本的 nova-scheduler 还是能够访问数据库的。

问题是： nova-scheduler 是怎么更新主机信息的，能够直接数据库进行写操作吗？
答案是：不能，nova-scheduler 不能够对数据库进行写操作，但是却可以从数据库中读取 Host 资源数据并缓存在进程的内存中。如下：

# nova.scheduler.host_manager.HostState:__init__()
class HostState(object):
    """Mutable and immutable information tracked for a host.
    This is an attempt to remove the ad-hoc data structures
    previously used and lock down access.
    """

    def __init__(self, host, node, compute=None):
        self.host = host
        self.nodename = node

        # Mutable available resources.
        # These will change as resources are virtually "consumed".
        self.total_usable_ram_mb = 0
        self.total_usable_disk_gb = 0
        self.disk_mb_used = 0
        self.free_ram_mb = 0
        self.free_disk_mb = 0
        self.vcpus_total = 0
        self.vcpus_used = 0
        self.pci_stats = None
        self.numa_topology = None

        # Additional host information from the compute node stats:
        self.num_instances = 0
        self.num_io_ops = 0

        # Other information
        self.host_ip = None
        self.hypervisor_type = None
        self.hypervisor_version = None
        self.hypervisor_hostname = None
        self.cpu_info = None
        self.supported_instances = None

nova-scheduler 并没有写数据库的操作函数，但是 nova-scheduler 会将数据库的数据缓存到进程内存中。这样就可以在保证了 nova-scheduler 能使用最新的 Host 资源信息，同时下降低了对数据库的 I/O 请求。

阶段四：从调度器 FilterScheduler 到过滤器 Filters

上面的代码中调用了 Filters 函数：get_filtered_hosts()，实现如下：

# nova.scheduler.host_manager.HostManager:get_filtered_hosts()
    def get_filtered_hosts(self, hosts, filter_properties,
            filter_class_names=None, index=0):
        """Filter hosts and return only ones passing all filters."""
        # 下面定义了若干局部函数，先省略掉
        def _strip_ignore_hosts(host_map, hosts_to_ignore):
            ignored_hosts = []
            for host in hosts_to_ignore:

        。。。。
        # 返回经过验证的可用的过滤器；
        filter_classes = self._choose_host_filters(filter_class_names)
        。。。。
            # 调用了get_filtered_objects
            return self.filter_handler.get_filtered_objects(filters,
                        hosts, filter_properties, index)



# 继续跳转到 get_filtered_objects()
 def get_filtered_objects(self, filters, objs, filter_properties, index=0):
        list_objs = list(objs)
        LOG.debug("Starting with %d host(s)", len(list_objs))
        part_filter_results = []
        full_filter_results = []
        log_msg = "%(cls_name)s: (start: %(start)s, end: %(end)s)"
        for filter_ in filters:
            if filter_.run_filter_for_index(index):
                cls_name = filter_.__class__.__name__
                start_count = len(list_objs)
                # 关键的一句话
                objs = filter_.filter_all(list_objs, filter_properties)
                if objs is None:
                    LOG.debug("Filter %s says to stop filtering", cls_name)
                    return
                list_objs = list(objs)
                end_count = len(list_objs)
                part_filter_results.append(log_msg % {"cls_name": cls_name,
                        "start": start_count, "end": end_count})
                if list_objs:
                    remaining = [(getattr(obj, "host", obj),
                                  getattr(obj, "nodename", ""))
                                 for obj in list_objs]
                    full_filter_results.append((cls_name, remaining))

        return list_objs



# objs 的 return 又调用了 filter_.filter_all(list_objs, filter_properties)
def filter_all(self, filter_obj_list, filter_properties):
        for obj in filter_obj_list：
            if self._filter_one(obj, filter_properties):
                # 符合规则 生产一个obj
                yield obj



# 继续调用 _filter_one()
def _filter_one(self, obj, filter_properties):

        # 如果符合 Filter 过滤器，就返回 TRUE，否则返回 FALSE

        return self.host_passes(obj, filter_properties)

经过一连串的调用跳转，Filter 的过滤工作就完成了。

阶段五：Filters 到权重计算与排序

# nova.scheduler.host_manager.HostManager:get_weighed_hosts（）
    def get_weighed_hosts(self, hosts, weight_properties):
        """Weigh the hosts."""
        return self.weight_handler.get_weighed_objects(self.weighers,
                hosts, weight_properties)


# nova.weights.BaseWeightHandler:get_weighed_objects（）
class BaseWeightHandler(loadables.BaseLoader):
    object_class = WeighedObject

    def get_weighed_objects(self, weighers, obj_list, weighing_properties):
        """Return a sorted (descending), normalized list of WeighedObjects."""
        weighed_objs = [self.object_class(obj, 0.0) for obj in obj_list]

        if len(weighed_objs) <= 1:
            return weighed_objs

        for weigher in weighers:
            weights = weigher.weigh_objects(weighed_objs, weighing_properties)

            # Normalize the weights
            weights = normalize(weights,
                                minval=weigher.minval,
                                maxval=weigher.maxval)

            for i, weight in enumerate(weights):
                obj = weighed_objs[i]
                obj.weight += weigher.weight_multiplier() * weight

        # 进行排序
        return sorted(weighed_objs, key=lambda x: x.weight, reverse=True)

相关阅读:
对象状态序列化到字节流中
 操作EXCEL完毕后，关闭EXCEL进程
 ORACLE多表查询优化（引）
再谈需要分析一
 动态添加table,动态添加控件
 ref传参时出错
 SqlServer2000下实现行列转换
 调用结构属性、方法或公共字段的区别
 拆箱存在的隐患
 鼠标悬停图片，滑动显示文字
原文地址：https://www.cnblogs.com/jmilkfan-fanguiju/p/11825100.html