http://incubator.apache.org/mesos/research.html, Mesos
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
为什么需要Mesos?
现在有越来越多的compute framework, 并且每个framework都有自己的适用场景和优缺点. 比如Hadoop, MPI, Pregel, Spark……
所以往往需要build不同的framework来满足不同的需要, 问题是如果不同的framework搭建在不同的cluster上, 太不方便了
首先那么多的cluster, 严重的资源浪费, 并且对于处理对象big data需要在各个cluster之间导来导去, 相当不方便
所以Mesos就提供了这样的一个方案, 可以使不同的framework来共享一个cluster.
现在已有的集群共享solution,
1. Statically partition the cluster and run one framework per partition, 将集群分成互不打扰的patition
2. Allocate a set of VMs to each framework, 使用虚拟机技术
Unfortunately, these solutions achieve neither high utilization nor efficient data sharing.
The main problem is the mismatch between the allocation granularities of these solutions and of existing frameworks.
这些技术在利用效率和数据共享上都不太好, 原因是他们共享的粒度太粗, 和现有的计算framework不匹配.
比如对于Hadoop, 对于资源的分配可以细到slot的级别, 一个instance可以包含多个slot.
In this paper, we propose Mesos, a thin resource sharing layer that enables fine-grained sharing across diverse cluster computing frameworks, by giving frameworks a common interface for accessing cluster resources.
Mesos的特点, 就说可以实现不同计算framework之间的fine-grained的资源共享, 它通过提供一种通用的cluster资源访问接口来实现.
Mesos架构设计
Design Philosophy
Because cluster frameworks are both highly diverse and rapidly evolving,
our overriding design philosophy has been to define a minimal interface that enables efficient resource sharing across frameworks,
and otherwise push control of task scheduling and execution to the frameworks.
设计哲学, 一句话就是简单至上.
首先, 为了应对framework之间极大的差异性, Mesos只提高一组最小的简单接口用于共享资源
接着, 由framework自身负责task的schedule和执行
架构Overview
Mesos consists of a master process that manages slave daemons running on each cluster node, and frameworks that run tasks on these slaves.
The master implements fine-grained sharing across frameworks using resource offers.
Each resource offer is a list of free resources on multiple slaves.
The master decides how many resources to offer to each framework according to an organizational policy, such as fair sharing or priority.
Each framework running on Mesos consists of two components:
a scheduler that registers with the master to be offered resources
an executor process that is launched on slave nodes to run the framework’s tasks
首先Mesos基于master, master用于管理slave, 并且在master上可以用户定义各个framework的资源分配策略, 比如fair或者priority
既然基于master, 就需要考虑单点问题, 这儿使用zookeeper来管理并确保failover
提到'resource offer'的概念, 其实就是可用资源的列表
Mesos slave会将resource offer发给Mesos master, master通知各个framework scheduler当前resource offer的情况
Scheduler会根据各自情况, 决定是否在slave上assign task, task的执行由framework executor来完成, 对mesos透明
实际的例子,
1. Slave向master报告resources offer, 4cpu, 4gb ram
2. master通知注册的schedulers
3. scheduler判断当前待执行的task列表, 发现task1, 和task2, 可以在s1执行, 发请求告诉master
4. master通知在s1上的framework1的executor, 执行相应的task
Isolation
Mesos provides performance isolation between framework executors running on the same slave by leveraging existing OS isolation mechanisms.
We currently isolate resources using OS container technologies, specifically Linux Containers and Solaris Projects.
These technologies can limit the CPU, memory, network bandwidth, and (in new Linux kernels) I/O usage of a process tree.