转载:https://blog.csdn.net/tgxallen/article/details/78086360
看源码是对一个技术认识最直接且最有效的方式了,之前用Linux Epoll做过一个服务程序,但是只是停留在会用的层次,对其中的原理和实现细节却认识较少,最近在阅读Linux epoll实现的源码,所以把epoll的实现做一个详细的介绍,如果有不到之处或存在错误,请大家指正。
本文主要内容如下:
- 实现epoll的一些重要数据结构
- epoll使用中关键函数源码剖析
- epoll中ET与LT模式原理介绍
- epoll惊群与accept惊群
- epoll与IOCP
- 实现epoll的一些重要数据结构
关于epoll数据结构,其实说三个就差不多可以解释epoll的大概原理了。红黑树,链表,队列这三个数据结构分别对应了epoll实现机制中的evenpoll,epitem,list_head三个类型。其中eventpoll对应于epoll_create创建的epfd对应的结构体,它贯穿整个epoll处理过程,epitem结构对应于每一个我感兴趣的事件,eventpoll管理的epitem的方法是基于红黑树的方式,在epitem所对应的事件的发生时,内核会将其移到eventpoll的就绪队列里。下面对eventpoll和epitem进行详细分析
- eventpoll
在linux内核中一切皆文件,同样epoll在内部也是先创建了一个匿名文件系统,然后文件节点将epoll fd与文件节点绑定,这个文件节点只用于epoll。这个在后面源码分析中会有所体现,这边提到这个是因为eventpoll这个结构体是在文件结构体中的private_data中存储,在后面的实现中需要通过文件节点的private_data来寻找对应的eventpoll结构体,eventpoll结构体是epoll最主要的结构体。
-
struct eventpoll {
-
/* Protect the access to this structure */
-
spinlock_t lock; /*用来保护当前数据结构的旋转锁*/
-
-
/*
-
* This mutex is used to ensure that files are not removed
-
* while epoll is using them. This is held during the event
-
* collection loop, the file cleanup path, the epoll file exit
-
* code and the ctl operations.
-
*/
-
struct mutex mtx;
-
-
/* Wait queue used by sys_epoll_wait() */
-
wait_queue_head_t wq; /*当我们调用epoll_wait时,系统陷入内核,内核会监听wait_queue中我们感兴趣的事件,如果事件发生会将该描述符放到rdlist中*/
-
-
/* Wait queue used by file->poll() */
-
wait_queue_head_t poll_wait;
-
-
/* List of ready file descriptors */
-
struct list_head rdllist; /*当我们调用epoll_wait时,内核会检查这个队列中是否有已完成的事件*/
-
-
/* RB tree root used to store monitored fd structs */
-
struct rb_root_cached rbr;
-
-
/*在内核将已就绪的事件从内核转移到用户空间的这段时间内,由ovflist来负责接收ready event
-
*/
-
struct epitem *ovflist;
-
-
/* wakeup_source used when ep_scan_ready_list is running */
-
struct wakeup_source *ws;
-
-
/* The user that created the eventpoll descriptor */
-
struct user_struct *user;
-
-
struct file *file;
-
-
/* used to optimize loop detection check */
-
int visited;
-
struct list_head visited_list_link;
-
-
-
/* used to track busy poll napi_id */
-
unsigned int napi_id;
-
-
};
-
- epitem
epitem如之前所说对应到每个用户注册的事件,当用户调用epoll_ctl(event,EPOLL_CTL_ADD|..),用户在用户空间创建epoll_event,内核会调用epoll_insert,然后创建一个epitem结构存放event。
-
struct epitem {
-
union {
-
/* 这个节点用来链接当前epitem所绑定到的eventpoll中的红黑树rbr结构 */
-
struct rb_node rbn;
-
/* Used to free the struct epitem */
-
struct rcu_head rcu;
-
};
-
-
/* 类似于rbn,这个结构用于链接eventpoll的rdlist结构 */
-
struct list_head rdllink;
-
-
/*
-
* Works together "struct eventpoll"->ovflist in keeping the
-
* single linked chain of items.
-
*/
-
struct epitem *next;
-
-
/* 从epitem可以得到当前item绑定的是哪个epoll文件描述符 */
-
struct epoll_filefd ffd;
-
-
/* Number of active wait queue attached to poll operations */
-
int nwait;
-
-
/* List containing poll wait queues */
-
struct list_head pwqlist;
-
-
/* 链接到对应的eventpoll */
-
struct eventpoll *ep;
-
-
/* List header used to link this item to the "struct file" items list */
-
struct list_head fllink;
-
-
/* wakeup_source used when EPOLLWAKEUP is set */
-
struct wakeup_source __rcu *ws;
-
-
/* 这个结构保存着用户创建的epoll_event,当用户调用epoll_ctl向epoll注册一个事件时,内核会创建一个epitem,并将用户传进来的epoll_event保存在这个结构中 */
-
struct epoll_event event;
-
};
另外有一点值得一说的是,内核为epitem和eppoll_entry分别在内核创建了高速缓存层,这也是epoll性能强大的原因之一吧。
2. epoll使用中关键函数源码剖析
epoll编程的用户态接口很简单,epoll_create, epoll_ctl, epoll_wait三个函数基本上就包含了epoll 编程相关的所有操作。虽然用户态很简单的调用一个接口就可以创建epoll描述符,但是内核却是需要忙活一阵子的。本节就针对这三个函数在内核中如何实现,以及其他相关的一些操作函数做一些解读。
- epoll_create
int epoll_create(int size);在比较新的Linux版本中,size其实已经不起什么作用了,只要其数值大于0即可,原来的用途是指定可以在epoll_create返回的描述符上最大可以注册多少感兴趣的事件。自从Linux2.6.8之后,size就不用了。epoll_create在系统内部不做其他任何操作,直接调用另一个系统调用,sys_epoll_create1.
-
SYSCALL_DEFINE1(epoll_create, int, size)
-
{
-
if (size <= 0)
-
return -EINVAL;
-
-
return sys_epoll_create1(0);
-
}
sys_epoll_create1负责epoll_create内核部分的所有工作,由上一节描述的,eventpoll是贯穿在epoll整个流程中的主要数据结构,所以很容易知道,sys_epoll_create1的一个任务应该是创建一个eventpoll结构供后续所有epoll操作使用。
-
SYSCALL_DEFINE1(epoll_create1, int, flags)
-
{
-
int error, fd;
-
struct eventpoll *ep = NULL;
-
struct file *file;
-
-
/* Check the EPOLL_* constant for consistency. */
-
BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);
-
-
if (flags & ~EPOLL_CLOEXEC)
-
return -EINVAL;
-
/*
-
* 创建数据结构 ("struct eventpoll").
-
*/
-
error = ep_alloc(&ep);
-
if (error < 0)
-
return error;
-
/*
-
* 从文件系统获取一个未被使用的文件描述符
-
*/
-
fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
-
if (fd < 0) {
-
error = fd;
-
goto out_free_ep;
-
}
/* 从匿名文件系统中。系统中获取一个名字为eventpoll的文件实例*/
-
file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
-
O_RDWR | (flags & O_CLOEXEC));
-
if (IS_ERR(file)) {
-
error = PTR_ERR(file);
-
goto out_free_fd;
-
}
-
ep->file = file;
-
fd_install(fd, file);
-
return fd;
-
-
out_free_fd:
-
put_unused_fd(fd);
-
out_free_ep:
-
ep_free(ep);
-
return error;
-
}
sys_epoll_create1所做的事情很清晰,首先创建一个eventpoll数据结构,另外从系统中获取一个未被使用的描述符,这个描述符就是epoll_create返回的epfd,然后从匿名文件系统中获取一个名字为eventpoll的文件实例。将这个实例与eventpoll结构体链接,然后将这个描述符置于这个文件系统的描述符队列中。
- epoll_ctl
创建epfd之后,便可以通过调用epoll_ctl向epfd注册我们感兴趣的描述符了。
#define EPOLLPRI 0x00000002
#define EPOLLOUT 0x00000004(写事件属性)
#define EPOLLERR 0x00000008
#define EPOLLHUP 0x00000010
#define EPOLLRDNORM 0x00000040
#define EPOLLRDBAND 0x00000080
#define EPOLLWRNORM 0x00000100
#define EPOLLWRBAND 0x00000200
#define EPOLLMSG 0x00000400
#define EPOLLRDHUP 0x00002000
#define EPOLLEXCLUSIVE (1U << 28)(指定文件描述符唤醒方式,这个标志应该是4.9版的Linux之后加进去的,与惊群问题相关,这个后续还会讨论)
#define EPOLLWAKEUP (1U << 29)
#define EPOLLONESHOT (1U << 30)
#define EPOLLET (1U << 31)(边缘触发模式,与水平触发对应(LT)后续详细讨论)
针对op的类型,epoll_ctl内部会调用相应的处理方式。
EPOLL_CTL_ADD --> ep_insert:创建epitem与event绑定,并将epitem添加到eventpoll的rbtree中,并为该epitem设置callback函数,epoll内部事件ready通知都是通过callback实现的。
EPOLL_CTL_DEL --> ep_remove:将该事件对应的epitem从eventpoll RBTree中删除,并释放相应资源
EPOLL_CTL_MOD --> ep_modify:修改epitem中的事件属性。
-
SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
-
struct epoll_event __user *, event)
-
{
-
int error;
-
int full_check = 0;
-
struct fd f, tf;
-
struct eventpoll *ep;
-
struct epitem *epi;
-
struct epoll_event epds;
-
struct eventpoll *tep = NULL;
-
-
error = -EFAULT;
-
if (ep_op_has_event(op) &&
-
copy_from_user(&epds, event, sizeof(struct epoll_event)))
-
goto error_return;
-
-
error = -EBADF;
-
f = fdget(epfd);
-
if (!f.file)
-
goto error_return;
-
-
/* Get the "struct file *" for the target file */
-
tf = fdget(fd);
-
if (!tf.file)
-
goto error_fput;
-
-
/* The target file descriptor must support poll */
-
error = -EPERM;
-
if (!tf.file->f_op->poll)
-
goto error_tgt_fput;
-
-
/* Check if EPOLLWAKEUP is allowed */
-
if (ep_op_has_event(op))
-
ep_take_care_of_epollwakeup(&epds);
-
-
/*
-
* We have to check that the file structure underneath the file descriptor
-
* the user passed to us _is_ an eventpoll file. And also we do not permit
-
* adding an epoll file descriptor inside itself.
-
*/
-
error = -EINVAL;
-
if (f.file == tf.file || !is_file_epoll(f.file))
-
goto error_tgt_fput;
-
-
/*
-
* epoll adds to the wakeup queue at EPOLL_CTL_ADD time only,
-
* so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.
-
* Also, we do not currently supported nested exclusive wakeups.
-
*/
-
if (ep_op_has_event(op) && (epds.events & EPOLLEXCLUSIVE)) {
-
if (op == EPOLL_CTL_MOD)
-
goto error_tgt_fput;
-
if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||
-
(epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))
-
goto error_tgt_fput;
-
}
-
-
/*
-
* At this point it is safe to assume that the "private_data" contains
-
* our own data structure.
-
*/
-
ep = f.file->private_data;
-
-
/*
-
* When we insert an epoll file descriptor, inside another epoll file
-
* descriptor, there is the change of creating closed loops, which are
-
* better be handled here, than in more critical paths. While we are
-
* checking for loops we also determine the list of files reachable
-
* and hang them on the tfile_check_list, so we can check that we
-
* haven't created too many possible wakeup paths.
-
*
-
* We do not need to take the global 'epumutex' on EPOLL_CTL_ADD when
-
* the epoll file descriptor is attaching directly to a wakeup source,
-
* unless the epoll file descriptor is nested. The purpose of taking the
-
* 'epmutex' on add is to prevent complex toplogies such as loops and
-
* deep wakeup paths from forming in parallel through multiple
-
* EPOLL_CTL_ADD operations.
-
*/
-
mutex_lock_nested(&ep->mtx, 0);
-
if (op == EPOLL_CTL_ADD) {
-
if (!list_empty(&f.file->f_ep_links) ||
-
is_file_epoll(tf.file)) {
-
full_check = 1;
-
mutex_unlock(&ep->mtx);
-
mutex_lock(&epmutex);
-
if (is_file_epoll(tf.file)) {
-
error = -ELOOP;
-
if (ep_loop_check(ep, tf.file) != 0) {
-
clear_tfile_check_list();
-
goto error_tgt_fput;
-
}
-
} else
-
list_add(&tf.file->f_tfile_llink,
-
&tfile_check_list);
-
mutex_lock_nested(&ep->mtx, 0);
-
if (is_file_epoll(tf.file)) {
-
tep = tf.file->private_data;
-
mutex_lock_nested(&tep->mtx, 1);
-
}
-
}
-
}
-
-
/*
-
* Try to lookup the file inside our RB tree, Since we grabbed "mtx"
-
* above, we can be sure to be able to use the item looked up by
-
* ep_find() till we release the mutex.
-
*/
-
epi = ep_find(ep, tf.file, fd);
-
-
error = -EINVAL;
-
switch (op) {
-
case EPOLL_CTL_ADD:
-
if (!epi) {
-
epds.events |= POLLERR | POLLHUP;
-
error = ep_insert(ep, &epds, tf.file, fd, full_check);
-
} else
-
error = -EEXIST;
-
if (full_check)
-
clear_tfile_check_list();
-
break;
-
case EPOLL_CTL_DEL:
-
if (epi)
-
error = ep_remove(ep, epi);
-
else
-
error = -ENOENT;
-
break;
-
case EPOLL_CTL_MOD:
-
if (epi) {
-
if (!(epi->event.events & EPOLLEXCLUSIVE)) {
-
epds.events |= POLLERR | POLLHUP;
-
error = ep_modify(ep, epi, &epds);
-
}
-
} else
-
error = -ENOENT;
-
break;
-
}
-
if (tep != NULL)
-
mutex_unlock(&tep->mtx);
-
mutex_unlock(&ep->mtx);
-
-
error_tgt_fput:
-
if (full_check)
-
mutex_unlock(&epmutex);
-
-
fdput(tf);
-
error_fput:
-
fdput(f);
-
error_return:
-
-
return error;
-
}
- epoll_wait
将感兴趣的文件描述符注册到epfd后,便可以调用epoll_wait等待事件发生时内核回调我们了。timeout:<0则为阻塞状态,=0表示立刻查看有无就绪事件,>0则表示等待一段时间,如果没有就绪事件也返回。
在第一小节已经描述过,epoll几个重要的结构体,epoll_wait主要工作是检查eventpoll的rdlist就绪队列是否有事件,而检查rdlist这个过程就是通过ep_poll这个函数实现的。而rdlist中的事件是由epoll_call_back函数从内核态拷贝到用户态的,然后会唤醒正在等待的epoll_wait,由epoll_wait将就绪的事件返回给用户。在ep_poll调用中有一个值得注意的地方,就是在没有可用时间就绪的时候,内核使用__add_wait_queue_exclusive(&ep->wq, &wait);将当前进程放入等待队列中,这个队列使用的是Linux的wait_queue,使用exclusive系列的函数会将当前等待进程放到等待队列的尾部,在唤醒进程的时候会在遇到第一个设置了WQ_FLAG_EXCLUSIVE之后停止唤醒新的进程。
-
SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
-
int, maxevents, int, timeout)
-
{
-
int error;
-
struct fd f;
-
struct eventpoll *ep;
-
-
/* The maximum number of event must be greater than zero */
-
if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
-
return -EINVAL;
-
-
/* Verify that the area passed by the user is writeable */
-
if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event)))
-
return -EFAULT;
-
-
/* Get the "struct file *" for the eventpoll file */
-
f = fdget(epfd);
-
if (!f.file)
-
return -EBADF;
-
-
/*
-
* We have to check that the file structure underneath the fd
-
* the user passed to us _is_ an eventpoll file.
-
*/
-
error = -EINVAL;
-
if (!is_file_epoll(f.file))
-
goto error_fput;
-
-
/*
-
* At this point it is safe to assume that the "private_data" contains
-
* our own data structure.
-
*/
-
ep = f.file->private_data;
-
-
/* Time to fish for events ... */
-
error = ep_poll(ep, events, maxevents, timeout);
-
-
error_fput:
-
fdput(f);
-
return error;
-
}
-
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-
int maxevents, long timeout)
-
{
-
int res = 0, eavail, timed_out = 0;
-
unsigned long flags;
-
u64 slack = 0;
-
wait_queue_entry_t wait;
-
ktime_t expires, *to = NULL;
-
-
if (timeout > 0) {
-
struct timespec64 end_time = ep_set_mstimeout(timeout);
-
-
slack = select_estimate_accuracy(&end_time);
-
to = &expires;
-
*to = timespec64_to_ktime(end_time);
-
} else if (timeout == 0) {
-
/*
-
* Avoid the unnecessary trip to the wait queue loop, if the
-
* caller specified a non blocking operation.
-
*/
-
timed_out = 1;
-
spin_lock_irqsave(&ep->lock, flags);
-
goto check_events;
-
}
-
-
fetch_events:
-
-
if (!ep_events_available(ep))
-
ep_busy_loop(ep, timed_out);
-
-
spin_lock_irqsave(&ep->lock, flags);
-
-
if (!ep_events_available(ep)) {
-
/*
-
* Busy poll timed out. Drop NAPI ID for now, we can add
-
* it back in when we have moved a socket with a valid NAPI
-
* ID onto the ready list.
-
*/
-
ep_reset_busy_poll_napi_id(ep);
-
-
/*
-
* We don't have any available event to return to the caller.
-
* We need to sleep here, and we will be wake up by
-
* ep_poll_callback() when events will become available.
-
*/
-
init_waitqueue_entry(&wait, current);
-
__add_wait_queue_exclusive(&ep->wq, &wait);
-
-
for (;;) {
-
/*
-
* We don't want to sleep if the ep_poll_callback() sends us
-
* a wakeup in between. That's why we set the task state
-
* to TASK_INTERRUPTIBLE before doing the checks.
-
*/
-
set_current_state(TASK_INTERRUPTIBLE);
-
/*
-
* Always short-circuit for fatal signals to allow
-
* threads to make a timely exit without the chance of
-
* finding more events available and fetching
-
* repeatedly.
-
*/
-
if (fatal_signal_pending(current)) {
-
res = -EINTR;
-
break;
-
}
-
if (ep_events_available(ep) || timed_out)
-
break;
-
if (signal_pending(current)) {
-
res = -EINTR;
-
break;
-
}
-
-
spin_unlock_irqrestore(&ep->lock, flags);
-
if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
-
timed_out = 1;
-
-
spin_lock_irqsave(&ep->lock, flags);
-
}
-
-
__remove_wait_queue(&ep->wq, &wait);
-
__set_current_state(TASK_RUNNING);
-
}
-
check_events:
-
/* Is it worth to try to dig for events ? */
-
eavail = ep_events_available(ep);
-
-
spin_unlock_irqrestore(&ep->lock, flags);
-
-
/*
-
* Try to transfer events to user space. In case we get 0 events and
-
* there's still timeout left over, we go trying again in search of
-
* more luck.
-
*/
-
if (!res && eavail &&
-
!(res = ep_send_events(ep, events, maxevents)) && !timed_out) /* 这里会调用ep_scan_ready_list检查就绪队列*/
-
goto fetch_events;
-
-
return res;
-
}
-
3.epoll中的水平触发(LT)与边缘出发(ET)区别
水平触发表达的含义就是有一个事件在某一时刻就绪了,内核会通知我们,多路复用分离器会告知我们这个事件目前已经就绪,可以进行处理,但是这个时候我们可以选择处理也可以选择不处理,对于水平触发模式,如果我们本次不处理这个事件,那么下次等待函数返回时(例如select,epoll_wait)仍然会继续通知用户这个事件就绪。这个地方就是ET与LT之间的区别,如果采用ET模式,那么如果第一次通知用户处理这个事件,如果用户没有处理那么下次epoll_wait返回时就不再向用户通知这个事件了,所以如果采用ET模式,用户需要针对每次事件做完备的处理,不然过了这村就没这个店了。其实这也是ET模式比LT模式要高效的一个很重要的原因。只是ET模式下处理事件的时候要格外注意。
原理上已经讲明白了,那么在源码实现上是如何实现两种模式的呢?
-
if (epi->event.events & EPOLLONESHOT)
-
epi->event.events &= EP_PRIVATE_BITS;
-
else if (!(epi->event.events & EPOLLET)) {
-
/*
-
* If this file has been added with Level
-
* Trigger mode, we need to insert back inside
-
* the ready list, so that the next call to
-
* epoll_wait() will check again the events
-
* availability. At this point, no one can insert
-
* into ep->rdllist besides us. The epoll_ctl()
-
* callers are locked out by
-
* ep_scan_ready_list() holding "mtx" and the
-
* poll callback will queue them in ep->ovflist.
-
*/
-
list_add_tail(&epi->rdllink, &ep->rdllist);
-
ep_pm_stay_awake(epi);
-
}