• DPDK KNI 接口3 源码学习


    图1. kni结构图

    从结构图中可以看到KNI需要内核模块的支持,即rte_kni.ko

    当rte_kni模块加载时,创建/dev/kni设备节点(rte_kni模块创建kni杂项设备,文件系统节点/dev/kni需要手动或者通过udev机制创建),藉此节点,DPDK KNI应用可控制和与内核rte_kni模块交互。

    在内核模块rte_kni加载时,可指定一些可选的参数以控制其行为:

    # modinfo rte_kni.ko
    lo_mode:        KNI loopback mode (default=lo_mode_none):
                    lo_mode_none        Kernel loopback disabled
                    lo_mode_fifo        Enable kernel loopback with fifo
                    lo_mode_fifo_skb    Enable kernel loopback with fifo and skb buffer
     
    kthread_mode:   Kernel thread mode (default=single):
                    single    Single kernel thread mode enabled.
                    multiple  Multiple kernel thread mode enabled.
     
    carrier:        Default carrier state for KNI interface (default=off):
                    off   Interfaces will be created with carrier state set to off.
    on Interfaces will be created with carrier state set to on.

    典型的情况是,在加载rte_kni模块时不指定任何参数,DPDK应用可由内核网络协议栈获取和向其发送报文。不指定任何参数,意味着仅创建一个内核线程处理所有的KNI虚拟设备在内核侧的报文接收,并且禁用回环模式,KNI接口的默认链路状态为关闭off

    回环模式

    以测试为目的,在加载rte_kni模块式可指定lo_mode参数:

    # insmod kmod/rte_kni.ko lo_mode=lo_mode_fifo
    lo_mode_fifo回环模式将在内核空间中操作FIFO环队列,由函数kni_fifo_get(kni->rx_q,...)和kni_fifo_put(kni->tx_q,...)实现从rx_q接收队列读取报文,再写入发送队列tx_q来实现回环操作。


    # insmod kmod/rte_kni.ko lo_mode=lo_mode_fifo_skb

    lo_mode_fifo_skb回环模式在以上lo_mode_fifo的基础之上,增加了sk_buff缓存的相关拷贝操作。具体包括将rx_q接收队列的数据拷贝到分配的接收skb缓存中。以及分配发送skb缓存,将之前由rx_q队列接收数据拷贝到发送skb缓存中,使用函数kni_net_tx(skb, dev)发送skb缓存数据。最终将数据报文拷贝到mbuf结构中,使用kni_fifo_put函数加入到tx_q发送队列。可见此回环测试模式,更加接近真实的使用场景

    如果没有指定lo_mode参数,回环模式将禁用。

    默认链路状态

    内核模块rte_kni创建的KNI虚拟接口的链路状态,可通过模块加装时的carrier选项控制。

    如果指定了carrier=off,当接口管理使能时,内核模块将接口的链路状态设置为关闭。DPDK应用可通过函数rte_kni_update_link设置KNI虚拟接口的链路状态。这对于需要KNI虚拟接口状态与对应的物理接口实际状态一致的应用是有用的。

    如果指定了carrier=on,当接口管理使能时,内核模块将自动设置接口的链路状态为启用。这对于仅将KNI接口作为纯虚拟接口,而不对应任何物理硬件;或者并不想通过rte_kni_update_link函数显示设置接口链路状态的DPDK应用是有用的。对于物理口为连接任何链路而进行的回环模式测试也是有用的。

    以下,设置默认的链路状态为启用:

    # insmod kmod/rte_kni.ko carrier=on

    以下,设置默认的链路状态为关闭:

    # insmod kmod/rte_kni.ko carrier=off

    如果carrier参数没有指定,KNI虚拟接口的默认链路状态为关闭。

    在任何KNI虚拟接口创建之前,rte_kni内核模块必须加装到内核,并且经由rte_kni_init函数配置(获取/dev/kni设备节点文件句柄)

    int rte_kni_init(unsigned int max_kni_ifaces __rte_unused)
    {
        /* Check FD and open */
        if (kni_fd < 0) {
            kni_fd = open("/dev/" KNI_DEVICE, O_RDWR);
            if (kni_fd < 0) {
                RTE_LOG(ERR, KNI,
                    "Can not open /dev/%s
    ", KNI_DEVICE);
                return -1;
            }
    }

    模块初始化函数kni_init也非常简单。除了解析上面的参数配置外,比较重要的就是注册misc设备和配置lo_mode。

     1 static int __init
     2 kni_init(void)
     3 {
     4     int rc;
     5 
     6     if (kni_parse_kthread_mode() < 0) {
     7         pr_err("Invalid parameter for kthread_mode
    ");
     8         return -EINVAL;
     9     }
    10 
    11     if (multiple_kthread_on == 0)
    12         pr_debug("Single kernel thread for all KNI devices
    ");
    13     else
    14         pr_debug("Multiple kernel thread mode enabled
    ");
    15 
    16     if (kni_parse_carrier_state() < 0) {   //carrier可配置为off和on,默认为off
    17         pr_err("Invalid parameter for carrier
    ");
    18         return -EINVAL;
    19     }
    20 
    21     if (kni_dflt_carrier == 0)
    22         pr_debug("Default carrier state set to off.
    ");
    23     else
    24         pr_debug("Default carrier state set to on.
    ");
    25 
    26 #ifdef HAVE_SIMPLIFIED_PERNET_OPERATIONS
    27     rc = register_pernet_subsys(&kni_net_ops);
    28 #else
    29     rc = register_pernet_gen_subsys(&kni_net_id, &kni_net_ops);
    30 #endif
    31     if (rc)
    32         return -EPERM;
    33 
    34     rc = misc_register(&kni_misc);
    35     if (rc != 0) {
    36         pr_err("Misc registration failed
    ");
    37         goto out;
    38     }
    39 
    40     /* Configure the lo mode according to the input parameter */
    41     kni_net_config_lo_mode(lo_mode);
    42 
    43     return 0;
    44 
    45 out:
    46 #ifdef HAVE_SIMPLIFIED_PERNET_OPERATIONS
    47     unregister_pernet_subsys(&kni_net_ops);
    48 #else
    49     unregister_pernet_gen_subsys(kni_net_id, &kni_net_ops);
    50 #endif
    51     return rc;
    52 }

    kni_net_config_lo_mode:

     1 void
     2 kni_net_config_lo_mode(char *lo_str)
     3 {
     4     if (!lo_str) {
     5         pr_debug("loopback disabled");
     6         return;
     7     }
     8 
     9     if (!strcmp(lo_str, "lo_mode_none"))
    10         pr_debug("loopback disabled");
    11     else if (!strcmp(lo_str, "lo_mode_fifo")) {
    12         pr_debug("loopback mode=lo_mode_fifo enabled");
    13         kni_net_rx_func = kni_net_rx_lo_fifo;
    14     } else if (!strcmp(lo_str, "lo_mode_fifo_skb")) {
    15         pr_debug("loopback mode=lo_mode_fifo_skb enabled");
    16         kni_net_rx_func = kni_net_rx_lo_fifo_skb;
    17     } else {
    18         pr_debug("Unknown loopback parameter, disabled");
    19     }
    20 }

    lo_mode可配置为lo_mode_none,lo_mode_fifo,和lo_mode_fifo_skb,默认为lo_mode_none。另外两个在实际产品中基本不会用到

    配置lo_mode,函数指针kni_net_rx_func指向不同的函数,默认是kni_net_rx_func

    通过register_pernet_subsys或者register_pernet_gen_subsys,注册了kni_net_ops,保证每个namespace都会调用kni_init_net进行初始化

    注册为misc设备后,其工作机制由注册的miscdevice决定,即

     1 static const struct file_operations kni_fops = {
     2     .owner = THIS_MODULE,
     3     .open = kni_open,    //检查保证一个namespace只能打开kni一次,打开后将kni基于namespace的私有数据赋值给打开的文件file->private_data,以便后面使用
     4     .release = kni_release,
     5     .unlocked_ioctl = (void *)kni_ioctl,
     6     .compat_ioctl = (void *)kni_compat_ioctl,
     7 };
     8 
     9 static struct miscdevice kni_misc = {
    10     .minor = MISC_DYNAMIC_MINOR,
    11     .name = KNI_DEVICE,
    12     .fops = &kni_fops,
    13 };

    如何使用kni设备呢?

     1 static int
     2 kni_ioctl(struct inode *inode, uint32_t ioctl_num, unsigned long ioctl_param)
     3 {
     4     int ret = -EINVAL;
     5     struct net *net = current->nsproxy->net_ns;
     6 
     7     pr_debug("IOCTL num=0x%0x param=0x%0lx
    ", ioctl_num, ioctl_param);
     8 
     9     /*
    10      * Switch according to the ioctl called
    11      */
    12     switch (_IOC_NR(ioctl_num)) {
    13     case _IOC_NR(RTE_KNI_IOCTL_TEST):
    14         /* For test only, not used */
    15         break;
    16     case _IOC_NR(RTE_KNI_IOCTL_CREATE):
    17         ret = kni_ioctl_create(net, ioctl_num, ioctl_param);
    18         break;
    19     case _IOC_NR(RTE_KNI_IOCTL_RELEASE):
    20         ret = kni_ioctl_release(net, ioctl_num, ioctl_param);
    21         break;
    22     default:
    23         pr_debug("IOCTL default
    ");
    24         break;
    25     }
    26 
    27     return ret;
    28 }

    RTE_KNI_IOCTL_CREATE和RTE_KNI_IOCTL_RELEASE,分别对应DPDK用户态的rte_kni_alloc和rte_kni_release,即申请kni interface和释放kni interface

    rte_kni_alloc:

      1 struct rte_kni *
      2 rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
      3           const struct rte_kni_conf *conf,
      4           struct rte_kni_ops *ops)
      5 {
      6     int ret;
      7     struct rte_kni_device_info dev_info;
      8     struct rte_kni *kni;
      9     struct rte_tailq_entry *te;
     10     struct rte_kni_list *kni_list;
     11 
     12     if (!pktmbuf_pool || !conf || !conf->name[0])
     13         return NULL;
     14 
     15     /* Check if KNI subsystem has been initialized */
     16     if (kni_fd < 0) {
     17         RTE_LOG(ERR, KNI, "KNI subsystem has not been initialized. Invoke rte_kni_init() first
    ");
     18         return NULL;
     19     }
     20 
     21     rte_mcfg_tailq_write_lock();
     22 
     23     kni = __rte_kni_get(conf->name);
     24     if (kni != NULL) {
     25         RTE_LOG(ERR, KNI, "KNI already exists
    ");
     26         goto unlock;
     27     }
     28 
     29     te = rte_zmalloc("KNI_TAILQ_ENTRY", sizeof(*te), 0);
     30     if (te == NULL) {
     31         RTE_LOG(ERR, KNI, "Failed to allocate tailq entry
    ");
     32         goto unlock;
     33     }
     34 
     35     kni = rte_zmalloc("KNI", sizeof(struct rte_kni), RTE_CACHE_LINE_SIZE);
     36     if (kni == NULL) {
     37         RTE_LOG(ERR, KNI, "KNI memory allocation failed
    ");
     38         goto kni_fail;
     39     }
     40 
     41     strlcpy(kni->name, conf->name, RTE_KNI_NAMESIZE);
     42 
     43     if (ops)
     44         memcpy(&kni->ops, ops, sizeof(struct rte_kni_ops));
     45     else
     46         kni->ops.port_id = UINT16_MAX;
     47 
     48     memset(&dev_info, 0, sizeof(dev_info));
     49     dev_info.core_id = conf->core_id;
     50     dev_info.force_bind = conf->force_bind;
     51     dev_info.group_id = conf->group_id;
     52     dev_info.mbuf_size = conf->mbuf_size;
     53     dev_info.mtu = conf->mtu;
     54     dev_info.min_mtu = conf->min_mtu;
     55     dev_info.max_mtu = conf->max_mtu;
     56 
     57     memcpy(dev_info.mac_addr, conf->mac_addr, RTE_ETHER_ADDR_LEN);
     58 
     59     strlcpy(dev_info.name, conf->name, RTE_KNI_NAMESIZE);
     60 
     61     ret = kni_reserve_mz(kni);//kni_reserve_mz申请连续的物理内存,并用其作为各个ring
     62     if (ret < 0)
     63         goto mz_fail;
     64 
     65     /* TX RING */
     66     kni->tx_q = kni->m_tx_q->addr;
     67     kni_fifo_init(kni->tx_q, KNI_FIFO_COUNT_MAX);
     68     dev_info.tx_phys = kni->m_tx_q->phys_addr;
     69 
     70     /* RX RING */
     71     kni->rx_q = kni->m_rx_q->addr;
     72     kni_fifo_init(kni->rx_q, KNI_FIFO_COUNT_MAX);
     73     dev_info.rx_phys = kni->m_rx_q->phys_addr;
     74 
     75     /* ALLOC RING */
     76     kni->alloc_q = kni->m_alloc_q->addr;
     77     kni_fifo_init(kni->alloc_q, KNI_FIFO_COUNT_MAX);
     78     dev_info.alloc_phys = kni->m_alloc_q->phys_addr;
     79 
     80     /* FREE RING */
     81     kni->free_q = kni->m_free_q->addr;
     82     kni_fifo_init(kni->free_q, KNI_FIFO_COUNT_MAX);
     83     dev_info.free_phys = kni->m_free_q->phys_addr;
     84 
     85     /* Request RING */
     86     kni->req_q = kni->m_req_q->addr;
     87     kni_fifo_init(kni->req_q, KNI_FIFO_COUNT_MAX);
     88     dev_info.req_phys = kni->m_req_q->phys_addr;
     89 
     90     /* Response RING */
     91     kni->resp_q = kni->m_resp_q->addr;
     92     kni_fifo_init(kni->resp_q, KNI_FIFO_COUNT_MAX);
     93     dev_info.resp_phys = kni->m_resp_q->phys_addr;
     94 
     95     /* Req/Resp sync mem area */
     96     kni->sync_addr = kni->m_sync_addr->addr;
     97     dev_info.sync_va = kni->m_sync_addr->addr;
     98     dev_info.sync_phys = kni->m_sync_addr->phys_addr;
     99 
    100     kni->pktmbuf_pool = pktmbuf_pool;
    101     kni->group_id = conf->group_id;
    102     kni->mbuf_size = conf->mbuf_size;
    103 
    104     dev_info.iova_mode = (rte_eal_iova_mode() == RTE_IOVA_VA) ? 1 : 0;
    105 
    106     ret = ioctl(kni_fd, RTE_KNI_IOCTL_CREATE, &dev_info); //使用IOCTL创建内核对应的虚拟网口
    107     if (ret < 0)
    108         goto ioctl_fail;
    109 
    110     te->data = kni;
    111 
    112     kni_list = RTE_TAILQ_CAST(rte_kni_tailq.head, rte_kni_list);
    113     TAILQ_INSERT_TAIL(kni_list, te, next);
    114 
    115     rte_mcfg_tailq_write_unlock();
    116 
    117     /* Allocate mbufs and then put them into alloc_q */
    118     kni_allocate_mbufs(kni);  //分配队列内存资源
    119 
    120     return kni;
    121 
    122 ioctl_fail:
    123     kni_release_mz(kni);
    124 mz_fail:
    125     rte_free(kni);
    126 kni_fail:
    127     rte_free(te);
    128 unlock:
    129     rte_mcfg_tailq_write_unlock();
    130 
    131     return NULL;
    132 }

    分配每个kni口对应的结构,并管理起来:

     1 /**
     2  * KNI context
     3  */
     4 struct rte_kni {
     5     char name[RTE_KNI_NAMESIZE];        /**< KNI interface name */
     6     uint16_t group_id;                  /**< Group ID of KNI devices */
     7     uint32_t slot_id;                   /**< KNI pool slot ID */
     8     struct rte_mempool *pktmbuf_pool;   /**< pkt mbuf mempool */
     9     unsigned int mbuf_size;                 /**< mbuf size */
    10 
    11     const struct rte_memzone *m_tx_q;   /**< TX queue memzone */
    12     const struct rte_memzone *m_rx_q;   /**< RX queue memzone */
    13     const struct rte_memzone *m_alloc_q;/**< Alloc queue memzone */
    14     const struct rte_memzone *m_free_q; /**< Free queue memzone */
    15 
    16     struct rte_kni_fifo *tx_q;          /**< TX queue */
    17     struct rte_kni_fifo *rx_q;          /**< RX queue */
    18     struct rte_kni_fifo *alloc_q;       /**< Allocated mbufs queue */
    19     struct rte_kni_fifo *free_q;        /**< To be freed mbufs queue */
    20 
    21     const struct rte_memzone *m_req_q;  /**< Request queue memzone */
    22     const struct rte_memzone *m_resp_q; /**< Response queue memzone */
    23     const struct rte_memzone *m_sync_addr;/**< Sync addr memzone */
    24 
    25     /* For request & response */
    26     struct rte_kni_fifo *req_q;         /**< Request queue */
    27     struct rte_kni_fifo *resp_q;        /**< Response queue */
    28     void *sync_addr;                   /**< Req/Resp Mem address */
    29 
    30     struct rte_kni_ops ops;             /**< operations for request */
    31 };

    接着根据rte_kni_conf信息配置rte_kni_device_info:

     1 /*
     2  * Struct used to create a KNI device. Passed to the kernel in IOCTL call
     3  */
     4 
     5 struct rte_kni_device_info {
     6     char name[RTE_KNI_NAMESIZE];  /**< Network device name for KNI */
     7 
     8     phys_addr_t tx_phys;
     9     phys_addr_t rx_phys;
    10     phys_addr_t alloc_phys;
    11     phys_addr_t free_phys;
    12 
    13     /* Used by Ethtool */
    14     phys_addr_t req_phys;
    15     phys_addr_t resp_phys;
    16     phys_addr_t sync_phys;
    17     void * sync_va;
    18 
    19     /* mbuf mempool */
    20     void * mbuf_va;
    21     phys_addr_t mbuf_phys;
    22 
    23     uint16_t group_id;            /**< Group ID */
    24     uint32_t core_id;             /**< core ID to bind for kernel thread */
    25 
    26     __extension__
    27     uint8_t force_bind : 1;       /**< Flag for kernel thread binding */
    28 
    29     /* mbuf size */
    30     unsigned mbuf_size;
    31     unsigned int mtu;
    32     unsigned int min_mtu;
    33     unsigned int max_mtu;
    34     uint8_t mac_addr[6];
    35     uint8_t iova_mode;
    36 };

    kni的资源初始化完成后使用IOCTL创建内核对应的虚拟网口:

    ioctl(kni_fd, RTE_KNI_IOCTL_CREATE, &dev_info);

    驱动执行kni_ioctl_create:

      1 static int
      2 kni_ioctl_create(struct net *net, uint32_t ioctl_num,
      3         unsigned long ioctl_param)
      4 {
      5     struct kni_net *knet = net_generic(net, kni_net_id);
      6     int ret;
      7     struct rte_kni_device_info dev_info;
      8     struct net_device *net_dev = NULL;
      9     struct kni_dev *kni, *dev, *n;
     10 
     11     pr_info("Creating kni...
    ");
     12     /* Check the buffer size, to avoid warning */
     13     if (_IOC_SIZE(ioctl_num) > sizeof(dev_info))
     14         return -EINVAL;
     15 
     16     /* Copy kni info from user space */
     17     if (copy_from_user(&dev_info, (void *)ioctl_param, sizeof(dev_info)))
     18         return -EFAULT;
     19 
     20     /* Check if name is zero-ended */
     21     if (strnlen(dev_info.name, sizeof(dev_info.name)) == sizeof(dev_info.name)) {
     22         pr_err("kni.name not zero-terminated");
     23         return -EINVAL;
     24     }
     25 
     26     /**
     27      * Check if the cpu core id is valid for binding.
     28      */
     29     if (dev_info.force_bind && !cpu_online(dev_info.core_id)) {
     30         pr_err("cpu %u is not online
    ", dev_info.core_id);
     31         return -EINVAL;
     32     }
     33 
     34     /* Check if it has been created */
     35     down_read(&knet->kni_list_lock);
     36     list_for_each_entry_safe(dev, n, &knet->kni_list_head, list) {
     37         if (kni_check_param(dev, &dev_info) < 0) {
     38             up_read(&knet->kni_list_lock);
     39             return -EINVAL;
     40         }
     41     }
     42     up_read(&knet->kni_list_lock);
     43 
     44     net_dev = alloc_netdev(sizeof(struct kni_dev), dev_info.name,
     45 #ifdef NET_NAME_USER
     46                             NET_NAME_USER,
     47 #endif
     48                             kni_net_init);
     49     if (net_dev == NULL) {
     50         pr_err("error allocating device "%s"
    ", dev_info.name);
     51         return -EBUSY;
     52     }
     53 
     54     dev_net_set(net_dev, net);
     55 
     56     kni = netdev_priv(net_dev);
     57 
     58     kni->net_dev = net_dev;
     59     kni->core_id = dev_info.core_id;
     60     strncpy(kni->name, dev_info.name, RTE_KNI_NAMESIZE);
     61 
     62     /* Translate user space info into kernel space info */
     63     if (dev_info.iova_mode) {
     64 #ifdef HAVE_IOVA_TO_KVA_MAPPING_SUPPORT
     65         kni->tx_q = iova_to_kva(current, dev_info.tx_phys);
     66         kni->rx_q = iova_to_kva(current, dev_info.rx_phys);
     67         kni->alloc_q = iova_to_kva(current, dev_info.alloc_phys);
     68         kni->free_q = iova_to_kva(current, dev_info.free_phys);
     69 
     70         kni->req_q = iova_to_kva(current, dev_info.req_phys);
     71         kni->resp_q = iova_to_kva(current, dev_info.resp_phys);
     72         kni->sync_va = dev_info.sync_va;
     73         kni->sync_kva = iova_to_kva(current, dev_info.sync_phys);
     74         kni->usr_tsk = current;
     75         kni->iova_mode = 1;
     76 #else
     77         pr_err("KNI module does not support IOVA to VA translation
    ");
     78         return -EINVAL;
     79 #endif
     80     } else {
     81 
     82         kni->tx_q = phys_to_virt(dev_info.tx_phys);
     83         kni->rx_q = phys_to_virt(dev_info.rx_phys);
     84         kni->alloc_q = phys_to_virt(dev_info.alloc_phys);
     85         kni->free_q = phys_to_virt(dev_info.free_phys);
     86 
     87         kni->req_q = phys_to_virt(dev_info.req_phys);
     88         kni->resp_q = phys_to_virt(dev_info.resp_phys);
     89         kni->sync_va = dev_info.sync_va;
     90         kni->sync_kva = phys_to_virt(dev_info.sync_phys);
     91         kni->iova_mode = 0;
     92     }
     93 
     94     kni->mbuf_size = dev_info.mbuf_size;
     95 
     96     pr_debug("tx_phys:      0x%016llx, tx_q addr:      0x%p
    ",
     97         (unsigned long long) dev_info.tx_phys, kni->tx_q);
     98     pr_debug("rx_phys:      0x%016llx, rx_q addr:      0x%p
    ",
     99         (unsigned long long) dev_info.rx_phys, kni->rx_q);
    100     pr_debug("alloc_phys:   0x%016llx, alloc_q addr:   0x%p
    ",
    101         (unsigned long long) dev_info.alloc_phys, kni->alloc_q);
    102     pr_debug("free_phys:    0x%016llx, free_q addr:    0x%p
    ",
    103         (unsigned long long) dev_info.free_phys, kni->free_q);
    104     pr_debug("req_phys:     0x%016llx, req_q addr:     0x%p
    ",
    105         (unsigned long long) dev_info.req_phys, kni->req_q);
    106     pr_debug("resp_phys:    0x%016llx, resp_q addr:    0x%p
    ",
    107         (unsigned long long) dev_info.resp_phys, kni->resp_q);
    108     pr_debug("mbuf_size:    %u
    ", kni->mbuf_size);
    109 
    110     /* if user has provided a valid mac address */
    111     if (is_valid_ether_addr(dev_info.mac_addr))
    112         memcpy(net_dev->dev_addr, dev_info.mac_addr, ETH_ALEN);
    113     else
    114         /*
    115          * Generate random mac address. eth_random_addr() is the
    116          * newer version of generating mac address in kernel.
    117          */
    118         random_ether_addr(net_dev->dev_addr);
    119 
    120     if (dev_info.mtu)
    121         net_dev->mtu = dev_info.mtu;
    122 #ifdef HAVE_MAX_MTU_PARAM
    123     net_dev->max_mtu = net_dev->mtu;
    124 
    125     if (dev_info.min_mtu)
    126         net_dev->min_mtu = dev_info.min_mtu;
    127 
    128     if (dev_info.max_mtu)
    129         net_dev->max_mtu = dev_info.max_mtu;
    130 #endif
    131 
    132     ret = register_netdev(net_dev); //注册netdev
    133     if (ret) {
    134         pr_err("error %i registering device "%s"
    ",
    135                     ret, dev_info.name);
    136         kni->net_dev = NULL;
    137         kni_dev_remove(kni);
    138         free_netdev(net_dev);
    139         return -ENODEV;
    140     }
    141 
    142     netif_carrier_off(net_dev);
    143 
    144     ret = kni_run_thread(knet, kni, dev_info.force_bind); //启动内核接收线
    145     if (ret != 0)
    146         return ret;
    147 
    148     down_write(&knet->kni_list_lock);
    149     list_add(&kni->list, &knet->kni_list_head);
    150     up_write(&knet->kni_list_lock);
    151 
    152     return 0;
    153 }

    通过phys_to_virt将ring的物理地址转成虚拟地址使用,这样就保证了KNI的用户态和内核态使用同一片物理地址,从而做到零拷贝

     1 static int
     2 kni_run_thread(struct kni_net *knet, struct kni_dev *kni, uint8_t force_bind)
     3 {
     4     /**
     5      * Create a new kernel thread for multiple mode, set its core affinity,
     6      * and finally wake it up.
     7      */
     8     if (multiple_kthread_on) {
     9         kni->pthread = kthread_create(kni_thread_multiple,
    10             (void *)kni, "kni_%s", kni->name);
    11         if (IS_ERR(kni->pthread)) {
    12             kni_dev_remove(kni);
    13             return -ECANCELED;
    14         }
    15 
    16         if (force_bind)
    17             kthread_bind(kni->pthread, kni->core_id);
    18         wake_up_process(kni->pthread);
    19     } else {
    20         mutex_lock(&knet->kni_kthread_lock);
    21 
    22         if (knet->kni_kthread == NULL) {
    23             knet->kni_kthread = kthread_create(kni_thread_single,
    24                 (void *)knet, "kni_single");
    25             if (IS_ERR(knet->kni_kthread)) {
    26                 mutex_unlock(&knet->kni_kthread_lock);
    27                 kni_dev_remove(kni);
    28                 return -ECANCELED;
    29             }
    30 
    31             if (force_bind)
    32                 kthread_bind(knet->kni_kthread, kni->core_id);
    33             wake_up_process(knet->kni_kthread);
    34         }
    35 
    36         mutex_unlock(&knet->kni_kthread_lock);
    37     }
    38 
    39     return 0;
    40 }

    如果KNI模块的参数指定了多线程模式,每创建一个kni设备,就创建一个内核线程。如果为单线程模式,则检查是否已经启动了kni_thread。没有的话,创建唯一的kni内核thread kni_single,有的话,则什么都不做。

    用户态与内核态报文交互的方式: 

    从ingress方向看,rte_eth_rx_burst时mbuf mempool中分配内存,通过rx_q发送给KNI,KNI线程将mbuf从rx_q中出队,将其转换成skb后,就将原mbuf通过free_q归还给用户rx thread,并由rx_thread释放,可以看出ingress方向的mbuf完全由用户态rx_thread自行管理。

    从egress方向看,当KNI口发包时,从 mbuf cache中取得一个mbuf,将skb内容copy到mbuf中,进入tx_q队列,tx_thread将该mbuf出队并完成发送,因为发送后该mbuf会被释放。所以要重新alloc一个mbuf通过alloc_q归还给kernel。这部分mbuf是上面用户态填充的alloc_q。

    这是DPDK app向KNI设备写入数据,也就是发给内核的情况。当内核从KNI设备发送数据时,按照内核的流程处理,最终会调用到net_device_ops->ndo_start_xmit。对于KNI驱动来说,即kni_net_tx

     1 /*
     2  * Transmit a packet (called by the kernel)
     3  */
     4 static int
     5 kni_net_tx(struct sk_buff *skb, struct net_device *dev)
     6 {
     7     int len = 0;
     8     uint32_t ret;
     9     struct kni_dev *kni = netdev_priv(dev);
    10     struct rte_kni_mbuf *pkt_kva = NULL;
    11     void *pkt_pa = NULL;
    12     void *pkt_va = NULL;
    13 
    14     /* save the timestamp */
    15 #ifdef HAVE_TRANS_START_HELPER
    16     netif_trans_update(dev);
    17 #else
    18     dev->trans_start = jiffies;
    19 #endif
    20 
    21     /* Check if the length of skb is less than mbuf size */
    22     if (skb->len > kni->mbuf_size)
    23         goto drop;
    24 
    25     /**
    26      * Check if it has at least one free entry in tx_q and
    27      * one entry in alloc_q.
    28      */
    29     if (kni_fifo_free_count(kni->tx_q) == 0 ||
    30             kni_fifo_count(kni->alloc_q) == 0) {
    31         /**
    32          * If no free entry in tx_q or no entry in alloc_q,
    33          * drops skb and goes out.
    34          */
    35         goto drop;
    36     }
    37 
    38     /* dequeue a mbuf from alloc_q */
    39     ret = kni_fifo_get(kni->alloc_q, &pkt_pa, 1);
    40     if (likely(ret == 1)) {
    41         void *data_kva;
    42 
    43         pkt_kva = get_kva(kni, pkt_pa);
    44         data_kva = get_data_kva(kni, pkt_kva);
    45         pkt_va = pa2va(pkt_pa, pkt_kva);
    46 
    47         len = skb->len;
    48         memcpy(data_kva, skb->data, len);
    49         if (unlikely(len < ETH_ZLEN)) {
    50             memset(data_kva + len, 0, ETH_ZLEN - len);
    51             len = ETH_ZLEN;
    52         }
    53         pkt_kva->pkt_len = len;
    54         pkt_kva->data_len = len;
    55 
    56         /* enqueue mbuf into tx_q */
    57         ret = kni_fifo_put(kni->tx_q, &pkt_va, 1);
    58         if (unlikely(ret != 1)) {
    59             /* Failing should not happen */
    60             pr_err("Fail to enqueue mbuf into tx_q
    ");
    61             goto drop;
    62         }
    63     } else {
    64         /* Failing should not happen */
    65         pr_err("Fail to dequeue mbuf from alloc_q
    ");
    66         goto drop;
    67     }
    68 
    69     /* Free skb and update statistics */
    70     dev_kfree_skb(skb);
    71     dev->stats.tx_bytes += len;
    72     dev->stats.tx_packets++;
    73 
    74     return NETDEV_TX_OK;
    75 
    76 drop:
    77     /* Free skb and update statistics */
    78     dev_kfree_skb(skb);
    79     dev->stats.tx_dropped++;
    80 
    81     return NETDEV_TX_OK;
    82 }

    1.对skb报文长度做检查,不能超过mbuf的大小。然后检查发送队列tx_q是否还有空位,“内存队列”是否有剩余的mbuf

    2.从alloc_q取出一个内存块,将其转换为虚拟地址,然后将skb的数据复制过去,最后将其追加到发送队列tx_q中

    3.发送完成后,就直接释放skb并更新统计计数

    DPDK提供了两个API rte_kni_rx_burst和rte_kni_tx_burst,用于从KNI接收报文和向KNI发送报文

     1 unsigned
     2 rte_kni_rx_burst(struct rte_kni *kni, struct rte_mbuf **mbufs, unsigned int num)
     3 {
     4     unsigned int ret = kni_fifo_get(kni->tx_q, (void **)mbufs, num);
     5 
     6     /* If buffers removed, allocate mbufs and then put them into alloc_q */
     7     if (ret)
     8         kni_allocate_mbufs(kni);
     9 
    10     return ret;
    11 }
    12 
    13 
    14 unsigned
    15 rte_kni_tx_burst(struct rte_kni *kni, struct rte_mbuf **mbufs, unsigned int num)
    16 {
    17     num = RTE_MIN(kni_fifo_free_count(kni->rx_q), num);
    18     void *phy_mbufs[num];
    19     unsigned int ret;
    20     unsigned int i;
    21 
    22     for (i = 0; i < num; i++)
    23         phy_mbufs[i] = va2pa_all(mbufs[i]);
    24 
    25     ret = kni_fifo_put(kni->rx_q, phy_mbufs, num);
    26 
    27     /* Get mbufs from free_q and then free them */
    28     kni_free_mbufs(kni);
    29 
    30 
    31 }

    1.接收报文时,从kni->tx_q直接取走所有报文。前面内核用KNI发送报文时,填充的就是这个fifo。当取走了报文后,DPDK应用层的调用kni_allocate_mbufs,负责给tx_q填充空闲mbuf,供内核使用。

    2.发送报文时,先将要发送给KNI的报文地址转换为物理地址,然后enqueue到kni->rx_q中(内核的KNI实现也是从这个fifo中读取报文),最后调用kni_free_mbufs释放掉内核处理完的mbuf报文

  • 相关阅读:
    关于IDEA 单元测试时 【empty test suite】
    js小数点问题
    梅林路由器阿里云ddnsipv6脚本(含socat转发)
    [Mysql] 页结构
    [Thread] 多线程顺序执行
    MySQL、MySQL Workbench [8.0.28] 最新安装包与安装教程【2022年3月】
    微信h5 视频自动播放解决方案
    美团前端暑期实习凉经
    通过图片地址获取图片的base64,再通过base64获取二进制数据
    C# Enum,Int,String的互相转换 枚举转换
  • 原文地址:https://www.cnblogs.com/mysky007/p/12305257.html
Copyright © 2020-2023  润新知