• spark docker java kubernetes 获取cpu内核/线程数问题


    升级服务从spark2.3.0-hadoop2.8 至 spark2.4.0 hadoop3.0

    一日后导致spark streaming kafka消费数据积压

    服务不是传统的部署在yarn上,而是布在kubernetes(1.13.2)上 https://spark.apache.org/docs/latest/running-on-kubernetes.html

    因为近期对集群有大操作,以为是集群的io瓶颈导致的积压,作了几项针对io优化,但没什么效果

    一直盯着服务日志和服务器的负载情况

    突然发现一点不对,spark相关服务的cpu占用一直在100%-200%之间,长时间停留在100%

    集群相关机器是32核,cpu占用100%可以理解为只用了单核,这里明显有问题

    猜测数据积压,很可能不是io瓶颈,而是计算瓶颈(服务内部有分词,分类,聚类计算等计算密集操作)

    程序内部会根据cpu核心作优化

    获取环境内核数的方法
    def GetCpuCoreNum(): Int = {
    Runtime.getRuntime.availableProcessors
    }

    打印内核心数

    spark 2.4.0

    root@consume-topic-qk-nwd-7d84585f5-kh7z5:/usr/spark-2.4.0# java -version
    java version "1.8.0_202"
    Java(TM) SE Runtime Environment (build 1.8.0_202-b08)
    Java HotSpot(TM) 64-Bit Server VM (build 25.202-b08, mixed mode)
    
    [cuidapeng@wx-k8s-4 ~]$ kb logs consume-topic-qk-nwd-7d84585f5-kh7z5 |more
    2019-03-04 15:21:59 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Cpu core Num 1
    2019-03-04 15:22:00 INFO SparkContext:54 - Running Spark version 2.4.0
    2019-03-04 15:22:00 INFO SparkContext:54 - Submitted application: topic-quick
    2019-03-04 15:22:00 INFO SecurityManager:54 - Changing view acls to: root
    2019-03-04 15:22:00 INFO SecurityManager:54 - Changing modify acls to: root
    2019-03-04 15:22:00 INFO SecurityManager:54 - Changing view acls groups to:
    2019-03-04 15:22:00 INFO SecurityManager:54 - Changing modify acls groups to:
    2019-03-04 15:22:00 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with m
    odify permissions: Set(root); groups with modify permissions: Set()
    2019-03-04 15:22:00 INFO Utils:54 - Successfully started service 'sparkDriver' on port 33016.
    2019-03-04 15:22:00 INFO SparkEnv:54 - Registering MapOutputTracker
    2019-03-04 15:22:01 INFO SparkEnv:54 - Registering BlockManagerMaster
    2019-03-04 15:22:01 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
    2019-03-04 15:22:01 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
    2019-03-04 15:22:01 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-dc0c496e-e5ab-4d07-a518-440f2336f65c
    2019-03-04 15:22:01 INFO MemoryStore:54 - MemoryStore started with capacity 4.5 GB
    2019-03-04 15:22:01 INFO SparkEnv:54 - Registering OutputCommitCoordinator
    2019-03-04 15:22:01 INFO log:192 - Logging initialized @2888ms

    Cpu core Num 1 服务变为单核计算,积压的原因就在这里

    果然猜测正确,回滚版本至2.3.0

    回滚至spark 2.3.0

    root@consume-topic-dt-nwd-67b7fd6dd5-jztpb:/usr/spark-2.3.0# java -version
    java version "1.8.0_131"
    Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
    Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
    
    2019-03-04 15:16:22 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Cpu core Num 32
    2019-03-04 15:16:23 INFO SparkContext:54 - Running Spark version 2.3.0
    2019-03-04 15:16:23 INFO SparkContext:54 - Submitted application: topic-dt
    2019-03-04 15:16:23 INFO SecurityManager:54 - Changing view acls to: root
    2019-03-04 15:16:23 INFO SecurityManager:54 - Changing modify acls to: root
    2019-03-04 15:16:23 INFO SecurityManager:54 - Changing view acls groups to:
    2019-03-04 15:16:23 INFO SecurityManager:54 - Changing modify acls groups to:
    2019-03-04 15:16:23 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with m
    odify permissions: Set(root); groups with modify permissions: Set()
    2019-03-04 15:16:23 INFO Utils:54 - Successfully started service 'sparkDriver' on port 40616.
    2019-03-04 15:16:23 INFO SparkEnv:54 - Registering MapOutputTracker
    2019-03-04 15:16:23 INFO SparkEnv:54 - Registering BlockManagerMaster
    2019-03-04 15:16:23 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
    2019-03-04 15:16:23 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
    2019-03-04 15:16:23 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-5dbf1194-477a-4001-8738-3da01b5a3f01
    2019-03-04 15:16:23 INFO MemoryStore:54 - MemoryStore started with capacity 6.2 GB
    2019-03-04 15:16:23 INFO SparkEnv:54 - Registering OutputCommitCoordinator
    2019-03-04 15:16:24 INFO log:192 - Logging initialized @2867ms

    Cpu core Num 32,32是物理机的内核数

    阻塞并不是io引起的,而是runtime可用core变小导致,spark升级至2.4.0后,服务由32核并发执行变成单核执行

    这实际不是spark的问题,而是jdk的问题

    很早以前有需求限制docker内的core资源,要求jdk获取到core数docker限制的core数,当时印象是对jdk提了需求未来jdk9,10会实现,jdk8还实现不了,就把docker限制内核数的方案给否了,以分散服务调度的方式作计算资源的限制

    对jdk8没想到这一点,却在这里踩了个坑

    docker 控制cpu的相关参数

    Usage:    docker run [OPTIONS] IMAGE [COMMAND] [ARG...]
    Run a command in a new container
    Options:
    --cpu-period int Limit CPU CFS (Completely Fair Scheduler) period
    --cpu-quota int Limit CPU CFS (Completely Fair Scheduler) quota
    --cpu-rt-period int Limit CPU real-time period in microseconds
    --cpu-rt-runtime int Limit CPU real-time runtime in microseconds
    -c, --cpu-shares int CPU shares (relative weight)
    --cpus decimal Number of CPUs
    --cpuset-cpus string CPUs in which to allow execution (0-3, 0,1)
    --cpuset-mems string MEMs in which to allow execution (0-3, 0,1)

    另外一点,服务是由kubernetes调度的,kubernetes在docker之上又作一层资源管理

    kubernetes对cpu的控制有两种方案
    一种是基于内核的 https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/
    一种是基于百分比的 https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
    手动分配cpu资源

            resources:
               requests:
                 cpu: 12
                 memory: "24Gi"
               limits:
                 cpu: 12
                 memory: "24Gi"

    更新服务

    2019-03-04 16:24:57 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Cpu core Num 12
    2019-03-04 16:24:57 INFO SparkContext:54 - Running Spark version 2.4.0
    2019-03-04 16:24:58 INFO SparkContext:54 - Submitted application: topic-dt
    2019-03-04 16:24:58 INFO SecurityManager:54 - Changing view acls to: root
    2019-03-04 16:24:58 INFO SecurityManager:54 - Changing modify acls to: root
    2019-03-04 16:24:58 INFO SecurityManager:54 - Changing view acls groups to:
    2019-03-04 16:24:58 INFO SecurityManager:54 - Changing modify acls groups to:
    2019-03-04 16:24:58 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with m
    odify permissions: Set(root); groups with modify permissions: Set()
    2019-03-04 16:24:58 INFO Utils:54 - Successfully started service 'sparkDriver' on port 36429.
    2019-03-04 16:24:58 INFO SparkEnv:54 - Registering MapOutputTracker
    2019-03-04 16:24:58 INFO SparkEnv:54 - Registering BlockManagerMaster
    2019-03-04 16:24:58 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
    2019-03-04 16:24:58 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
    2019-03-04 16:24:58 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-764f35a8-ea7f-4057-8123-22cbbe2d9a39
    2019-03-04 16:24:58 INFO MemoryStore:54 - MemoryStore started with capacity 6.2 GB
    2019-03-04 16:24:58 INFO SparkEnv:54 - Registering OutputCommitCoordinator
    2019-03-04 16:24:58 INFO log:192 - Logging initialized @2855ms

    Cpu core Num 12 生效

    kubernetes(docker) 和spark(jdk)之间core有一个兼容性问题

    jdk 1.8.0_131 在docker内 获取的是主机上的内核数

    jdk 1.8.0_202 在docker内 获取的是docker被限制的内核数,kubernetes不指定resource默认限制为1

    升级至spark2.4.0-hadoop3.0(jdk 1.8.0_202),同时kubernetes同时指定内核数,也可以切换jdk至低版本,但需要重新打docker镜像。

    指定内核数

    Name:               wx-k8s-8
    Roles:              <none>
    Labels:             beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/os=linux
                        flannel.alpha.coreos.com/backend-type: vxlan
                        flannel.alpha.coreos.com/kube-subnet-manager: true                    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                        node.alpha.kubernetes.io/ttl: 0
                        volumes.kubernetes.io/controller-managed-attach-detach: true
    CreationTimestamp:  Thu, 24 Jan 2019 14:11:15 +0800
    Taints:             <none>
    Unschedulable:      false
    Conditions:
      Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
      ----             ------  -----------------                 ------------------                ------                       -------
      MemoryPressure   False   Mon, 04 Mar 2019 17:27:16 +0800   Thu, 24 Jan 2019 14:11:15 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
      DiskPressure     False   Mon, 04 Mar 2019 17:27:16 +0800   Thu, 24 Jan 2019 14:11:15 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
      PIDPressure      False   Mon, 04 Mar 2019 17:27:16 +0800   Thu, 24 Jan 2019 14:11:15 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
      Ready            True    Mon, 04 Mar 2019 17:27:16 +0800   Thu, 24 Jan 2019 14:24:48 +0800   KubeletReady                 kubelet is posting ready status
    Addresses:
    Capacity:
     cpu:                32
     ephemeral-storage:  1951511544Ki
     hugepages-1Gi:      0
     hugepages-2Mi:      0
     memory:             65758072Ki
     pods:               110
    Allocatable:
     cpu:                32
     ephemeral-storage:  1798513035973
     hugepages-1Gi:      0
     hugepages-2Mi:      0
     memory:             65655672Ki
     pods:               110
    System Info:
     Container Runtime Version:  docker://17.3.2
     Kubelet Version:            v1.13.2
     Kube-Proxy Version:         v1.13.2
    PodCIDR:                     10.244.7.0/24
    Non-terminated Pods:         (15 in total)
      Namespace                  Name                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
      ---------                  ----                                               ------------  ----------  ---------------  -------------  ---
      kube-system                kube-flannel-ds-l594f                              100m (0%)     100m (0%)   50Mi (0%)        50Mi (0%)      11d
      kube-system                kube-proxy-vckxf                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         39d
    Allocated resources:
      (Total limits may be over 100 percent, i.e., overcommitted.)
      Resource           Requests       Limits
      --------           --------       ------
      cpu                20100m (62%)   100m (0%)
      memory             45106Mi (70%)  61490Mi (95%)
      ephemeral-storage  0 (0%)         0 (0%)
    Events:              <none
  • 相关阅读:
    ASM instance正常启动,但是用sqlplus 连不上的问题
    Ubuntu环境下,项目出现:Call to undefined function curl_init() 提示
    linux安装curl扩展
    https请求排错过程
    php-fpm.conf文件的位置在哪里
    如何查找php-fpm监听的端口
    laravel AppKernel.php中的middleware、middlewareGroups、routeMiddleware
    laravel项目数据库交互逻辑
    Laravel中APP_KEY起什么作用
    php 出现Warning: A non-numeric value encountered问题的原因及解决方法
  • 原文地址:https://www.cnblogs.com/zihunqingxin/p/10829273.html
Copyright © 2020-2023  润新知