skywalking是一款国产的开源的链路追踪软件,那么链路追踪、监控系统、日志系统的区别是什么呢。本质上链路追踪也算是一种监控,而链路追踪跟监控系统都是日志。
skywalking中文文档: https://skyapm.github.io/document-cn-translation-of-skywalking/zh/8.0.0/
与日常监控不同的是我们对监控得出的结果处理可以更主动。以prometheus为例,prometheus收集了数据在grafana上展出出来,并且按制定的规则报警,但是我们一般不会主动去看prometheus的线图然后得出哪里哪里马上要出问题了,我们得提前处理,都是它报警了我去看下情况,然后再去看看日志,根据经验,进行处理以及后续的优化。在常规运维中,这是一个被动的行为,可以理解为“亡羊补牢”。
而链路追踪软件在启用后,就可以看到哪个调用链用得频率高,哪个函数方法执行的慢,跟XXX的连接延时比较大,此时就可以根据实际排期进行更高性价比的调整优化,此时业务并没有出问题,可能就是稍慢一点。当然了,也会出现某个业务使用过程中慢,才要对此进行分析的,这个行为可以理解成普通的被动监控了。不过在在常规运维中,我们对链路追踪的期望是前者,这是一个主动的行为,可以理解为“未雨绸缪”。
那么日志系统呢?日志系统收集了很多日志,而监控跟链路追踪其实是对自己所需要的日志进行了收集及聚合处理后得出了自己所需要的数值、目标等等,最后进行了不同的展示。所以日志系统是最底层的东西,监控报警我只看线条没有用,我得去看当时的日志,到底系统、业务是因为什么才波动了;链路追踪也一样,函数运行的慢,那我去看这个函数的处理逻辑,处理流程都经历了什么才能去调优。
目前,APM中skywalking与pinpoint是实现了对代码完全无任何侵入,这样比较符合运维人员的想法,毕竟Zipkin类的对代码侵入了,那么那就需要有风险担责,这个业务运行时的锅我们还是不要轻易背。具体的对比大家可以看https://www.jianshu.com/p/626cae6c0522 这篇文章。
我们使用k8s内运行的方式来安装skywalking,官方指引是用helm安装,这边笔者已经将yaml导出并进行修改调整
elasticsearch:skywalking可以对接的后端很多:https://skyapm.github.io/document-cn-translation-of-skywalking/zh/8.0.0/setup/backend/backend-storage.html,当然了你的elasticsearch不用跑在容器里,所以这是一个非必要操作,如果跑在容器里记得要分配对应的存储进行持久化。下面这个文件在只有一个节点时重启后会起不来,因为他无法变成green状态不符合健康检查,所以在单独测试时将健康检查的那段注释掉即可。
apiVersion: v1 kind: Service metadata: name: skywalking-elasticsearch namespace: default labels: app: skywalking-elasticsearch spec: ports: - name: http port: 9200 protocol: TCP targetPort: 9200 - name: transport port: 9300 protocol: TCP targetPort: 9300 selector: app: skywalking-elasticsearch --- apiVersion: v1 kind: Service metadata: name: skywalking-elasticsearch-headless namespace: default labels: app: skywalking-elasticsearch spec: clusterIP: None publishNotReadyAddresses: true ports: - name: http port: 9200 protocol: TCP targetPort: 9200 - name: transport port: 9300 protocol: TCP targetPort: 9300 selector: app: skywalking-elasticsearch --- apiVersion: apps/v1 kind: StatefulSet metadata: name: skywalking-elasticsearch namespace: default labels: app: skywalking-elasticsearch spec: replicas: 1 podManagementPolicy: Parallel selector: matchLabels: app: skywalking-elasticsearch serviceName: skywalking-elasticsearch-headless template: metadata: name: skywalking-elasticsearch labels: app: skywalking-elasticsearch spec: # affinity: # podAntiAffinity: # requiredDuringSchedulingIgnoredDuringExecution: # - labelSelector: # matchExpressions: # - key: app # operator: In # values: # - skywalking-elasticsearch # topologyKey: kubernetes.io/hostname initContainers: - command: - sysctl - -w - vm.max_map_count=262144 image: docker.elastic.co/elasticsearch/elasticsearch:7.5.1 imagePullPolicy: IfNotPresent name: configure-sysctl resources: {} securityContext: privileged: true runAsUser: 0 securityContext: fsGroup: 1000 runAsUser: 1000 containers: - env: - name: node.name valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: cluster.initial_master_nodes value: skywalking-elasticsearch-0 - name: discovery.seed_hosts value: skywalking-elasticsearch-headless - name: cluster.name value: skywalking-elasticsearch - name: network.host value: 0.0.0.0 - name: ES_JAVA_OPTS value: -Xmx1g -Xms1g - name: node.data value: "true" - name: node.ingest value: "true" - name: node.master value: "true" name: skywalking-elasticsearch image: docker.elastic.co/elasticsearch/elasticsearch:7.5.1 imagePullPolicy: IfNotPresent ports: - containerPort: 9200 name: http protocol: TCP - containerPort: 9300 name: transport protocol: TCP resources: limits: cpu: "1" memory: 2Gi requests: cpu: 100m memory: 2Gi readinessProbe: exec: command: - sh - -c - | #!/usr/bin/env bash -e # If the node is starting up wait for the cluster to be ready (request params: 'wait_for_status=green&timeout=1s' ) # Once it has started only check that the node itself is responding START_FILE=/tmp/.es_start_file http () { local path="${1}" if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then BASIC_AUTH="-u ${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}" else BASIC_AUTH='' fi curl -XGET -s -k --fail ${BASIC_AUTH} http://127.0.0.1:9200${path} } if [ -f "${START_FILE}" ]; then echo 'Elasticsearch is already running, lets check the node is healthy and there are master nodes available' http "/_cluster/health?timeout=0s" else echo 'Waiting for elasticsearch cluster to become cluster to be ready (request params: "wait_for_status=green&timeout=1s" )' if http "/_cluster/health?wait_for_status=green&timeout=1s" ; then touch ${START_FILE} exit 0 else echo 'Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )' exit 1 fi fi failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 3 timeoutSeconds: 5 securityContext: capabilities: drop: - ALL runAsNonRoot: true runAsUser: 1000 volumeMounts: - name: skywalking-elasticsearch mountPath: /usr/share/elasticsearch/data terminationGracePeriodSeconds: 120 volumeClaimTemplates: - metadata: name: skywalking-elasticsearch spec: accessModes: - ReadWriteOnce storageClassName: yizhuang-nfs resources: requests: storage: 100Gi
job:对es进行结构初始化。es如果之前初始化过了就没必要再次执行了。
apiVersion: batch/v1 kind: Job metadata: name: skywalking-job namespace: default labels: app: skywalking-job spec: template: metadata: name: skywalking-job labels: app: skywalking-job spec: initContainers: - command: - sh - -c - for i in $(seq 1 60); do nc -z -w3 skywalking-elasticsearch 9200 && exit 0 || sleep 5; done; exit 1 image: busybox:1.30 imagePullPolicy: IfNotPresent name: wait-for-elasticsearch containers: - env: - name: JAVA_OPTS value: -Xmx2g -Xms2g -Dmode=init # -Dmode=init模式是给elasticsearch集群初始化数据结构 - name: SW_STORAGE value: elasticsearch7 - name: SW_STORAGE_ES_CLUSTER_NODES value: skywalking-elasticsearch:9200 name: skywalking-job image: apache/skywalking-oap-server:8.1.0 imagePullPolicy: IfNotPresent restartPolicy: Never # Job的restartPolicy必须设置Never
oap:就是skywalking服务本身
apiVersion: v1 kind: ServiceAccount metadata: name: skywalking-oap namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: skywalking-oap namespace: default rules: - apiGroups: - "" resources: - pods - configmaps verbs: - get - watch - list --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: skywalking-oap namespace: default roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: skywalking-oap subjects: - kind: ServiceAccount name: skywalking-oap namespace: default --- apiVersion: v1 kind: Service metadata: name: skywalking-oap namespace: default labels: app: skywalking-oap spec: ports: - name: rest port: 12800 protocol: TCP targetPort: 12800 - name: grpc port: 11800 protocol: TCP targetPort: 11800 selector: app: skywalking-oap --- apiVersion: apps/v1 kind: Deployment metadata: name: skywalking-oap namespace: default labels: app: skywalking-oap spec: replicas: 1 selector: matchLabels: app: skywalking-oap template: metadata: labels: app: skywalking-oap spec: serviceAccount: skywalking-oap serviceAccountName: skywalking-oap # affinity: # podAntiAffinity: # preferredDuringSchedulingIgnoredDuringExecution: # - podAffinityTerm: # labelSelector: # matchLabels: # app: skywalking-oap # topologyKey: kubernetes.io/hostname # weight: 1 initContainers: - command: - sh - -c - for i in $(seq 1 60); do nc -z -w3 skywalking-elasticsearch 9200 && exit 0 || sleep 5; done; exit 1 image: busybox:1.30 imagePullPolicy: IfNotPresent name: wait-for-elasticsearch containers: - env: - name: JAVA_OPTS value: -Dmode=no-init -Xmx2g -Xms512m - name: SW_CLUSTER # 设置集群类型在kubernetes内 value: kubernetes - name: SW_CLUSTER_K8S_NAMESPACE value: default - name: SW_CLUSTER_K8S_LABEL value: app=skywalking-oap - name: SKYWALKING_COLLECTOR_UID valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.uid - name: SW_STORAGE value: elasticsearch7 - name: SW_STORAGE_ES_CLUSTER_NODES value: skywalking-elasticsearch:9200 - name: SW_STORAGE_DAY_STEP # 每个ES索引存多少天的数据 value: "1" - name: SW_STORAGE_ES_FLUSH_INTERVAL value: "60" - name: SW_CORE_RECORD_DATA_TTL # 记录数据过期时间,这里要注意,比如你想存30天数据,那么TTL要设置为DAY_STEP+30=31 value: "4" - name: SW_CORE_METRICS_DATA_TTL # 指标数据过期时间,同上 value: "4" - name: SW_TRACE_SAMPLE_RATE # 采样率,10000为100%,生产环境需要调小 value: "10000" name: skywalking-oap image: apache/skywalking-oap-server8.1.0 imagePullPolicy: IfNotPresent ports: - containerPort: 11800 name: grpc protocol: TCP - containerPort: 12800 name: rest protocol: TCP readinessProbe: failureThreshold: 3 initialDelaySeconds: 15 periodSeconds: 20 successThreshold: 1 tcpSocket: port: 12800 timeoutSeconds: 1 livenessProbe: failureThreshold: 3 initialDelaySeconds: 15 periodSeconds: 20 successThreshold: 1 tcpSocket: port: 12800 timeoutSeconds: 1 resources: requests: memory: 512Mi cpu: 30m limits: memory: 2Gi cpu: 500m
ui:负责展示出图
--- apiVersion: networking.istio.io/v1alpha3 kind: Gateway metadata: name: skywalking-dev-xxx-com namespace: default spec: selector: istio: ingressgateway servers: - hosts: - skywalking-dev.xxx.com port: number: 80 name: http protocol: HTTP --- apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: skywalking-dev-xxx-com namespace: default spec: hosts: - skywalking-dev.xxx.com gateways: - skywalking-dev-xxx-com http: - match: - uri: prefix: / route: - destination: host: skywalking-ui port: number: 80 --- apiVersion: v1 kind: Service metadata: name: skywalking-ui namespace: default labels: app: skywalking-ui spec: ports: - port: 80 protocol: TCP targetPort: 8080 selector: app: skywalking-ui --- apiVersion: apps/v1 kind: Deployment metadata: name: skywalking-ui namespace: default labels: app: skywalking-ui spec: replicas: 1 selector: matchLabels: app: skywalking-ui template: metadata: labels: app: skywalking-ui spec: imagePullSecrets: - name: aliyun-registry containers: - env: - name: SW_OAP_ADDRESS value: skywalking-oap:12800 image: apache/skywalking-ui:8.1.0 imagePullPolicy: IfNotPresent name: skywalking-ui ports: - containerPort: 8080 name: page protocol: TCP resources: requests: memory: 512Mi cpu: 30m limits: memory: 1Gi cpu: 500m
然后我们就可以看到,此时还没有介入客户端,所以没有数据,但是服务端的事情已经完成。
接下来就是客户端的接入,skywalking支持很多的客户端,当然最常用的还是接入java应用,我们只需要去下载对应的对应的包就可以了,http://skywalking.apache.org/downloads/,建议客户端的版本号与你服务端的版本号一致,比如我服务端版本是8.1.1,那么我下载的链接应该为 https://downloads.apache.org/skywalking/8.1.0/apache-skywalking-apm-8.1.0.tar.gz ,下载解压后目录结构如下
agent/ ├── activations │ ├── apm-toolkit-log4j-1.x-activation-8.1.0.jar │ ├── apm-toolkit-log4j-2.x-activation-8.1.0.jar │ ├── apm-toolkit-logback-1.x-activation-8.1.0.jar │ ├── apm-toolkit-meter-activation-8.1.0.jar │ ├── apm-toolkit-opentracing-activation-8.1.0.jar │ └── apm-toolkit-trace-activation-8.1.0.jar ├── bootstrap-plugins │ ├── apm-jdk-http-plugin-8.1.0.jar │ └── apm-jdk-threading-plugin-8.1.0.jar ├── config │ └── agent.config # agent端的配置文件,我们需要修改一些地方 ├── logs ├── optional-plugins │ ├── apm-customize-enhance-plugin-8.1.0.jar │ ├── apm-gson-2.x-plugin-8.1.0.jar │ ├── apm-kotlin-coroutine-plugin-8.1.0.jar │ ├── apm-spring-annotation-plugin-8.1.0.jar │ ├── apm-spring-cloud-gateway-2.0.x-plugin-8.1.0.jar │ ├── apm-spring-cloud-gateway-2.1.x-plugin-8.1.0.jar │ ├── apm-spring-tx-plugin-8.1.0.jar │ ├── apm-trace-ignore-plugin-8.1.0.jar │ └── apm-zookeeper-3.4.x-plugin-8.1.0.jar ├── optional-reporter-plugins │ └── kafka-reporter-plugin-8.1.0.jar ├── plugins │ ├── apm-activemq-5.x-plugin-8.1.0.jar │ ├── apm-armeria-0.84.x-plugin-8.1.0.jar │ ├── apm-armeria-0.85.x-plugin-8.1.0.jar │ ├── apm-avro-plugin-8.1.0.jar │ ├── apm-canal-1.x-plugin-8.1.0.jar │ ├── apm-cassandra-java-driver-3.x-plugin-8.1.0.jar │ ├── apm-dubbo-2.7.x-plugin-8.1.0.jar │ ├── apm-dubbo-plugin-8.1.0.jar │ ├── apm-ehcache-2.x-plugin-8.1.0.jar │ ├── apm-elastic-job-2.x-plugin-8.1.0.jar │ ├── apm-elasticsearch-5.x-plugin-8.1.0.jar │ ├── apm-elasticsearch-6.x-plugin-8.1.0.jar │ ├── apm-feign-default-http-9.x-plugin-8.1.0.jar │ ├── apm-finagle-6.25.x-plugin-8.1.0.jar │ ├── apm-grpc-1.x-plugin-8.1.0.jar │ ├── apm-h2-1.x-plugin-8.1.0.jar │ ├── apm-httpasyncclient-4.x-plugin-8.1.0.jar │ ├── apm-httpclient-3.x-plugin-8.1.0.jar │ ├── apm-httpClient-4.x-plugin-8.1.0.jar │ ├── apm-hystrix-1.x-plugin-8.1.0.jar │ ├── apm-influxdb-2.x-plugin-8.1.0.jar │ ├── apm-jdbc-commons-8.1.0.jar │ ├── apm-jedis-2.x-plugin-8.1.0.jar │ ├── apm-jetty-client-9.0-plugin-8.1.0.jar │ ├── apm-jetty-client-9.x-plugin-8.1.0.jar │ ├── apm-jetty-server-9.x-plugin-8.1.0.jar │ ├── apm-kafka-plugin-8.1.0.jar │ ├── apm-lettuce-5.x-plugin-8.1.0.jar │ ├── apm-light4j-plugin-8.1.0.jar │ ├── apm-mariadb-2.x-plugin-8.1.0.jar │ ├── apm-mongodb-2.x-plugin-8.1.0.jar │ ├── apm-mongodb-3.x-plugin-8.1.0.jar │ ├── apm-mysql-5.x-plugin-8.1.0.jar │ ├── apm-mysql-6.x-plugin-8.1.0.jar │ ├── apm-mysql-8.x-plugin-8.1.0.jar │ ├── apm-mysql-commons-8.1.0.jar │ ├── apm-netty-socketio-plugin-8.1.0.jar │ ├── apm-nutz-http-1.x-plugin-8.1.0.jar │ ├── apm-nutz-mvc-annotation-1.x-plugin-8.1.0.jar │ ├── apm-okhttp-3.x-plugin-8.1.0.jar │ ├── apm-play-2.x-plugin-8.1.0.jar │ ├── apm-postgresql-8.x-plugin-8.1.0.jar │ ├── apm-pulsar-plugin-8.1.0.jar │ ├── apm-quasar-plugin-8.1.0.jar │ ├── apm-rabbitmq-5.x-plugin-8.1.0.jar │ ├── apm-redisson-3.x-plugin-8.1.0.jar │ ├── apm-resttemplate-4.3.x-plugin-8.1.0.jar │ ├── apm-rocketmq-3.x-plugin-8.1.0.jar │ ├── apm-rocketmq-4.x-plugin-8.1.0.jar │ ├── apm-servicecomb-java-chassis-0.x-plugin-8.1.0.jar │ ├── apm-servicecomb-java-chassis-1.x-plugin-8.1.0.jar │ ├── apm-sharding-jdbc-1.5.x-plugin-8.1.0.jar │ ├── apm-sharding-sphere-3.x-plugin-8.1.0.jar │ ├── apm-shardingsphere-4.0.x-plugin-8.1.0.jar │ ├── apm-sharding-sphere-4.1.0-plugin-8.1.0.jar │ ├── apm-sharding-sphere-4.x-plugin-8.1.0.jar │ ├── apm-sharding-sphere-4.x-rc3-plugin-8.1.0.jar │ ├── apm-solrj-7.x-plugin-8.1.0.jar │ ├── apm-spring-async-annotation-plugin-8.1.0.jar │ ├── apm-spring-cloud-feign-1.x-plugin-8.1.0.jar │ ├── apm-spring-cloud-feign-2.x-plugin-8.1.0.jar │ ├── apm-spring-concurrent-util-4.x-plugin-8.1.0.jar │ ├── apm-spring-core-patch-8.1.0.jar │ ├── apm-springmvc-annotation-3.x-plugin-8.1.0.jar │ ├── apm-springmvc-annotation-4.x-plugin-8.1.0.jar │ ├── apm-springmvc-annotation-5.x-plugin-8.1.0.jar │ ├── apm-springmvc-annotation-commons-8.1.0.jar │ ├── apm-spring-webflux-5.x-plugin-8.1.0.jar │ ├── apm-spymemcached-2.x-plugin-8.1.0.jar │ ├── apm-struts2-2.x-plugin-8.1.0.jar │ ├── apm-undertow-2.x-plugin-8.1.0.jar │ ├── apm-vertx-core-3.x-plugin-8.1.0.jar │ ├── apm-xmemcached-2.x-plugin-8.1.0.jar │ ├── baidu-brpc-plugin-8.1.0.jar │ ├── dubbo-2.7.x-conflict-patch-8.1.0.jar │ ├── dubbo-conflict-patch-8.1.0.jar │ ├── graphql-12.x-plugin-8.1.0.jar │ ├── graphql-8.x-plugin-8.1.0.jar │ ├── graphql-9.x-plugin-8.1.0.jar │ ├── motan-plugin-8.1.0.jar │ ├── resteasy-server-3.x-plugin-8.1.0.jar │ ├── sofa-rpc-plugin-8.1.0.jar │ ├── spring-commons-8.1.0.jar │ └── tomcat-7.x-8.x-plugin-8.1.0.jar └── skywalking-agent.jar # 该版本gaent探针jar包
我们对agent conf文件进行修改,结果如下
[root@devops-bj-yz-dx1 conf.d]# grep ^[a-z] agent/config/agent.config agent.service_name=${SW_AGENT_NAME:Your_ApplicationName} # 因为我们的架构都是容器内运行的,需要封装镜像,这里就不用改了 collector.backend_service=${SW_AGENT_COLLECTOR_BACKEND_SERVICES:skywalking-oap.default:11800} # 这个是指定我们服务端的访问地址端口,很重要,根据我们k8s yaml文件定义的,服务端的SVC叫skywallking-oap,在default命名空间下,端口11800 logging.file_name=${SW_LOGGING_FILE_NAME:skywalking-api.log} # 指定日志文件名称,这个看个戏喜好 logging.level=${SW_LOGGING_LEVEL:ERROR} # 日志等级,默认INFO
剩下的就是要将该agent在封装镜像时扔进去了,我们只需要在Dockerfile添加COPY agent /root/agent即可将该目录放在容器的/root/下,然后就是启动我们的java pod,我们知道在pod是多个,但是其实代表的是同一个服务,也就是同一类pod应该叫同一个ApplicationName,这样skywalking在收集数据后会将同名APP数据进行汇总,当然了你仍然可以查询到单个POD具体的情况。举个例子,www-baidu-com-xxxxx-xxxxx跟www-baidu-com-yyyy-yyyy这两个pod的名字应该相同都叫www-baidu-com或者baidu,这个看公司的命名规范制度。
我们最后要做的事情就是要在java的k8s yaml文件里定义好一段java的启动参数env
env:- name: JAVA_OPTS
# -javaagent一定要跟我们Dockerfile里封装的路径匹配上,而后面的ApplicationName就是该项目的命名,也就是我们刚才的www-baidu-com
value: "-server -Xms123m -Xmx456m -Xss789k -XX:+UseG1GC -Dfile.encoding=UTF-8 -Dserver.port=6666 -javaagent:/root/agent/skywalking-agent.jar -Dskywalking.agent.service_name=ApplicationName"
这样我们的java pod启动后就会开始想服务端发送数据,我们稍等一会就可以在页面上看到数据了,这里面提一句,如果服务端异常或者挂掉,不会影响业务本身,只是会报skywalking相关数据发送的失败的错误,服务端恢复后也就正常了。这里面注意右下角的时间一定要选好了,不然可能没数据。