opentelemetry全链路初探–埋点与jaeger

前言

某天一位业务研发老哥跑来咨询

  • 研发老哥:我的服务出现了504,但是不太清楚是哪个环节报错,每次请求需要访问4个微服务、2个数据库、1个redis、1个消息队列。。。
  • 苦逼运维:停停停,不要再说了,目前不支持链路追踪,只能手动帮你一个服务一个服务的排查了
  • 先请老哥大概描述了一下业务逻辑以及访问方式,10分钟过去了。再逐级排查每个服务以及对应访问的资源层,终于在半小时之后完成了故障定位。。。

这效率也太低了,于是,关于链路建设项目提上了议程,目标只有一个,快速定位问题,提高稳定性。而链路建设,OpenTelemetry是目前行业热点,那本运维就来研究研究

环境准备

组件 版本
操作系统 Ubuntu 22.04.4 LTS
opentelemetry-sdk 1.35.0

安装

首先先简单说一下OpenTelemetry的数据采集流程,然后先跑起来再去讨论细节

  • OpenTelemetry就是在代码中埋入采集点进行数据采集,opentelemetry-sdk
  • 再通过固定的协议将数据上传至某个地方进行数据展示,jaeger UI

安装OpenTelemetry-sdk

pip3 install opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-api 

安装数据展示jaeger UI

docker pull docker.m.daocloud.io/jaegertracing/all-in-one:latest  docker run -d --name jaeger    -e COLLECTOR_OTLP_ENABLED=true    -p 16686:16686    -p 4317:4317    -p 4318:4318    docker.m.daocloud.io/jaegertracing/all-in-one:latest 

docker启动之后访问:http://127.0.0.1:16686

opentelemetry全链路初探--埋点与jaeger

第一个例子

web服务

首先先准备一个web服务,这里我们用tornado来实现,安装tornado:pip3 install tornado

import tornado.httpserver as httpserver import tornado.web from tornado.ioloop import IOLoop   class TestFlow(tornado.web.RequestHandler):     def get(self):         self.finish('hello world')   def applications():     urls = []     urls.append([r'/', TestFlow])     return tornado.web.Application(urls)  def main():     app = applications()     server = httpserver.HTTPServer(app)     server.bind(10000, '0.0.0.0')     server.start(1)     IOLoop.current().start()   if __name__ == "__main__":     try:         main()     except KeyboardInterrupt as e:         IOLoop.current().stop()     finally:         IOLoop.current().close()  

检查是否能够正常访问:

opentelemetry全链路初探--埋点与jaeger

添加埋点

import tornado.httpserver as httpserver import tornado.web from tornado.ioloop import IOLoop from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.resources import SERVICE_NAME, Resource from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter   trace.set_tracer_provider(     TracerProvider(resource=Resource.create({SERVICE_NAME: "s1"})) ) tracer = trace.get_tracer(__name__) span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")) trace.get_tracer_provider().add_span_processor(span_processor)   class TestFlow(tornado.web.RequestHandler):     def get(self):         views()         self.finish('hello world')  def views():     span = tracer.start_span("s1-span")     span.end()  def applications():     urls = []     urls.append([r'/', TestFlow])     return tornado.web.Application(urls)  def main():     app = applications()     server = httpserver.HTTPServer(app)     server.bind(10000, '0.0.0.0')     server.start(1)     IOLoop.current().start()   if __name__ == "__main__":     try:         main()     except KeyboardInterrupt as e:         IOLoop.current().stop()     finally:         IOLoop.current().close()  

再次访问 curl http://localhost:10000 ,打开jaeger UI查看

opentelemetry全链路初探--埋点与jaeger

opentelemetry全链路初探--埋点与jaeger

已经有数据了,刚才的埋点已经上报至jaeger UI了

埋点数据属性

丰富一下埋点数据的属性

def views():     span = tracer.start_span("s1-span")     span.set_attribute("name", "wilson")     span.set_attribute("addr", "cd")     span.end() 

opentelemetry全链路初探--埋点与jaeger

增加数据库访问追踪

def views():     span = tracer.start_span("s1-span")     span.set_attribute("name", "wilson")     span.set_attribute("addr", "cd")     ctx = trace.set_span_in_context(span)     get_db(ctx)     span.end()  def get_db(parent_ctx):     span = tracer.start_span("s1-span-db", context=parent_ctx)     span.end()  

opentelemetry全链路初探--埋点与jaeger

增加跨服务追踪

增加第二个web服务:s2.py

import tornado.httpserver as httpserver import tornado.web from tornado.ioloop import IOLoop from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.resources import SERVICE_NAME, Resource from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator    trace.set_tracer_provider(     TracerProvider(resource=Resource.create({SERVICE_NAME: "s2"})) ) tracer = trace.get_tracer(__name__) span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")) trace.get_tracer_provider().add_span_processor(span_processor)   class TestFlow(tornado.web.RequestHandler):     def get(self):         ctx = TraceContextTextMapPropagator().extract(self.request.headers)         span = tracer.start_span("s2-span", context=ctx)         span.end()         self.finish('hello world')  def applications():     urls = []     urls.append([r'/', TestFlow])     return tornado.web.Application(urls)  def main():     app = applications()     server = httpserver.HTTPServer(app)     server.bind(20000, '0.0.0.0')     server.start(1)     IOLoop.current().start()   if __name__ == "__main__":     try:         main()     except KeyboardInterrupt as e:         IOLoop.current().stop()     finally:         IOLoop.current().close()  

修改s1.py

from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator import requests  def views():     span = tracer.start_span("s1-span")     span.set_attribute("name", "wilson")     span.set_attribute("addr", "cd")     ctx = trace.set_span_in_context(span)     get_db(ctx)     headers = {}     TraceContextTextMapPropagator().inject(headers, context=ctx)     requests.get("http://localhost:20000", headers=headers)     span.end()  

opentelemetry全链路初探--埋点与jaeger

改造进k8s

jaeger

编排文件:

apiVersion: apps/v1 kind: Deployment metadata:   labels:     app: jaeger   name: jaeger   namespace: default spec:   replicas: 1   selector:     matchLabels:       app: jaeger   template:     metadata:       labels:         app: jaeger     spec:       containers:       - image: docker.m.daocloud.io/jaegertracing/all-in-one:latest         imagePullPolicy: Always         name: jaeger       dnsPolicy: ClusterFirst       restartPolicy: Always  ---  apiVersion: v1 kind: Service metadata:   labels:     app: jaeger-service   name: jaeger-service   namespace: default spec:   ports:   - name: port-4317     port: 4317     protocol: TCP     targetPort: 4317   - name: port-4318     port: 4318     protocol: TCP     targetPort: 4318   - name: port-16686     port: 16686     protocol: TCP     targetPort: 16686   selector:     app: jaeger   type: NodePort   

s2

1)制作镜像

由于在k8s集群中通过svc访问jaeger,需要改造一下s2.py

s2.py

... import os  JAEGER_ADDR=os.environ.get('JAEGER_ADDR') ... span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=JAEGER_ADDR)) ... 

Dockerfile

FROM python:3.8  WORKDIR /opt RUN pip3 install tornado opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp -i https://pypi.tuna.tsinghua.edu.cn/simple ADD s2.py /opt CMD python3 s2.py 

2)编排文件

apiVersion: apps/v1 kind: Deployment metadata:   labels:     app: s2   name: s2   namespace: default spec:   replicas: 1   selector:     matchLabels:       app: s2   template:     metadata:       labels:         app: s2     spec:       containers:       - env:         - name: JAEGER_ADDR           value: http://jaeger-service:4318/v1/traces         image: s2:v1         imagePullPolicy: Always         name: s2       dnsPolicy: ClusterFirst       restartPolicy: Always ---  apiVersion: v1 kind: Service metadata:   labels:     app: s2-service   name: s2-service   namespace: default spec:   ports:   - name: s2-port     port: 20000     protocol: TCP     targetPort: 20000   selector:     app: s2   type: NodePort   

s1

1)制作镜像

由于在k8s集群中通过svc访问s2与jaeger,需要改造一下s1.py

s1.py

... import os  S2_ADDR=os.environ.get('S2_ADDR') JAEGER_ADDR=os.environ.get('JAEGER_ADDR')  ... span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=JAEGER_ADDR)) ...  def views():     span = tracer.start_span("s1-span")     span.set_attribute("name", "wilson")     span.set_attribute("addr", "cd")     ctx = trace.set_span_in_context(span)     get_db(ctx)     headers = {}     TraceContextTextMapPropagator().inject(headers, context=ctx)     requests.get(S2_ADDR, headers=headers)     span.end()  ...  

Dockerfile:

FROM python:3.8  WORKDIR /opt RUN pip3 install tornado opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp -i https://pypi.tuna.tsinghua.edu.cn/simple ADD s1.py /opt CMD python3 s1.py  

2)编排文件

apiVersion: apps/v1 kind: Deployment metadata:   labels:     app: s1   name: s1   namespace: default spec:   replicas: 1   selector:     matchLabels:       app: s1   template:     metadata:       labels:         app: s1     spec:       containers:       - env:         - name: S2_ADDR           value: http://s2-service:20000         - name: JAEGER_ADDR           value: http://jaeger-service:4318/v1/traces         image: s1:v1         imagePullPolicy: Always         name: s1       dnsPolicy: ClusterFirst       restartPolicy: Always  ---  apiVersion: v1 kind: Service metadata:   labels:     app: s1-service   name: s1-service   namespace: default spec:   ports:   - name: s1-port     port: 10000     protocol: TCP     targetPort: 10000   selector:     app: s1   type: NodePort  

查看结果

▶ kubectl get pod -owide NAME                            READY   STATUS    RESTARTS         AGE     IP             NODE       NOMINATED NODE   READINESS GATES jaeger-6669cd7c4-4pl5j          1/1     Running   0                7m31s   10.244.0.236   minikube   <none>           <none> s1-5c569c5b4b-lctzq             1/1     Running   0                73s     10.244.0.237   minikube   <none>           <none> s2-5bb648dcdf-mlnbj             1/1     Running   0                61s     10.244.0.238   minikube   <none>           <none>  ▶ kubectl get svc NAME             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                         AGE jaeger-service   NodePort    10.106.13.217    <none>        4317:31891/TCP,4318:31997/TCP,16686:31002/TCP   5m49s s1-service       NodePort    10.102.25.195    <none>        10000:32376/TCP                                 4m23s s2-service       NodePort    10.103.114.198   <none>        20000:30032/TCP                                 3m40s  

进行数据测试:

  • 访问s1服务

    ▶ curl http://192.168.49.2:32376 hello world% 
  • 查看jaeger日志,访问:http://192.168.49.2:31002/
    opentelemetry全链路初探--埋点与jaeger

总结

在第一个例子中,我们主要采集了业务服务的trace记录,即一个完整的请求需要经过的路径,包括读取数据库、跨服务请求等等

在整个跟踪过程中trace_idspan_id发挥了决定性的作用,前者为请求链路的唯一标识,串联了整个访问步骤;而后者则是链路上每一次不同的具体操作的标识

opentelemetry全链路初探--埋点与jaeger

  • 采集:通过嵌入代码埋点,采集重点监控的流程,比如数据库读写速度、下游服务速度等
  • 处理:opentelemetry-sdk对数据进行处理:过滤、缓存、合并
  • 导出:将处理过的数据,通过固定的协议(otlp协议、grpc协议、http协议等)发送到后端系统,比如jaeger

opentelemetry全链路初探--埋点与jaeger

联系我

  • 联系我,做深入的交流

opentelemetry全链路初探--埋点与jaeger


至此,本文结束
在下才疏学浅,有撒汤漏水的,请各位不吝赐教...

发表评论

评论已关闭。

相关文章