前言
某天一位业务研发老哥跑来咨询
- 研发老哥:我的服务出现了504,但是不太清楚是哪个环节报错,每次请求需要访问4个微服务、2个数据库、1个redis、1个消息队列。。。
- 苦逼运维:停停停,不要再说了,目前不支持链路追踪,只能手动帮你一个服务一个服务的排查了
- 先请老哥大概描述了一下业务逻辑以及访问方式,10分钟过去了。再逐级排查每个服务以及对应访问的资源层,终于在半小时之后完成了故障定位。。。
这效率也太低了,于是,关于链路建设项目提上了议程,目标只有一个,快速定位问题,提高稳定性。而链路建设,OpenTelemetry是目前行业热点,那本运维就来研究研究
环境准备
| 组件 | 版本 |
|---|---|
| 操作系统 | Ubuntu 22.04.4 LTS |
| opentelemetry-sdk | 1.35.0 |
安装
首先先简单说一下OpenTelemetry的数据采集流程,然后先跑起来再去讨论细节
- OpenTelemetry就是在代码中埋入采集点进行数据采集,opentelemetry-sdk
- 再通过固定的协议将数据上传至某个地方进行数据展示,jaeger UI
安装OpenTelemetry-sdk
pip3 install opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-api
安装数据展示jaeger UI
docker pull docker.m.daocloud.io/jaegertracing/all-in-one:latest docker run -d --name jaeger -e COLLECTOR_OTLP_ENABLED=true -p 16686:16686 -p 4317:4317 -p 4318:4318 docker.m.daocloud.io/jaegertracing/all-in-one:latest
docker启动之后访问:http://127.0.0.1:16686

第一个例子
web服务
首先先准备一个web服务,这里我们用tornado来实现,安装tornado:pip3 install tornado
import tornado.httpserver as httpserver import tornado.web from tornado.ioloop import IOLoop class TestFlow(tornado.web.RequestHandler): def get(self): self.finish('hello world') def applications(): urls = [] urls.append([r'/', TestFlow]) return tornado.web.Application(urls) def main(): app = applications() server = httpserver.HTTPServer(app) server.bind(10000, '0.0.0.0') server.start(1) IOLoop.current().start() if __name__ == "__main__": try: main() except KeyboardInterrupt as e: IOLoop.current().stop() finally: IOLoop.current().close()
检查是否能够正常访问:

添加埋点
import tornado.httpserver as httpserver import tornado.web from tornado.ioloop import IOLoop from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.resources import SERVICE_NAME, Resource from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter trace.set_tracer_provider( TracerProvider(resource=Resource.create({SERVICE_NAME: "s1"})) ) tracer = trace.get_tracer(__name__) span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")) trace.get_tracer_provider().add_span_processor(span_processor) class TestFlow(tornado.web.RequestHandler): def get(self): views() self.finish('hello world') def views(): span = tracer.start_span("s1-span") span.end() def applications(): urls = [] urls.append([r'/', TestFlow]) return tornado.web.Application(urls) def main(): app = applications() server = httpserver.HTTPServer(app) server.bind(10000, '0.0.0.0') server.start(1) IOLoop.current().start() if __name__ == "__main__": try: main() except KeyboardInterrupt as e: IOLoop.current().stop() finally: IOLoop.current().close()
再次访问 curl http://localhost:10000 ,打开jaeger UI查看


已经有数据了,刚才的埋点已经上报至jaeger UI了
埋点数据属性
丰富一下埋点数据的属性
def views(): span = tracer.start_span("s1-span") span.set_attribute("name", "wilson") span.set_attribute("addr", "cd") span.end()

增加数据库访问追踪
def views(): span = tracer.start_span("s1-span") span.set_attribute("name", "wilson") span.set_attribute("addr", "cd") ctx = trace.set_span_in_context(span) get_db(ctx) span.end() def get_db(parent_ctx): span = tracer.start_span("s1-span-db", context=parent_ctx) span.end()

增加跨服务追踪
增加第二个web服务:s2.py
import tornado.httpserver as httpserver import tornado.web from tornado.ioloop import IOLoop from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.resources import SERVICE_NAME, Resource from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator trace.set_tracer_provider( TracerProvider(resource=Resource.create({SERVICE_NAME: "s2"})) ) tracer = trace.get_tracer(__name__) span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")) trace.get_tracer_provider().add_span_processor(span_processor) class TestFlow(tornado.web.RequestHandler): def get(self): ctx = TraceContextTextMapPropagator().extract(self.request.headers) span = tracer.start_span("s2-span", context=ctx) span.end() self.finish('hello world') def applications(): urls = [] urls.append([r'/', TestFlow]) return tornado.web.Application(urls) def main(): app = applications() server = httpserver.HTTPServer(app) server.bind(20000, '0.0.0.0') server.start(1) IOLoop.current().start() if __name__ == "__main__": try: main() except KeyboardInterrupt as e: IOLoop.current().stop() finally: IOLoop.current().close()
修改s1.py
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator import requests def views(): span = tracer.start_span("s1-span") span.set_attribute("name", "wilson") span.set_attribute("addr", "cd") ctx = trace.set_span_in_context(span) get_db(ctx) headers = {} TraceContextTextMapPropagator().inject(headers, context=ctx) requests.get("http://localhost:20000", headers=headers) span.end()

改造进k8s
jaeger
编排文件:
apiVersion: apps/v1 kind: Deployment metadata: labels: app: jaeger name: jaeger namespace: default spec: replicas: 1 selector: matchLabels: app: jaeger template: metadata: labels: app: jaeger spec: containers: - image: docker.m.daocloud.io/jaegertracing/all-in-one:latest imagePullPolicy: Always name: jaeger dnsPolicy: ClusterFirst restartPolicy: Always --- apiVersion: v1 kind: Service metadata: labels: app: jaeger-service name: jaeger-service namespace: default spec: ports: - name: port-4317 port: 4317 protocol: TCP targetPort: 4317 - name: port-4318 port: 4318 protocol: TCP targetPort: 4318 - name: port-16686 port: 16686 protocol: TCP targetPort: 16686 selector: app: jaeger type: NodePort
s2
1)制作镜像
由于在k8s集群中通过svc访问jaeger,需要改造一下s2.py
s2.py
... import os JAEGER_ADDR=os.environ.get('JAEGER_ADDR') ... span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=JAEGER_ADDR)) ...
Dockerfile
FROM python:3.8 WORKDIR /opt RUN pip3 install tornado opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp -i https://pypi.tuna.tsinghua.edu.cn/simple ADD s2.py /opt CMD python3 s2.py
2)编排文件
apiVersion: apps/v1 kind: Deployment metadata: labels: app: s2 name: s2 namespace: default spec: replicas: 1 selector: matchLabels: app: s2 template: metadata: labels: app: s2 spec: containers: - env: - name: JAEGER_ADDR value: http://jaeger-service:4318/v1/traces image: s2:v1 imagePullPolicy: Always name: s2 dnsPolicy: ClusterFirst restartPolicy: Always --- apiVersion: v1 kind: Service metadata: labels: app: s2-service name: s2-service namespace: default spec: ports: - name: s2-port port: 20000 protocol: TCP targetPort: 20000 selector: app: s2 type: NodePort
s1
1)制作镜像
由于在k8s集群中通过svc访问s2与jaeger,需要改造一下s1.py
s1.py
... import os S2_ADDR=os.environ.get('S2_ADDR') JAEGER_ADDR=os.environ.get('JAEGER_ADDR') ... span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=JAEGER_ADDR)) ... def views(): span = tracer.start_span("s1-span") span.set_attribute("name", "wilson") span.set_attribute("addr", "cd") ctx = trace.set_span_in_context(span) get_db(ctx) headers = {} TraceContextTextMapPropagator().inject(headers, context=ctx) requests.get(S2_ADDR, headers=headers) span.end() ...
Dockerfile:
FROM python:3.8 WORKDIR /opt RUN pip3 install tornado opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp -i https://pypi.tuna.tsinghua.edu.cn/simple ADD s1.py /opt CMD python3 s1.py
2)编排文件
apiVersion: apps/v1 kind: Deployment metadata: labels: app: s1 name: s1 namespace: default spec: replicas: 1 selector: matchLabels: app: s1 template: metadata: labels: app: s1 spec: containers: - env: - name: S2_ADDR value: http://s2-service:20000 - name: JAEGER_ADDR value: http://jaeger-service:4318/v1/traces image: s1:v1 imagePullPolicy: Always name: s1 dnsPolicy: ClusterFirst restartPolicy: Always --- apiVersion: v1 kind: Service metadata: labels: app: s1-service name: s1-service namespace: default spec: ports: - name: s1-port port: 10000 protocol: TCP targetPort: 10000 selector: app: s1 type: NodePort
查看结果
▶ kubectl get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES jaeger-6669cd7c4-4pl5j 1/1 Running 0 7m31s 10.244.0.236 minikube <none> <none> s1-5c569c5b4b-lctzq 1/1 Running 0 73s 10.244.0.237 minikube <none> <none> s2-5bb648dcdf-mlnbj 1/1 Running 0 61s 10.244.0.238 minikube <none> <none> ▶ kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE jaeger-service NodePort 10.106.13.217 <none> 4317:31891/TCP,4318:31997/TCP,16686:31002/TCP 5m49s s1-service NodePort 10.102.25.195 <none> 10000:32376/TCP 4m23s s2-service NodePort 10.103.114.198 <none> 20000:30032/TCP 3m40s
进行数据测试:
-
访问s1服务
▶ curl http://192.168.49.2:32376 hello world% -
查看jaeger日志,访问:
http://192.168.49.2:31002/

总结
在第一个例子中,我们主要采集了业务服务的trace记录,即一个完整的请求需要经过的路径,包括读取数据库、跨服务请求等等
在整个跟踪过程中trace_id与span_id发挥了决定性的作用,前者为请求链路的唯一标识,串联了整个访问步骤;而后者则是链路上每一次不同的具体操作的标识

- 采集:通过嵌入代码埋点,采集重点监控的流程,比如数据库读写速度、下游服务速度等
- 处理:opentelemetry-sdk对数据进行处理:过滤、缓存、合并
- 导出:将处理过的数据,通过固定的协议(otlp协议、grpc协议、http协议等)发送到后端系统,比如jaeger

联系我
- 联系我,做深入的交流

至此,本文结束
在下才疏学浅,有撒汤漏水的,请各位不吝赐教...
