Grafana 系列文章（十四）：Helm 安装Loki

技术分享 3年前 (2023-02-12) 0 999+

前言

写或者翻译这么多篇 Loki 相关的文章了, 发现还没写怎么安装 😓

现在开始介绍如何使用 Helm 安装 Loki.

前提

有 Helm, 并且添加 Grafana 的官方源:

helm repo add grafana https://grafana.github.io/helm-charts helm repo update

🐾Warning:

网络受限, 需要保证网络通畅.

部署

架构

Promtail(收集) + Loki(存储及处理) + Grafana(展示)

Promtail

启用 Prometheus Operator Service Monitor 做监控
增加external_labels - cluster, 以识别是哪个 K8S 集群;
pipeline_stages 改为 cri, 以对 cri 日志做处理(因为我的集群用的 Container Runtime 是 CRI, 而 Loki Helm 默认配置是 docker)
增加对 systemd-journal 的日志收集:

promtail:   config:     snippets:       pipelineStages:         - cri: {}    extraArgs:      - -client.external-labels=cluster=ctyun   # systemd-journal 额外配置:   # Add additional scrape config   extraScrapeConfigs:     - job_name: journal       journal:         path: /var/log/journal         max_age: 12h         labels:           job: systemd-journal       relabel_configs:         - source_labels: ['__journal__systemd_unit']           target_label: 'unit'         - source_labels: ['__journal__hostname']           target_label: 'hostname'    # Mount journal directory into Promtail pods   extraVolumes:     - name: journal       hostPath:         path: /var/log/journal    extraVolumeMounts:     - name: journal       mountPath: /var/log/journal       readOnly: true

Loki

启用持久化存储
启用 Prometheus Operator Service Monitor 做监控
1. 并配置 Loki 相关 Prometheus Rule 做告警
因为个人集群日志量较小, 适当调大 ingester 相关配置

Grafana

启用持久化存储
启用 Prometheus Operator Service Monitor 做监控
sidecar 都配置上, 方便动态更新 dashboards/datasources/plugins/notifiers;

Helm 安装

通过如下命令安装:

helm upgrade --install loki --namespace=loki --create-namespace grafana/loki-stack -f values.yaml

自定义 values.yaml 如下:

loki:   enabled: true   persistence:     enabled: true     storageClassName: local-path     size: 20Gi   serviceScheme: https   user: admin   password: changit!   config:     ingester:       chunk_idle_period: 1h       max_chunk_age: 4h     compactor:       retention_enabled: true   serviceMonitor:     enabled: true     prometheusRule:       enabled: true       rules:         #  Some examples from https://awesome-prometheus-alerts.grep.to/rules.html#loki         - alert: LokiProcessTooManyRestarts           expr: changes(process_start_time_seconds{job=~"loki"}[15m]) > 2           for: 0m           labels:             severity: warning           annotations:             summary: Loki process too many restarts (instance {{ $labels.instance }})             description: "A loki process had too many restarts (target {{ $labels.instance }})n  VALUE = {{ $value }}n  LABELS = {{ $labels }}"         - alert: LokiRequestErrors           expr: 100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 10           for: 15m           labels:             severity: critical           annotations:             summary: Loki request errors (instance {{ $labels.instance }})             description: "The {{ $labels.job }} and {{ $labels.route }} are experiencing errorsn  VALUE = {{ $value }}n  LABELS = {{ $labels }}"         - alert: LokiRequestPanic           expr: sum(increase(loki_panic_total[10m])) by (namespace, job) > 0           for: 5m           labels:             severity: critical           annotations:             summary: Loki request panic (instance {{ $labels.instance }})             description: "The {{ $labels.job }} is experiencing {{ printf "%.2f" $value }}% increase of panicsn  VALUE = {{ $value }}n  LABELS = {{ $labels }}"         - alert: LokiRequestLatency           expr: (histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~"(?i).*tail.*"}[5m])) by (le)))  > 1           for: 5m           labels:             severity: critical           annotations:             summary: Loki request latency (instance {{ $labels.instance }})             description: "The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latencyn  VALUE = {{ $value }}n  LABELS = {{ $labels }}"  promtail:   enabled: true   config:     snippets:       pipelineStages:         - cri: {}     extraArgs:     - -client.external-labels=cluster=ctyun           serviceMonitor:     # -- If enabled, ServiceMonitor resources for Prometheus Operator are created     enabled: true    # systemd-journal 额外配置:   # Add additional scrape config   extraScrapeConfigs:     - job_name: journal       journal:         path: /var/log/journal         max_age: 12h         labels:           job: systemd-journal       relabel_configs:         - source_labels: ['__journal__systemd_unit']           target_label: 'unit'         - source_labels: ['__journal__hostname']           target_label: 'hostname'    # Mount journal directory into Promtail pods   extraVolumes:     - name: journal       hostPath:         path: /var/log/journal    extraVolumeMounts:     - name: journal       mountPath: /var/log/journal       readOnly: true  fluent-bit:   enabled: false  grafana:   enabled: true   adminUser: caseycui   adminPassword: changit!   ## Sidecars that collect the configmaps with specified label and stores the included files them into the respective folders   ## Requires at least Grafana 5 to work and can't be used together with parameters dashboardProviders, datasources and dashboards   sidecar:     image:       repository: quay.io/kiwigrid/k8s-sidecar       tag: 1.15.6       sha: ''     dashboards:       enabled: true       SCProvider: true       label: grafana_dashboard     datasources:       enabled: true       # label that the configmaps with datasources are marked with       label: grafana_datasource     plugins:       enabled: true       # label that the configmaps with plugins are marked with       label: grafana_plugin     notifiers:       enabled: true       # label that the configmaps with notifiers are marked with       label: grafana_notifier   image:     tag: 8.3.5   persistence:     enabled: true     size: 2Gi     storageClassName: local-path   serviceMonitor:     enabled: true   imageRenderer:     enabled: disable  filebeat:   enabled: false  logstash:   enabled: false

安装后的资源拓扑如下:

Day 2 配置(按需)

Grafana 增加 Dashboards

在同一个 NS 下, 创建如下 ConfigMap: (只要打上grafana_dashboard 这个 label 就会被 Grafana 的 sidecar 自动导入)

apiVersion: v1 kind: ConfigMap metadata:   name: sample-grafana-dashboard   labels:      grafana_dashboard: "1" data:   k8s-dashboard.json: |-   [...]

Grafana 增加 DataSource

在同一个 NS 下, 创建如下 ConfigMap: (只要打上grafana_datasource 这个 label 就会被 Grafana 的 sidecar 自动导入)

apiVersion: v1 kind: ConfigMap metadata:   name: loki-loki-stack   labels:     grafana_datasource: '1' data:   loki-stack-datasource.yaml: |-     apiVersion: 1     datasources:     - name: Loki       type: loki       access: proxy       url: http://loki:3100       version: 1

Traefik 配置 Grafana IngressRoute

因为我是用的 Traefik 2, 通过 CRD IngressRoute 配置 Ingress, 配置如下:

apiVersion: traefik.containo.us/v1alpha1 kind: IngressRoute metadata:   name: grafana spec:   entryPoints:     - web     - websecure   routes:     - kind: Rule       match: Host(`grafana.ewhisper.cn`)       middlewares:         - name: hsts-header           namespace: kube-system         - name: redirectshttps           namespace: kube-system       services:         - name: loki-grafana           namespace: monitoring           port: 80   tls: {}