Velero系列文章（四）：使用Velero进行生产迁移实战

技术分享 4年前 (2022-12-13) 0 999+

关注

概述

目的

通过 velero 工具, 实现以下整体目标:

特定 namespace 在B A两个集群间做迁移;

具体目标为:

在B A集群上创建 velero (包括 restic )
备份 B集群 特定 namespace : caseycui2020:
1. 备份resources - 如deployments, configmaps等;
  1. 备份前, 排除特定secrets的yaml.
2. 备份volume数据; (通过restic实现)
  1. 通过"选择性启用" 的方式, 只备份特定的pod volume
迁移特定 namespace 到 A集群 : caseycui2020:
1. 迁移resources - 通过include的方式, 仅迁移特定resources;
2. 迁移volume数据. (通过restic 实现)

安装

在您的本地目录中创建特定于Velero的凭证文件（credentials-velero):

使用的是xsky的对象存储: (公司的netapp的对象存储不兼容)
```
[default] aws_access_key_id = xxxxxxxxxxxxxxxxxxxxxxxx aws_secret_access_key = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 
```
(openshift) 需要先创建 namespace : velero: oc new-project velero
默认情况下，用户维度的openshift namespace 不会在集群中的所有节点上调度Pod。

要在所有节点上计划namespace，需要一个注释：
```
oc annotate namespace velero openshift.io/node-selector="" 
```
这应该在安装velero之前完成。

启动服务器和存储服务。在Velero目录中，运行：

velero install      --provider aws      --plugins velero/velero-plugin-for-aws:v1.0.0      --bucket velero      --secret-file ./credentials-velero      --use-restic      --use-volume-snapshots=true      --backup-location-config region="default",s3ForcePathStyle="true",s3Url="http://glacier.ewhisper.cn",insecureSkipTLSVerify="true",signatureVersion="4"      --snapshot-location-config region="default"

创建的内容包括:

CustomResourceDefinition/backups.velero.io: attempting to create resource CustomResourceDefinition/backups.velero.io: created CustomResourceDefinition/backupstoragelocations.velero.io: attempting to create resource CustomResourceDefinition/backupstoragelocations.velero.io: created CustomResourceDefinition/deletebackuprequests.velero.io: attempting to create resource CustomResourceDefinition/deletebackuprequests.velero.io: created CustomResourceDefinition/downloadrequests.velero.io: attempting to create resource CustomResourceDefinition/downloadrequests.velero.io: created CustomResourceDefinition/podvolumebackups.velero.io: attempting to create resource CustomResourceDefinition/podvolumebackups.velero.io: created CustomResourceDefinition/podvolumerestores.velero.io: attempting to create resource CustomResourceDefinition/podvolumerestores.velero.io: created CustomResourceDefinition/resticrepositories.velero.io: attempting to create resource CustomResourceDefinition/resticrepositories.velero.io: created CustomResourceDefinition/restores.velero.io: attempting to create resource CustomResourceDefinition/restores.velero.io: created CustomResourceDefinition/schedules.velero.io: attempting to create resource CustomResourceDefinition/schedules.velero.io: created CustomResourceDefinition/serverstatusrequests.velero.io: attempting to create resource CustomResourceDefinition/serverstatusrequests.velero.io: created CustomResourceDefinition/volumesnapshotlocations.velero.io: attempting to create resource CustomResourceDefinition/volumesnapshotlocations.velero.io: created Waiting for resources to be ready in cluster... Namespace/velero: attempting to create resource Namespace/velero: created ClusterRoleBinding/velero: attempting to create resource ClusterRoleBinding/velero: created ServiceAccount/velero: attempting to create resource ServiceAccount/velero: created Secret/cloud-credentials: attempting to create resource Secret/cloud-credentials: created BackupStorageLocation/default: attempting to create resource BackupStorageLocation/default: created VolumeSnapshotLocation/default: attempting to create resource VolumeSnapshotLocation/default: created Deployment/velero: attempting to create resource Deployment/velero: created DaemonSet/restic: attempting to create resource DaemonSet/restic: created Velero is installed! ⛵ Use 'kubectl logs deployment/velero -n velero' to view the status.

(openshift) 将velero ServiceAccount添加到privilegedSCC：

$ oc adm policy add-scc-to-user privileged -z velero -n velero

(openshift) 对于OpenShift版本> = 4.1，修改DaemonSet yaml以请求privileged模式：

@@ -67,3 +67,5 @@ spec:               value: /credentials/cloud             - name: VELERO_SCRATCH_DIR               value: /scratch +          securityContext: +            privileged: true

或:

oc patch ds/restic    --namespace velero    --type json    -p '[{"op":"add","path":"/spec/template/spec/containers/0/securityContext","value": { "privileged": true}}]'

备份 - B集群

备份集群级别的特定资源

velero backup create <backup-name> --include-cluster-resources=true  --include-resources deployments,configmaps

查看备份

velero backup describe YOUR_BACKUP_NAME

备份特定 namespace `caseycui2020`

排除特定资源

标签为velero.io/exclude-from-backup=true的资源不包括在备份中，即使它包含匹配的选择器标签也是如此。

通过这种方式, 不需要备份的secret 等资源通过velero.io/exclude-from-backup=true 标签(label)进行排除.

通过这种方式排除的secret部分示例如下:

builder-dockercfg-jbnzr default-token-lshh8 pipeline-token-xt645

使用restic 备份Pod Volume

🐾 注意:

该 namespace 下, 以下2个pod volume也需要备份, 但是目前还没正式使用:

mycoreapphttptask-callback

mycoreapphttptaskservice-callback

通过 "选择性启用" 的方式进行有选择地备份.

对包含要备份的卷的每个Pod运行以下命令：

oc -n caseycui2020 annotate pod/<mybackendapp-pod-name> backup.velero.io/backup-volumes=jmx-exporter-agent,pinpoint-agent,my-mybackendapp-claim oc -n caseycui2020 annotate pod/<elitegetrecservice-pod-name> backup.velero.io/backup-volumes=uploadfile

其中，卷名是容器 spec中卷的名称。

例如，对于以下pod：

apiVersion: v1 kind: Pod metadata:   name: sample   namespace: foo spec:   containers:   - image: k8s.gcr.io/test-webserver     name: test-webserver     volumeMounts:     - name: pvc-volume       mountPath: /volume-1     - name: emptydir-volume       mountPath: /volume-2   volumes:   - name: pvc-volume     persistentVolumeClaim:       claimName: test-volume-claim   - name: emptydir-volume     emptyDir: {}

你应该运行:

kubectl -n foo annotate pod/sample backup.velero.io/backup-volumes=pvc-volume,emptydir-volume

如果您使用控制器来管理您的pods，则也可以在pod template spec中提供此批注。

备份及验证

备份namespace及其对象, 以及具有相关annotation的pod volume:

# 生产 namespace  velero backup create caseycui2020 --include-namespaces caseycui2020

查看备份

velero backup describe YOUR_BACKUP_NAME velero backup logs caseycui2020 oc -n velero get podvolumebackups -l velero.io/backup-name=caseycui2020 -o yaml

describe查看的结果如下:

Name:         caseycui2020 Namespace:    velero Labels:       velero.io/storage-location=default Annotations:  velero.io/source-cluster-k8s-gitversion=v1.18.3+2cf11e2               velero.io/source-cluster-k8s-major-version=1               velero.io/source-cluster-k8s-minor-version=18+  Phase:  Completed  Errors:    0 Warnimybackendapp:  0  Namespaces:   Included:  caseycui2020   Excluded:  <none>  Resources:   Included:        *   Excluded:        <none>   Cluster-scoped:  auto  Label selector:  <none>  Storage Location:  default  Velero-Native Snapshot PVs:  auto  TTL:  720h0m0s  Hooks:  <none>  Backup Format Version:  1.1.0  Started:    2020-10-21 09:28:16 +0800 CST Completed:  2020-10-21 09:29:17 +0800 CST  Expiration:  2020-11-20 09:28:16 +0800 CST  Total items to be backed up:  591 Items backed up:              591  Velero-Native Snapshots: <none included>  Restic Backups (specify --details for more information):   Completed:  3

定期备份

使用基于cron表达式创建定期计划的备份：

velero schedule create caseycui2020-b-daily --schedule="0 3 * * *" --include-namespaces caseycui2020

另外，您可以使用一些非标准的速记cron表达式：

velero schedule create test-daily --schedule="@every 24h" --include-namespaces caseycui2020

有关更多用法示例，请参见cron软件包的文档。

集群迁移 - 到A集群

使用 Backups 和 Restores

只要您将每个Velero实例指向相同的云对象存储位置，Velero就能帮助您将资源从一个群集移植到另一个群集。此方案假定您的群集由同一云提供商托管。请注意，Velero本身不支持跨云提供程序迁移持久卷快照。如果要在云平台之间迁移卷数据，请启用restic，它将在文件系统级别备份卷内容。

（集群 B）假设您尚未使用Velero schedule 操作对数据进行检查点检查，则需要首先备份整个群集（根据需要替换<BACKUP-NAME>）：
```
velero backup create <BACKUP-NAME> 
```
默认备份保留期限以TTL（有效期）表示，为30天（720小时）；您可以使用--ttl <DURATION>标志根据需要进行更改。有关备份到期的更多信息，请参见velero的工作原理。
（集群 A）配置BackupStorageLocations和VolumeSnapshotLocations, 指向 集群1 使用的位置, 使用velero backup-location create和velero snapshot-location create. 要确保配置BackupStorageLocations为 read-only, 通过在velero backup-location create时使用--access-mode=ReadOnly flag (因为我就只有一个bucket, 所以就没有配置只读这一项). 如下为在A集群安装, 安装时已配置了BackupStorageLocations和VolumeSnapshotLocations.
```
velero install      --provider aws      --plugins velero/velero-plugin-for-aws:v1.0.0      --bucket velero      --secret-file ./credentials-velero      --use-restic      --use-volume-snapshots=true      --backup-location-config region="default",s3ForcePathStyle="true",s3Url="http://glacier.ewhisper.cn",insecureSkipTLSVerify="true",signatureVersion="4"     --snapshot-location-config region="default" 
```
（集群A）确保已创建Velero Backup对象。 Velero资源与云存储中的备份文件同步。
```
velero backup describe <BACKUP-NAME> 
```
注意：默认同步间隔为1分钟，因此请确保在检查之前等待。您可以使用Velero服务器的--backup-sync-period标志配置此间隔。

（集群A）一旦确认现在存在正确的备份（<BACKUP-NAME>），就可以使用以下方法还原所有内容： (因为backup 中只有caseycui2020一个 namespace , 所以还原是就不需要--include-namespaces caseycui2020 进行过滤)

velero restore create --from-backup caseycui2020 --include-resources buildconfigs.build.openshift.io,configmaps,deploymentconfigs.apps.openshift.io,imagestreams.image.openshift.io,imagestreamtags.image.openshift.io,imagetags.image.openshift.io,limitranges,namespaces,networkpolicies.networking.k8s.io,persistentvolumeclaims,prometheusrules.monitoring.coreos.com,resourcequotas,rolebindimybackendapp.authorization.openshift.io,rolebindimybackendapp.rbac.authorization.k8s.io,routes.route.openshift.io,secrets,servicemonitors.monitoring.coreos.com,services,templateinstances.template.openshift.io

因为后面验证persistentvolumeclaims的restore有问题, 所以后续使用的时候先拿掉这个pvc, 后面再想办法解决:

velero restore create --from-backup caseycui2020 --include-resources buildconfigs.build.openshift.io,configmaps,deploymentconfigs.apps.openshift.io,imagestreams.image.openshift.io,imagestreamtags.image.openshift.io,imagetags.image.openshift.io,limitranges,namespaces,networkpolicies.networking.k8s.io,persistentvolumeclaims,prometheusrules.monitoring.coreos.com,resourcequotas,rolebindimybackendapp.authorization.openshift.io,rolebindimybackendapp.rbac.authorization.k8s.io,routes.route.openshift.io,secrets,servicemonitors.monitoring.coreos.com,services,templateinstances.template.openshift.io

验证2个集群

检查第二个群集是否按预期运行：

(集群A)运行:

velero restore get

结果如下:

NAME                       BACKUP      STATUS            STARTED   COMPLETED   ERRORS   WARNImybackendapp   CREATED                         SELECTOR caseycui2020-20201021102342   caseycui2020   Failed            <nil>     <nil>       0        0          2020-10-21 10:24:14 +0800 CST   <none> caseycui2020-20201021103040   caseycui2020   PartiallyFailed   <nil>     <nil>       46       34         2020-10-21 10:31:12 +0800 CST   <none> caseycui2020-20201021105848   caseycui2020   InProgress        <nil>     <nil>       0        0          2020-10-21 10:59:20 +0800 CST   <none>

然后运行:

velero restore describe <RESTORE-NAME-FROM-GET-COMMAND> oc -n velero get podvolumerestores -l velero.io/restore-name=YOUR_RESTORE_NAME -o yaml

结果如下:

Name:         caseycui2020-20201021102342 Namespace:    velero Labels:       <none> Annotations:  <none>  Phase:  InProgress  Started:    <n/a> Completed:  <n/a>  Backup:  caseycui2020  Namespaces:   Included:  all namespaces found in the backup   Excluded:  <none>  Resources:   Included:        buildconfigs.build.openshift.io, configmaps, deploymentconfigs.apps.openshift.io, imagestreams.image.openshift.io, imagestreamtags.image.openshift.io, imagetags.image.openshift.io, limitranges, namespaces, networkpolicies.networking.k8s.io, persistentvolumeclaims, prometheusrules.monitoring.coreos.com, resourcequotas, rolebindimybackendapp.authorization.openshift.io, rolebindimybackendapp.rbac.authorization.k8s.io, routes.route.openshift.io, secrets, servicemonitors.monitoring.coreos.com, services, templateinstances.template.openshift.io   Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io   Cluster-scoped:  auto  Namespace mappimybackendapp:  <none>  Label selector:  <none>  Restore PVs:  auto

如果遇到问题，请确保Velero在两个群集中的相同namespace中运行。

我这边碰到问题, 就是openshift里边, 使用了imagestream和imagetag, 然后对应的镜像拉不过来, 容器没有启动.

容器没有启动, podvolume也没有恢复成功.

Name:         caseycui2020-20201021110424 Namespace:    velero Labels:       <none> Annotations:  <none>  Phase:  PartiallyFailed (run 'velero restore logs caseycui2020-20201021110424' for more information)  Started:    <n/a> Completed:  <n/a>  Warnimybackendapp:   Velero:     <none>   Cluster:    <none>   Namespaces:     caseycui2020:  could not restore, imagetags.image.openshift.io "mybackendapp:1.0.0" already exists. Warning: the in-cluster version is different than the backed-up version.                 could not restore, imagetags.image.openshift.io "mybackendappno:1.0.0" already exists. Warning: the in-cluster version is different than the backed-up version.                 ...  Errors:   Velero:     <none>   Cluster:    <none>   Namespaces:     caseycui2020:  error restoring imagestreams.image.openshift.io/caseycui2020/mybackendapp: ImageStream.image.openshift.io "mybackendapp" is invalid: []: Internal error: imagestreams "mybackendapp" is invalid: spec.tags[latest].from.name: Invalid value: "mybackendapp@sha256:6c5ab553a97c74ad602d2427a326124621c163676df91f7040b035fa64b533c7": error generating tag event: imagestreamimage.image.openshift.io ......  Backup:  caseycui2020  Namespaces:   Included:  all namespaces found in the backup   Excluded:  <none>  Resources:   Included:        buildconfigs.build.openshift.io, configmaps, deploymentconfigs.apps.openshift.io, imagestreams.image.openshift.io, imagestreamtags.image.openshift.io, imagetags.image.openshift.io, limitranges, namespaces, networkpolicies.networking.k8s.io, persistentvolumeclaims, prometheusrules.monitoring.coreos.com, resourcequotas, rolebindimybackendapp.authorization.openshift.io, rolebindimybackendapp.rbac.authorization.k8s.io, routes.route.openshift.io, secrets, servicemonitors.monitoring.coreos.com, services, templateinstances.template.openshift.io   Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io   Cluster-scoped:  auto  Namespace mappimybackendapp:  <none>  Label selector:  <none>  Restore PVs:  auto

迁移问题总结

目前总结问题如下:

imagestreams.image.openshift.io, imagestreamtags.image.openshift.io, imagetags.image.openshift.io 里的镜像没有成功导入; 确切地说是latest这个tag没有导入成功. imagestreamtags.image.openshift.io生效也需要时间.
persistentvolumeclaims 迁移过来后报错, 报错如下:
```
phase: Lost 
```
原因是: A B集群的StorageClass的配置是不同的, 所以B集群的PVC, 在A集群想要直接binding是不可能的. 而且创建后无法直接修改, 需要删掉重新创建.
Routes 域名, 有部分域名是特定于A B集群的域名, 如: jenkins-caseycui2020.b.caas.ewhisper.cn迁移到A集群调整为: jenkins-caseycui2020.a.caas.ewhisper.cn
podVolume 数据没有迁移.

`latest`这个tag没有导入成功

手动导入, 命令如下: (1.0.1 为ImageStream的最新的版本)

oc tag xxl-job-admin:1.0.1 xxl-job-admin:latest

PVC phase Lost问题

如果手动创建, 需要对PVC yaml进行调整. 调整前后PVC如下:

B集群原YAML:

kind: PersistentVolumeClaim apiVersion: v1 metadata:   annotations:     pv.kubernetes.io/bind-completed: 'yes'     pv.kubernetes.io/bound-by-controller: 'yes'     volume.beta.kubernetes.io/storage-provisioner: csi.trident.netapp.io   selfLink: /api/v1/namespaces/caseycui2020/persistentvolumeclaims/jenkins   resourceVersion: '77304786'   name: jenkins   uid: ffcabc42-845d-4cdf-8c7c-56e97cb5ea82   creationTimestamp: '2020-10-21T03:05:46Z'   managedFields:     - manager: kube-controller-manager       operation: Update       apiVersion: v1       time: '2020-10-21T03:05:46Z'       fieldsType: FieldsV1       fieldsV1:         'f:status':           'f:phase': {}     - manager: velero-server       operation: Update       apiVersion: v1       time: '2020-10-21T03:05:46Z'       fieldsType: FieldsV1       fieldsV1:         'f:metadata':           'f:annotations':             .: {}             'f:pv.kubernetes.io/bind-completed': {}             'f:pv.kubernetes.io/bound-by-controller': {}             'f:volume.beta.kubernetes.io/storage-provisioner': {}           'f:labels':             .: {}             'f:app': {}             'f:template': {}             'f:template.openshift.io/template-instance-owner': {}             'f:velero.io/backup-name': {}             'f:velero.io/restore-name': {}         'f:spec':           'f:accessModes': {}           'f:resources':             'f:requests':               .: {}               'f:storage': {}           'f:storageClassName': {}           'f:volumeMode': {}           'f:volumeName': {}   namespace: caseycui2020   finalizers:     - kubernetes.io/pvc-protection   labels:     app: jenkins-persistent     template: jenkins-persistent-monitored     template.openshift.io/template-instance-owner: 5a0b28c3-c760-451b-b92f-a781406d9e91     velero.io/backup-name: caseycui2020     velero.io/restore-name: caseycui2020-20201021110424 spec:   accessModes:     - ReadWriteOnce   resources:     requests:       storage: 5Gi   volumeName: pvc-414efafd-8b22-48da-8c20-6025a8e671ca   storageClassName: nas-data   volumeMode: Filesystem status:   phase: Lost

调整后:

kind: PersistentVolumeClaim apiVersion: v1 metadata:   name: jenkins   namespace: caseycui2020   labels:     app: jenkins-persistent     template: jenkins-persistent-monitored     template.openshift.io/template-instance-owner: 5a0b28c3-c760-451b-b92f-a781406d9e91 spec:   accessModes:     - ReadWriteOnce   resources:     requests:       storage: 5Gi   storageClassName: nas-data   volumeMode: Filesystem

`podVolume` 数据没有迁移

可以手动迁移, 命令如下:

# 登录B集群 # 先把B 集群/opt/prometheus数据拿出来到当前文件夹 oc rsync xxl-job-admin-5-9sgf7:/opt/prometheus . # 上边rsync命令会创建个prometheus的目录 cd prometheus # 登录A集群 # 再把数据拷贝进去(拷贝之前得先确保这个pod启动起来) (可以先把`JAVA_OPTS`删掉) oc rsync ./ xxl-job-admin-2-6k8df:/opt/prometheus/