The 3 AM Wake-Up Call#
There are two types of DevOps engineers: those who’ve lost critical infrastructure data, and those who will.
I learned this distinction at 3:47 AM on a Tuesday.
The phone buzzed. Monitoring alert. Then another. Then the call from the on-call engineer.
“ArgoCD is down. All applications showing as Unknown.”
Heart rate: elevated. Coffee: brewing. Laptop: opening.
I logged into the cluster. The ArgoCD namespace was there. Pods were running. But something was fundamentally wrong.
1
| kubectl get applications -n argocd
|
Empty.
Wait. Maybe I’m in the wrong context?
1
2
| kubectl config current-context
# production-cluster
|
No, this is the right cluster.
Nothing.
1
2
3
4
| # Maybe the CRD itself is gone?
kubectl get crd applications.argoproj.io
# NAME CREATED AT
# applications.argoproj.io 2025-08-15T10:23:41Z
|
The CRD exists. But no applications.
1
2
3
| # Check all namespaces, just in case
kubectl get applications --all-namespaces
# No resources found
|
150+ applications managing deployments across three Kubernetes clusters. Gone.
Not “out of sync.” Not “failed to sync.” Just… gone.
What happened?
A new engineer on Team Mavericks, trying to “clean up unused resources” in the staging namespace, had accidentally run a cleanup script against the production ArgoCD namespace. The script had proper RBAC permissions (we were too permissive). It executed successfully. ArgoCD had lost all memory of what it was managing—150+ application definitions, gone in seconds.
Git repositories were fine. The actual deployed applications were still running in their respective clusters. But ArgoCD—our GitOps control plane—had lost all memory of what it was managing.
The recovery took 6 hours. It should have taken 15 minutes.
We had backups. But they were ad-hoc CLI exports from two weeks ago. Applications had changed. New deployments had happened. We spent hours reconstructing state from Git commit history, deployment scripts, and tribal knowledge.
That night taught me the difference between “having backups” and “having a disaster recovery strategy.”
This article is everything I wish I’d known before that wake-up call.
Understanding ArgoCD’s Storage Architecture#
Before we talk about backup, we need to understand what we’re backing up—and more importantly, where ArgoCD actually stores its data.
The Design Philosophy: Kubernetes-Native Storage#
ArgoCD doesn’t use a traditional database. There’s no PostgreSQL, no MongoDB, no external data store to configure and maintain.
Everything is stored as Kubernetes resources.
This is brilliant and terrifying at the same time.
Brilliant because:
- No external dependencies to manage
- Backup/restore uses standard Kubernetes tools
- High availability is Kubernetes-native
- Disaster recovery is conceptually simple
Terrifying because:
- If your Kubernetes cluster’s etcd is corrupted, ArgoCD’s state is corrupted
- Accidental deletion of resources = accidental deletion of ArgoCD config
- Cluster migration requires careful planning
What Gets Stored Where#
When you installed ArgoCD using:
1
| kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
|
That manifest created several types of resources. Let’s understand what each stores:
1. Application Definitions (Custom Resource Definitions)#
What: Your application configurations—what to deploy, where to deploy it, from which Git repo.
Stored as: Application CRDs in the argocd namespace.
Example:
1
2
3
4
5
6
| # Using kubectl (checks CRDs directly)
kubectl get applications -n argocd
NAME SYNC STATUS HEALTH STATUS
do-analytics-prod-apps.analyticsui.prod Synced Healthy
aws-myapp-staging-apps.backend.staging OutOfSync Progressing
|
1
2
3
4
5
6
| # Using ArgoCD CLI (same result, prettier output)
argocd app list
NAME CLUSTER NAMESPACE PROJECT STATUS HEALTH SYNCPOLICY
do-analytics-prod-apps.analyticsui.prod https://67860677-ba01... analytics-prod-apps do-analytics-prod... Synced Healthy Auto
aws-myapp-staging.backend https://3CCD9E5C2236A7E0... myapp-staging aws-myapp-staging OutOfSync Degraded Manual
|
View the actual data:
1
2
3
4
5
| # Using kubectl (raw YAML)
kubectl get application do-analytics-prod-apps.analyticsui.prod -n argocd -o yaml
# Using ArgoCD CLI (formatted, easier to read)
argocd app get do-analytics-prod-apps.analyticsui.prod
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: do-analytics-prod-apps.analyticsui.prod
namespace: argocd
spec:
project: do-analytics-prod-apps
source:
repoURL: https://gitlab.com/company/gitops/manifests.git
path: manifests/analytics-platform/analyticsui
targetRevision: prod
destination:
server: https://67860677-ba01-4c64-aea4-981fde9b5fc6.k8s.ondigitalocean.com
namespace: analytics-prod-apps
syncPolicy:
automated:
prune: true
selfHeal: true
|
This is your entire application configuration. Delete this CRD, and ArgoCD forgets that application exists.
2. Project Definitions (AppProject CRDs)#
What: RBAC boundaries, source/destination restrictions, project-level settings.
Stored as: AppProject CRDs in the argocd namespace.
View them:
1
2
3
4
5
| # Using kubectl
kubectl get appprojects -n argocd
# Using ArgoCD CLI
argocd proj list
|
Example AppProject:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: do-analytics-prod-apps
namespace: argocd
spec:
destinations:
- namespace: analytics-prod-apps
server: https://67860677-ba01-4c64-aea4-981fde9b5fc6.k8s.ondigitalocean.com
sourceRepos:
- https://gitlab.com/company/gitops/manifests.git
clusterResourceWhitelist:
- group: '*'
kind: '*'
|
Why this matters: Projects control what can be deployed where. Lose this, and you lose your security boundaries.
3. Repository Credentials (Secrets)#
What: Git repository access credentials (passwords, SSH keys, tokens).
Stored as: Kubernetes Secrets with label argocd.argoproj.io/secret-type=repository.
View them (credentials are base64-encoded):
1
| kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository
|
Example:
1
2
3
4
5
6
7
8
9
10
11
12
| apiVersion: v1
kind: Secret
metadata:
name: repo-123456789
namespace: argocd
labels:
argocd.argoproj.io/secret-type: repository
type: Opaque
data:
password: <base64-encoded-password>
url: aHR0cHM6Ly9naXRsYWIuY29tL2NvbXBhbnkvcmVwby5naXQ=
username: <base64-encoded-username>
|
Critical: If you lose these secrets, ArgoCD can’t pull manifests from Git.
4. Cluster Credentials (Secrets)#
What: Authentication credentials for external Kubernetes clusters that ArgoCD manages.
Stored as: Kubernetes Secrets with label argocd.argoproj.io/secret-type=cluster.
View them:
1
| kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=cluster
|
When you ran:
1
| argocd cluster add do-production-cluster
|
ArgoCD created a ServiceAccount in the target cluster and stored the credentials as a Secret in the ArgoCD namespace.
Example structure:
1
2
3
4
5
6
7
8
9
10
11
12
| apiVersion: v1
kind: Secret
metadata:
name: cluster-do-production-67860677
namespace: argocd
labels:
argocd.argoproj.io/secret-type: cluster
type: Opaque
data:
name: ZG8tcHJvZHVjdGlvbi1jbHVzdGVy
server: aHR0cHM6Ly82Nzg2MDY3Ny1iYTAxLTRjNjQtYWVhNC05ODFmZGU5YjVmZDYuazhzLm9uZGlnaXRhbG9jZWFuLmNvbQ==
config: <base64-encoded-kubeconfig-with-token>
|
Without these, ArgoCD loses access to all external clusters it manages.
5. ArgoCD Configuration (ConfigMaps)#
What: ArgoCD server settings, UI customizations, notification configs, etc.
Stored as: ConfigMaps in the argocd namespace.
Key ConfigMaps:
argocd-cm - Main configuration:
1
| kubectl get configmap argocd-cm -n argocd -o yaml
|
Example contents:
1
2
3
4
5
6
7
8
9
10
11
12
| apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cm
namespace: argocd
data:
url: https://argocd.example.com
dex.config: |
connectors:
# SSO configuration
repositories: |
- url: https://gitlab.com/company/gitops
|
argocd-rbac-cm - RBAC policies:
1
| kubectl get configmap argocd-rbac-cm -n argocd -o yaml
|
Example:
1
2
3
4
5
6
7
8
9
10
11
12
13
| apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-rbac-cm
namespace: argocd
data:
policy.csv: |
p, role:developers, applications, sync, default/*, allow
p, role:developers, applications, get, default/*, allow
p, role:ops, applications, *, */*, allow
g, engineering-team, role:developers
g, ops-team, role:ops
policy.default: role:readonly
|
argocd-cmd-params-cm - Server startup parameters:
1
| kubectl get configmap argocd-cmd-params-cm -n argocd -o yaml
|
Lose these, and your ArgoCD reverts to default configuration (losing SSO, RBAC policies, custom settings).
6. TLS Certificates and SSH Keys (Secrets)#
What: TLS certificates for ArgoCD server, SSH keys for Git repository access.
Stored as: Various Secrets in the argocd namespace.
Examples:
argocd-server-tls - HTTPS certificate for ArgoCD UIargocd-repo-server-tls - Internal TLS for repo server- SSH private keys for Git access (created when you add repos via SSH)
7. Initial Admin Password (Secret)#
What: The initial admin password generated during installation.
Stored as: Secret argocd-initial-admin-secret.
You retrieved this during installation:
1
| kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
|
Note: This secret is typically deleted after you change the admin password, but it’s worth knowing about for fresh installations.
What ArgoCD Does NOT Store#
Important distinctions:
❌ Application manifests - These live in Git, not in ArgoCD. ArgoCD pulls them on-demand.
❌ Deployed resources - The actual pods, services, deployments live in target clusters, not in ArgoCD.
❌ Git repository history - ArgoCD references Git, doesn’t clone it permanently.
❌ Application state/logs - ArgoCD tracks sync status, but not runtime logs or metrics.
This is why GitOps works: The source of truth is Git. ArgoCD is just the synchronization engine. If you lose ArgoCD, your applications keep running. You just lose the automation layer.
The Installation Decision That Matters#
When you installed ArgoCD with:
1
| kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
|
You made a critical choice: non-HA, single-replica deployment.
Look at what this manifest creates:
1
2
3
4
5
6
7
8
| # argocd-application-controller
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: argocd-application-controller
spec:
replicas: 1 # <-- Single replica
serviceName: argocd-application-controller
|
1
2
3
4
5
6
7
| # argocd-server
apiVersion: apps/v1
kind: Deployment
metadata:
name: argocd-server
spec:
replicas: 1 # <-- Single replica
|
1
2
3
4
5
6
7
| # argocd-repo-server
apiVersion: apps/v1
kind: Deployment
metadata:
name: argocd-repo-server
spec:
replicas: 1 # <-- Single replica
|
What this means:
- ✅ Simple, easy to install
- ✅ Low resource usage
- ✅ Perfect for development/staging
- ❌ Single point of failure
- ❌ Not production-ready
- ❌ Downtime during upgrades
Alternative: HA Installation
For production, you’d use Helm with HA configuration:
1
2
3
4
5
6
7
8
| helm install argocd argo/argo-cd \
--namespace argocd \
--create-namespace \
--set server.replicas=3 \
--set repoServer.replicas=2 \
--set controller.replicas=1 \
--set redis-ha.enabled=true \
--set redis-ha.replicas=3
|
This changes:
- ✅ Multiple replicas of API server (survive pod failures)
- ✅ Multiple repo servers (load distribution)
- ✅ Redis HA with Sentinel (no single point of failure)
- ✅ Survives node failures
- ✅ Zero-downtime upgrades possible
Storage-wise, this doesn’t change WHERE data is stored (still Kubernetes CRDs/ConfigMaps/Secrets), but it changes availability and resilience.
We’ll cover HA setup in detail later in this article.
The Data Flow: Where Things Happen#
Understanding the data flow helps with backup planning:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
| ┌─────────────────────────────────────────────────────────────┐
│ ArgoCD Installation (kubectl apply) │
│ │
│ Creates in Kubernetes etcd: │
│ ├── CRDs (Application, AppProject) │
│ ├── ConfigMaps (argocd-cm, argocd-rbac-cm) │
│ ├── Secrets (TLS certs, repo credentials) │
│ └── Deployments/StatefulSets (ArgoCD components) │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ You create Application via CLI or UI │
│ │
│ argocd app create myapp \ │
│ --repo https://gitlab.com/company/repo.git \ │
│ --path manifests/myapp \ │
│ --dest-server https://cluster.com \ │
│ --dest-namespace myapp │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ ArgoCD stores this as Application CRD │
│ │
│ apiVersion: argoproj.io/v1alpha1 │
│ kind: Application │
│ metadata: │
│ name: myapp │
│ spec: │
│ source: │
│ repoURL: https://gitlab.com/company/repo.git │
│ path: manifests/myapp │
│ │
│ Stored in: Kubernetes etcd → Part of cluster state │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ ArgoCD Application Controller watches this CRD │
│ │
│ Every 3 minutes (default): │
│ 1. Reads Application CRD from Kubernetes │
│ 2. Pulls manifests from Git │
│ 3. Compares Git state vs Cluster state │
│ 4. Syncs if needed │
└─────────────────────────────────────────────────────────────┘
|
Key insight for backup: The Application CRD is the bridge between Git (source of truth for manifests) and Kubernetes (source of truth for what’s deployed). Lose the CRD, lose the bridge.
The Backup Implication#
Now that we know where everything lives, the backup strategy becomes clear:
What we must backup:
- Application CRDs (kubectl get applications -n argocd)
- AppProject CRDs (kubectl get appprojects -n argocd)
- Repository credential Secrets
- Cluster credential Secrets
- ConfigMaps (argocd-cm, argocd-rbac-cm, argocd-cmd-params-cm)
- TLS Secrets
What we don’t need to backup:
- Application manifests (they’re in Git)
- Deployed resources (they’re in target clusters)
- ArgoCD component Deployments/StatefulSets (recreated from install manifest)
Let’s implement this.
The Three-Tier Backup Strategy#
Based on that 3 AM incident and months of refining our approach at DevOps Den, here’s the backup strategy that actually works in production.
Tier 1: Automated CLI Exports (Daily Quick Backups)#
Purpose: Fast recovery from accidental deletion, quick rollback capability.
Frequency: Daily, automated via cronjob.
Recovery Time: 15 minutes.
Storage: Git repository (version controlled backups).
Option A: The Quick Method (Recommended for Getting Started)#
ArgoCD provides a built-in argocd admin export command that backs up everything in one shot:
1
2
3
4
5
6
7
| # Export all ArgoCD resources to a file
argocd admin export -n argocd > argocd-backup-$(date +%Y%m%d).yaml
# Or using Docker (if argocd CLI not installed locally)
docker run -v ~/.kube:/home/argocd/.kube --rm \
quay.io/argoproj/argocd:v2.13.2 \
argocd admin export -n argocd > argocd-backup-$(date +%Y%m%d).yaml
|
What it backs up:
- All Applications
- All AppProjects
- All repository credentials (Secrets)
- All cluster credentials (Secrets)
- All ConfigMaps (argocd-cm, argocd-rbac-cm, etc.)
For multi-namespace ArgoCD setups (if your applications are in multiple namespaces):
1
2
3
| argocd admin export -n argocd \
--application-namespaces="team-mavericks,team-infrastructure,team-platform" \
> argocd-backup-$(date +%Y%m%d).yaml
|
Restore from backup:
1
2
3
4
5
6
7
| # Import the backup
argocd admin import argocd-backup-20260120.yaml
# Or using Docker
docker run -i -v ~/.kube:/home/argocd/.kube --rm \
quay.io/argoproj/argocd:v2.13.2 \
argocd admin import - < argocd-backup-20260120.yaml
|
Why this is great for getting started:
- ✅ Single command
- ✅ Officially supported by ArgoCD
- ✅ Everything in one file
- ✅ Simple to automate
Limitation: Single monolithic file. For production, you may want more granular backups (next approach).
Option B: The Production Script (Granular Backups)#
Create backup-argocd.sh:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
| #!/bin/bash
set -euo pipefail
# Configuration
BACKUP_DIR="/tmp/argocd-backup-$(date +%Y%m%d-%H%M%S)"
GIT_REPO="[email protected]:company/argocd-backups.git"
RETENTION_DAYS=30
# Create backup directory
mkdir -p "$BACKUP_DIR"
echo "Starting ArgoCD backup at $(date)"
# 1. Export ArgoCD CRDs (the schema definitions themselves)
echo "Backing up ArgoCD CRDs..."
kubectl get crd applications.argoproj.io -o yaml > "$BACKUP_DIR/crd-applications.yaml"
kubectl get crd appprojects.argoproj.io -o yaml > "$BACKUP_DIR/crd-appprojects.yaml"
# 2. Export all applications
echo "Backing up applications..."
argocd app list -o yaml > "$BACKUP_DIR/applications.yaml"
# Count for verification
APP_COUNT=$(argocd app list | wc -l)
echo "Backed up $APP_COUNT applications"
# 3. Export all projects
echo "Backing up projects..."
argocd proj list -o yaml > "$BACKUP_DIR/projects.yaml"
# 4. Export repository credentials (requires kubectl, as ArgoCD CLI doesn't expose this)
echo "Backing up repository credentials..."
kubectl get secrets -n argocd \
-l argocd.argoproj.io/secret-type=repository \
-o yaml > "$BACKUP_DIR/repositories.yaml"
# 5. Export cluster credentials
echo "Backing up cluster credentials..."
kubectl get secrets -n argocd \
-l argocd.argoproj.io/secret-type=cluster \
-o yaml > "$BACKUP_DIR/clusters.yaml"
# 6. Export ArgoCD configuration
echo "Backing up ArgoCD configuration..."
kubectl get configmap argocd-cm -n argocd -o yaml > "$BACKUP_DIR/argocd-cm.yaml"
kubectl get configmap argocd-rbac-cm -n argocd -o yaml > "$BACKUP_DIR/argocd-rbac-cm.yaml"
kubectl get configmap argocd-cmd-params-cm -n argocd -o yaml > "$BACKUP_DIR/argocd-cmd-params-cm.yaml"
# 7. Export TLS secrets
echo "Backing up TLS certificates..."
kubectl get secret argocd-server-tls -n argocd -o yaml > "$BACKUP_DIR/argocd-server-tls.yaml" 2>/dev/null || echo "No server TLS secret found"
# 8. Create a manifest list for easy verification
echo "Creating manifest..."
cat > "$BACKUP_DIR/MANIFEST.txt" <<EOF
ArgoCD Backup - $(date)
================================================
Applications: $APP_COUNT
Projects: $(argocd proj list | wc -l)
Repositories: $(kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository --no-headers | wc -l)
Clusters: $(kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=cluster --no-headers | wc -l)
Files:
$(ls -lh "$BACKUP_DIR")
Backup completed at: $(date)
EOF
cat "$BACKUP_DIR/MANIFEST.txt"
# 9. Push to Git for version control
echo "Pushing backup to Git..."
cd "$BACKUP_DIR"
git init
git add .
git commit -m "ArgoCD backup $(date +%Y-%m-%d_%H:%M:%S)"
git remote add origin "$GIT_REPO"
git push -u origin main --force
# 10. Cleanup old backups (keep last 30 days)
echo "Cleaning up old backups..."
find /tmp/argocd-backup-* -type d -mtime +$RETENTION_DAYS -exec rm -rf {} \; 2>/dev/null || true
echo "Backup completed successfully at $(date)"
|
Deploy as Kubernetes CronJob#
Create argocd-backup-cronjob.yaml:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
| apiVersion: v1
kind: ServiceAccount
metadata:
name: argocd-backup
namespace: argocd
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: argocd-backup
namespace: argocd
rules:
- apiGroups: [""]
resources: ["secrets", "configmaps"]
verbs: ["get", "list"]
- apiGroups: ["argoproj.io"]
resources: ["applications", "appprojects"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: argocd-backup
namespace: argocd
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: argocd-backup
subjects:
- kind: ServiceAccount
name: argocd-backup
namespace: argocd
---
apiVersion: v1
kind: Secret
metadata:
name: argocd-backup-git-ssh
namespace: argocd
type: Opaque
data:
id_rsa: <base64-encoded-ssh-private-key>
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: argocd-backup
namespace: argocd
spec:
schedule: "0 2 * * *" # 2 AM daily
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
serviceAccountName: argocd-backup
containers:
- name: backup
image: argoproj/argocd:v2.9.0
command:
- /bin/bash
- -c
- |
# Install git
apt-get update && apt-get install -y git
# Configure git
git config --global user.email "[email protected]"
git config --global user.name "ArgoCD Backup"
# Setup SSH for git
mkdir -p ~/.ssh
cp /ssh-key/id_rsa ~/.ssh/id_rsa
chmod 600 ~/.ssh/id_rsa
ssh-keyscan gitlab.com >> ~/.ssh/known_hosts
# Run backup script
/scripts/backup-argocd.sh
volumeMounts:
- name: backup-script
mountPath: /scripts
- name: ssh-key
mountPath: /ssh-key
readOnly: true
volumes:
- name: backup-script
configMap:
name: argocd-backup-script
defaultMode: 0755
- name: ssh-key
secret:
secretName: argocd-backup-git-ssh
defaultMode: 0600
restartPolicy: OnFailure
---
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-backup-script
namespace: argocd
data:
backup-argocd.sh: |
#!/bin/bash
# (Insert the backup script content here)
|
Deploy it:
1
| kubectl apply -f argocd-backup-cronjob.yaml
|
Verify it’s working:
1
2
3
4
5
6
7
8
| # Check cronjob schedule
kubectl get cronjob -n argocd
# Manually trigger for testing
kubectl create job --from=cronjob/argocd-backup argocd-backup-manual -n argocd
# Watch the job
kubectl logs -f job/argocd-backup-manual -n argocd
|
The Restore Procedure (Tier 1)#
When disaster strikes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
| # 1. Clone the backup repository
git clone [email protected]:company/argocd-backups.git
cd argocd-backups
# 2. List available backups
git log --oneline
# 3. Checkout the desired backup
git checkout <commit-hash>
# 4. Restore CRDs first (if they were deleted)
kubectl apply -f crd-applications.yaml
kubectl apply -f crd-appprojects.yaml
# Wait for CRDs to be established
sleep 5
# 5. Restore projects (dependencies for applications)
kubectl apply -f projects.yaml
# 6. Restore repository credentials
kubectl apply -f repositories.yaml
# 7. Restore cluster credentials
kubectl apply -f clusters.yaml
# 8. Restore configuration
kubectl apply -f argocd-cm.yaml
kubectl apply -f argocd-rbac-cm.yaml
kubectl apply -f argocd-cmd-params-cm.yaml
# 9. Restore applications
kubectl apply -f applications.yaml
# 10. Restart ArgoCD components to pick up new config
kubectl rollout restart deployment argocd-server -n argocd
kubectl rollout restart deployment argocd-repo-server -n argocd
kubectl rollout restart statefulset argocd-application-controller -n argocd
# 11. Verify
argocd app list
|
Recovery time: 10-15 minutes from start to finish.
What this saved us: During the 3 AM incident, if we’d had this in place, we would’ve been back online in 15 minutes instead of 6 hours.
Tier 2: Velero (Complete Disaster Recovery)#
Purpose: Full namespace backup, cross-cluster migration, complete disaster recovery.
Frequency: Daily automated, plus manual before major changes.
Recovery Time: 30-60 minutes (depending on data size).
Storage: Object storage (S3, GCS, Azure Blob).
Why Velero?#
Velero backs up entire Kubernetes namespaces, including:
- All CRDs (Applications, AppProjects)
- All Secrets (repo credentials, cluster credentials, TLS certs)
- All ConfigMaps (ArgoCD settings)
- All PersistentVolumes (if applicable)
- Resource relationships and dependencies
Velero is atomic: It takes a consistent snapshot at a point in time.
Installing Velero#
Prerequisites: Object storage bucket (S3/GCS/Azure Blob).
For AWS S3:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| # Install Velero CLI
wget https://github.com/vmware-tanzu/velero/releases/download/v1.12.0/velero-v1.12.0-linux-amd64.tar.gz
tar -xvf velero-v1.12.0-linux-amd64.tar.gz
sudo mv velero-v1.12.0-linux-amd64/velero /usr/local/bin/
# Create S3 bucket
aws s3 mb s3://argocd-backups-velero --region ap-south-1
# Create IAM user for Velero
aws iam create-user --user-name velero
# Attach policy (you need a policy with S3 access)
aws iam attach-user-policy --user-name velero --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
# Create access key
aws iam create-access-key --user-name velero
# Create credentials file
cat > credentials-velero <<EOF
[default]
aws_access_key_id=<ACCESS_KEY>
aws_secret_access_key=<SECRET_KEY>
EOF
# Install Velero in the cluster
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket argocd-backups-velero \
--backup-location-config region=ap-south-1 \
--snapshot-location-config region=ap-south-1 \
--secret-file ./credentials-velero
|
Verify installation:
1
2
| kubectl get pods -n velero
kubectl logs deployment/velero -n velero
|
Create Backup Schedule for ArgoCD#
1
2
3
4
5
6
7
8
| # Daily backup at 2 AM, retain for 30 days
velero schedule create argocd-daily \
--schedule="0 2 * * *" \
--include-namespaces argocd \
--ttl 720h
# Verify schedule
velero schedule get
|
Manual Backup Before Major Changes#
Before cluster migration, ArgoCD upgrades, or major configuration changes:
1
2
3
4
5
6
7
8
9
10
| # Create named backup
velero backup create argocd-pre-migration-$(date +%Y%m%d) \
--include-namespaces argocd \
--wait
# Verify backup completed
velero backup describe argocd-pre-migration-$(date +%Y%m%d)
# Check backup logs
velero backup logs argocd-pre-migration-$(date +%Y%m%d)
|
The Velero Restore Procedure#
Scenario 1: Restore to Same Cluster
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # List available backups
velero backup get
# Restore from backup
velero restore create argocd-restore-$(date +%Y%m%d) \
--from-backup argocd-daily-20260120020000 \
--wait
# Monitor restore
velero restore describe argocd-restore-$(date +%Y%m%d)
velero restore logs argocd-restore-$(date +%Y%m%d)
# Verify
kubectl get applications -n argocd
argocd app list
|
Scenario 2: Restore to New Cluster (Complete DR)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| # 1. Install Velero in new cluster (pointing to same S3 bucket)
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket argocd-backups-velero \
--backup-location-config region=ap-south-1 \
--snapshot-location-config region=ap-south-1 \
--secret-file ./credentials-velero
# 2. Verify Velero can see existing backups
velero backup get
# 3. Restore ArgoCD from backup
velero restore create argocd-dr-restore \
--from-backup argocd-daily-20260120020000 \
--wait
# 4. Verify all components
kubectl get all -n argocd
kubectl get applications -n argocd
kubectl get appprojects -n argocd
# 5. Update ArgoCD server URL if needed
kubectl patch configmap argocd-cm -n argocd \
--type merge \
-p '{"data":{"url":"https://new-argocd.example.com"}}'
# 6. Restart ArgoCD
kubectl rollout restart deployment argocd-server -n argocd
|
Recovery time: 30-60 minutes depending on backup size and network speed.
Purpose: ArgoCD managing its own configuration, self-healing setup, infrastructure as code.
Philosophy: If ArgoCD is the GitOps tool, why not use GitOps to manage ArgoCD itself?
This is what we implemented at DevOps Den after the 3 AM incident. Now all our ArgoCD configuration lives in Git, managed by ArgoCD itself using the app-of-apps pattern.
Benefits:
- Version-controlled ArgoCD configuration (every change tracked in Git)
- Self-healing (ArgoCD syncs its own config from Git)
- Reproducible across environments (dev, staging, prod)
- No manual backup needed (Git is the backup)
- Audit trail (who changed what, when, and why)
The Structure#
Create a Git repository for ArgoCD configuration:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| argocd-bootstrap/
├── argocd-install/
│ ├── namespace.yaml
│ └── install.yaml
├── projects/
│ ├── project-analytics.yaml
│ ├── project-myapp.yaml
│ └── project-compliance.yaml
├── applications/
│ ├── do-analytics-prod-analyticsui.yaml
│ ├── do-myapp-staging-backend.yaml
│ └── aws-myapp-prod-frontend.yaml
├── repositories/
│ ├── repo-gitlab-company.yaml # Using sealed-secrets
│ └── repo-github-oss.yaml
├── clusters/
│ ├── cluster-do-prod.yaml # Using sealed-secrets
│ ├── cluster-aws-prod.yaml
│ └── cluster-e2e-test.yaml
├── config/
│ ├── argocd-cm.yaml
│ ├── argocd-rbac-cm.yaml
│ └── argocd-cmd-params-cm.yaml
└── app-of-apps.yaml
|
Example Files#
projects/project-analytics.yaml:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: do-analytics-prod-apps
namespace: argocd
spec:
description: Analytics platform - Production
sourceRepos:
- https://gitlab.com/company/gitops/manifests.git
destinations:
- namespace: analytics-prod-apps
server: https://67860677-ba01-4c64-aea4-981fde9b5fc6.k8s.ondigitalocean.com
clusterResourceWhitelist:
- group: '*'
kind: '*'
|
applications/do-analytics-prod-analyticsui.yaml:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: do-analytics-prod-apps.analyticsui.prod
namespace: argocd
spec:
project: do-analytics-prod-apps
source:
repoURL: https://gitlab.com/company/gitops/manifests.git
path: manifests/analytics-platform/analyticsui
targetRevision: prod
destination:
server: https://67860677-ba01-4c64-aea4-981fde9b5fc6.k8s.ondigitalocean.com
namespace: analytics-prod-apps
syncPolicy:
automated:
prune: true
selfHeal: true
|
Handling Secrets: Use Sealed Secrets for repository/cluster credentials:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # Install sealed-secrets controller
kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml
# Create a secret for repository credentials
kubectl create secret generic repo-gitlab-company \
--from-literal=url=https://gitlab.com/company/gitops/manifests.git \
--from-literal=username=git \
--from-literal=password=your-token \
--dry-run=client -o yaml > repo-secret.yaml
# Seal it
kubeseal -o yaml < repo-secret.yaml > repositories/repo-gitlab-company.yaml
# Now you can commit the sealed secret to Git
git add repositories/repo-gitlab-company.yaml
git commit -m "Add GitLab repository credentials (sealed)"
|
The sealed secret looks like:
1
2
3
4
5
6
7
8
9
10
11
12
| apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
name: repo-gitlab-company
namespace: argocd
labels:
argocd.argoproj.io/secret-type: repository
spec:
encryptedData:
password: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEq...
url: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEq...
username: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEq...
|
Safe to commit to Git - can only be decrypted by the sealed-secrets controller in that specific cluster.
The App-of-Apps Pattern#
app-of-apps.yaml:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: argocd-config
namespace: argocd
spec:
project: default
source:
repoURL: https://gitlab.com/company/argocd-bootstrap.git
targetRevision: main
path: .
directory:
recurse: true
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
|
Deploy the App-of-Apps:
1
2
| # After fresh ArgoCD installation
kubectl apply -f app-of-apps.yaml
|
Now ArgoCD manages itself:
- The
argocd-config application watches the Git repository - Any changes to projects, applications, config → Git commit → ArgoCD auto-syncs
- ArgoCD’s configuration is now version-controlled and self-healing
Benefits:
- ✅ All ArgoCD config in Git (version controlled, auditable)
- ✅ Self-healing (manual changes get reverted)
- ✅ Reproducible (deploy to new cluster =
kubectl apply -f app-of-apps.yaml) - ✅ No manual backups needed (Git is the backup)
The DR scenario becomes:
1
2
3
4
5
6
7
8
9
| # Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# Deploy app-of-apps
kubectl apply -f app-of-apps.yaml
# Wait 2 minutes, ArgoCD recreates everything from Git
argocd app list
|
Recovery time: 5 minutes.
High Availability: Making ArgoCD Bulletproof#
Now that we know how to backup, let’s ensure we never need to restore by making ArgoCD highly available.
The Problem with Single-Replica Deployment#
Recall the installation command:
1
| kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
|
This creates:
- 1 replica of
argocd-server (API and UI) - 1 replica of
argocd-repo-server (Git repository interaction) - 1 replica of
argocd-application-controller (sync orchestration) - 1 replica of Redis (caching)
What breaks:
- Node failure → ArgoCD goes down until pod reschedules
- Pod crash → 30-60 seconds of downtime
- Rolling updates → downtime during upgrade
- High load → single pod can’t scale
HA Architecture#
Production-grade ArgoCD needs:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| ┌─────────────────────────────────────────────────────────────┐
│ argocd-server (3 replicas) │
│ ├── Pod 1 (node-1) │
│ ├── Pod 2 (node-2) ← Load balanced │
│ └── Pod 3 (node-3) │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ argocd-repo-server (2 replicas) │
│ ├── Pod 1 (node-1) ← Git cloning, manifest generation │
│ └── Pod 2 (node-2) │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ argocd-application-controller (1 replica) │
│ └── Pod 1 (StatefulSet) ← Leader election, watches apps │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Redis HA (3 replicas with Sentinel) │
│ ├── Redis Pod 1 (Master) │
│ ├── Redis Pod 2 (Replica) ← Automatic failover │
│ └── Redis Pod 3 (Replica) │
│ │
│ Sentinel (3 replicas) - monitors Redis, elects new master │
└─────────────────────────────────────────────────────────────┘
|
Why this survives failures:
- API server (3 replicas) → 2 can fail, 1 keeps serving
- Repo server (2 replicas) → 1 can fail, Git operations continue
- Application controller (1 replica) → Uses leader election, survives via StatefulSet
- Redis HA → Master fails, Sentinel promotes replica
Installing HA ArgoCD#
Using Helm (recommended for HA):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Add ArgoCD Helm repository
helm repo add argo https://argoproj.github.io/argo-helm
helm repo update
# Install with HA configuration
helm install argocd argo/argo-cd \
--namespace argocd \
--create-namespace \
--set server.replicas=3 \
--set repoServer.replicas=2 \
--set controller.replicas=1 \
--set redis-ha.enabled=true \
--set redis-ha.replicas=3 \
--set redis.enabled=false
|
Verify HA deployment:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| kubectl get pods -n argocd
# Should see:
# argocd-server-xxx-1
# argocd-server-xxx-2
# argocd-server-xxx-3
# argocd-repo-server-xxx-1
# argocd-repo-server-xxx-2
# argocd-application-controller-0
# argocd-redis-ha-server-0
# argocd-redis-ha-server-1
# argocd-redis-ha-server-2
# argocd-redis-ha-haproxy-xxx-1
# argocd-redis-ha-haproxy-xxx-2
# argocd-redis-ha-haproxy-xxx-3
|
HA with values.yaml (Recommended for Production)#
Create argocd-ha-values.yaml:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
| # Server HA
server:
replicas: 3
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/name: argocd-server
topologyKey: kubernetes.io/hostname
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
# Repo server HA
repoServer:
replicas: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/name: argocd-repo-server
topologyKey: kubernetes.io/hostname
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
# Application controller
controller:
replicas: 1
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 2Gi
# Redis HA
redis:
enabled: false
redis-ha:
enabled: true
replicas: 3
haproxy:
replicas: 3
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/name: argocd-redis-ha-haproxy
topologyKey: kubernetes.io/hostname
# High availability settings
configs:
cm:
timeout.reconciliation: 180s
statusbadge.enabled: "true"
params:
controller.operation.processors: "10"
controller.status.processors: "20"
controller.self.heal.timeout.seconds: "5"
server.insecure: "false"
|
Install with values file:
1
2
3
4
| helm install argocd argo/argo-cd \
--namespace argocd \
--create-namespace \
-f argocd-ha-values.yaml
|
Testing HA: Chaos Engineering#
Test 1: Kill argocd-server pod:
1
2
3
4
5
6
7
| # Identify a server pod
kubectl get pods -n argocd -l app.kubernetes.io/name=argocd-server
# Delete one
kubectl delete pod argocd-server-xxx-1 -n argocd
# Access UI - should continue working (other 2 pods serving)
|
Test 2: Drain a node:
1
2
3
4
| # Drain node where argocd-server pods run
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data
# Pods reschedule to other nodes, zero downtime
|
Test 3: Redis failover:
1
2
3
4
5
6
7
| # Kill Redis master
kubectl delete pod argocd-redis-ha-server-0 -n argocd
# Watch Sentinel promote a replica
kubectl logs -f argocd-redis-ha-server-1 -n argocd -c sentinel
# ArgoCD continues operating with new Redis master
|
HA Monitoring#
Key metrics to track:
- Pod availability:
1
2
3
| kubectl get pods -n argocd -o wide
# All pods should be Running and Ready
|
- Redis HA status:
1
2
3
4
| # Check Redis Sentinel status
kubectl exec -it argocd-redis-ha-server-0 -n argocd -c redis -- redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster
# Should return current master IP
|
- Application sync status:
1
2
3
| argocd app list
# Monitor for OutOfSync or degraded apps
|
Prometheus metrics:
1
2
3
4
5
6
7
8
9
10
11
12
13
| # ArgoCD exposes Prometheus metrics on port 8082
apiVersion: v1
kind: Service
metadata:
name: argocd-metrics
namespace: argocd
spec:
ports:
- name: metrics
port: 8082
targetPort: 8082
selector:
app.kubernetes.io/name: argocd-server
|
Grafana dashboards: Import ArgoCD official dashboard (ID 14584) from Grafana.com.
Disaster Recovery Scenarios and Runbooks#
Let’s walk through real disaster scenarios and exact recovery procedures.
Scenario 1: Accidental Application Deletion#
Symptom: Someone ran kubectl delete application myapp -n argocd.
Impact: ArgoCD forgets about the application, but deployed resources keep running in target cluster.
Recovery (Tier 1 - CLI backup):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # 1. Clone backup repo
git clone [email protected]:company/argocd-backups.git
cd argocd-backups
# 2. Find the deleted application
grep -r "name: myapp" applications.yaml
# 3. Extract just that application
kubectl apply -f - <<EOF
$(yq eval 'select(.metadata.name == "myapp")' applications.yaml)
EOF
# 4. Verify
argocd app get myapp
argocd app sync myapp
|
Recovery time: 2 minutes.
Scenario 2: Complete ArgoCD Namespace Deletion#
Symptom: kubectl delete namespace argocd (oops).
Impact: Total loss of ArgoCD. Applications keep running in target clusters, but no automation.
Recovery (Tier 2 - Velero):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| # 1. Reinstall ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# Wait for pods to be ready
kubectl wait --for=condition=Ready pods --all -n argocd --timeout=300s
# 2. Install Velero (if not already installed)
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket argocd-backups-velero \
--backup-location-config region=ap-south-1 \
--snapshot-location-config region=ap-south-1 \
--secret-file ./credentials-velero
# 3. List available backups
velero backup get
# 4. Restore from latest backup
velero restore create argocd-emergency-restore \
--from-backup argocd-daily-$(date +%Y%m%d)020000 \
--namespace-mappings argocd:argocd \
--wait
# 5. Verify restore
kubectl get applications -n argocd
argocd app list
# 6. Sync all applications to recover from any drift
argocd app sync --all
|
Recovery time: 20-30 minutes.
Scenario 3: Cluster Migration (Complete Infrastructure Change)#
Symptom: Migrating from DigitalOcean to AWS EKS. Need to move ArgoCD.
Recovery (Tier 3 - GitOps approach):
Prerequisites: ArgoCD config is in Git (argocd-bootstrap repo).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| # 1. New cluster is provisioned (AWS EKS)
aws eks update-kubeconfig --name production-cluster --region ap-south-1
# 2. Install ArgoCD in new cluster
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# 3. Install sealed-secrets controller (for decrypting repo/cluster credentials)
kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml
# 4. Copy sealed-secrets master key from old cluster
kubectl get secret -n kube-system sealed-secrets-key -o yaml > sealed-secrets-key.yaml
kubectl apply -f sealed-secrets-key.yaml -n kube-system
# This allows new cluster to decrypt the same sealed secrets
# 5. Deploy app-of-apps
kubectl apply -f https://raw.githubusercontent.com/company/argocd-bootstrap/main/app-of-apps.yaml
# 6. Wait for ArgoCD to self-configure
kubectl get applications -n argocd --watch
# Within 2-3 minutes, all projects, applications, repos, clusters restored
# 7. Verify
argocd app list
argocd app sync --all
|
Recovery time: 10 minutes.
Key advantage: Zero manual reconstruction. Git is the source of truth.
Scenario 4: etcd Corruption (Kubernetes Cluster Disaster)#
Symptom: Kubernetes etcd corrupted. Cluster state lost.
Impact: Total cluster failure. All ArgoCD data in etcd is gone.
Recovery (Combination of Velero + GitOps):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| # 1. Rebuild Kubernetes cluster (new control plane)
# (Cloud provider specific - EKS console, DO dashboard, kubeadm, etc.)
# 2. Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# 3. Option A: Restore from Velero (if Velero backups were in object storage)
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket argocd-backups-velero \
--backup-location-config region=ap-south-1 \
--snapshot-location-config region=ap-south-1 \
--secret-file ./credentials-velero
velero restore create disaster-recovery \
--from-backup argocd-daily-<latest> \
--wait
# 3. Option B: Restore from Git (if using GitOps approach)
kubectl apply -f https://raw.githubusercontent.com/company/argocd-bootstrap/main/app-of-apps.yaml
# 4. Verify
argocd app list
argocd app sync --all
|
Recovery time: 30-60 minutes (most time spent rebuilding Kubernetes cluster).
Production Best Practices Checklist#
Based on months of running ArgoCD in production, here’s the checklist:
Backup & DR#
High Availability#
Security#
Monitoring & Alerting#
Operational Excellence#
Lessons from the Trenches#
Lesson 1: Backup is Only Half the Story#
We had backups. But we’d never tested restore. When disaster struck, we discovered:
- Backups were incomplete (missing cluster credentials)
- Restore procedure wasn’t documented
- Team didn’t know how to restore
Takeaway: Test your restore procedure quarterly. Schedule it. Put it on the calendar. Actually restore to a test cluster.
Lesson 2: Manual Backups Don’t Scale#
Early on, we’d manually run argocd app list -o yaml > backup.yaml before major changes.
We forgot. A lot.
Takeaway: Automate everything. CronJob, Velero schedule, Git-based config. If it requires a human to remember, it will fail.
Lesson 3: HA Isn’t Optional for Production#
At DevOps Den, we ran single-replica ArgoCD for months. “It’s just automation, we can redeploy.”
Then ArgoCD went down during a critical deployment window. A customer-facing bug needed an immediate hotfix. Team Mavericks had the fix ready, but we couldn’t deploy it because the GitOps control plane was offline.
We had to manually kubectl apply the fix (breaking our GitOps workflow) and then reconcile it in ArgoCD later.
Takeaway: If your deployments depend on it, it needs to be HA. ArgoCD is infrastructure, treat it as such.
Lesson 4: GitOps for ArgoCD is the Best Long-Term Strategy#
CLI backups are great for quick recovery. Velero is excellent for disaster recovery.
But GitOps for ArgoCD config is the ultimate solution because:
- You get versioning (Git history)
- You get auditing (who changed what when)
- You get self-healing (drift correction)
- You get reproducibility (deploy anywhere)
Takeaway: Invest in GitOps for ArgoCD early. It pays dividends forever.
Lesson 5: Know What You’re NOT Backing Up#
We once “restored” ArgoCD from backup and wondered why applications weren’t syncing.
Turns out: We backed up Application CRDs, but the deployed resources were still in target clusters. ArgoCD saw drift everywhere and started auto-healing (re-syncing from Git, which had older versions).
Takeaway: Understand the scope of your backup. ArgoCD config ≠ Application state. Git is source of truth for manifests, Kubernetes is source of truth for deployed resources.
What’s Next?#
You’ve implemented backup, disaster recovery, and high availability for ArgoCD. Your GitOps control plane is now production-grade.
Next steps:
- Test your DR plan - Schedule a restore drill this month
- Implement monitoring - Set up Prometheus + Grafana dashboards
- GitOps everything - Move all ArgoCD config to Git (app-of-apps)
- Automate validation - CI pipeline to validate ArgoCD manifests before merge
- Security hardening - SSO integration, RBAC tightening, network policies
Related articles:
Final Thoughts#
That 3 AM wake-up call was painful. Six hours of manual reconstruction, team frustration, missed SLAs.
But it taught me something critical: Infrastructure automation needs the same rigor as application code.
ArgoCD is your GitOps control plane. It orchestrates hundreds of deployments. It’s the bridge between Git (your source of truth) and Kubernetes (your runtime). Treat it with the importance it deserves.
Backup strategy: Multiple tiers, automated, tested regularly.
HA deployment: Multi-replica, across nodes, with Redis failover.
GitOps approach: ArgoCD managing itself, configuration as code.
These aren’t optional for production. They’re the baseline.
Now when my phone buzzes at 3 AM, I know:
- ✅ Automated backups ran last night
- ✅ Velero has a full snapshot
- ✅ GitOps config can restore everything in 10 minutes
- ✅ HA setup means ArgoCD is still running anyway
Recovery went from 6 hours to 10 minutes.
That’s the power of treating your automation infrastructure with production-grade discipline.
Need help implementing production-grade ArgoCD?
I help teams design and implement bulletproof GitOps infrastructure. Services include:
- ArgoCD Disaster Recovery Audit - Assessment of your backup strategy and recovery procedures
- Production Hardening - HA setup, backup automation, monitoring integration
- Disaster Recovery Planning - Runbooks, testing procedures, team training
- GitOps Architecture Design - Repository structure, multi-cluster patterns, security
Schedule a consultation or reach out at www.uk4.in.
Kudos to every DevOps engineer who’s been woken up at 3 AM. You’re not alone.
Built with resilience, automated with discipline, deployed with confidence—powered by GitOps and battle-tested DR strategies.