ArgoCD Backup and Disaster Recovery: Never Lose GitOps State

The 3 AM Wake-Up Call

There are two types of DevOps engineers: those who’ve lost critical infrastructure data, and those who will.

I learned this distinction at 3:47 AM on a Tuesday.

The phone buzzed. Monitoring alert. Then another. Then the call from the on-call engineer.

“ArgoCD is down. All applications showing as Unknown.”

Heart rate: elevated. Coffee: brewing. Laptop: opening.

I logged into the cluster. The ArgoCD namespace was there. Pods were running. But something was fundamentally wrong.

1
kubectl get applications -n argocd

Empty.

Wait. Maybe I’m in the wrong context?

1
2
kubectl config current-context
# production-cluster

No, this is the right cluster.

1
argocd app list

Nothing.

1
2
3
4
# Maybe the CRD itself is gone?
kubectl get crd applications.argoproj.io
# NAME                        CREATED AT
# applications.argoproj.io    2025-08-15T10:23:41Z

The CRD exists. But no applications.

1
2
3
# Check all namespaces, just in case
kubectl get applications --all-namespaces
# No resources found

150+ applications managing deployments across three Kubernetes clusters. Gone.

Not “out of sync.” Not “failed to sync.” Just… gone.

What happened?

A new engineer on Team Mavericks, trying to “clean up unused resources” in the staging namespace, had accidentally run a cleanup script against the production ArgoCD namespace. The script had proper RBAC permissions (we were too permissive). It executed successfully. ArgoCD had lost all memory of what it was managing—150+ application definitions, gone in seconds.

Git repositories were fine. The actual deployed applications were still running in their respective clusters. But ArgoCD—our GitOps control plane—had lost all memory of what it was managing.

The recovery took 6 hours. It should have taken 15 minutes.

We had backups. But they were ad-hoc CLI exports from two weeks ago. Applications had changed. New deployments had happened. We spent hours reconstructing state from Git commit history, deployment scripts, and tribal knowledge.

That night taught me the difference between “having backups” and “having a disaster recovery strategy.”

This article is everything I wish I’d known before that wake-up call.

Understanding ArgoCD’s Storage Architecture

Before we talk about backup, we need to understand what we’re backing up—and more importantly, where ArgoCD actually stores its data.

The Design Philosophy: Kubernetes-Native Storage

ArgoCD doesn’t use a traditional database. There’s no PostgreSQL, no MongoDB, no external data store to configure and maintain.

Everything is stored as Kubernetes resources.

This is brilliant and terrifying at the same time.

Brilliant because:

No external dependencies to manage
Backup/restore uses standard Kubernetes tools
High availability is Kubernetes-native
Disaster recovery is conceptually simple

Terrifying because:

If your Kubernetes cluster’s etcd is corrupted, ArgoCD’s state is corrupted
Accidental deletion of resources = accidental deletion of ArgoCD config
Cluster migration requires careful planning

What Gets Stored Where

When you installed ArgoCD using:

1
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

That manifest created several types of resources. Let’s understand what each stores:

1. Application Definitions (Custom Resource Definitions)

What: Your application configurations—what to deploy, where to deploy it, from which Git repo.

Stored as: Application CRDs in the argocd namespace.

Example:

1
2
3
4
5
6
# Using kubectl (checks CRDs directly)
kubectl get applications -n argocd

NAME                                     SYNC STATUS   HEALTH STATUS
do-analytics-prod-apps.analyticsui.prod  Synced        Healthy
aws-myapp-staging-apps.backend.staging   OutOfSync     Progressing

1
2
3
4
5
6
# Using ArgoCD CLI (same result, prettier output)
argocd app list

NAME                                     CLUSTER                          NAMESPACE           PROJECT              STATUS     HEALTH   SYNCPOLICY
do-analytics-prod-apps.analyticsui.prod  https://67860677-ba01...         analytics-prod-apps do-analytics-prod... Synced     Healthy  Auto
aws-myapp-staging.backend               https://3CCD9E5C2236A7E0...      myapp-staging       aws-myapp-staging    OutOfSync  Degraded Manual

View the actual data:

1
2
3
4
5
# Using kubectl (raw YAML)
kubectl get application do-analytics-prod-apps.analyticsui.prod -n argocd -o yaml

# Using ArgoCD CLI (formatted, easier to read)
argocd app get do-analytics-prod-apps.analyticsui.prod

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: do-analytics-prod-apps.analyticsui.prod
  namespace: argocd
spec:
  project: do-analytics-prod-apps
  source:
    repoURL: https://gitlab.com/company/gitops/manifests.git
    path: manifests/analytics-platform/analyticsui
    targetRevision: prod
  destination:
    server: https://67860677-ba01-4c64-aea4-981fde9b5fc6.k8s.ondigitalocean.com
    namespace: analytics-prod-apps
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

This is your entire application configuration. Delete this CRD, and ArgoCD forgets that application exists.

2. Project Definitions (AppProject CRDs)

What: RBAC boundaries, source/destination restrictions, project-level settings.

Stored as: AppProject CRDs in the argocd namespace.

View them:

1
2
3
4
5
# Using kubectl
kubectl get appprojects -n argocd

# Using ArgoCD CLI
argocd proj list

Example AppProject:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: do-analytics-prod-apps
  namespace: argocd
spec:
  destinations:
  - namespace: analytics-prod-apps
    server: https://67860677-ba01-4c64-aea4-981fde9b5fc6.k8s.ondigitalocean.com
  sourceRepos:
  - https://gitlab.com/company/gitops/manifests.git
  clusterResourceWhitelist:
  - group: '*'
    kind: '*'

Why this matters: Projects control what can be deployed where. Lose this, and you lose your security boundaries.

3. Repository Credentials (Secrets)

What: Git repository access credentials (passwords, SSH keys, tokens).

Stored as: Kubernetes Secrets with label argocd.argoproj.io/secret-type=repository.

View them (credentials are base64-encoded):

1
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository

Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
apiVersion: v1
kind: Secret
metadata:
  name: repo-123456789
  namespace: argocd
  labels:
    argocd.argoproj.io/secret-type: repository
type: Opaque
data:
  password: <base64-encoded-password>
  url: aHR0cHM6Ly9naXRsYWIuY29tL2NvbXBhbnkvcmVwby5naXQ=
  username: <base64-encoded-username>

Critical: If you lose these secrets, ArgoCD can’t pull manifests from Git.

4. Cluster Credentials (Secrets)

What: Authentication credentials for external Kubernetes clusters that ArgoCD manages.

Stored as: Kubernetes Secrets with label argocd.argoproj.io/secret-type=cluster.

View them:

1
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=cluster

When you ran:

1
argocd cluster add do-production-cluster

ArgoCD created a ServiceAccount in the target cluster and stored the credentials as a Secret in the ArgoCD namespace.

Example structure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
apiVersion: v1
kind: Secret
metadata:
  name: cluster-do-production-67860677
  namespace: argocd
  labels:
    argocd.argoproj.io/secret-type: cluster
type: Opaque
data:
  name: ZG8tcHJvZHVjdGlvbi1jbHVzdGVy
  server: aHR0cHM6Ly82Nzg2MDY3Ny1iYTAxLTRjNjQtYWVhNC05ODFmZGU5YjVmZDYuazhzLm9uZGlnaXRhbG9jZWFuLmNvbQ==
  config: <base64-encoded-kubeconfig-with-token>

Without these, ArgoCD loses access to all external clusters it manages.

5. ArgoCD Configuration (ConfigMaps)

What: ArgoCD server settings, UI customizations, notification configs, etc.

Stored as: ConfigMaps in the argocd namespace.

Key ConfigMaps:

argocd-cm - Main configuration:

1
kubectl get configmap argocd-cm -n argocd -o yaml

Example contents:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
data:
  url: https://argocd.example.com
  dex.config: |
    connectors:
    # SSO configuration
  repositories: |
    - url: https://gitlab.com/company/gitops

argocd-rbac-cm - RBAC policies:

1
kubectl get configmap argocd-rbac-cm -n argocd -o yaml

Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-rbac-cm
  namespace: argocd
data:
  policy.csv: |
    p, role:developers, applications, sync, default/*, allow
    p, role:developers, applications, get, default/*, allow
    p, role:ops, applications, *, */*, allow
    g, engineering-team, role:developers
    g, ops-team, role:ops
  policy.default: role:readonly

argocd-cmd-params-cm - Server startup parameters:

1
kubectl get configmap argocd-cmd-params-cm -n argocd -o yaml

Lose these, and your ArgoCD reverts to default configuration (losing SSO, RBAC policies, custom settings).

6. TLS Certificates and SSH Keys (Secrets)

What: TLS certificates for ArgoCD server, SSH keys for Git repository access.

Stored as: Various Secrets in the argocd namespace.

Examples:

argocd-server-tls - HTTPS certificate for ArgoCD UI
argocd-repo-server-tls - Internal TLS for repo server
SSH private keys for Git access (created when you add repos via SSH)

7. Initial Admin Password (Secret)

What: The initial admin password generated during installation.

Stored as: Secret argocd-initial-admin-secret.

You retrieved this during installation:

1
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

Note: This secret is typically deleted after you change the admin password, but it’s worth knowing about for fresh installations.

What ArgoCD Does NOT Store

Important distinctions:

❌ Application manifests - These live in Git, not in ArgoCD. ArgoCD pulls them on-demand.

❌ Deployed resources - The actual pods, services, deployments live in target clusters, not in ArgoCD.

❌ Git repository history - ArgoCD references Git, doesn’t clone it permanently.

❌ Application state/logs - ArgoCD tracks sync status, but not runtime logs or metrics.

This is why GitOps works: The source of truth is Git. ArgoCD is just the synchronization engine. If you lose ArgoCD, your applications keep running. You just lose the automation layer.

The Installation Decision That Matters

When you installed ArgoCD with:

1
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

You made a critical choice: non-HA, single-replica deployment.

Look at what this manifest creates:

1
2
3
4
5
6
7
8
# argocd-application-controller
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: argocd-application-controller
spec:
  replicas: 1  # <-- Single replica
  serviceName: argocd-application-controller

1
2
3
4
5
6
7
# argocd-server
apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-server
spec:
  replicas: 1  # <-- Single replica

1
2
3
4
5
6
7
# argocd-repo-server
apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-repo-server
spec:
  replicas: 1  # <-- Single replica

What this means:

✅ Simple, easy to install
✅ Low resource usage
✅ Perfect for development/staging
❌ Single point of failure
❌ Not production-ready
❌ Downtime during upgrades

Alternative: HA Installation

For production, you’d use Helm with HA configuration:

1
2
3
4
5
6
7
8
helm install argocd argo/argo-cd \
  --namespace argocd \
  --create-namespace \
  --set server.replicas=3 \
  --set repoServer.replicas=2 \
  --set controller.replicas=1 \
  --set redis-ha.enabled=true \
  --set redis-ha.replicas=3

This changes:

✅ Multiple replicas of API server (survive pod failures)
✅ Multiple repo servers (load distribution)
✅ Redis HA with Sentinel (no single point of failure)
✅ Survives node failures
✅ Zero-downtime upgrades possible

Storage-wise, this doesn’t change WHERE data is stored (still Kubernetes CRDs/ConfigMaps/Secrets), but it changes availability and resilience.

We’ll cover HA setup in detail later in this article.

The Data Flow: Where Things Happen

Understanding the data flow helps with backup planning:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
┌─────────────────────────────────────────────────────────────┐
│ ArgoCD Installation (kubectl apply)                         │
│                                                              │
│  Creates in Kubernetes etcd:                                │
│  ├── CRDs (Application, AppProject)                         │
│  ├── ConfigMaps (argocd-cm, argocd-rbac-cm)                 │
│  ├── Secrets (TLS certs, repo credentials)                  │
│  └── Deployments/StatefulSets (ArgoCD components)           │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ You create Application via CLI or UI                        │
│                                                              │
│  argocd app create myapp \                                  │
│    --repo https://gitlab.com/company/repo.git \             │
│    --path manifests/myapp \                                 │
│    --dest-server https://cluster.com \                      │
│    --dest-namespace myapp                                   │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ ArgoCD stores this as Application CRD                       │
│                                                              │
│  apiVersion: argoproj.io/v1alpha1                           │
│  kind: Application                                          │
│  metadata:                                                  │
│    name: myapp                                              │
│  spec:                                                      │
│    source:                                                  │
│      repoURL: https://gitlab.com/company/repo.git           │
│      path: manifests/myapp                                  │
│                                                              │
│  Stored in: Kubernetes etcd → Part of cluster state        │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ ArgoCD Application Controller watches this CRD              │
│                                                              │
│  Every 3 minutes (default):                                 │
│  1. Reads Application CRD from Kubernetes                   │
│  2. Pulls manifests from Git                                │
│  3. Compares Git state vs Cluster state                     │
│  4. Syncs if needed                                         │
└─────────────────────────────────────────────────────────────┘

Key insight for backup: The Application CRD is the bridge between Git (source of truth for manifests) and Kubernetes (source of truth for what’s deployed). Lose the CRD, lose the bridge.

The Backup Implication

Now that we know where everything lives, the backup strategy becomes clear:

What we must backup:

Application CRDs (kubectl get applications -n argocd)
AppProject CRDs (kubectl get appprojects -n argocd)
Repository credential Secrets
Cluster credential Secrets
ConfigMaps (argocd-cm, argocd-rbac-cm, argocd-cmd-params-cm)
TLS Secrets

What we don’t need to backup:

Application manifests (they’re in Git)
Deployed resources (they’re in target clusters)
ArgoCD component Deployments/StatefulSets (recreated from install manifest)

Let’s implement this.

The Three-Tier Backup Strategy

Based on that 3 AM incident and months of refining our approach at DevOps Den, here’s the backup strategy that actually works in production.

Tier 1: Automated CLI Exports (Daily Quick Backups)

Purpose: Fast recovery from accidental deletion, quick rollback capability.

Frequency: Daily, automated via cronjob.

Recovery Time: 15 minutes.

Storage: Git repository (version controlled backups).

Option A: The Quick Method (Recommended for Getting Started)

ArgoCD provides a built-in argocd admin export command that backs up everything in one shot:

1
2
3
4
5
6
7
# Export all ArgoCD resources to a file
argocd admin export -n argocd > argocd-backup-$(date +%Y%m%d).yaml

# Or using Docker (if argocd CLI not installed locally)
docker run -v ~/.kube:/home/argocd/.kube --rm \
  quay.io/argoproj/argocd:v2.13.2 \
  argocd admin export -n argocd > argocd-backup-$(date +%Y%m%d).yaml

What it backs up:

All Applications
All AppProjects
All repository credentials (Secrets)
All cluster credentials (Secrets)
All ConfigMaps (argocd-cm, argocd-rbac-cm, etc.)

For multi-namespace ArgoCD setups (if your applications are in multiple namespaces):

1
2
3
argocd admin export -n argocd \
  --application-namespaces="team-mavericks,team-infrastructure,team-platform" \
  > argocd-backup-$(date +%Y%m%d).yaml

Restore from backup:

1
2
3
4
5
6
7
# Import the backup
argocd admin import argocd-backup-20260120.yaml

# Or using Docker
docker run -i -v ~/.kube:/home/argocd/.kube --rm \
  quay.io/argoproj/argocd:v2.13.2 \
  argocd admin import - < argocd-backup-20260120.yaml

Why this is great for getting started:

✅ Single command
✅ Officially supported by ArgoCD
✅ Everything in one file
✅ Simple to automate

Limitation: Single monolithic file. For production, you may want more granular backups (next approach).

Option B: The Production Script (Granular Backups)

Create backup-argocd.sh:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
#!/bin/bash
set -euo pipefail

# Configuration
BACKUP_DIR="/tmp/argocd-backup-$(date +%Y%m%d-%H%M%S)"
GIT_REPO="[email protected]:company/argocd-backups.git"
RETENTION_DAYS=30

# Create backup directory
mkdir -p "$BACKUP_DIR"

echo "Starting ArgoCD backup at $(date)"

# 1. Export ArgoCD CRDs (the schema definitions themselves)
echo "Backing up ArgoCD CRDs..."
kubectl get crd applications.argoproj.io -o yaml > "$BACKUP_DIR/crd-applications.yaml"
kubectl get crd appprojects.argoproj.io -o yaml > "$BACKUP_DIR/crd-appprojects.yaml"

# 2. Export all applications
echo "Backing up applications..."
argocd app list -o yaml > "$BACKUP_DIR/applications.yaml"

# Count for verification
APP_COUNT=$(argocd app list | wc -l)
echo "Backed up $APP_COUNT applications"

# 3. Export all projects
echo "Backing up projects..."
argocd proj list -o yaml > "$BACKUP_DIR/projects.yaml"

# 4. Export repository credentials (requires kubectl, as ArgoCD CLI doesn't expose this)
echo "Backing up repository credentials..."
kubectl get secrets -n argocd \
  -l argocd.argoproj.io/secret-type=repository \
  -o yaml > "$BACKUP_DIR/repositories.yaml"

# 5. Export cluster credentials
echo "Backing up cluster credentials..."
kubectl get secrets -n argocd \
  -l argocd.argoproj.io/secret-type=cluster \
  -o yaml > "$BACKUP_DIR/clusters.yaml"

# 6. Export ArgoCD configuration
echo "Backing up ArgoCD configuration..."
kubectl get configmap argocd-cm -n argocd -o yaml > "$BACKUP_DIR/argocd-cm.yaml"
kubectl get configmap argocd-rbac-cm -n argocd -o yaml > "$BACKUP_DIR/argocd-rbac-cm.yaml"
kubectl get configmap argocd-cmd-params-cm -n argocd -o yaml > "$BACKUP_DIR/argocd-cmd-params-cm.yaml"

# 7. Export TLS secrets
echo "Backing up TLS certificates..."
kubectl get secret argocd-server-tls -n argocd -o yaml > "$BACKUP_DIR/argocd-server-tls.yaml" 2>/dev/null || echo "No server TLS secret found"

# 8. Create a manifest list for easy verification
echo "Creating manifest..."
cat > "$BACKUP_DIR/MANIFEST.txt" <<EOF
ArgoCD Backup - $(date)
================================================

Applications: $APP_COUNT
Projects: $(argocd proj list | wc -l)
Repositories: $(kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository --no-headers | wc -l)
Clusters: $(kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=cluster --no-headers | wc -l)

Files:
$(ls -lh "$BACKUP_DIR")

Backup completed at: $(date)
EOF

cat "$BACKUP_DIR/MANIFEST.txt"

# 9. Push to Git for version control
echo "Pushing backup to Git..."
cd "$BACKUP_DIR"
git init
git add .
git commit -m "ArgoCD backup $(date +%Y-%m-%d_%H:%M:%S)"
git remote add origin "$GIT_REPO"
git push -u origin main --force

# 10. Cleanup old backups (keep last 30 days)
echo "Cleaning up old backups..."
find /tmp/argocd-backup-* -type d -mtime +$RETENTION_DAYS -exec rm -rf {} \; 2>/dev/null || true

echo "Backup completed successfully at $(date)"

Deploy as Kubernetes CronJob

Create argocd-backup-cronjob.yaml:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
apiVersion: v1
kind: ServiceAccount
metadata:
  name: argocd-backup
  namespace: argocd
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: argocd-backup
  namespace: argocd
rules:
- apiGroups: [""]
  resources: ["secrets", "configmaps"]
  verbs: ["get", "list"]
- apiGroups: ["argoproj.io"]
  resources: ["applications", "appprojects"]
  verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: argocd-backup
  namespace: argocd
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: argocd-backup
subjects:
- kind: ServiceAccount
  name: argocd-backup
  namespace: argocd
---
apiVersion: v1
kind: Secret
metadata:
  name: argocd-backup-git-ssh
  namespace: argocd
type: Opaque
data:
  id_rsa: <base64-encoded-ssh-private-key>
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: argocd-backup
  namespace: argocd
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: argocd-backup
          containers:
          - name: backup
            image: argoproj/argocd:v2.9.0
            command:
            - /bin/bash
            - -c
            - |
              # Install git
              apt-get update && apt-get install -y git

              # Configure git
              git config --global user.email "[email protected]"
              git config --global user.name "ArgoCD Backup"

              # Setup SSH for git
              mkdir -p ~/.ssh
              cp /ssh-key/id_rsa ~/.ssh/id_rsa
              chmod 600 ~/.ssh/id_rsa
              ssh-keyscan gitlab.com >> ~/.ssh/known_hosts

              # Run backup script
              /scripts/backup-argocd.sh
            volumeMounts:
            - name: backup-script
              mountPath: /scripts
            - name: ssh-key
              mountPath: /ssh-key
              readOnly: true
          volumes:
          - name: backup-script
            configMap:
              name: argocd-backup-script
              defaultMode: 0755
          - name: ssh-key
            secret:
              secretName: argocd-backup-git-ssh
              defaultMode: 0600
          restartPolicy: OnFailure
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-backup-script
  namespace: argocd
data:
  backup-argocd.sh: |
    #!/bin/bash
    # (Insert the backup script content here)

Deploy it:

1
kubectl apply -f argocd-backup-cronjob.yaml

Verify it’s working:

1
2
3
4
5
6
7
8
# Check cronjob schedule
kubectl get cronjob -n argocd

# Manually trigger for testing
kubectl create job --from=cronjob/argocd-backup argocd-backup-manual -n argocd

# Watch the job
kubectl logs -f job/argocd-backup-manual -n argocd

The Restore Procedure (Tier 1)

When disaster strikes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# 1. Clone the backup repository
git clone [email protected]:company/argocd-backups.git
cd argocd-backups

# 2. List available backups
git log --oneline

# 3. Checkout the desired backup
git checkout <commit-hash>

# 4. Restore CRDs first (if they were deleted)
kubectl apply -f crd-applications.yaml
kubectl apply -f crd-appprojects.yaml

# Wait for CRDs to be established
sleep 5

# 5. Restore projects (dependencies for applications)
kubectl apply -f projects.yaml

# 6. Restore repository credentials
kubectl apply -f repositories.yaml

# 7. Restore cluster credentials
kubectl apply -f clusters.yaml

# 8. Restore configuration
kubectl apply -f argocd-cm.yaml
kubectl apply -f argocd-rbac-cm.yaml
kubectl apply -f argocd-cmd-params-cm.yaml

# 9. Restore applications
kubectl apply -f applications.yaml

# 10. Restart ArgoCD components to pick up new config
kubectl rollout restart deployment argocd-server -n argocd
kubectl rollout restart deployment argocd-repo-server -n argocd
kubectl rollout restart statefulset argocd-application-controller -n argocd

# 11. Verify
argocd app list

Recovery time: 10-15 minutes from start to finish.

What this saved us: During the 3 AM incident, if we’d had this in place, we would’ve been back online in 15 minutes instead of 6 hours.

Tier 2: Velero (Complete Disaster Recovery)

Purpose: Full namespace backup, cross-cluster migration, complete disaster recovery.

Frequency: Daily automated, plus manual before major changes.

Recovery Time: 30-60 minutes (depending on data size).

Storage: Object storage (S3, GCS, Azure Blob).

Why Velero?

Velero backs up entire Kubernetes namespaces, including:

All CRDs (Applications, AppProjects)
All Secrets (repo credentials, cluster credentials, TLS certs)
All ConfigMaps (ArgoCD settings)
All PersistentVolumes (if applicable)
Resource relationships and dependencies

Velero is atomic: It takes a consistent snapshot at a point in time.

Installing Velero

Prerequisites: Object storage bucket (S3/GCS/Azure Blob).

For AWS S3:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Install Velero CLI
wget https://github.com/vmware-tanzu/velero/releases/download/v1.12.0/velero-v1.12.0-linux-amd64.tar.gz
tar -xvf velero-v1.12.0-linux-amd64.tar.gz
sudo mv velero-v1.12.0-linux-amd64/velero /usr/local/bin/

# Create S3 bucket
aws s3 mb s3://argocd-backups-velero --region ap-south-1

# Create IAM user for Velero
aws iam create-user --user-name velero

# Attach policy (you need a policy with S3 access)
aws iam attach-user-policy --user-name velero --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

# Create access key
aws iam create-access-key --user-name velero

# Create credentials file
cat > credentials-velero <<EOF
[default]
aws_access_key_id=<ACCESS_KEY>
aws_secret_access_key=<SECRET_KEY>
EOF

# Install Velero in the cluster
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket argocd-backups-velero \
  --backup-location-config region=ap-south-1 \
  --snapshot-location-config region=ap-south-1 \
  --secret-file ./credentials-velero

Verify installation:

1
2
kubectl get pods -n velero
kubectl logs deployment/velero -n velero

Create Backup Schedule for ArgoCD

1
2
3
4
5
6
7
8
# Daily backup at 2 AM, retain for 30 days
velero schedule create argocd-daily \
  --schedule="0 2 * * *" \
  --include-namespaces argocd \
  --ttl 720h

# Verify schedule
velero schedule get

Manual Backup Before Major Changes

Before cluster migration, ArgoCD upgrades, or major configuration changes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Create named backup
velero backup create argocd-pre-migration-$(date +%Y%m%d) \
  --include-namespaces argocd \
  --wait

# Verify backup completed
velero backup describe argocd-pre-migration-$(date +%Y%m%d)

# Check backup logs
velero backup logs argocd-pre-migration-$(date +%Y%m%d)

The Velero Restore Procedure

Scenario 1: Restore to Same Cluster

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# List available backups
velero backup get

# Restore from backup
velero restore create argocd-restore-$(date +%Y%m%d) \
  --from-backup argocd-daily-20260120020000 \
  --wait

# Monitor restore
velero restore describe argocd-restore-$(date +%Y%m%d)
velero restore logs argocd-restore-$(date +%Y%m%d)

# Verify
kubectl get applications -n argocd
argocd app list

Scenario 2: Restore to New Cluster (Complete DR)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# 1. Install Velero in new cluster (pointing to same S3 bucket)
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket argocd-backups-velero \
  --backup-location-config region=ap-south-1 \
  --snapshot-location-config region=ap-south-1 \
  --secret-file ./credentials-velero

# 2. Verify Velero can see existing backups
velero backup get

# 3. Restore ArgoCD from backup
velero restore create argocd-dr-restore \
  --from-backup argocd-daily-20260120020000 \
  --wait

# 4. Verify all components
kubectl get all -n argocd
kubectl get applications -n argocd
kubectl get appprojects -n argocd

# 5. Update ArgoCD server URL if needed
kubectl patch configmap argocd-cm -n argocd \
  --type merge \
  -p '{"data":{"url":"https://new-argocd.example.com"}}'

# 6. Restart ArgoCD
kubectl rollout restart deployment argocd-server -n argocd

Recovery time: 30-60 minutes depending on backup size and network speed.

Tier 3: GitOps for ArgoCD (The Meta Approach)

Purpose: ArgoCD managing its own configuration, self-healing setup, infrastructure as code.

Philosophy: If ArgoCD is the GitOps tool, why not use GitOps to manage ArgoCD itself?

This is what we implemented at DevOps Den after the 3 AM incident. Now all our ArgoCD configuration lives in Git, managed by ArgoCD itself using the app-of-apps pattern.

Benefits:

Version-controlled ArgoCD configuration (every change tracked in Git)
Self-healing (ArgoCD syncs its own config from Git)
Reproducible across environments (dev, staging, prod)
No manual backup needed (Git is the backup)
Audit trail (who changed what, when, and why)

The Structure

Create a Git repository for ArgoCD configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
argocd-bootstrap/
├── argocd-install/
│   ├── namespace.yaml
│   └── install.yaml
├── projects/
│   ├── project-analytics.yaml
│   ├── project-myapp.yaml
│   └── project-compliance.yaml
├── applications/
│   ├── do-analytics-prod-analyticsui.yaml
│   ├── do-myapp-staging-backend.yaml
│   └── aws-myapp-prod-frontend.yaml
├── repositories/
│   ├── repo-gitlab-company.yaml  # Using sealed-secrets
│   └── repo-github-oss.yaml
├── clusters/
│   ├── cluster-do-prod.yaml      # Using sealed-secrets
│   ├── cluster-aws-prod.yaml
│   └── cluster-e2e-test.yaml
├── config/
│   ├── argocd-cm.yaml
│   ├── argocd-rbac-cm.yaml
│   └── argocd-cmd-params-cm.yaml
└── app-of-apps.yaml

Example Files

projects/project-analytics.yaml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: do-analytics-prod-apps
  namespace: argocd
spec:
  description: Analytics platform - Production
  sourceRepos:
  - https://gitlab.com/company/gitops/manifests.git
  destinations:
  - namespace: analytics-prod-apps
    server: https://67860677-ba01-4c64-aea4-981fde9b5fc6.k8s.ondigitalocean.com
  clusterResourceWhitelist:
  - group: '*'
    kind: '*'

applications/do-analytics-prod-analyticsui.yaml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: do-analytics-prod-apps.analyticsui.prod
  namespace: argocd
spec:
  project: do-analytics-prod-apps
  source:
    repoURL: https://gitlab.com/company/gitops/manifests.git
    path: manifests/analytics-platform/analyticsui
    targetRevision: prod
  destination:
    server: https://67860677-ba01-4c64-aea4-981fde9b5fc6.k8s.ondigitalocean.com
    namespace: analytics-prod-apps
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Handling Secrets: Use Sealed Secrets for repository/cluster credentials:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Install sealed-secrets controller
kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml

# Create a secret for repository credentials
kubectl create secret generic repo-gitlab-company \
  --from-literal=url=https://gitlab.com/company/gitops/manifests.git \
  --from-literal=username=git \
  --from-literal=password=your-token \
  --dry-run=client -o yaml > repo-secret.yaml

# Seal it
kubeseal -o yaml < repo-secret.yaml > repositories/repo-gitlab-company.yaml

# Now you can commit the sealed secret to Git
git add repositories/repo-gitlab-company.yaml
git commit -m "Add GitLab repository credentials (sealed)"

The sealed secret looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
  name: repo-gitlab-company
  namespace: argocd
  labels:
    argocd.argoproj.io/secret-type: repository
spec:
  encryptedData:
    password: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEq...
    url: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEq...
    username: AgBy3i4OJSWK+PiTySYZZA9rO43cGDEq...

Safe to commit to Git - can only be decrypted by the sealed-secrets controller in that specific cluster.

The App-of-Apps Pattern

app-of-apps.yaml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: argocd-config
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://gitlab.com/company/argocd-bootstrap.git
    targetRevision: main
    path: .
    directory:
      recurse: true
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true

Deploy the App-of-Apps:

1
2
# After fresh ArgoCD installation
kubectl apply -f app-of-apps.yaml

Now ArgoCD manages itself:

The argocd-config application watches the Git repository
Any changes to projects, applications, config → Git commit → ArgoCD auto-syncs
ArgoCD’s configuration is now version-controlled and self-healing

Benefits:

✅ All ArgoCD config in Git (version controlled, auditable)
✅ Self-healing (manual changes get reverted)
✅ Reproducible (deploy to new cluster = kubectl apply -f app-of-apps.yaml)
✅ No manual backups needed (Git is the backup)

The DR scenario becomes:

1
2
3
4
5
6
7
8
9
# Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Deploy app-of-apps
kubectl apply -f app-of-apps.yaml

# Wait 2 minutes, ArgoCD recreates everything from Git
argocd app list

Recovery time: 5 minutes.

High Availability: Making ArgoCD Bulletproof

Now that we know how to backup, let’s ensure we never need to restore by making ArgoCD highly available.

The Problem with Single-Replica Deployment

Recall the installation command:

1
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

This creates:

1 replica of argocd-server (API and UI)
1 replica of argocd-repo-server (Git repository interaction)
1 replica of argocd-application-controller (sync orchestration)
1 replica of Redis (caching)

What breaks:

Node failure → ArgoCD goes down until pod reschedules
Pod crash → 30-60 seconds of downtime
Rolling updates → downtime during upgrade
High load → single pod can’t scale

HA Architecture

Production-grade ArgoCD needs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
┌─────────────────────────────────────────────────────────────┐
│ argocd-server (3 replicas)                                  │
│ ├── Pod 1 (node-1)                                          │
│ ├── Pod 2 (node-2)    ← Load balanced                       │
│ └── Pod 3 (node-3)                                          │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ argocd-repo-server (2 replicas)                             │
│ ├── Pod 1 (node-1)    ← Git cloning, manifest generation    │
│ └── Pod 2 (node-2)                                          │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ argocd-application-controller (1 replica)                   │
│ └── Pod 1 (StatefulSet)  ← Leader election, watches apps    │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ Redis HA (3 replicas with Sentinel)                         │
│ ├── Redis Pod 1 (Master)                                    │
│ ├── Redis Pod 2 (Replica)  ← Automatic failover             │
│ └── Redis Pod 3 (Replica)                                   │
│                                                              │
│ Sentinel (3 replicas) - monitors Redis, elects new master   │
└─────────────────────────────────────────────────────────────┘

Why this survives failures:

API server (3 replicas) → 2 can fail, 1 keeps serving
Repo server (2 replicas) → 1 can fail, Git operations continue
Application controller (1 replica) → Uses leader election, survives via StatefulSet
Redis HA → Master fails, Sentinel promotes replica

Installing HA ArgoCD

Using Helm (recommended for HA):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Add ArgoCD Helm repository
helm repo add argo https://argoproj.github.io/argo-helm
helm repo update

# Install with HA configuration
helm install argocd argo/argo-cd \
  --namespace argocd \
  --create-namespace \
  --set server.replicas=3 \
  --set repoServer.replicas=2 \
  --set controller.replicas=1 \
  --set redis-ha.enabled=true \
  --set redis-ha.replicas=3 \
  --set redis.enabled=false

Verify HA deployment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
kubectl get pods -n argocd

# Should see:
# argocd-server-xxx-1
# argocd-server-xxx-2
# argocd-server-xxx-3
# argocd-repo-server-xxx-1
# argocd-repo-server-xxx-2
# argocd-application-controller-0
# argocd-redis-ha-server-0
# argocd-redis-ha-server-1
# argocd-redis-ha-server-2
# argocd-redis-ha-haproxy-xxx-1
# argocd-redis-ha-haproxy-xxx-2
# argocd-redis-ha-haproxy-xxx-3

HA with values.yaml (Recommended for Production)

Create argocd-ha-values.yaml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# Server HA
server:
  replicas: 3
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: argocd-server
        topologyKey: kubernetes.io/hostname
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: 500m
      memory: 512Mi

# Repo server HA
repoServer:
  replicas: 2
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: argocd-repo-server
        topologyKey: kubernetes.io/hostname
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: 500m
      memory: 512Mi

# Application controller
controller:
  replicas: 1
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 2000m
      memory: 2Gi

# Redis HA
redis:
  enabled: false

redis-ha:
  enabled: true
  replicas: 3
  haproxy:
    replicas: 3
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/name: argocd-redis-ha-haproxy
          topologyKey: kubernetes.io/hostname

# High availability settings
configs:
  cm:
    timeout.reconciliation: 180s
    statusbadge.enabled: "true"
  params:
    controller.operation.processors: "10"
    controller.status.processors: "20"
    controller.self.heal.timeout.seconds: "5"
    server.insecure: "false"

Install with values file:

1
2
3
4
helm install argocd argo/argo-cd \
  --namespace argocd \
  --create-namespace \
  -f argocd-ha-values.yaml

Testing HA: Chaos Engineering

Test 1: Kill argocd-server pod:

1
2
3
4
5
6
7
# Identify a server pod
kubectl get pods -n argocd -l app.kubernetes.io/name=argocd-server

# Delete one
kubectl delete pod argocd-server-xxx-1 -n argocd

# Access UI - should continue working (other 2 pods serving)

Test 2: Drain a node:

1
2
3
4
# Drain node where argocd-server pods run
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data

# Pods reschedule to other nodes, zero downtime

Test 3: Redis failover:

1
2
3
4
5
6
7
# Kill Redis master
kubectl delete pod argocd-redis-ha-server-0 -n argocd

# Watch Sentinel promote a replica
kubectl logs -f argocd-redis-ha-server-1 -n argocd -c sentinel

# ArgoCD continues operating with new Redis master

HA Monitoring

Key metrics to track:

Pod availability:

1
2
3
kubectl get pods -n argocd -o wide

# All pods should be Running and Ready

Redis HA status:

1
2
3
4
# Check Redis Sentinel status
kubectl exec -it argocd-redis-ha-server-0 -n argocd -c redis -- redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster

# Should return current master IP

Application sync status:

1
2
3
argocd app list

# Monitor for OutOfSync or degraded apps

Prometheus metrics:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# ArgoCD exposes Prometheus metrics on port 8082
apiVersion: v1
kind: Service
metadata:
  name: argocd-metrics
  namespace: argocd
spec:
  ports:
  - name: metrics
    port: 8082
    targetPort: 8082
  selector:
    app.kubernetes.io/name: argocd-server

Grafana dashboards: Import ArgoCD official dashboard (ID 14584) from Grafana.com.

Disaster Recovery Scenarios and Runbooks

Let’s walk through real disaster scenarios and exact recovery procedures.

Scenario 1: Accidental Application Deletion

Symptom: Someone ran kubectl delete application myapp -n argocd.

Impact: ArgoCD forgets about the application, but deployed resources keep running in target cluster.

Recovery (Tier 1 - CLI backup):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# 1. Clone backup repo
git clone [email protected]:company/argocd-backups.git
cd argocd-backups

# 2. Find the deleted application
grep -r "name: myapp" applications.yaml

# 3. Extract just that application
kubectl apply -f - <<EOF
$(yq eval 'select(.metadata.name == "myapp")' applications.yaml)
EOF

# 4. Verify
argocd app get myapp
argocd app sync myapp

Recovery time: 2 minutes.

Scenario 2: Complete ArgoCD Namespace Deletion

Symptom: kubectl delete namespace argocd (oops).

Impact: Total loss of ArgoCD. Applications keep running in target clusters, but no automation.

Recovery (Tier 2 - Velero):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# 1. Reinstall ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Wait for pods to be ready
kubectl wait --for=condition=Ready pods --all -n argocd --timeout=300s

# 2. Install Velero (if not already installed)
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket argocd-backups-velero \
  --backup-location-config region=ap-south-1 \
  --snapshot-location-config region=ap-south-1 \
  --secret-file ./credentials-velero

# 3. List available backups
velero backup get

# 4. Restore from latest backup
velero restore create argocd-emergency-restore \
  --from-backup argocd-daily-$(date +%Y%m%d)020000 \
  --namespace-mappings argocd:argocd \
  --wait

# 5. Verify restore
kubectl get applications -n argocd
argocd app list

# 6. Sync all applications to recover from any drift
argocd app sync --all

Recovery time: 20-30 minutes.

Scenario 3: Cluster Migration (Complete Infrastructure Change)

Symptom: Migrating from DigitalOcean to AWS EKS. Need to move ArgoCD.

Recovery (Tier 3 - GitOps approach):

Prerequisites: ArgoCD config is in Git (argocd-bootstrap repo).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# 1. New cluster is provisioned (AWS EKS)
aws eks update-kubeconfig --name production-cluster --region ap-south-1

# 2. Install ArgoCD in new cluster
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# 3. Install sealed-secrets controller (for decrypting repo/cluster credentials)
kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml

# 4. Copy sealed-secrets master key from old cluster
kubectl get secret -n kube-system sealed-secrets-key -o yaml > sealed-secrets-key.yaml
kubectl apply -f sealed-secrets-key.yaml -n kube-system

# This allows new cluster to decrypt the same sealed secrets

# 5. Deploy app-of-apps
kubectl apply -f https://raw.githubusercontent.com/company/argocd-bootstrap/main/app-of-apps.yaml

# 6. Wait for ArgoCD to self-configure
kubectl get applications -n argocd --watch

# Within 2-3 minutes, all projects, applications, repos, clusters restored

# 7. Verify
argocd app list
argocd app sync --all

Recovery time: 10 minutes.

Key advantage: Zero manual reconstruction. Git is the source of truth.

Scenario 4: etcd Corruption (Kubernetes Cluster Disaster)

Symptom: Kubernetes etcd corrupted. Cluster state lost.

Impact: Total cluster failure. All ArgoCD data in etcd is gone.

Recovery (Combination of Velero + GitOps):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# 1. Rebuild Kubernetes cluster (new control plane)
# (Cloud provider specific - EKS console, DO dashboard, kubeadm, etc.)

# 2. Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# 3. Option A: Restore from Velero (if Velero backups were in object storage)
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket argocd-backups-velero \
  --backup-location-config region=ap-south-1 \
  --snapshot-location-config region=ap-south-1 \
  --secret-file ./credentials-velero

velero restore create disaster-recovery \
  --from-backup argocd-daily-<latest> \
  --wait

# 3. Option B: Restore from Git (if using GitOps approach)
kubectl apply -f https://raw.githubusercontent.com/company/argocd-bootstrap/main/app-of-apps.yaml

# 4. Verify
argocd app list
argocd app sync --all

Recovery time: 30-60 minutes (most time spent rebuilding Kubernetes cluster).

Production Best Practices Checklist

Based on months of running ArgoCD in production, here’s the checklist:

Backup & DR

Daily automated backups (CLI export to Git)
Velero scheduled backups (daily, 30-day retention)
Manual backup before changes (migrations, upgrades, major config)
GitOps for ArgoCD config (app-of-apps pattern)
Quarterly restore testing (verify backups actually work)
Documented runbooks (DR procedures written down)
Off-cluster backup storage (S3/GCS, not in same cluster)

High Availability

Multi-replica deployments (3x server, 2x repo-server)
Redis HA with Sentinel (automatic failover)
Pod anti-affinity (spread across nodes)
Resource limits configured (prevent resource exhaustion)
Horizontal Pod Autoscaling (scale under load)

Security

Sealed secrets for credentials (never plain text in Git)
RBAC policies configured (least privilege access)
TLS for ArgoCD server (encrypted communication)
Ingress with authentication (SSO via Dex/OIDC)
Regular security updates (patch ArgoCD versions)

Monitoring & Alerting

Prometheus metrics enabled (performance visibility)
Grafana dashboards (ArgoCD operational metrics)
Alerts on sync failures (email/webhook notifications)
Alerts on application OutOfSync (drift detection)
Redis HA monitoring (Sentinel status checks)

Operational Excellence

ArgoCD version pinned (avoid unexpected upgrades)
Upgrade tested in staging (before production)
Application naming convention (consistent, predictable)
Project structure documented (team onboarding)
Git repository structure (clear, scalable)

Lessons from the Trenches

Lesson 1: Backup is Only Half the Story

We had backups. But we’d never tested restore. When disaster struck, we discovered:

Backups were incomplete (missing cluster credentials)
Restore procedure wasn’t documented
Team didn’t know how to restore

Takeaway: Test your restore procedure quarterly. Schedule it. Put it on the calendar. Actually restore to a test cluster.

Lesson 2: Manual Backups Don’t Scale

Early on, we’d manually run argocd app list -o yaml > backup.yaml before major changes.

We forgot. A lot.

Takeaway: Automate everything. CronJob, Velero schedule, Git-based config. If it requires a human to remember, it will fail.

Lesson 3: HA Isn’t Optional for Production

At DevOps Den, we ran single-replica ArgoCD for months. “It’s just automation, we can redeploy.”

Then ArgoCD went down during a critical deployment window. A customer-facing bug needed an immediate hotfix. Team Mavericks had the fix ready, but we couldn’t deploy it because the GitOps control plane was offline.

We had to manually kubectl apply the fix (breaking our GitOps workflow) and then reconcile it in ArgoCD later.

Takeaway: If your deployments depend on it, it needs to be HA. ArgoCD is infrastructure, treat it as such.

Lesson 4: GitOps for ArgoCD is the Best Long-Term Strategy

CLI backups are great for quick recovery. Velero is excellent for disaster recovery.

But GitOps for ArgoCD config is the ultimate solution because:

You get versioning (Git history)
You get auditing (who changed what when)
You get self-healing (drift correction)
You get reproducibility (deploy anywhere)

Takeaway: Invest in GitOps for ArgoCD early. It pays dividends forever.

Lesson 5: Know What You’re NOT Backing Up

We once “restored” ArgoCD from backup and wondered why applications weren’t syncing.

Turns out: We backed up Application CRDs, but the deployed resources were still in target clusters. ArgoCD saw drift everywhere and started auto-healing (re-syncing from Git, which had older versions).

Takeaway: Understand the scope of your backup. ArgoCD config ≠ Application state. Git is source of truth for manifests, Kubernetes is source of truth for deployed resources.

What’s Next?

You’ve implemented backup, disaster recovery, and high availability for ArgoCD. Your GitOps control plane is now production-grade.

Next steps:

Test your DR plan - Schedule a restore drill this month
Implement monitoring - Set up Prometheus + Grafana dashboards
GitOps everything - Move all ArgoCD config to Git (app-of-apps)
Automate validation - CI pipeline to validate ArgoCD manifests before merge
Security hardening - SSO integration, RBAC tightening, network policies

Related articles:

ArgoCD in the Real World: From First Install to Multi-Cluster GitOps - If you need to set up ArgoCD from scratch or understand multi-cluster patterns
GitLab SaaS to Self-Hosted Migration - Another infrastructure migration story with lessons on planning and execution

Final Thoughts

That 3 AM wake-up call was painful. Six hours of manual reconstruction, team frustration, missed SLAs.

But it taught me something critical: Infrastructure automation needs the same rigor as application code.

ArgoCD is your GitOps control plane. It orchestrates hundreds of deployments. It’s the bridge between Git (your source of truth) and Kubernetes (your runtime). Treat it with the importance it deserves.

Backup strategy: Multiple tiers, automated, tested regularly.

HA deployment: Multi-replica, across nodes, with Redis failover.

GitOps approach: ArgoCD managing itself, configuration as code.

These aren’t optional for production. They’re the baseline.

Now when my phone buzzes at 3 AM, I know:

✅ Automated backups ran last night
✅ Velero has a full snapshot
✅ GitOps config can restore everything in 10 minutes
✅ HA setup means ArgoCD is still running anyway

Recovery went from 6 hours to 10 minutes.

That’s the power of treating your automation infrastructure with production-grade discipline.

Need help implementing production-grade ArgoCD?

I help teams design and implement bulletproof GitOps infrastructure. Services include:

ArgoCD Disaster Recovery Audit - Assessment of your backup strategy and recovery procedures
Production Hardening - HA setup, backup automation, monitoring integration
Disaster Recovery Planning - Runbooks, testing procedures, team training
GitOps Architecture Design - Repository structure, multi-cluster patterns, security

Schedule a consultation or reach out at www.uk4.in.

Kudos to every DevOps engineer who’s been woken up at 3 AM. You’re not alone.

Built with resilience, automated with discipline, deployed with confidence—powered by GitOps and battle-tested DR strategies.

The 3 AM Wake-Up Call#

Understanding ArgoCD’s Storage Architecture#

The Design Philosophy: Kubernetes-Native Storage#

What Gets Stored Where#

1. Application Definitions (Custom Resource Definitions)#

2. Project Definitions (AppProject CRDs)#

3. Repository Credentials (Secrets)#

4. Cluster Credentials (Secrets)#

5. ArgoCD Configuration (ConfigMaps)#

6. TLS Certificates and SSH Keys (Secrets)#

7. Initial Admin Password (Secret)#

What ArgoCD Does NOT Store#

The Installation Decision That Matters#

The Data Flow: Where Things Happen#

The Backup Implication#

The Three-Tier Backup Strategy#

Tier 1: Automated CLI Exports (Daily Quick Backups)#

Option A: The Quick Method (Recommended for Getting Started)#

Option B: The Production Script (Granular Backups)#

Deploy as Kubernetes CronJob#

The Restore Procedure (Tier 1)#

Tier 2: Velero (Complete Disaster Recovery)#

Why Velero?#

Installing Velero#

Create Backup Schedule for ArgoCD#

Manual Backup Before Major Changes#

The Velero Restore Procedure#

Tier 3: GitOps for ArgoCD (The Meta Approach)#

The Structure#

Example Files#

The App-of-Apps Pattern#

High Availability: Making ArgoCD Bulletproof#

The Problem with Single-Replica Deployment#

HA Architecture#

Installing HA ArgoCD#

HA with values.yaml (Recommended for Production)#

Testing HA: Chaos Engineering#

HA Monitoring#

Disaster Recovery Scenarios and Runbooks#

Scenario 1: Accidental Application Deletion#

Scenario 2: Complete ArgoCD Namespace Deletion#

Scenario 3: Cluster Migration (Complete Infrastructure Change)#

Scenario 4: etcd Corruption (Kubernetes Cluster Disaster)#

Production Best Practices Checklist#

Backup & DR#

High Availability#

Security#

Monitoring & Alerting#

Operational Excellence#

Lessons from the Trenches#

Lesson 1: Backup is Only Half the Story#

Lesson 2: Manual Backups Don’t Scale#

Lesson 3: HA Isn’t Optional for Production#

Lesson 4: GitOps for ArgoCD is the Best Long-Term Strategy#

Lesson 5: Know What You’re NOT Backing Up#

What’s Next?#

Final Thoughts#