How I Traced a 20% CPU Drain to a Silent Dokploy Loop
A step-by-step war story of diagnosing unexpected server CPU consumption — from raw process inspection to a Docker container running a tight metadata retry loop.
The Problem
My production server — a 6-core Contabo VPS running a hotspot billing platform — started feeling sluggish. No obvious trigger, no recent deployments, no traffic spikes. Just a quiet, persistent drag that showed up in response times and made the server feel like it was always working on something I couldn't see.
Before reaching for any configuration changes, I wanted evidence. What was actually consuming my CPU cycles?
This is the full diagnostic trail — every command, every finding, and the root cause that turned out to be both surprising and completely fixable.
Step 1: Get a Process Snapshot
The first tool isn't htop or a fancy dashboard. It's a single command that gives you a ranked list of CPU consumers in one shot:
ps aux --sort=-%cpu | head -20The output immediately told me a few things. Ignoring ps itself (which briefly spikes while running), the real entries of interest were:
USER PID %CPU %MEM ... COMMAND
root 3536 4.2 7.6 ... node -r dotenv/config dist/server.mjs
root 878 3.4 4.2 ... /usr/bin/dockerd
root 2129 2.1 0.8 ... traefik traefik
root 3044959 2.1 1.1 ... ./server
CPU percentages in the single digits don't look alarming at first glance. But the TIME column — cumulative CPU time consumed since the process started — told a different story. The Node.js process had accumulated 316 hours and 38 minutes of CPU time since May 22. That's roughly 5 days. Sustained, silent consumption.
Step 2: Confirm It's Genuine CPU, Not I/O Wait
High CPU numbers can be misleading. If the kernel is waiting on slow disk or database operations, it shows up as CPU usage in some views even though your code isn't actually running. Before investigating the process itself, rule this out:
top -bn1 | grep "Cpu(s)"Look specifically at the wa (I/O wait) value. If it's above 10%, your bottleneck is the storage layer, not your code. In my case, wa was near zero — this was genuine CPU consumption, not a disguised I/O problem.
Step 3: Measure the Pattern, Not Just the Snapshot
A single ps snapshot can be misleading. A process might spike briefly for a legitimate reason. pidstat gives you a time-series view of what a specific process is actually doing:
pidstat -p 3536 2 5This samples PID 3536 every 2 seconds, 5 times. The output revealed something important:
%usr %system %guest %wait %CPU CPU Command
2.50 5.00 0.00 0.00 7.50 4 MainThread
5.00 7.50 0.00 0.00 12.50 4 MainThread
2.00 4.00 0.00 0.00 6.00 1 MainThread
...
Average: 3.40 5.60 0.00 0.10 9.00 - MainThread
The %system value (5.60%) was consistently higher than %usr (3.40%). That's the red flag. In normal application workloads, user-space code dominates. When system time is higher, the process is spending most of its time inside the kernel — making syscalls. That pattern points to: heavy network I/O, frequent file operations, or memory pressure triggering repeated allocations.
Step 4: Identify the Container
The process was running from /app — a path that doesn't exist on the host filesystem. That immediately told me it was running inside a Docker container. To find which one:
cat /proc/3536/cgroup | grep dockerOutput:
0::/system.slice/docker-ef046589f0df78850cdf86df7a228c91d81b3a338e55b07dce6f77fe857c81e3.scope
Container ID: ef046589. Now I could inspect it directly:
docker inspect ef046589 --format '{{.Name}} | Image: {{.Config.Image}} | Cmd: {{range .Config.Cmd}}{{.}} {{end}}'Result:
/dokploy.1.g2l35e00g8nbbs55zj47ap3ff | Image: dokploy/dokploy:v0.28.8 | Cmd: sh -c pnpm run wait-for-postgres && exec pnpm start
The culprit was Dokploy — my self-hosted deployment platform.
Step 5: See the Full Picture with Docker Stats
With the container identified, I ran docker stats --no-stream to see every container's resource usage side by side:
docker stats --no-streamCONTAINER ID NAME CPU % MEM USAGE / LIMIT
ef046589 dokploy.1.g2l35... 19.72% 1.332GiB / 11.68GiB
f7586f0af9b3 dokploy-redis.1.hfn... 3.63% 12.4MiB / 11.68GiB
0a236ab969f9 dokploy-postgres.1.vx2... 0.01% 66.72MiB / 11.68GiB
510e1f355e75 connectlocal-hotspot-api... 0.98% 77.89MiB / 11.68GiB
Two numbers stood out immediately:
- Dokploy: 19.72% CPU, 1.332 GiB RAM — for a deployment manager with nothing actively deploying
- dokploy-redis: 3.63% CPU — Redis should be near-zero when idle
High Redis CPU alongside high application CPU is a classic indicator of a queue processing loop — the app is pushing jobs to Redis faster than it can process them, or it's stuck retrying the same failed job repeatedly.
Step 6: Read the Logs
docker logs ef046589 --tail 80 --timestamps 2>&1 | grep -iE "error|fail|retry"The output was unambiguous:
2026-05-26T05:28:24.579Z Failed to proxy http://169.254.169.254/latest/meta-data/ Error: connect ECONNREFUSED 127.0.0.1:80
2026-05-26T05:28:24.579Z Failed to proxy http://169.254.169.254/latest/meta-data/iam/security-credentials/ Error: connect ECONNREFUSED 127.0.0.1:80
2026-05-26T05:28:30.245Z Failed to proxy http://169.254.169.254/latest/meta-data/user-data Error: connect ECONNREFUSED 127.0.0.1:80
2026-05-26T05:28:30.830Z Failed to proxy http://169.254.169.254/latest/meta-data/instance-id Error: connect ECONNREFUSED 127.0.0.1:80
2026-05-26T05:28:31.414Z Failed to proxy http://169.254.169.254/latest/meta-data/ami-id Error: connect ECONNREFUSED 127.0.0.1:80
2026-05-26T12:03:01.018Z Failed to proxy http://metadata.google.internal/computeMetadata/v1/instance/?recursive=true Error: connect ECONNREFUSED 127.0.0.1:80
Root Cause: The Cloud Metadata Detection Loop
169.254.169.254 is a link-local IP address reserved for cloud instance metadata services. AWS EC2, Google Cloud, DigitalOcean, and other cloud providers expose server information — public IP, instance type, IAM credentials — at this address from within virtual machines.
Dokploy v0.28.8 uses this endpoint to auto-detect:
- Which cloud provider the server is running on
- The server's public IP address
On AWS or GCP, this works perfectly. On a Contabo VPS (and most bare-metal or non-hyperscaler providers), this endpoint simply doesn't exist. The connection is refused immediately.
The bug: Dokploy didn't treat a missing metadata endpoint as a definitive "not a cloud server" signal and move on. Instead, it retried — continuously. Every few seconds, it probed AWS metadata endpoints, got ECONNREFUSED, logged the error, and tried again. Each attempt generated syscalls (the high %system we saw earlier), hammered the Redis queue, and burned CPU — indefinitely, silently, for days.
This is a known issue in Dokploy versions prior to v0.29.3, which introduced smarter cloud provider detection and significantly reduced healthcheck frequency.
The Fix: Upgrade Dokploy
Step 1 — Apply the mandatory security patch
Before upgrading, Dokploy v0.29.3 introduced a required auth secret migration that must be run first:
curl -sSL https://dokploy.com/security/0.29.3.sh | bashThis generates a unique authentication secret for your installation and migrates any existing 2FA data into Docker Secrets. The cd: can't cd to /app warning you may see is harmless — it's a host-path issue in the script, but the secret is already saved successfully.
Step 2 — Identify your Swarm service
docker stack ls
docker service ls | grep dokployIn my case, the service name was simply dokploy:
mb9d8fvt7r43 dokploy replicated 1/1 dokploy/dokploy:v0.28.8
Step 3 — Update the image
docker service update --image dokploy/dokploy:v0.29.5 dokployDocker Swarm performs a rolling update — the old container is replaced with the new one with no downtime. After about 60 seconds:
overall progress: 1 out of 1 tasks
1/1: running [==================================================>]
verify: Service dokploy converged
Results
docker stats --no-stream | grep -E "NAME|dokploy"CONTAINER ID NAME CPU % MEM USAGE / LIMIT
85a23339a372 dokploy.1.perhu... 0.28% 801.4MiB / 11.68GiB
f7586f0af9b3 dokploy-redis.1.hfn... 0.76% 12.51MiB / 11.68GiB
| Metric | Before | After | Change | |---|---|---|---| | Dokploy CPU | 19.72% | 0.28% | -98.6% | | Dokploy RAM | 1.332 GiB | 801 MiB | -40% | | Redis CPU | 3.63% | 0.76% | -79% |
A single service update recovered nearly 20% of CPU capacity across the entire server.
The Diagnostic Toolkit
Here's the complete set of commands used in this investigation, in order:
# 1. Ranked snapshot of CPU consumers
ps aux --sort=-%cpu | head -20
# 2. Distinguish CPU vs I/O wait
top -bn1 | grep "Cpu(s)"
# 3. Time-series CPU breakdown for a specific PID (user vs system)
pidstat -p <PID> 2 5
# 4. Find which Docker container owns a PID
cat /proc/<PID>/cgroup | grep docker
# 5. Inspect the container
docker inspect <container-id> --format '{{.Name}} | Image: {{.Config.Image}}'
# 6. Cross-container resource comparison
docker stats --no-stream
# 7. Tail logs with error filtering
docker logs <container-id> --tail 80 --timestamps 2>&1 | grep -iE "error|fail|retry"Key Takeaways
Watch cumulative CPU time, not just the snapshot percentage. A process at 4% right now might have consumed 300+ hours of CPU — that's the real signal of a sustained problem.
High %system > %usr is a red flag. Normal applications spend more time in user space than in the kernel. When %system dominates, look for tight retry loops, excessive file I/O, or network call patterns.
Redis CPU is a proxy for queue activity. If your Redis instance is using unexpected CPU and you have no active jobs, something is hammering the queue in a loop.
Metadata endpoints are cloud-specific. 169.254.169.254 only exists on AWS, GCP, and similar hyperscalers. If your software probes it and you're on a bare-metal or non-major-cloud provider, make sure failure is handled as a permanent negative — not a transient retry.
Keep your deployment tooling up to date. Dokploy v0.28.8 had this bug for months. A single docker service update command eliminated 20% of continuous server load.
All diagnostics were performed on a live production server without any downtime. The investigation took under 30 minutes from first ps aux to confirmed fix.