How I Traced a 20% CPU Drain to a Silent Dokploy Loop

The Problem

My production server — a 6-core Contabo VPS running a hotspot billing platform — started feeling sluggish. No obvious trigger, no recent deployments, no traffic spikes. Just a quiet, persistent drag that showed up in response times and made the server feel like it was always working on something I couldn't see.

Before reaching for any configuration changes, I wanted evidence. What was actually consuming my CPU cycles?

This is the full diagnostic trail — every command, every finding, and the root cause that turned out to be both surprising and completely fixable.

Step 1: Get a Process Snapshot

The first tool isn't htop or a fancy dashboard. It's a single command that gives you a ranked list of CPU consumers in one shot:

ps aux --sort=-%cpu | head -20

The output immediately told me a few things. Ignoring ps itself (which briefly spikes while running), the real entries of interest were:

USER   PID   %CPU %MEM  ...  COMMAND
root   3536   4.2  7.6  ...  node -r dotenv/config dist/server.mjs
root    878   3.4  4.2  ...  /usr/bin/dockerd
root   2129   2.1  0.8  ...  traefik traefik
root   3044959 2.1 1.1  ...  ./server

CPU percentages in the single digits don't look alarming at first glance. But the TIME column — cumulative CPU time consumed since the process started — told a different story. The Node.js process had accumulated 316 hours and 38 minutes of CPU time since May 22. That's roughly 5 days. Sustained, silent consumption.

Step 2: Confirm It's Genuine CPU, Not I/O Wait

High CPU numbers can be misleading. If the kernel is waiting on slow disk or database operations, it shows up as CPU usage in some views even though your code isn't actually running. Before investigating the process itself, rule this out:

top -bn1 | grep "Cpu(s)"

Look specifically at the wa (I/O wait) value. If it's above 10%, your bottleneck is the storage layer, not your code. In my case, wa was near zero — this was genuine CPU consumption, not a disguised I/O problem.

Step 3: Measure the Pattern, Not Just the Snapshot

A single ps snapshot can be misleading. A process might spike briefly for a legitimate reason. pidstat gives you a time-series view of what a specific process is actually doing:

pidstat -p 3536 2 5

This samples PID 3536 every 2 seconds, 5 times. The output revealed something important:

%usr   %system  %guest  %wait   %CPU   CPU  Command
2.50    5.00     0.00    0.00   7.50    4   MainThread
5.00    7.50     0.00    0.00  12.50    4   MainThread
2.00    4.00     0.00    0.00   6.00    1   MainThread
...
Average: 3.40   5.60     0.00    0.10   9.00    -   MainThread

The %system value (5.60%) was consistently higher than %usr (3.40%). That's the red flag. In normal application workloads, user-space code dominates. When system time is higher, the process is spending most of its time inside the kernel — making syscalls. That pattern points to: heavy network I/O, frequent file operations, or memory pressure triggering repeated allocations.

Step 4: Identify the Container

The process was running from /app — a path that doesn't exist on the host filesystem. That immediately told me it was running inside a Docker container. To find which one:

cat /proc/3536/cgroup | grep docker

Output:

0::/system.slice/docker-ef046589f0df78850cdf86df7a228c91d81b3a338e55b07dce6f77fe857c81e3.scope

Container ID: ef046589. Now I could inspect it directly:

docker inspect ef046589 --format '{{.Name}} | Image: {{.Config.Image}} | Cmd: {{range .Config.Cmd}}{{.}} {{end}}'

Result:

/dokploy.1.g2l35e00g8nbbs55zj47ap3ff | Image: dokploy/dokploy:v0.28.8 | Cmd: sh -c pnpm run wait-for-postgres && exec pnpm start

The culprit was Dokploy — my self-hosted deployment platform.

Step 5: See the Full Picture with Docker Stats

With the container identified, I ran docker stats --no-stream to see every container's resource usage side by side:

docker stats --no-stream

CONTAINER ID   NAME                          CPU %    MEM USAGE / LIMIT
ef046589       dokploy.1.g2l35...            19.72%   1.332GiB / 11.68GiB
f7586f0af9b3   dokploy-redis.1.hfn...         3.63%   12.4MiB  / 11.68GiB
0a236ab969f9   dokploy-postgres.1.vx2...      0.01%   66.72MiB / 11.68GiB
510e1f355e75   connectlocal-hotspot-api...    0.98%   77.89MiB / 11.68GiB

Two numbers stood out immediately:

Dokploy: 19.72% CPU, 1.332 GiB RAM — for a deployment manager with nothing actively deploying
dokploy-redis: 3.63% CPU — Redis should be near-zero when idle

High Redis CPU alongside high application CPU is a classic indicator of a queue processing loop — the app is pushing jobs to Redis faster than it can process them, or it's stuck retrying the same failed job repeatedly.

Step 6: Read the Logs

docker logs ef046589 --tail 80 --timestamps 2>&1 | grep -iE "error|fail|retry"

The output was unambiguous:

2026-05-26T05:28:24.579Z Failed to proxy http://169.254.169.254/latest/meta-data/ Error: connect ECONNREFUSED 127.0.0.1:80
2026-05-26T05:28:24.579Z Failed to proxy http://169.254.169.254/latest/meta-data/iam/security-credentials/ Error: connect ECONNREFUSED 127.0.0.1:80
2026-05-26T05:28:30.245Z Failed to proxy http://169.254.169.254/latest/meta-data/user-data Error: connect ECONNREFUSED 127.0.0.1:80
2026-05-26T05:28:30.830Z Failed to proxy http://169.254.169.254/latest/meta-data/instance-id Error: connect ECONNREFUSED 127.0.0.1:80
2026-05-26T05:28:31.414Z Failed to proxy http://169.254.169.254/latest/meta-data/ami-id Error: connect ECONNREFUSED 127.0.0.1:80
2026-05-26T12:03:01.018Z Failed to proxy http://metadata.google.internal/computeMetadata/v1/instance/?recursive=true Error: connect ECONNREFUSED 127.0.0.1:80

Root Cause: The Cloud Metadata Detection Loop

169.254.169.254 is a link-local IP address reserved for cloud instance metadata services. AWS EC2, Google Cloud, DigitalOcean, and other cloud providers expose server information — public IP, instance type, IAM credentials — at this address from within virtual machines.

Dokploy v0.28.8 uses this endpoint to auto-detect:

Which cloud provider the server is running on
The server's public IP address

On AWS or GCP, this works perfectly. On a Contabo VPS (and most bare-metal or non-hyperscaler providers), this endpoint simply doesn't exist. The connection is refused immediately.

The bug: Dokploy didn't treat a missing metadata endpoint as a definitive "not a cloud server" signal and move on. Instead, it retried — continuously. Every few seconds, it probed AWS metadata endpoints, got ECONNREFUSED, logged the error, and tried again. Each attempt generated syscalls (the high %system we saw earlier), hammered the Redis queue, and burned CPU — indefinitely, silently, for days.

This is a known issue in Dokploy versions prior to v0.29.3, which introduced smarter cloud provider detection and significantly reduced healthcheck frequency.

The Fix: Upgrade Dokploy

Step 1 — Apply the mandatory security patch

Before upgrading, Dokploy v0.29.3 introduced a required auth secret migration that must be run first:

curl -sSL https://dokploy.com/security/0.29.3.sh | bash

This generates a unique authentication secret for your installation and migrates any existing 2FA data into Docker Secrets. The cd: can't cd to /app warning you may see is harmless — it's a host-path issue in the script, but the secret is already saved successfully.

Step 2 — Identify your Swarm service

docker stack ls
docker service ls | grep dokploy

In my case, the service name was simply dokploy:

mb9d8fvt7r43   dokploy   replicated   1/1   dokploy/dokploy:v0.28.8

Step 3 — Update the image

docker service update --image dokploy/dokploy:v0.29.5 dokploy

Docker Swarm performs a rolling update — the old container is replaced with the new one with no downtime. After about 60 seconds:

overall progress: 1 out of 1 tasks
1/1: running   [==================================================>]
verify: Service dokploy converged

Results

docker stats --no-stream | grep -E "NAME|dokploy"

CONTAINER ID   NAME                     CPU %    MEM USAGE / LIMIT
85a23339a372   dokploy.1.perhu...       0.28%    801.4MiB / 11.68GiB
f7586f0af9b3   dokploy-redis.1.hfn...   0.76%    12.51MiB / 11.68GiB

| Metric | Before | After | Change | |---|---|---|---| | Dokploy CPU | 19.72% | 0.28% | -98.6% | | Dokploy RAM | 1.332 GiB | 801 MiB | -40% | | Redis CPU | 3.63% | 0.76% | -79% |

A single service update recovered nearly 20% of CPU capacity across the entire server.

The Diagnostic Toolkit

Here's the complete set of commands used in this investigation, in order:

# 1. Ranked snapshot of CPU consumers
ps aux --sort=-%cpu | head -20

# 2. Distinguish CPU vs I/O wait
top -bn1 | grep "Cpu(s)"

# 3. Time-series CPU breakdown for a specific PID (user vs system)
pidstat -p <PID> 2 5

# 4. Find which Docker container owns a PID
cat /proc/<PID>/cgroup | grep docker

# 5. Inspect the container
docker inspect <container-id> --format '{{.Name}} | Image: {{.Config.Image}}'

# 6. Cross-container resource comparison
docker stats --no-stream

# 7. Tail logs with error filtering
docker logs <container-id> --tail 80 --timestamps 2>&1 | grep -iE "error|fail|retry"

Key Takeaways

Watch cumulative CPU time, not just the snapshot percentage. A process at 4% right now might have consumed 300+ hours of CPU — that's the real signal of a sustained problem.

High %system > %usr is a red flag. Normal applications spend more time in user space than in the kernel. When %system dominates, look for tight retry loops, excessive file I/O, or network call patterns.

Redis CPU is a proxy for queue activity. If your Redis instance is using unexpected CPU and you have no active jobs, something is hammering the queue in a loop.

Metadata endpoints are cloud-specific. 169.254.169.254 only exists on AWS, GCP, and similar hyperscalers. If your software probes it and you're on a bare-metal or non-major-cloud provider, make sure failure is handled as a permanent negative — not a transient retry.

Keep your deployment tooling up to date. Dokploy v0.28.8 had this bug for months. A single docker service update command eliminated 20% of continuous server load.

All diagnostics were performed on a live production server without any downtime. The investigation took under 30 minutes from first ps aux to confirmed fix.