When File-Based Detection Lies: Switching to RCON for Minecraft Server Status
My Minecraft server was running. The PID file proved it. And yet, the server wasnât running at all.
This paradox cost me a week of intermittent debugging before I understood what was actually happening: Iâd been asking the wrong question entirely.
The Symptom: A Silent Status Bar
I run a Minecraft server on AWS with a Discord bot that displays player count in its status lineâthat little âPlaying: 2 players onlineâ text beneath the botâs name. One day I noticed it had gone blank. The bot was online, the chat bridge worked fine, but the status showed nothing.
My first instinct was to check git history. The status had worked before, so something must have changed. A quick search found it: a cleanup commit had removed the status_updater task along with 270 other lines of âoutdatedâ code. Classic over-aggressive refactoring.
I restored the code. The status stayed blank.
This is when the real debugging began.
The Original Detection: Trusting a File
The status update logic depended on knowing whether the server was running. Hereâs what that check looked like:
def is_server_running(self):
"""Check if server is running via PID file"""
pid_file = Path("/var/run/minecraft/server.pid")
if not pid_file.exists():
return False
pid = int(pid_file.read_text().strip())
try:
os.kill(pid, 0) # Signal 0 just checks if process exists
return True
except OSError:
return False
This seemed bulletproof. The PID file gets created when the server starts and contains the process ID. The os.kill(pid, 0) trick verifies the process is still alive without actually killing it. If both checks pass, the server is running.
Except âthe process is aliveâ and âthe server is accepting playersâ are fundamentally different things.
The Zombie State: When Java Lives But Minecraft Dies
Minecraft servers are Java applications, and Java processes can enter a zombie state where the JVM is technically running but the game server inside has crashed or hung. This happens more often than youâd expect:
- Plugin crashes: A badly-written mod throws an unhandled exception in the main tick loop
- Memory exhaustion: The heap fills up, GC thrashes endlessly, but the process never exits
- Network stack failures: The server socket dies but the main thread keeps spinning
- Deadlocks: Two threads waiting on each other indefinitely
In all these cases, the PID file exists, os.kill(pid, 0) succeeds, and my detection logic reports âserver running.â But try to connect as a player and youâll timeout.
The status updater would call is_server_running(), get True, then try to query player count via RCONâwhich would fail silently because the RCON subsystem was part of the crashed server. The code handled RCON failures by simply not updating the status, leaving the Discord status bar empty.
I had a detection method that perfectly detected the wrong thing.
The Fix: Ask the Server Directly
The insight seems obvious in retrospect: if you want to know whether a Minecraft server can serve players, ask it to serve something.
def is_server_running(self):
"""Check if server is running by attempting RCON connection"""
try:
with MCRcon(self.host, self.password, port=self.port) as mcr:
mcr.command("list")
return True
except Exception:
return False
RCON (Remote Console) lets you send commands to a running server. If the server responds to a list command, itâs genuinely operationalânot just a living process, but an actual functioning game server. If RCON fails, we know the server canât handle players either.
This changes the question from âis the process alive?â to âis the server healthy?â
Verifying the Fix
I didnât want to just deploy and hope. To confirm RCON-based detection would catch zombie states, I simulated one:
# Find the Minecraft server process
$ pgrep -f "minecraft_server"
12345
# Pause it (simulates a hung server)
$ kill -STOP 12345
# Test detection
$ python -c "from rcon_client import RconClient; print(RconClient().is_server_running())"
False
# Resume the server
$ kill -CONT 12345
The PID file check would have said ârunning.â The RCON check correctly identified the frozen server as non-functional.
The Implementation
With the core detection fixed, I updated the status updater:
@tasks.loop(seconds=30)
async def status_updater():
"""Update Discord status with player count from RCON"""
if not rcon_client.is_server_running():
await client.change_presence(
activity=discord.Game(name="Server offline"),
status=discord.Status.idle
)
return
player_info = rcon_client.get_online_players()
if player_info['success']:
count = player_info['count']
status_text = f"{count} player{'s' if count != 1 else ''} online"
await client.change_presence(
activity=discord.Game(name=status_text),
status=discord.Status.online
)
else:
logger.warning(f"RCON player query failed: {player_info.get('error')}")
await client.change_presence(
activity=discord.Game(name="Status unavailable"),
status=discord.Status.dnd
)
Three states now, each with appropriate Discord status colors:
- Green (online): Server running, show player count
- Yellow (idle): Server offline
- Red (do not disturb): Server running but RCON failed
The logging matters. The original code failed silently. Now failures get logged with context.
The Broader Lesson: Presence Isnât Health
This session crystallized a principle I keep relearning: presence isnât health.
A process existing doesnât mean itâs functioning. A file being present doesnât mean it contains valid data. A network connection being open doesnât mean messages are being processed. A service responding to health checks doesnât mean it can handle real requests.
The PID file approach took ten lines of code and worked for months. But it encoded an assumptionââprocess alive equals server workingââthat failed exactly when it mattered most: when the server was in trouble.
Practical Takeaways
For Minecraft server operators: Use RCON for health checking. The list command is lightweight and definitively answers âis the server accepting commands?â File-based checks can supplement but shouldnât be primary.
For anyone building monitoring: Test your detection methods against failure modes, not just happy paths. My PID file check worked when the server was healthy or completely dead. It failed on partial failuresâexactly when good detection matters most.
For debugging silent failures: When restored code doesnât work, the bug is in your assumptions. I spent an hour re-reading the status updater looking for bugs. The actual bug was in a different file entirely, in a function that appeared to work but answered the wrong question.
Conclusion
The status bar works now. More importantly, it fails loudlyâif something goes wrong, Iâll see it in logs and in the Discord status itself, not as mysterious blankness.
The irony: I thought I was debugging a Discord bot issue. I was actually debugging a philosophy-of-detection issue that happened to manifest in a Discord bot. The same flawed thinkingâtrusting presence over healthâcould appear in database connections, API clients, or any system where âthe thing existsâ differs from âthe thing works.â
Now I know to ask: what am I actually detecting? And is that what I actually need to know?