Debugging a Minecraft Server on AWS: From CloudFormation Limits to RCON Credentials
Todayâs coding session turned into a masterclass on hidden constraintsâthe kind of debugging that makes you appreciate just how many invisible boundaries exist between cloud services and the applications running on them. I maintain a Minecraft server on EC2 with Discord integration for player management, and today both layers demanded attention.
The Cryptic Error That Started Everything
The day began with a CloudFormation deployment failure. The error message was characteristically unhelpful:
Waiter StackCreateComplete failed: Waiter encountered a terminal failure state:
For expression "Stacks[].StackStatus" we matched expected path: "ROLLBACK_COMPLETE"
Stack rolled back. Great. But why? The first diagnostic step was querying stack events:
aws cloudformation describe-stack-events \
--stack-name minecraft-server \
--query "StackEvents[?ResourceStatus=='CREATE_FAILED'].[LogicalResourceId,ResourceStatusReason]" \
--output table
The actual culprit emerged:
MinecraftInstance: Resource handler returned message: "Encoded User data is
limited to 25600 bytes (Service: Ec2, Status Code: 400)"
My EC2 UserData script had grown past AWSâs 25,600 byte limit for base64-encoded initialization scripts. The raw script was around 19KB, and base64 encoding adds roughly 33% overhead, pushing it to approximately 25.3KBâjust over the limit.
The Fix: Breaking Up Monolithic Scripts
The solution was architectural rather than cosmetic. Instead of cramming everything into UserData, I refactored to upload initialization scripts to S3 during deployment, then have the instance download them at boot.
First, the deploy script uploads the scripts before creating the stack:
# Upload initialization scripts to S3
TEMPLATE_BUCKET="cf-templates-${AWS_ACCOUNT_ID}-${AWS_REGION}"
aws s3 cp scripts/ec2-init-system.sh "s3://$TEMPLATE_BUCKET/scripts/"
aws s3 cp scripts/ec2-init-minecraft.sh "s3://$TEMPLATE_BUCKET/scripts/"
aws s3 cp scripts/ec2-init-services.sh "s3://$TEMPLATE_BUCKET/scripts/"
Then the UserData becomes a thin bootstrap that fetches and executes these scripts:
UserData:
Fn::Base64: !Sub |
#!/bin/bash
set -e
TEMPLATE_BUCKET="cf-templates-${AWS::AccountId}-${AWS::Region}"
aws s3 cp "s3://$TEMPLATE_BUCKET/scripts/ec2-init-system.sh" /tmp/
aws s3 cp "s3://$TEMPLATE_BUCKET/scripts/ec2-init-minecraft.sh" /tmp/
aws s3 cp "s3://$TEMPLATE_BUCKET/scripts/ec2-init-services.sh" /tmp/
chmod +x /tmp/ec2-init-*.sh
/tmp/ec2-init-system.sh
/tmp/ec2-init-minecraft.sh
/tmp/ec2-init-services.sh
This pattern is cleaner anywayâmodular scripts are easier to test and update independently, and thereâs no practical size limit on what S3 can store.
Tracing Bugs Across System Boundaries
With the infrastructure stable, the session shifted to application-level issues. What looked like four unrelated bugs turned out to share a common theme: data failing to cross system boundaries correctly.
The Missing Whitelist Entry
A user had linked their Discord account but wasnât whitelisted. Rather than diving into sync script code, I checked timestamps first:
aws dynamodb get-item \
--table-name minecraft-players \
--key '{"discord_id": {"S": "123456789"}}' \
--query "Item.linked_at.S"
The user linked at 11:42 PM. The server had shut down at 10:30 PM. The whitelist sync only runs when Minecraft is active, so their entry sat in DynamoDB without being applied. Not a code bugâjust an edge case where the offline-to-online transition didnât trigger a sync.
The Restore Script That Wouldnât Finish
This one took longer. The backup restore appeared to workâfiles downloaded, extracted, permissions fixedâbut the server readiness check never completed. The script hung indefinitely waiting for an RCON response.
The issue was how credentials crossed process boundaries. The restore script runs as a subprocess, and the RCON password lived in the parentâs environment:
# Before: environment variable not exported to subprocess
subprocess.run(["./check_server_ready.sh"], check=True)
The fix was explicit about passing the environment:
# After: pass RCON credentials to the subprocess
env = os.environ.copy()
env["RCON_PASSWORD"] = self.rcon_password
subprocess.run(["./check_server_ready.sh"], env=env, check=True)
Tracing this required connecting three files across two process boundaries: restore script â subprocess call â health check script â RCON connection â missing password. Having full codebase context made that chain visible immediately.
RCON Command Feedback Spam
The final issue was cosmetic but annoying: every automated save-all command appeared in playersâ game chat. The fix was adding a filter to skip automated command responses before relaying to Discord. Minecraftâs gamerule sendCommandFeedback would also work, but I wanted players to see feedback from their own commandsâjust not the serverâs scheduled ones.
Configuration Sprawl: The Root Cause
Debugging these issues revealed a deeper problem: the same configuration values were defined in multiple places with slight variations. The whitelist sync needed the server port. The restore script needed RCON credentials. The Discord bot needed to know which commands to filter.
Each component sourced these values differentlyâsome from environment variables, some hardcoded, some from CloudFormation parameters. This made debugging frustrating because fixing a value in one place didnât fix it everywhere.
The solution was creating a single .env.example as the source of truth:
SERVER_PORT=25565
RCON_PASSWORD=changeme
RCON_PORT=25575
MAX_PLAYERS=10
SPAWN_PROTECTION=0
Then updating the deploy script and CloudFormation template to read from this file consistently. Not glamorous, but it prevents the âbut I already changed that settingâ debugging loop.
Diagnostic Commands Worth Keeping
For anyone debugging similar AWS infrastructure issues:
# Get failed resource details from CloudFormation
aws cloudformation describe-stack-events \
--stack-name YOUR_STACK \
--query "StackEvents[?ResourceStatus=='CREATE_FAILED']"
# Check encoded UserData size before deploying
cat userdata.sh | base64 | wc -c
# Verify S3 script uploads
aws s3 ls s3://YOUR_BUCKET/scripts/
# Test RCON connectivity manually
mcrcon -H localhost -P 25575 -p "$RCON_PASSWORD" "list"
The UserData size check is now part of my pre-deployment validation. Catching the limit before CloudFormation fails saves a 10-minute rollback cycle.
What Got Fixed
By sessionâs end: deployments work reliably with S3-based script loading, the whitelist syncs correctly on server startup, restore operations complete with proper credential passing, and command feedback no longer spams player chat.
The real lesson wasnât any single fixâit was recognizing that cloud debugging is often about finding which boundary the data didnât cross. Environment variables that donât propagate. Process state that doesnât inherit. Service assumptions that donât hold when components restart. Todayâs bugs all lived in those gaps, and understanding that pattern will make the next round of debugging faster.