Troubleshooting Guide¶
This document covers common issues and their solutions.
Connection Issues¶
"Connection refused" error¶
Symptoms: Client cannot connect to server
Causes and solutions:
- Server not running - start the server
- Wrong address/port - check configuration
- Firewall blocking connection - add firewall rule
- Server bound to wrong interface - check
bind_addressin config
"Certificate verify failed" error¶
Symptoms: TLS handshake fails
Causes and solutions:
- Wrong CA certificate - verify CA matches server's issuer
- Certificate expired - check certificate dates with
openssl x509 -in cert.pem -noout -dates - Hostname mismatch - verify server certificate includes the hostname you're connecting to
- Clock skew - ensure client and server clocks are synchronized
"Permission denied" error¶
Symptoms: Client authenticates but operations fail
Causes and solutions:
- Client certificate not trusted by server CA
- Client certificate revoked (if CRL checking enabled)
- Certificate subject doesn't match access rules (if implemented)
Database Issues¶
"Database locked" error (SQLite)¶
Symptoms: Operations fail with database lock errors
Causes and solutions:
- Multiple processes accessing same database file - use PostgreSQL for multi-instance
- Stale lock file - delete
.db-waland.db-shmfiles if server crashed - NFS/network filesystem - SQLite doesn't work well on network filesystems
Hosts File Issues¶
Changes not appearing¶
Symptoms: Added/updated hosts not visible in /etc/hosts
Causes and solutions:
- Server hasn't regenerated file - check server logs
- Post-edit hook failed - check hook execution logs
- File permissions - verify server can write to hosts file
- Atomic rename failed - check disk space and filesystem
Hosts file corrupted¶
Symptoms: /etc/hosts has invalid content
Solutions:
- Rollback to previous snapshot:
router-hosts snapshot rollback <id> - List snapshots and reimport:
router-hosts snapshot list
# Choose a snapshot ID from the list, then rollback:
router-hosts snapshot rollback <id>
ACME Issues¶
See ACME documentation for certificate-specific issues.
Quick ACME Checklist¶
- HTTP-01 failures:
- DNS points to this server?
- Port 80 accessible?
-
Rate limited? (check logs)
-
DNS-01 failures:
- Zone exists in provider?
- API token has correct permissions?
- Record propagated? (use
dig)
Performance Issues¶
Slow list/search operations¶
Causes and solutions:
- Large dataset - add pagination with
--limitand--offset - Missing indexes - check database configuration
- Network latency - consider local caching or PostgreSQL read replicas
High memory usage¶
Causes and solutions:
- Large import - use streaming import instead of loading all at once
- Event log too large - configure retention to prune old events
- Connection pool too large - reduce pool size
Hook Issues¶
Hooks not executing¶
Causes and solutions:
- Hook disabled - check
[hooks]section in config - Script not executable -
chmod +x /path/to/hook.sh - Script not found - use absolute paths
- Timeout - hooks have 30s default timeout
Hook executing but no effect¶
Causes and solutions:
- Environment variables - hooks run in limited environment
- Working directory - hooks run from server's working directory
- Error not logged - add explicit logging to hook script
Kubernetes Operator Issues¶
The router-hosts operator watches Kubernetes resources and creates DNS entries automatically.
Service not being processed¶
Symptoms: Service exists but no DNS entry created
Causes and solutions:
- Missing
router-hosts.fzymgc.house/enabled: "true"annotation - Missing
router-hosts.fzymgc.house/hostnameannotation - required for Services - Invalid service type - only
LoadBalancerandNodePortare supported
# Check annotations on Service
kubectl get svc <name> -o jsonpath='{.metadata.annotations}'
# Verify operator is running
kubectl get pods -n router-hosts -l app=router-hosts-operator
"InvalidServiceType" warning event¶
Symptoms: Kubernetes event shows invalid service type
Cause: ClusterIP and ExternalName Services are not supported
Solution: Use LoadBalancer or NodePort service type, or remove the enabled annotation if DNS registration isn't needed.
"MissingHostname" warning event¶
Symptoms: Service annotated but no hostname configured
Cause: router-hosts.fzymgc.house/hostname annotation is required for Services (unlike Ingress which has spec.rules[].host)
Solution: Add the hostname annotation:
annotations:
router-hosts.fzymgc.house/enabled: "true"
router-hosts.fzymgc.house/hostname: "myservice.example.com"
"InvalidHostname" warning event¶
Symptoms: Hostname annotation present but rejected
Cause: Hostname doesn't conform to RFC 1123 format
Common issues:
- Contains underscores (use hyphens instead)
- Starts or ends with hyphen
- Contains consecutive dots
- Label exceeds 63 characters
Solution: Fix the hostname format:
# Wrong
router-hosts.fzymgc.house/hostname: "my_service.example.com"
router-hosts.fzymgc.house/hostname: "-service.example.com"
# Correct
router-hosts.fzymgc.house/hostname: "my-service.example.com"
"MissingIPAddress" warning event (NodePort)¶
Symptoms: NodePort Service not creating DNS entry
Cause: NodePort Services require explicit IP address annotation because they expose on all nodes
Solution: Add the IP annotation:
annotations:
router-hosts.fzymgc.house/enabled: "true"
router-hosts.fzymgc.house/hostname: "myservice.example.com"
router-hosts.fzymgc.house/ip-address: "192.168.1.100" # Required for NodePort
"PendingLoadBalancer" normal event¶
Symptoms: LoadBalancer Service waiting for IP
Cause: Cloud provider hasn't assigned an external IP yet
Solutions:
- Wait for cloud provider to provision load balancer
- Check cloud provider quotas and limits
- For bare-metal clusters, ensure MetalLB or similar is configured
# Check LoadBalancer status
kubectl get svc <name> -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
DNS entry not updated after Service change¶
Symptoms: Changed Service but DNS doesn't reflect updates
Causes and solutions:
- Check operator logs for errors
- Verify router-hosts server is reachable
- Check retry backoff - transient errors use exponential backoff
# Check operator logs
kubectl logs -n router-hosts -l app=router-hosts-operator --tail=100
# Force reconciliation by touching annotation
kubectl annotate svc <name> router-hosts.fzymgc.house/timestamp="$(date +%s)" --overwrite
Operator not connecting to router-hosts server¶
Symptoms: All reconciliations fail with client errors
Causes and solutions:
- Verify server address in RouterHostsConfig
- Check mTLS certificates are valid and mounted
- Verify network connectivity between operator and server
# Check operator configuration
kubectl get routerhostsconfig -A -o yaml
# Check certificate secrets exist
kubectl get secrets -n router-hosts | grep tls
Logging and Debugging¶
Enable debug logging¶
# Server (with debug logging)
LOG_LEVEL=debug router-hosts serve
# Very verbose
LOG_LEVEL=trace router-hosts serve
Common log patterns¶
| Pattern | Meaning |
|---|---|
accepted connection |
Client connected successfully |
TLS handshake failed |
Certificate issue |
event stored |
Write operation succeeded |
regenerating hosts file |
About to update /etc/hosts |
hook completed |
Post-edit hook finished |
SIGHUP received |
Certificate reload triggered |
Getting Help¶
If you can't resolve an issue:
- Check the GitHub issues for similar problems
- Enable debug logging and capture relevant output
- Open a new issue with:
- router-hosts version (
router-hosts --version) - Operating system and version
- Configuration (redact sensitive values)
- Error messages and logs
- Steps to reproduce
Recovery Procedures¶
Complete database recovery¶
If the database is corrupted beyond repair:
# Stop server
systemctl stop router-hosts
# Backup corrupted database (for analysis)
mv /var/lib/router-hosts/hosts.db /var/lib/router-hosts/hosts.db.corrupt
# Reimport from most recent export
router-hosts host import /backup/hosts-export.json --input-format json
# Or reimport from /etc/hosts directly
router-hosts host import /etc/hosts
Certificate emergency replacement¶
If certificates are compromised:
# Generate new certificates (example with mkcert)
mkcert -install
mkcert -cert-file server.crt -key-file server.key router.example.com
# Replace on server
cp server.crt /etc/router-hosts/
cp server.key /etc/router-hosts/
# Trigger reload
pkill -HUP router-hosts
# Generate new client certs and distribute to clients
mkcert -client -cert-file client.crt -key-file client.key client@example.com