Troubleshooting Guide¶
This document covers common issues and their solutions.
Connection Issues¶
"Connection refused" error¶
Symptoms: Client cannot connect to server
Causes and solutions:
- Server not running - start the server
- Wrong address/port - check configuration
- Firewall blocking connection - add firewall rule
- Server bound to wrong interface - check
bind_addressin config
"Certificate verify failed" error¶
Symptoms: TLS handshake fails
Causes and solutions:
- Wrong CA certificate - verify CA matches server's issuer
- Certificate expired - check certificate dates with
openssl x509 -in cert.pem -noout -dates - Hostname mismatch - verify server certificate includes the hostname you're connecting to
- Clock skew - ensure client and server clocks are synchronized
"Permission denied" error¶
Symptoms: Client authenticates but operations fail
Causes and solutions:
- Client certificate not trusted by server CA
- Client certificate revoked (if CRL checking enabled)
- Certificate subject doesn't match access rules (if implemented)
Database Issues¶
"Database locked" error (SQLite)¶
Symptoms: Operations fail with database lock errors
Causes and solutions:
- Multiple processes accessing same database file - use PostgreSQL for multi-instance
- Stale lock file - delete
.db-waland.db-shmfiles if server crashed - NFS/network filesystem - SQLite doesn't work well on network filesystems
Hosts File Issues¶
Changes not appearing¶
Symptoms: Added/updated hosts not visible in /etc/hosts
Causes and solutions:
- Server hasn't regenerated file - check server logs
- Post-edit hook failed - check hook execution logs
- File permissions - verify server can write to hosts file
- Atomic rename failed - check disk space and filesystem
Hosts file corrupted¶
Symptoms: /etc/hosts has invalid content
Solutions:
- Rollback to previous snapshot:
router-hosts snapshot rollback <id> - List snapshots and reimport:
router-hosts snapshot list
# Choose a snapshot ID from the list, then rollback:
router-hosts snapshot rollback <id>
ACME Issues¶
See ACME documentation for certificate-specific issues.
Quick ACME Checklist¶
- HTTP-01 failures:
- DNS points to this server?
- Port 80 accessible?
-
Rate limited? (check logs)
-
DNS-01 failures:
- Zone exists in provider?
- API token has correct permissions?
- Record propagated? (use
dig)
Performance Issues¶
Slow list/search operations¶
Causes and solutions:
- Large dataset - add pagination with
--limitand--offset - Missing indexes - check database configuration
- Network latency - consider local caching or PostgreSQL read replicas
High memory usage¶
Causes and solutions:
- Large import - use streaming import instead of loading all at once
- Event log too large - configure retention to prune old events
- Connection pool too large - reduce pool size
Hook Issues¶
Hooks not executing¶
Causes and solutions:
- Hook disabled - check
[hooks]section in config - Script not executable -
chmod +x /path/to/hook.sh - Script not found - use absolute paths
- Timeout - hooks have 30s default timeout
Hook executing but no effect¶
Causes and solutions:
- Environment variables - hooks run in limited environment
- Working directory - hooks run from server's working directory
- Error not logged - add explicit logging to hook script
Kubernetes Operator Issues¶
The router-hosts operator watches Kubernetes resources and creates DNS entries automatically.
Note: The Go operator reconciles two resource types —
HostMappingand TraefikIngressRoute/IngressRouteTCP. There is no KubernetesServicecontroller and noenabled/hostname/ip-addressannotation API. See the Kubernetes Operator guide.
HostMapping not syncing¶
Symptoms: HostMapping exists but no DNS entry is created; status.phase is Error.
Causes and solutions:
spec.ipis missing or invalid — it is required and must be a valid IPv4/IPv6 address. (The field isspec.ip; the pre-0.10.2 CRD usedspec.ipAddress.)- Read the failure reason from status:
IngressRoute hostnames not registered¶
Symptoms: A Traefik IngressRoute/IngressRouteTCP exists but its hosts are missing from router-hosts.
Causes and solutions:
- Only
Host(`…`)(IngressRoute) andHostSNI(`…`)(IngressRouteTCP) patterns inspec.routes[].matchare extracted. Other match expressions yield no hostnames. - Hostnames that fail RFC 1123 validation are logged and skipped — check the operator logs.
- Entries are created with the operator's
--default-ingress-ip. If that flag is empty, hosts are created with no IP; setrouterHosts.defaultIngressIPin the chart.
# Operator logs (extraction warnings, gRPC errors)
kubectl logs -n router-hosts-system -l app.kubernetes.io/name=router-hosts-operator --tail=100
# Inspect the operator-managed host-id map on the resource
kubectl get ingressroute <name> -n <namespace> \
-o jsonpath='{.metadata.annotations.router-hosts\.fzymgc\.house/host-ids}'
DNS entry not updated after a resource change¶
Symptoms: Changed a HostMapping/IngressRoute but router-hosts doesn't reflect it.
Causes and solutions:
- Check operator logs for reconcile errors.
- Verify the router-hosts server is reachable.
- Transient failures are retried with a requeue backoff — give it a moment.
Operator not connecting to router-hosts server¶
Symptoms: All reconciliations fail with client errors
Causes and solutions:
- Verify the server address passed via
--server-address(HelmrouterHosts.serverAddress) - Check mTLS certificates are valid and mounted
- Verify network connectivity between operator and server
# Check operator configuration (flags are on the Deployment, not a CRD)
kubectl get deployment -n router-hosts-system router-hosts-operator \
-o jsonpath='{.spec.template.spec.containers[0].args}'
# Check certificate secrets exist
kubectl get secrets -n router-hosts-system | grep tls
Logging and Debugging¶
Enable debug logging¶
# Server (with debug logging)
LOG_LEVEL=debug router-hosts serve
# Very verbose
LOG_LEVEL=trace router-hosts serve
Common log patterns¶
| Pattern | Meaning |
|---|---|
accepted connection |
Client connected successfully |
TLS handshake failed |
Certificate issue |
event stored |
Write operation succeeded |
regenerating hosts file |
About to update /etc/hosts |
hook completed |
Post-edit hook finished |
SIGHUP received |
Certificate reload triggered |
Getting Help¶
If you can't resolve an issue:
- Check the GitHub issues for similar problems
- Enable debug logging and capture relevant output
- Open a new issue with:
- router-hosts version (
router-hosts --version) - Operating system and version
- Configuration (redact sensitive values)
- Error messages and logs
- Steps to reproduce
Recovery Procedures¶
Complete database recovery¶
If the database is corrupted beyond repair:
# Stop server
systemctl stop router-hosts
# Backup corrupted database (for analysis)
mv /var/lib/router-hosts/hosts.db /var/lib/router-hosts/hosts.db.corrupt
# Reimport from most recent export
router-hosts host import /backup/hosts-export.json --input-format json
# Or reimport from /etc/hosts directly
router-hosts host import /etc/hosts
Certificate emergency replacement¶
If certificates are compromised:
# Generate new certificates (example with mkcert)
mkcert -install
mkcert -cert-file server.crt -key-file server.key router.example.com
# Replace on server
cp server.crt /etc/router-hosts/
cp server.key /etc/router-hosts/
# Trigger reload
pkill -HUP router-hosts
# Generate new client certs and distribute to clients
mkcert -client -cert-file client.crt -key-file client.key client@example.com