Troubleshooting Guide¶

This document covers common issues and their solutions.

Connection Issues¶

"Connection refused" error¶

Symptoms: Client cannot connect to server

Causes and solutions:

Server not running - start the server
Wrong address/port - check configuration
Firewall blocking connection - add firewall rule
Server bound to wrong interface - check bind_address in config

"Certificate verify failed" error¶

Symptoms: TLS handshake fails

Causes and solutions:

Wrong CA certificate - verify CA matches server's issuer
Certificate expired - check certificate dates with openssl x509 -in cert.pem -noout -dates
Hostname mismatch - verify server certificate includes the hostname you're connecting to
Clock skew - ensure client and server clocks are synchronized

"Permission denied" error¶

Symptoms: Client authenticates but operations fail

Causes and solutions:

Client certificate not trusted by server CA
Client certificate revoked (if CRL checking enabled)
Certificate subject doesn't match access rules (if implemented)

Database Issues¶

"Database locked" error (SQLite)¶

Symptoms: Operations fail with database lock errors

Causes and solutions:

Multiple processes accessing same database file - use PostgreSQL for multi-instance
Stale lock file - delete .db-wal and .db-shm files if server crashed
NFS/network filesystem - SQLite doesn't work well on network filesystems

Hosts File Issues¶

Changes not appearing¶

Symptoms: Added/updated hosts not visible in /etc/hosts

Causes and solutions:

Server hasn't regenerated file - check server logs
Post-edit hook failed - check hook execution logs
File permissions - verify server can write to hosts file
Atomic rename failed - check disk space and filesystem

Hosts file corrupted¶

Symptoms: /etc/hosts has invalid content

Solutions:

Rollback to previous snapshot: router-hosts snapshot rollback <id>
List snapshots and reimport:

router-hosts snapshot list
# Choose a snapshot ID from the list, then rollback:
router-hosts snapshot rollback <id>

ACME Issues¶

See ACME documentation for certificate-specific issues.

Quick ACME Checklist¶

HTTP-01 failures:
DNS points to this server?
Port 80 accessible?
Rate limited? (check logs)
DNS-01 failures:
Zone exists in provider?
API token has correct permissions?
Record propagated? (use dig)

Performance Issues¶

Slow list/search operations¶

Causes and solutions:

Large dataset - add pagination with --limit and --offset
Missing indexes - check database configuration
Network latency - consider local caching or PostgreSQL read replicas

High memory usage¶

Causes and solutions:

Large import - use streaming import instead of loading all at once
Event log too large - configure retention to prune old events
Connection pool too large - reduce pool size

Hook Issues¶

Hooks not executing¶

Causes and solutions:

Hook disabled - check [hooks] section in config
Script not executable - chmod +x /path/to/hook.sh
Script not found - use absolute paths
Timeout - hooks have 30s default timeout

Hook executing but no effect¶

Causes and solutions:

Environment variables - hooks run in limited environment
Working directory - hooks run from server's working directory
Error not logged - add explicit logging to hook script

Kubernetes Operator Issues¶

The router-hosts operator watches Kubernetes resources and creates DNS entries automatically.

Note: The Go operator reconciles two resource types — HostMapping and Traefik IngressRoute/IngressRouteTCP. There is no Kubernetes Service controller and no enabled/hostname/ip-address annotation API. See the Kubernetes Operator guide.

HostMapping not syncing¶

Symptoms: HostMapping exists but no DNS entry is created; status.phase is Error.

Causes and solutions:

spec.ip is missing or invalid — it is required and must be a valid IPv4/IPv6 address. (The field is spec.ip; the pre-0.10.2 CRD used spec.ipAddress.)
Read the failure reason from status:

kubectl get hostmapping <name> -n <namespace> \
  -o jsonpath='{.status.phase}{" "}{.status.message}'

IngressRoute hostnames not registered¶

Symptoms: A Traefik IngressRoute/IngressRouteTCP exists but its hosts are missing from router-hosts.

Causes and solutions:

Only Host(`…`) (IngressRoute) and HostSNI(`…`) (IngressRouteTCP) patterns in spec.routes[].match are extracted. Other match expressions yield no hostnames.
Hostnames that fail RFC 1123 validation are logged and skipped — check the operator logs.
Entries are created with the operator's --default-ingress-ip. If that flag is empty, hosts are created with no IP; set routerHosts.defaultIngressIP in the chart.

# Operator logs (extraction warnings, gRPC errors)
kubectl logs -n router-hosts-system -l app.kubernetes.io/name=router-hosts-operator --tail=100

# Inspect the operator-managed host-id map on the resource
kubectl get ingressroute <name> -n <namespace> \
  -o jsonpath='{.metadata.annotations.router-hosts\.fzymgc\.house/host-ids}'

DNS entry not updated after a resource change¶

Symptoms: Changed a HostMapping/IngressRoute but router-hosts doesn't reflect it.

Causes and solutions:

Check operator logs for reconcile errors.
Verify the router-hosts server is reachable.
Transient failures are retried with a requeue backoff — give it a moment.

Operator not connecting to router-hosts server¶

Symptoms: All reconciliations fail with client errors

Causes and solutions:

Verify the server address passed via --server-address (Helm routerHosts.serverAddress)
Check mTLS certificates are valid and mounted
Verify network connectivity between operator and server

# Check operator configuration (flags are on the Deployment, not a CRD)
kubectl get deployment -n router-hosts-system router-hosts-operator \
  -o jsonpath='{.spec.template.spec.containers[0].args}'

# Check certificate secrets exist
kubectl get secrets -n router-hosts-system | grep tls

Logging and Debugging¶

Enable debug logging¶

# Server (with debug logging)
LOG_LEVEL=debug router-hosts serve

# Very verbose
LOG_LEVEL=trace router-hosts serve

Common log patterns¶

Pattern	Meaning
`accepted connection`	Client connected successfully
`TLS handshake failed`	Certificate issue
`event stored`	Write operation succeeded
`regenerating hosts file`	About to update /etc/hosts
`hook completed`	Post-edit hook finished
`SIGHUP received`	Certificate reload triggered

Getting Help¶

If you can't resolve an issue:

Check the GitHub issues for similar problems
Enable debug logging and capture relevant output
Open a new issue with:
router-hosts version (router-hosts --version)
Operating system and version
Configuration (redact sensitive values)
Error messages and logs
Steps to reproduce

Recovery Procedures¶

Complete database recovery¶

If the database is corrupted beyond repair:

# Stop server
systemctl stop router-hosts

# Backup corrupted database (for analysis)
mv /var/lib/router-hosts/hosts.db /var/lib/router-hosts/hosts.db.corrupt

# Reimport from most recent export
router-hosts host import /backup/hosts-export.json --input-format json

# Or reimport from /etc/hosts directly
router-hosts host import /etc/hosts

Certificate emergency replacement¶

If certificates are compromised:

# Generate new certificates (example with mkcert)
mkcert -install
mkcert -cert-file server.crt -key-file server.key router.example.com

# Replace on server
cp server.crt /etc/router-hosts/
cp server.key /etc/router-hosts/

# Trigger reload
pkill -HUP router-hosts

# Generate new client certs and distribute to clients
mkcert -client -cert-file client.crt -key-file client.key client@example.com