Operations Guide¶

This document covers operational aspects of running router-hosts in production.

Post-Edit Hooks¶

Server executes shell commands after /etc/hosts updates:

on_success hooks - after successful regeneration (e.g., reload dnsmasq)
on_failure hooks - after failed regeneration (e.g., alerting)
Hooks run with 30s timeout, failures logged but don't fail operation
Environment variables provide context (event type, entry count, error message)

Configuration Example¶

Each hook requires a name and command:

[[hooks.on_success]]
name = "reload-dns"
command = "systemctl reload dnsmasq"

[[hooks.on_success]]
name = "notify-slack"
command = "/usr/local/bin/notify-slack.sh"

[[hooks.on_failure]]
name = "alert-ops"
command = "/usr/local/bin/alert-ops.sh"

Hook Name Requirements¶

Hook names must follow these rules:

Format: Kebab-case only (lowercase letters, numbers, hyphens)
Length: Maximum 50 characters
Uniqueness: No duplicate names within the same hook type
Examples: reload-dns, alert-ops-team, log-update

Hook names appear in health endpoints and logs, providing meaningful identification without exposing sensitive command details.

Environment Variables¶

Hooks receive these environment variables:

Variable	Description
`ROUTER_HOSTS_EVENT`	"success" or "failure"
`ROUTER_HOSTS_ENTRY_COUNT`	Number of host entries
`ROUTER_HOSTS_ERROR`	Error message (failure hooks only)

Certificate Reload via SIGHUP¶

The server supports dynamic TLS certificate reload via SIGHUP signal (Unix only).

How It Works¶

Server receives SIGHUP signal
Validates new certificates on disk (PEM format, key present, CA present)
If valid: graceful shutdown (30s drain), restart with new certs
If invalid: logs error, keeps running with current certs

Graceful Shutdown Details¶

During the 30-second graceful shutdown period:

New connections: Rejected (server stops accepting)
In-flight gRPC requests: Allowed to complete
WriteQueue operations: Continue processing until completion or timeout
Storage layer: Shared across reloads (database connections persist)

If the timeout expires before all operations complete, remaining connections are forcibly closed. The server logs a warning indicating some requests may have been interrupted.

What persists across reloads:

Storage backend (SQLite connection)
CommandHandler (business logic)
HookExecutor (post-edit hooks configuration)
HostsFileGenerator (output path configuration)

What is recreated:

TLS certificates (the whole point of SIGHUP)
gRPC server instance
WriteQueue (fresh channel and worker task)

Usage¶

# Find server PID and send SIGHUP
pkill -HUP router-hosts

# Or with explicit PID
kill -HUP $(pgrep router-hosts)

Integration with Vault Agent¶

Configure Vault Agent to send SIGHUP after certificate renewal:

template {
  source      = "cert.tpl"
  destination = "/etc/router-hosts/server.crt"
  command     = "pkill -HUP router-hosts"
}

Platform Support¶

Platform	SIGHUP Support
Linux	Yes
macOS	Yes
Windows	No (logs warning)

Certificate Validation¶

What gets validated on SIGHUP:

Files exist and are readable
Valid PEM format
Private key can be parsed
CA certificate can be parsed

What doesn't get validated:

Certificate expiry (server starts with expired certs)
CA chain validity (checked at connection time)
Key/cert match (checked by Go crypto/tls on load)

Logging¶

The server uses slog for structured logging.

Log Levels¶

Level	Use Case
`error`	Operation failures, certificate errors
`warn`	Degraded operation, hook timeouts
`info`	Normal operations, startup/shutdown
`debug`	Request details, hook execution
`trace`	Wire-level details

Configuration¶

Set via LOG_LEVEL environment variable or in the TOML configuration:

# All components at info level
LOG_LEVEL=info router-hosts serve

# Enable debug logging
LOG_LEVEL=debug router-hosts serve

# Or configure in server config
[logging]
level = "info"  # error, warn, info, debug

Access Logs¶

The server emits structured access logs at INFO level for every gRPC request. These logs provide observability into server operations.

Log Format:

Each request produces a single log line with structured fields:

Field	Description	Always Present
`method`	gRPC method name (e.g., AddHost, GetHost)	Yes
`status`	Result status (`ok` or `error`)	Yes
`duration_ms`	Request duration in milliseconds	Yes
`id`	Host entry ULID (for CRUD operations)	When available
`hostname`	Host's hostname	When available
`ip`	Host's IP address	When available
`query`	Search query (for SearchHosts)	When available

Example Output:

INFO request method=AddHost id=01JG... hostname=myserver.local ip=192.168.1.10 status=ok duration_ms=5
INFO request method=GetHost id=01JG... hostname=myserver.local ip=192.168.1.10 status=ok duration_ms=2
INFO request method=UpdateHost id=01JG... hostname=newname.local ip=10.0.0.5 status=ok duration_ms=3
INFO request method=DeleteHost id=01JG... status=ok duration_ms=1
INFO request method=ListHosts status=ok duration_ms=12
INFO request method=SearchHosts query=*.example.com status=ok duration_ms=8

Security:

All user-provided fields (id, hostname, ip, query) are sanitized before logging to prevent log injection attacks:

Control characters (newlines, carriage returns, tabs) are replaced with the Unicode replacement character (U+FFFD)
Fields are truncated to 256 characters maximum to prevent log flooding
This protects against malicious input that could inject fake log entries or break log parsers

Privacy Considerations:

Access logs contain IP addresses and hostnames which may be considered sensitive:

IP addresses: May be PII under GDPR and similar regulations
Hostnames: Could reveal internal infrastructure naming

Consider your log retention policies and access controls accordingly. For environments with strict privacy requirements, configure log aggregation to filter or redact these fields.

Querying Logs:

Example queries for common log aggregation tools:

# Find all failed requests
grep 'status=error' /var/log/router-hosts.log

# Find operations on a specific host
grep 'hostname=myserver.local' /var/log/router-hosts.log

# Find slow requests (>100ms)
awk '/duration_ms=[0-9]{3,}/' /var/log/router-hosts.log

# Find search operations
grep 'method=SearchHosts' /var/log/router-hosts.log

Log Filtering/Redaction:

For environments requiring PII redaction, configure your log aggregator to filter sensitive fields:

# Vector (vector.dev) example
transforms:
  redact_pii:
    type: remap
    inputs: ["router_hosts_logs"]
    source: |
      .ip = "REDACTED"
      .hostname = redact(.hostname, filters: ["pattern"], patterns: [r'\.[a-z]+$'])

# Fluentd example
<filter router-hosts.**>
  @type record_transformer
  <record>
    ip ${record["ip"] ? "REDACTED" : nil}
  </record>
</filter>

Note: The logged IP addresses and hostnames are the host entry values being managed, not client connection IPs. Client authentication is via mTLS certificates and is not logged in access logs.

Monitoring¶

Health Checks¶

The server exposes health check RPCs within HostsService for monitoring and orchestration systems.

Available Health RPCs:

RPC	Purpose	Checks
`Liveness`	Process alive check	Returns immediately (no I/O)
`Readiness`	Ready to serve	Verifies database connectivity
`Health`	Detailed status	Server, database, ACME, hooks

Using grpcurl for health checks:

# Check readiness (verifies database)
grpcurl -cacert /path/to/ca.crt \
  -cert /path/to/client.crt \
  -key /path/to/client.key \
  localhost:50051 router_hosts.v1.HostsService/Readiness

# Get detailed health status
grpcurl -cacert /path/to/ca.crt \
  -cert /path/to/client.crt \
  -key /path/to/client.key \
  localhost:50051 router_hosts.v1.HostsService/Health

Health Response Fields:

Field	Description
`healthy`	Overall health status
`server_status`	gRPC server status
`database_status`	Storage backend connectivity
`acme_status`	ACME certificate manager status (if configured)
`hooks`	Individual hook health status

The Readiness RPC is suitable for Kubernetes readiness probes as it verifies the server can process requests (database is accessible).

Operator Health Endpoints¶

The Kubernetes operator exposes HTTP health endpoints (separate from the server's gRPC Health service):

Endpoint	Purpose	Behavior
`/healthz`	Liveness	Returns 200 if process is alive
`/readyz`	Readiness	Returns 200 if startup complete AND router-hosts server reachable

See Operator Documentation for details on probe configuration.

Metrics and Tracing (OpenTelemetry)¶

All metrics and traces are exported via OpenTelemetry (OTLP/gRPC) to a collector of your choice.

Configuration¶

Metrics and tracing are opt-in. Add a [metrics.otel] section to enable:

[metrics.otel]
endpoint = "http://otel-collector:4317"
service_name = "router-hosts"     # Optional, defaults to "router-hosts"
export_metrics = true             # Optional, defaults to true
export_traces = true              # Optional, defaults to true
export_interval_secs = 60         # Optional, defaults to 60 seconds
# headers = { "Authorization" = "Bearer token" }  # Optional

The export_interval_secs option controls how frequently metrics are pushed to the OTEL collector. Lower values increase metric freshness but add collector overhead. The default of 60 seconds balances freshness with efficiency for most deployments.

Available Metrics¶

All metrics recorded via the OpenTelemetry Go SDK are exported to the OTEL collector:

Metric	Type	Labels	Description
`router_hosts_requests_total`	Counter	`method`, `status`	Total gRPC requests
`router_hosts_request_duration_seconds`	Histogram	`method`	Request latency
`router_hosts_storage_operations_total`	Counter	`operation`, `status`	DB operations count
`router_hosts_storage_duration_seconds`	Histogram	`operation`	DB operation latency
`router_hosts_hook_executions_total`	Counter	`name`, `type`, `status`	Hook execution count
`router_hosts_hook_duration_seconds`	Histogram	`name`, `type`	Hook execution time
`router_hosts_hosts_entries`	Gauge	-	Current host entry count

Prometheus Scraping via OTEL Collector¶

If you need Prometheus-style /metrics scraping, configure your OTEL collector to expose a Prometheus endpoint:

# otel-collector-config.yaml
exporters:
  prometheus:
    endpoint: "0.0.0.0:9090"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [prometheus]

Then configure your Prometheus to scrape the collector:

# prometheus.yml
scrape_configs:
  - job_name: 'router-hosts'
    static_configs:
      - targets: ['otel-collector:9090']

Trace Context Propagation¶

Incoming gRPC requests with W3C Trace Context headers (traceparent, tracestate) are automatically linked to distributed traces.

Graceful Degradation¶

No [metrics.otel] config → no OTEL layers, zero overhead
export_traces = false → trace exporter disabled (still logs to console)
export_metrics = false → metrics export disabled
Collector unavailable at runtime → OpenTelemetry SDK handles retry/backoff internally

Note: Invalid configuration (malformed endpoint, invalid headers) will cause server startup to fail. Verify your OTEL collector is reachable before deploying.

Collector Retry Behavior¶

When the OTEL collector becomes unavailable at runtime:

Traces: The batch exporter retries with exponential backoff (5s initial, 30s max). Failed spans are dropped after retry exhaustion to prevent memory growth.
Metrics: The periodic reader (60s interval) retries on each export cycle. Metrics are aggregated in-memory and the latest values are sent when connectivity resumes.
Logging: Failed exports log at warn level. Enable LOG_LEVEL=debug for detailed retry diagnostics.

This design ensures the server never blocks on telemetry failures—observability is best-effort.

Kubernetes Collector Sidecar¶

Example collector sidecar configuration:

containers:
  - name: otel-collector
    image: otel/opentelemetry-collector:latest
    ports:
      - containerPort: 4317
    volumeMounts:
      - name: otel-config
        mountPath: /etc/otelcol
volumes:
  - name: otel-config
    configMap:
      name: otel-collector-config

Troubleshooting OTEL¶

No traces/metrics appearing in collector:

Verify connectivity: grpcurl -plaintext otel-collector:4317 list
Check server logs for OTEL.*initialized messages
Confirm export_traces and export_metrics are true (or omitted for defaults)
Ensure collector is configured to receive OTLP/gRPC on port 4317

Server fails to start with OTEL errors:

Validate endpoint URL format: http://host:port (no trailing slash)
Check header syntax in config: headers = { "Key" = "Value" }
Ensure collector is reachable from server network

High memory usage with OTEL enabled:

Verify collector is healthy—backpressure from failing exports can buffer spans
Consider reducing trace sampling in high-throughput scenarios
Check for circular trace propagation in service mesh configurations

Debug logging:

LOG_LEVEL=debug router-hosts serve

Backup and Recovery¶

Automatic Snapshots¶

The server creates snapshots before destructive operations:

Before import (replaces all hosts)
Before rollback (creates backup of current state)

Manual Backup¶

# Export current state
router-hosts host export --export-format json > backup.json

# List available snapshots
router-hosts snapshot list

Recovery¶

# Rollback to previous snapshot
router-hosts snapshot rollback <id>

# Import from backup
router-hosts host import backup.json --conflict-mode replace

Retention Policy¶

Configure snapshot retention in server config:

[retention]
max_snapshots = 50      # Keep at most 50 snapshots
max_age_days = 30       # Delete snapshots older than 30 days

Security Considerations¶

File Permissions¶

File	Recommended Permissions
Server certificate	0644
Server private key	0600
CA certificate	0644
Database file	0600
Hosts file	0644

Network Security¶

Server binds to configured address only
TLS 1.2+ required (Go crypto/tls defaults)
Client certificates required for all connections
No anonymous or insecure connections allowed

Audit Trail¶

All operations are logged with:

Client certificate subject (who)
Operation type and parameters (what)
Timestamp (when)
Success/failure status (outcome)