diff --git a/README.md b/README.md index f64506a..325dace 100644 --- a/README.md +++ b/README.md @@ -212,7 +212,8 @@ While not final, some of the metrics collected are as follows: - `web_path_overflow_total` - Paths rejected due to cardinality limit - `web_referrer_overflow_total` - Referrers rejected due to limit - `web_event_overflow_total` - Custom events rejected due to limit -- `web_blocked_requests_total{reason}` - File server requests blocked by security filters +- `web_blocked_requests_total{reason}` - File server requests blocked by + security filters **Process metrics:** diff --git a/docs/observability.md b/docs/observability.md new file mode 100644 index 0000000..1f3b9dd --- /dev/null +++ b/docs/observability.md @@ -0,0 +1,300 @@ +# Observability Setup + +Watchdog exposes Prometheus-formatted metrics at `/metrics`. You need a +time-series database to scrape and store these metrics, then visualize them in +Grafana. + +> [!IMPORTANT] +> +> **Why you need Prometheus:** +> +> - Watchdog exposes _current state_ (counters, gauges) +> - Prometheus _scrapes periodically_ and _stores time-series data_ +> - Grafana _visualizes_ the historical data from Prometheus +> - Grafana cannot directly scrape Prometheus `/metrics` endpoints + +## Prometheus Setup + +### Configuring Prometheus + +Create `/etc/prometheus/prometheus.yml`: + +```yaml +global: + scrape_interval: 15s + evaluation_interval: 15s + +scrape_configs: + - job_name: "watchdog" + static_configs: + - targets: ["localhost:8080"] + + # Optional: scrape multiple Watchdog instances + # static_configs: + # - targets: + # - 'watchdog-1.example.com:8080' + # - 'watchdog-2.example.com:8080' + # labels: + # instance: 'production' + + # Scrape Prometheus itself + - job_name: "prometheus" + static_configs: + - targets: ["localhost:9090"] +``` + +### Verify Prometheus' health state + +```bash +# Check Prometheus is running +curl http://localhost:9090/-/healthy + +# Check it's scraping Watchdog +curl http://localhost:9090/api/v1/targets +``` + +### NixOS + +Add to your NixOS configuration: + +```nix +{ + services.prometheus = { + enable = true; + port = 9090; + + # Retention period + retentionTime = "30d"; + + scrapeConfigs = [ + { + job_name = "watchdog"; + static_configs = [{ + targets = [ "localhost:8080" ]; + }]; + } + ]; + }; + + # Open firewall if needed + # networking.firewall.allowedTCPPorts = [ 9090 ]; +} +``` + +For multiple Watchdog instances: + +```nix +{ + services.prometheus.scrapeConfigs = [ + { + job_name = "watchdog"; + static_configs = [ + { + labels.env = "production"; + targets = [ + "watchdog-1:8080" + "watchdog-2:8080" + "watchdog-3:8080" + ]; + } + ]; + } + ]; +} +``` + +## Grafana Setup + +### NixOS + +```nix +{ + services.grafana = { + enable = true; + settings = { + server = { + http_addr = "127.0.0.1"; + http_port = 3000; + }; + }; + + provision = { + enable = true; + + datasources.settings.datasources = [{ + name = "Prometheus"; + type = "prometheus"; + url = "http://localhost:9090"; + isDefault = true; + }]; + }; + }; +} +``` + +### Configure Data Source (Manual) + +If you're not using NixOS for provisioning, then you'll need to do provisioning +_imperatively_ from your Grafana configuration. Ths can be done through the +admin panel by navigating to `Configuration`, and choosing "add data source" +under `Data Sources`. Select your prometheus instance, and save it. + +### Import Pre-built Dashboard + +A sample Grafana dashboard is provided with support for multi-host and +multi-site configurations. Import it, configure the data source and it should +work out of the box. + +If you're not using NixOS for provisioning, the dashboard _also_ needs to be +provisioned manually. Under `Dashboards`, select `Import` and provide the JSON +contents or upload the sample dashboard from `contrib/grafana/watchdog.json`. +Select your Prometheus data source and import it. + +See [contrib/grafana/README.md](../contrib/grafana/README.md) for full +documentation. + +## Example Queries + +Once Prometheus is scraping Watchdog and Grafana is connected, you may write +your own widgets or create queries. Here are some example queries using +Prometheus query language, promql. Those are provided as examples and might not +provide everything you need. Nevertheless, use them to improve your setup at +your disposal. + +If you believe you have some valuable widgets that you'd like to contribute +back, feel free! + +### Top 10 Pages by Traffic + +```promql +topk(10, sum by (path) (rate(web_pageviews_total[5m]))) +``` + +### Mobile vs Desktop Split + +```promql +sum by (device) (rate(web_pageviews_total[1h])) +``` + +### Unique Visitors + +```promql +web_daily_unique_visitors +``` + +### Top Referrers + +```promql +topk(10, sum by (referrer) (rate(web_pageviews_total{referrer!="direct"}[1d]))) +``` + +### Multi-Site: Traffic per Domain + +```promql +sum by (domain) (rate(web_pageviews_total[1h])) +``` + +### Cardinality Health + +```promql +# Should be near zero +rate(web_path_overflow_total[5m]) +rate(web_referrer_overflow_total[5m]) +rate(web_event_overflow_total[5m]) +``` + +## Horizontal Scaling Considerations + +When running multiple Watchdog instances: + +1. **Each instance exposes its own metrics** - Prometheus scrapes all instances +2. **Prometheus aggregates automatically** - use `sum()` in queries to aggregate + across instances +3. **No shared state needed** - each Watchdog instance is independent + +Watchdog is almost entirely stateless, so horizontal scaling should be trivial +as long as you have the necessary infrastructure and, well, the patience. +Example with 3 instances: + +```promql +# Total pageviews across all instances +sum(rate(web_pageviews_total[5m])) + +# Per-instance breakdown +sum by (instance) (rate(web_pageviews_total[5m])) +``` + +## Alternatives to Prometheus + +### VictoriaMetrics + +Drop-in Prometheus replacement with better performance and compression: + +```nix +{ + services.victoriametrics = { + enable = true; + listenAddress = ":8428"; + retentionPeriod = "12month"; + }; + + # Configure Prometheus to remote-write to VictoriaMetrics + services.prometheus = { + enable = true; + remoteWrite = [{ + url = "http://localhost:8428/api/v1/write"; + }]; + }; +} +``` + +### Grafana Agent + +Lightweight alternative that scrapes and forwards to Grafana Cloud or local +Prometheus: + +```bash +# Systemd setup for Grafana Agent +sudo systemctl enable --now grafana-agent +``` + +```yaml +# /etc/grafana-agent.yaml +metrics: + wal_directory: /var/lib/grafana-agent + configs: + - name: watchdog + scrape_configs: + - job_name: watchdog + static_configs: + - targets: ["localhost:8080"] + remote_write: + - url: http://localhost:9090/api/v1/write +``` + +## Monitoring the Monitoring + +Monitor Prometheus itself: + +```promql +# Prometheus scrape success rate +up{job="watchdog"} + +# Scrape duration +scrape_duration_seconds{job="watchdog"} + +# Time since last scrape +time() - timestamp(up{job="watchdog"}) +``` + +## Additional Recommendations + +1. **Retention**: Set `--storage.tsdb.retention.time=30d` or longer based on + disk space +2. **Backups**: Back up `/var/lib/prometheus` periodically (or whatever your + state directory is) +3. **Alerting**: Configure Prometheus alerting rules for critical metrics +4. **High Availability**: Run multiple Prometheus instances with identical + configs +5. **Remote Storage**: For long-term storage, use Thanos, Cortex, or + VictoriaMetrics