docs: provide obserability stack guide

Signed-off-by: NotAShelf <raf@notashelf.dev> Change-Id: Ibadc31d02413da836e85eaa3d446eb9e6a6a6964
2026-06-11 15:43:23 +00:00 · 2026-03-02 22:35:32 +03:00 · 2026-03-02 22:35:32 +03:00 · df06ed38bf
commit df06ed38bf
parent 13343ef2bd
2 changed files with 302 additions and 1 deletions
--- a/docs/observability.md
+++ b/docs/observability.md
@ -0,0 +1,300 @@
+# Observability Setup
+
+Watchdog exposes Prometheus-formatted metrics at `/metrics`. You need a
+time-series database to scrape and store these metrics, then visualize them in
+Grafana.
+
+> [!IMPORTANT]
+>
+> **Why you need Prometheus:**
+>
+> - Watchdog exposes _current state_ (counters, gauges)
+> - Prometheus _scrapes periodically_ and _stores time-series data_
+> - Grafana _visualizes_ the historical data from Prometheus
+> - Grafana cannot directly scrape Prometheus `/metrics` endpoints
+
+## Prometheus Setup
+
+### Configuring Prometheus
+
+Create `/etc/prometheus/prometheus.yml`:
+
+```yaml
+global:
+  scrape_interval: 15s
+  evaluation_interval: 15s
+
+scrape_configs:
+  - job_name: "watchdog"
+    static_configs:
+      - targets: ["localhost:8080"]
+
+    # Optional: scrape multiple Watchdog instances
+    # static_configs:
+    #   - targets:
+    #       - 'watchdog-1.example.com:8080'
+    #       - 'watchdog-2.example.com:8080'
+    #     labels:
+    #       instance: 'production'
+
+  # Scrape Prometheus itself
+  - job_name: "prometheus"
+    static_configs:
+      - targets: ["localhost:9090"]
+```
+
+### Verify Prometheus' health state
+
+```bash
+# Check Prometheus is running
+curl http://localhost:9090/-/healthy
+
+# Check it's scraping Watchdog
+curl http://localhost:9090/api/v1/targets
+```
+
+### NixOS
+
+Add to your NixOS configuration:
+
+```nix
+{
+  services.prometheus = {
+    enable = true;
+    port = 9090;
+
+    # Retention period
+    retentionTime = "30d";
+
+    scrapeConfigs = [
+      {
+        job_name = "watchdog";
+        static_configs = [{
+          targets = [ "localhost:8080" ];
+        }];
+      }
+    ];
+  };
+
+  # Open firewall if needed
+  # networking.firewall.allowedTCPPorts = [ 9090 ];
+}
+```
+
+For multiple Watchdog instances:
+
+```nix
+{
+  services.prometheus.scrapeConfigs = [
+    {
+      job_name = "watchdog";
+      static_configs = [
+        {
+          labels.env = "production";
+          targets = [
+            "watchdog-1:8080"
+            "watchdog-2:8080"
+            "watchdog-3:8080"
+          ];
+        }
+      ];
+    }
+  ];
+}
+```
+
+## Grafana Setup
+
+### NixOS
+
+```nix
+{
+  services.grafana = {
+    enable = true;
+    settings = {
+      server = {
+        http_addr = "127.0.0.1";
+        http_port = 3000;
+      };
+    };
+
+    provision = {
+      enable = true;
+
+      datasources.settings.datasources = [{
+        name = "Prometheus";
+        type = "prometheus";
+        url = "http://localhost:9090";
+        isDefault = true;
+      }];
+    };
+  };
+}
+```
+
+### Configure Data Source (Manual)
+
+If you're not using NixOS for provisioning, then you'll need to do provisioning
+_imperatively_ from your Grafana configuration. Ths can be done through the
+admin panel by navigating to `Configuration`, and choosing "add data source"
+under `Data Sources`. Select your prometheus instance, and save it.
+
+### Import Pre-built Dashboard
+
+A sample Grafana dashboard is provided with support for multi-host and
+multi-site configurations. Import it, configure the data source and it should
+work out of the box.
+
+If you're not using NixOS for provisioning, the dashboard _also_ needs to be
+provisioned manually. Under `Dashboards`, select `Import` and provide the JSON
+contents or upload the sample dashboard from `contrib/grafana/watchdog.json`.
+Select your Prometheus data source and import it.
+
+See [contrib/grafana/README.md](../contrib/grafana/README.md) for full
+documentation.
+
+## Example Queries
+
+Once Prometheus is scraping Watchdog and Grafana is connected, you may write
+your own widgets or create queries. Here are some example queries using
+Prometheus query language, promql. Those are provided as examples and might not
+provide everything you need. Nevertheless, use them to improve your setup at
+your disposal.
+
+If you believe you have some valuable widgets that you'd like to contribute
+back, feel free!
+
+### Top 10 Pages by Traffic
+
+```promql
+topk(10, sum by (path) (rate(web_pageviews_total[5m])))
+```
+
+### Mobile vs Desktop Split
+
+```promql
+sum by (device) (rate(web_pageviews_total[1h]))
+```
+
+### Unique Visitors
+
+```promql
+web_daily_unique_visitors
+```
+
+### Top Referrers
+
+```promql
+topk(10, sum by (referrer) (rate(web_pageviews_total{referrer!="direct"}[1d])))
+```
+
+### Multi-Site: Traffic per Domain
+
+```promql
+sum by (domain) (rate(web_pageviews_total[1h]))
+```
+
+### Cardinality Health
+
+```promql
+# Should be near zero
+rate(web_path_overflow_total[5m])
+rate(web_referrer_overflow_total[5m])
+rate(web_event_overflow_total[5m])
+```
+
+## Horizontal Scaling Considerations
+
+When running multiple Watchdog instances:
+
+1. **Each instance exposes its own metrics** - Prometheus scrapes all instances
+2. **Prometheus aggregates automatically** - use `sum()` in queries to aggregate
+   across instances
+3. **No shared state needed** - each Watchdog instance is independent
+
+Watchdog is almost entirely stateless, so horizontal scaling should be trivial
+as long as you have the necessary infrastructure and, well, the patience.
+Example with 3 instances:
+
+```promql
+# Total pageviews across all instances
+sum(rate(web_pageviews_total[5m]))
+
+# Per-instance breakdown
+sum by (instance) (rate(web_pageviews_total[5m]))
+```
+
+## Alternatives to Prometheus
+
+### VictoriaMetrics
+
+Drop-in Prometheus replacement with better performance and compression:
+
+```nix
+{
+  services.victoriametrics = {
+    enable = true;
+    listenAddress = ":8428";
+    retentionPeriod = "12month";
+  };
+
+  # Configure Prometheus to remote-write to VictoriaMetrics
+  services.prometheus = {
+    enable = true;
+    remoteWrite = [{
+      url = "http://localhost:8428/api/v1/write";
+    }];
+  };
+}
+```
+
+### Grafana Agent
+
+Lightweight alternative that scrapes and forwards to Grafana Cloud or local
+Prometheus:
+
+```bash
+# Systemd setup for Grafana Agent
+sudo systemctl enable --now grafana-agent
+```
+
+```yaml
+# /etc/grafana-agent.yaml
+metrics:
+  wal_directory: /var/lib/grafana-agent
+  configs:
+    - name: watchdog
+      scrape_configs:
+        - job_name: watchdog
+          static_configs:
+            - targets: ["localhost:8080"]
+      remote_write:
+        - url: http://localhost:9090/api/v1/write
+```
+
+## Monitoring the Monitoring
+
+Monitor Prometheus itself:
+
+```promql
+# Prometheus scrape success rate
+up{job="watchdog"}
+
+# Scrape duration
+scrape_duration_seconds{job="watchdog"}
+
+# Time since last scrape
+time() - timestamp(up{job="watchdog"})
+```
+
+## Additional Recommendations
+
+1. **Retention**: Set `--storage.tsdb.retention.time=30d` or longer based on
+   disk space
+2. **Backups**: Back up `/var/lib/prometheus` periodically (or whatever your
+   state directory is)
+3. **Alerting**: Configure Prometheus alerting rules for critical metrics
+4. **High Availability**: Run multiple Prometheus instances with identical
+   configs
+5. **Remote Storage**: For long-term storage, use Thanos, Cortex, or
+   VictoriaMetrics