docs: provide obserability stack guide

Signed-off-by: NotAShelf <raf@notashelf.dev> Change-Id: Ibadc31d02413da836e85eaa3d446eb9e6a6a6964
2026-05-17 12:44:13 +00:00 · 2026-03-02 22:35:32 +03:00 · 2026-03-02 22:35:32 +03:00 · df06ed38bf
commit df06ed38bf
parent 13343ef2bd
2 changed files with 302 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -212,7 +212,8 @@ While not final, some of the metrics collected are as follows:
 - `web_path_overflow_total` - Paths rejected due to cardinality limit
 - `web_referrer_overflow_total` - Referrers rejected due to limit
 - `web_event_overflow_total` - Custom events rejected due to limit
- `web_blocked_requests_total{reason}` - File server requests blocked by security filters
+- `web_blocked_requests_total{reason}` - File server requests blocked by
  security filters
 **Process metrics:**
--- a/docs/observability.md
+++ b/docs/observability.md
@ -0,0 +1,300 @@
 # Observability Setup
 Watchdog exposes Prometheus-formatted metrics at `/metrics`. You need a
 time-series database to scrape and store these metrics, then visualize them in
 Grafana.
 > [!IMPORTANT]
 >
 > **Why you need Prometheus:**
 >
 > - Watchdog exposes _current state_ (counters, gauges)
 > - Prometheus _scrapes periodically_ and _stores time-series data_
 > - Grafana _visualizes_ the historical data from Prometheus
 > - Grafana cannot directly scrape Prometheus `/metrics` endpoints
 ## Prometheus Setup
 ### Configuring Prometheus
 Create `/etc/prometheus/prometheus.yml`:
 ```yaml
 global:
  scrape_interval: 15s
  evaluation_interval: 15s
 scrape_configs:
  - job_name: "watchdog"
    static_configs:
      - targets: ["localhost:8080"]
    # Optional: scrape multiple Watchdog instances
    # static_configs:
    #   - targets:
    #       - 'watchdog-1.example.com:8080'
    #       - 'watchdog-2.example.com:8080'
    #     labels:
    #       instance: 'production'
  # Scrape Prometheus itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
 ```
 ### Verify Prometheus' health state
 ```bash
 # Check Prometheus is running
 curl http://localhost:9090/-/healthy
 # Check it's scraping Watchdog
 curl http://localhost:9090/api/v1/targets
 ```
 ### NixOS
 Add to your NixOS configuration:
 ```nix
 {
  services.prometheus = {
    enable = true;
    port = 9090;
    # Retention period
    retentionTime = "30d";
    scrapeConfigs = [
      {
        job_name = "watchdog";
        static_configs = [{
          targets = [ "localhost:8080" ];
        }];
      }
    ];
  };
  # Open firewall if needed
  # networking.firewall.allowedTCPPorts = [ 9090 ];
 }
 ```
 For multiple Watchdog instances:
 ```nix
 {
  services.prometheus.scrapeConfigs = [
    {
      job_name = "watchdog";
      static_configs = [
        {
          labels.env = "production";
          targets = [
            "watchdog-1:8080"
            "watchdog-2:8080"
            "watchdog-3:8080"
          ];
        }
      ];
    }
  ];
 }
 ```
 ## Grafana Setup
 ### NixOS
 ```nix
 {
  services.grafana = {
    enable = true;
    settings = {
      server = {
        http_addr = "127.0.0.1";
        http_port = 3000;
      };
    };
    provision = {
      enable = true;
      datasources.settings.datasources = [{
        name = "Prometheus";
        type = "prometheus";
        url = "http://localhost:9090";
        isDefault = true;
      }];
    };
  };
 }
 ```
 ### Configure Data Source (Manual)
 If you're not using NixOS for provisioning, then you'll need to do provisioning
 _imperatively_ from your Grafana configuration. Ths can be done through the
 admin panel by navigating to `Configuration`, and choosing "add data source"
 under `Data Sources`. Select your prometheus instance, and save it.
 ### Import Pre-built Dashboard
 A sample Grafana dashboard is provided with support for multi-host and
 multi-site configurations. Import it, configure the data source and it should
 work out of the box.
 If you're not using NixOS for provisioning, the dashboard _also_ needs to be
 provisioned manually. Under `Dashboards`, select `Import` and provide the JSON
 contents or upload the sample dashboard from `contrib/grafana/watchdog.json`.
 Select your Prometheus data source and import it.
 See [contrib/grafana/README.md](../contrib/grafana/README.md) for full
 documentation.
 ## Example Queries
 Once Prometheus is scraping Watchdog and Grafana is connected, you may write
 your own widgets or create queries. Here are some example queries using
 Prometheus query language, promql. Those are provided as examples and might not
 provide everything you need. Nevertheless, use them to improve your setup at
 your disposal.
 If you believe you have some valuable widgets that you'd like to contribute
 back, feel free!
 ### Top 10 Pages by Traffic
 ```promql
 topk(10, sum by (path) (rate(web_pageviews_total[5m])))
 ```
 ### Mobile vs Desktop Split
 ```promql
 sum by (device) (rate(web_pageviews_total[1h]))
 ```
 ### Unique Visitors
 ```promql
 web_daily_unique_visitors
 ```
 ### Top Referrers
 ```promql
 topk(10, sum by (referrer) (rate(web_pageviews_total{referrer!="direct"}[1d])))
 ```
 ### Multi-Site: Traffic per Domain
 ```promql
 sum by (domain) (rate(web_pageviews_total[1h]))
 ```
 ### Cardinality Health
 ```promql
 # Should be near zero
 rate(web_path_overflow_total[5m])
 rate(web_referrer_overflow_total[5m])
 rate(web_event_overflow_total[5m])
 ```
 ## Horizontal Scaling Considerations
 When running multiple Watchdog instances:
 1. **Each instance exposes its own metrics** - Prometheus scrapes all instances
 2. **Prometheus aggregates automatically** - use `sum()` in queries to aggregate
   across instances
 3. **No shared state needed** - each Watchdog instance is independent
 Watchdog is almost entirely stateless, so horizontal scaling should be trivial
 as long as you have the necessary infrastructure and, well, the patience.
 Example with 3 instances:
 ```promql
 # Total pageviews across all instances
 sum(rate(web_pageviews_total[5m]))
 # Per-instance breakdown
 sum by (instance) (rate(web_pageviews_total[5m]))
 ```
 ## Alternatives to Prometheus
 ### VictoriaMetrics
 Drop-in Prometheus replacement with better performance and compression:
 ```nix
 {
  services.victoriametrics = {
    enable = true;
    listenAddress = ":8428";
    retentionPeriod = "12month";
  };
  # Configure Prometheus to remote-write to VictoriaMetrics
  services.prometheus = {
    enable = true;
    remoteWrite = [{
      url = "http://localhost:8428/api/v1/write";
    }];
  };
 }
 ```
 ### Grafana Agent
 Lightweight alternative that scrapes and forwards to Grafana Cloud or local
 Prometheus:
 ```bash
 # Systemd setup for Grafana Agent
 sudo systemctl enable --now grafana-agent
 ```
 ```yaml
 # /etc/grafana-agent.yaml
 metrics:
  wal_directory: /var/lib/grafana-agent
  configs:
    - name: watchdog
      scrape_configs:
        - job_name: watchdog
          static_configs:
            - targets: ["localhost:8080"]
      remote_write:
        - url: http://localhost:9090/api/v1/write
 ```
 ## Monitoring the Monitoring
 Monitor Prometheus itself:
 ```promql
 # Prometheus scrape success rate
 up{job="watchdog"}
 # Scrape duration
 scrape_duration_seconds{job="watchdog"}
 # Time since last scrape
 time() - timestamp(up{job="watchdog"})
 ```
 ## Additional Recommendations
 1. **Retention**: Set `--storage.tsdb.retention.time=30d` or longer based on
   disk space
 2. **Backups**: Back up `/var/lib/prometheus` periodically (or whatever your
   state directory is)
 3. **Alerting**: Configure Prometheus alerting rules for critical metrics
 4. **High Availability**: Run multiple Prometheus instances with identical
   configs
 5. **Remote Storage**: For long-term storage, use Thanos, Cortex, or
   VictoriaMetrics