NotAShelf df06ed38bf

Signed-off-by: NotAShelf <raf@notashelf.dev>
Change-Id: Ibadc31d02413da836e85eaa3d446eb9e6a6a6964

2026-03-02 22:38:33 +03:00

6.7 KiB

Raw Blame History

Observability Setup

Watchdog exposes Prometheus-formatted metrics at /metrics. You need a time-series database to scrape and store these metrics, then visualize them in Grafana.

Important

Why you need Prometheus:

Watchdog exposes current state (counters, gauges)

Prometheus scrapes periodically and stores time-series data

Grafana visualizes the historical data from Prometheus

Grafana cannot directly scrape Prometheus /metrics endpoints

Prometheus Setup

Configuring Prometheus

Create /etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "watchdog"
    static_configs:
      - targets: ["localhost:8080"]

    # Optional: scrape multiple Watchdog instances
    # static_configs:
    #   - targets:
    #       - 'watchdog-1.example.com:8080'
    #       - 'watchdog-2.example.com:8080'
    #     labels:
    #       instance: 'production'

  # Scrape Prometheus itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

Verify Prometheus' health state

# Check Prometheus is running
curl http://localhost:9090/-/healthy

# Check it's scraping Watchdog
curl http://localhost:9090/api/v1/targets

NixOS

Add to your NixOS configuration:

{
  services.prometheus = {
    enable = true;
    port = 9090;

    # Retention period
    retentionTime = "30d";

    scrapeConfigs = [
      {
        job_name = "watchdog";
        static_configs = [{
          targets = [ "localhost:8080" ];
        }];
      }
    ];
  };

  # Open firewall if needed
  # networking.firewall.allowedTCPPorts = [ 9090 ];
}

For multiple Watchdog instances:

{
  services.prometheus.scrapeConfigs = [
    {
      job_name = "watchdog";
      static_configs = [
        {
          labels.env = "production";
          targets = [
            "watchdog-1:8080"
            "watchdog-2:8080"
            "watchdog-3:8080"
          ];
        }
      ];
    }
  ];
}

Grafana Setup

NixOS

{
  services.grafana = {
    enable = true;
    settings = {
      server = {
        http_addr = "127.0.0.1";
        http_port = 3000;
      };
    };

    provision = {
      enable = true;

      datasources.settings.datasources = [{
        name = "Prometheus";
        type = "prometheus";
        url = "http://localhost:9090";
        isDefault = true;
      }];
    };
  };
}

Configure Data Source (Manual)

If you're not using NixOS for provisioning, then you'll need to do provisioning imperatively from your Grafana configuration. Ths can be done through the admin panel by navigating to Configuration, and choosing "add data source" under Data Sources. Select your prometheus instance, and save it.

Import Pre-built Dashboard

A sample Grafana dashboard is provided with support for multi-host and multi-site configurations. Import it, configure the data source and it should work out of the box.

If you're not using NixOS for provisioning, the dashboard also needs to be provisioned manually. Under Dashboards, select Import and provide the JSON contents or upload the sample dashboard from contrib/grafana/watchdog.json. Select your Prometheus data source and import it.

See contrib/grafana/README.md for full documentation.

Example Queries

Once Prometheus is scraping Watchdog and Grafana is connected, you may write your own widgets or create queries. Here are some example queries using Prometheus query language, promql. Those are provided as examples and might not provide everything you need. Nevertheless, use them to improve your setup at your disposal.

If you believe you have some valuable widgets that you'd like to contribute back, feel free!

Top 10 Pages by Traffic

topk(10, sum by (path) (rate(web_pageviews_total[5m])))

Mobile vs Desktop Split

sum by (device) (rate(web_pageviews_total[1h]))

Unique Visitors

web_daily_unique_visitors

Top Referrers

topk(10, sum by (referrer) (rate(web_pageviews_total{referrer!="direct"}[1d])))

Multi-Site: Traffic per Domain

sum by (domain) (rate(web_pageviews_total[1h]))

Cardinality Health

# Should be near zero
rate(web_path_overflow_total[5m])
rate(web_referrer_overflow_total[5m])
rate(web_event_overflow_total[5m])

Horizontal Scaling Considerations

When running multiple Watchdog instances:

Each instance exposes its own metrics - Prometheus scrapes all instances
Prometheus aggregates automatically - use sum() in queries to aggregate across instances
No shared state needed - each Watchdog instance is independent

Watchdog is almost entirely stateless, so horizontal scaling should be trivial as long as you have the necessary infrastructure and, well, the patience. Example with 3 instances:

# Total pageviews across all instances
sum(rate(web_pageviews_total[5m]))

# Per-instance breakdown
sum by (instance) (rate(web_pageviews_total[5m]))

Alternatives to Prometheus

VictoriaMetrics

Drop-in Prometheus replacement with better performance and compression:

{
  services.victoriametrics = {
    enable = true;
    listenAddress = ":8428";
    retentionPeriod = "12month";
  };

  # Configure Prometheus to remote-write to VictoriaMetrics
  services.prometheus = {
    enable = true;
    remoteWrite = [{
      url = "http://localhost:8428/api/v1/write";
    }];
  };
}

Grafana Agent

Lightweight alternative that scrapes and forwards to Grafana Cloud or local Prometheus:

# Systemd setup for Grafana Agent
sudo systemctl enable --now grafana-agent

# /etc/grafana-agent.yaml
metrics:
  wal_directory: /var/lib/grafana-agent
  configs:
    - name: watchdog
      scrape_configs:
        - job_name: watchdog
          static_configs:
            - targets: ["localhost:8080"]
      remote_write:
        - url: http://localhost:9090/api/v1/write

Monitoring the Monitoring

Monitor Prometheus itself:

# Prometheus scrape success rate
up{job="watchdog"}

# Scrape duration
scrape_duration_seconds{job="watchdog"}

# Time since last scrape
time() - timestamp(up{job="watchdog"})

Additional Recommendations

Retention: Set --storage.tsdb.retention.time=30d or longer based on disk space
Backups: Back up /var/lib/prometheus periodically (or whatever your state directory is)
Alerting: Configure Prometheus alerting rules for critical metrics
High Availability: Run multiple Prometheus instances with identical configs
Remote Storage: For long-term storage, use Thanos, Cortex, or VictoriaMetrics

6.7 KiB Raw Blame History

Observability Setup

Prometheus Setup

Configuring Prometheus

Verify Prometheus' health state

NixOS

Grafana Setup

NixOS

Configure Data Source (Manual)

Import Pre-built Dashboard

Example Queries

Top 10 Pages by Traffic

Mobile vs Desktop Split

Unique Visitors

Top Referrers

Multi-Site: Traffic per Domain

Cardinality Health

Horizontal Scaling Considerations

Alternatives to Prometheus

VictoriaMetrics

Grafana Agent

Monitoring the Monitoring

Additional Recommendations

6.7 KiB

Raw Blame History