meta: provide sample grafana dashboard; basic promql docs

Signed-off-by: NotAShelf <raf@notashelf.dev> Change-Id: Icb48454e2b0d37fea290c1681ccddcfe6a6a6964
2026-05-30 18:21:30 +00:00 · 2026-03-02 22:35:49 +03:00 · 2026-03-02 22:35:49 +03:00 · e7297bcc8d
commit e7297bcc8d
parent df06ed38bf
2 changed files with 1349 additions and 0 deletions
--- a/contrib/grafana/README.md
+++ b/contrib/grafana/README.md
@ -0,0 +1,157 @@
+# Grafana Dashboard
+
+We provide a _sample_ Grafana dashboard for Watchdog, with complete support for
+multi-host and multi-site deployments as much as possible. It should be noted,
+however, that this is designed to be a _reference_ more than anything. Updates
+cannot be provided, and it is recommended that you _write your own dashboard_
+while using this one as a reference.
+
+Nevertheless, here are some features provided by the sample dashboard at
+`watchdog.json` that you may be interested in:
+
+- **Multi-Instance Support**: Filter and aggregate across multiple Watchdog
+  instances
+- **Multi-Site Support**: Filter by domain for multi-site deployments
+- **Real-time Metrics**: Auto-refresh every 30 seconds
+- **Traffic Analysis**: Pageviews, unique visitors, device breakdown, geographic
+  distribution
+- **Top Content**: Top pages, referrers, custom events
+- **System Health**: Instance health, cardinality overflow monitoring, request
+  rates
+
+To import it, go to "Dashboards" on your Grafana instance then hit "Import".
+Upload the JSON file, select your Prometheus data source (assuming you have a
+scraper set up) and hit Import.
+
+## Dashboard Variables
+
+The dashboard includes three template variables for flexible filtering:
+
+### Data Source
+
+- **Variable**: `$datasource`
+- **Type**: Data source selector
+- **Default**: Prometheus
+- **Usage**: Select which Prometheus instance to query
+
+### Instance
+
+- **Variable**: `$instance`
+- **Type**: Multi-select query variable
+- **Default**: All instances
+- **Query**: `label_values(web_pageviews_total, instance)`
+- **Usage**: Filter by specific Watchdog instances (e.g., `watchdog-1:8080`,
+  `watchdog-2:8080`)
+
+### Domain
+
+- **Variable**: `$domain`
+- **Type**: Multi-select query variable
+- **Default**: All domains
+- **Query**: `label_values(web_pageviews_total{instance=~"$instance"}, domain)`
+- **Usage**: Filter by specific domains for multi-site analytics
+
+**Example filters:**
+
+- View all sites across all instances: Instance=All, Domain=All
+- View single site across all instances: Instance=All, Domain=example.com
+- View single instance, all sites: Instance=watchdog-1:8080, Domain=All
+- View single site on single instance: Instance=watchdog-1:8080,
+  Domain=example.com
+
+## Dashboard Sections
+
+### Overview Row
+
+- **Unique Visitors (Today)**: Current HyperLogLog estimate across selected
+  instances/domains
+- **Pageviews/min**: Real-time pageview rate
+- **Total Pageviews**: Total pageviews in selected time range
+- **Cardinality Overflow/min**: Health indicator (should be ~0)
+- **Pageviews by Domain**: Time series showing traffic per domain
+- **Unique Visitors by Domain**: Time series showing unique visitors per domain
+
+### Traffic Analysis Row
+
+- **Device Breakdown**: Pie chart of mobile/tablet/desktop traffic
+- **Top 10 Countries**: Geographic distribution of traffic
+- **Top 20 Pages**: Most visited pages with heat map
+- **Top 15 Referrers**: Traffic sources (excludes direct traffic)
+- **Top 15 Custom Events**: Most triggered custom events
+
+### System Health Row
+
+- **Instance Health**: Uptime status for each Watchdog instance (1=up, 0=down)
+- **Cardinality Overflow**: Rate of rejected metrics due to cardinality limits
+  (should be near zero)
+- **Request Rate by Instance**: Request throughput per instance
+
+## Metrics Reference
+
+All metrics aggregated using `sum()` across selected instances:
+
+```promql
+# Total unique visitors
+sum(web_daily_unique_visitors{instance=~"$instance",domain=~"$domain"})
+
+# Pageview rate
+sum(rate(web_pageviews_total{instance=~"$instance",domain=~"$domain"}[$__rate_interval])) * 60
+
+# Top pages
+topk(20, sum(increase(web_pageviews_total{instance=~"$instance",domain=~"$domain"}[$__range])) by (path))
+
+# Device breakdown
+sum(increase(web_pageviews_total{instance=~"$instance",domain=~"$domain"}[$__range])) by (device)
+
+# Cardinality health
+rate(web_path_overflow_total{instance=~"$instance"}[$__rate_interval]) * 60
+```
+
+### Modify Time Range
+
+Default: Last 24 hours
+
+To change:
+
+1. Dashboard Settings -> Time Options
+2. Set default time range
+3. Save dashboard
+
+### Add Alerts
+
+Example alert for cardinality overflow:
+
+1. Edit "Cardinality Overflow" panel
+2. Click **Alert** tab
+3. Create alert rule:
+   - Condition: `WHEN max() OF query(A,5m,now) IS ABOVE 10`
+   - Message: "Cardinality limits are being hit - increase
+     max_paths/max_sources/max_custom_events"
+
+## Multi-Instance Aggregation
+
+When running multiple Watchdog instances, Prometheus automatically aggregates
+metrics. You may use Prometheus' query language (Promql) to create some queries
+to visualise data in various ways. Some examples would be:
+
+**Per-instance breakdown:**
+
+```promql
+sum(rate(web_pageviews_total[$__rate_interval])) by (instance)
+```
+
+**Total across all instances:**
+
+```promql
+sum(rate(web_pageviews_total[$__rate_interval]))
+```
+
+**Unique visitors (note: HLL counts don't sum directly):**
+
+```promql
+# Approximate total - slight overcount due to HLL properties
+sum(web_daily_unique_visitors)
+
+# Per-instance (accurate)
+web_daily_unique_visitors
+```
--- a/contrib/grafana/watchdog.json
+++ b/contrib/grafana/watchdog.json