meta: provide sample grafana dashboard; basic promql docs

Signed-off-by: NotAShelf <raf@notashelf.dev>
Change-Id: Icb48454e2b0d37fea290c1681ccddcfe6a6a6964
This commit is contained in:
raf 2026-03-02 22:35:49 +03:00
commit e7297bcc8d
Signed by: NotAShelf
GPG key ID: 29D95B64378DB4BF
2 changed files with 1349 additions and 0 deletions

157
contrib/grafana/README.md Normal file
View file

@ -0,0 +1,157 @@
# Grafana Dashboard
We provide a _sample_ Grafana dashboard for Watchdog, with complete support for
multi-host and multi-site deployments as much as possible. It should be noted,
however, that this is designed to be a _reference_ more than anything. Updates
cannot be provided, and it is recommended that you _write your own dashboard_
while using this one as a reference.
Nevertheless, here are some features provided by the sample dashboard at
`watchdog.json` that you may be interested in:
- **Multi-Instance Support**: Filter and aggregate across multiple Watchdog
instances
- **Multi-Site Support**: Filter by domain for multi-site deployments
- **Real-time Metrics**: Auto-refresh every 30 seconds
- **Traffic Analysis**: Pageviews, unique visitors, device breakdown, geographic
distribution
- **Top Content**: Top pages, referrers, custom events
- **System Health**: Instance health, cardinality overflow monitoring, request
rates
To import it, go to "Dashboards" on your Grafana instance then hit "Import".
Upload the JSON file, select your Prometheus data source (assuming you have a
scraper set up) and hit Import.
## Dashboard Variables
The dashboard includes three template variables for flexible filtering:
### Data Source
- **Variable**: `$datasource`
- **Type**: Data source selector
- **Default**: Prometheus
- **Usage**: Select which Prometheus instance to query
### Instance
- **Variable**: `$instance`
- **Type**: Multi-select query variable
- **Default**: All instances
- **Query**: `label_values(web_pageviews_total, instance)`
- **Usage**: Filter by specific Watchdog instances (e.g., `watchdog-1:8080`,
`watchdog-2:8080`)
### Domain
- **Variable**: `$domain`
- **Type**: Multi-select query variable
- **Default**: All domains
- **Query**: `label_values(web_pageviews_total{instance=~"$instance"}, domain)`
- **Usage**: Filter by specific domains for multi-site analytics
**Example filters:**
- View all sites across all instances: Instance=All, Domain=All
- View single site across all instances: Instance=All, Domain=example.com
- View single instance, all sites: Instance=watchdog-1:8080, Domain=All
- View single site on single instance: Instance=watchdog-1:8080,
Domain=example.com
## Dashboard Sections
### Overview Row
- **Unique Visitors (Today)**: Current HyperLogLog estimate across selected
instances/domains
- **Pageviews/min**: Real-time pageview rate
- **Total Pageviews**: Total pageviews in selected time range
- **Cardinality Overflow/min**: Health indicator (should be ~0)
- **Pageviews by Domain**: Time series showing traffic per domain
- **Unique Visitors by Domain**: Time series showing unique visitors per domain
### Traffic Analysis Row
- **Device Breakdown**: Pie chart of mobile/tablet/desktop traffic
- **Top 10 Countries**: Geographic distribution of traffic
- **Top 20 Pages**: Most visited pages with heat map
- **Top 15 Referrers**: Traffic sources (excludes direct traffic)
- **Top 15 Custom Events**: Most triggered custom events
### System Health Row
- **Instance Health**: Uptime status for each Watchdog instance (1=up, 0=down)
- **Cardinality Overflow**: Rate of rejected metrics due to cardinality limits
(should be near zero)
- **Request Rate by Instance**: Request throughput per instance
## Metrics Reference
All metrics aggregated using `sum()` across selected instances:
```promql
# Total unique visitors
sum(web_daily_unique_visitors{instance=~"$instance",domain=~"$domain"})
# Pageview rate
sum(rate(web_pageviews_total{instance=~"$instance",domain=~"$domain"}[$__rate_interval])) * 60
# Top pages
topk(20, sum(increase(web_pageviews_total{instance=~"$instance",domain=~"$domain"}[$__range])) by (path))
# Device breakdown
sum(increase(web_pageviews_total{instance=~"$instance",domain=~"$domain"}[$__range])) by (device)
# Cardinality health
rate(web_path_overflow_total{instance=~"$instance"}[$__rate_interval]) * 60
```
### Modify Time Range
Default: Last 24 hours
To change:
1. Dashboard Settings -> Time Options
2. Set default time range
3. Save dashboard
### Add Alerts
Example alert for cardinality overflow:
1. Edit "Cardinality Overflow" panel
2. Click **Alert** tab
3. Create alert rule:
- Condition: `WHEN max() OF query(A,5m,now) IS ABOVE 10`
- Message: "Cardinality limits are being hit - increase
max_paths/max_sources/max_custom_events"
## Multi-Instance Aggregation
When running multiple Watchdog instances, Prometheus automatically aggregates
metrics. You may use Prometheus' query language (Promql) to create some queries
to visualise data in various ways. Some examples would be:
**Per-instance breakdown:**
```promql
sum(rate(web_pageviews_total[$__rate_interval])) by (instance)
```
**Total across all instances:**
```promql
sum(rate(web_pageviews_total[$__rate_interval]))
```
**Unique visitors (note: HLL counts don't sum directly):**
```promql
# Approximate total - slight overcount due to HLL properties
sum(web_daily_unique_visitors)
# Per-instance (accurate)
web_daily_unique_visitors
```

File diff suppressed because it is too large Load diff