Signed-off-by: NotAShelf <raf@notashelf.dev> Change-Id: Ibadc31d02413da836e85eaa3d446eb9e6a6a6964
300 lines
6.7 KiB
Markdown
300 lines
6.7 KiB
Markdown
# Observability Setup
|
|
|
|
Watchdog exposes Prometheus-formatted metrics at `/metrics`. You need a
|
|
time-series database to scrape and store these metrics, then visualize them in
|
|
Grafana.
|
|
|
|
> [!IMPORTANT]
|
|
>
|
|
> **Why you need Prometheus:**
|
|
>
|
|
> - Watchdog exposes _current state_ (counters, gauges)
|
|
> - Prometheus _scrapes periodically_ and _stores time-series data_
|
|
> - Grafana _visualizes_ the historical data from Prometheus
|
|
> - Grafana cannot directly scrape Prometheus `/metrics` endpoints
|
|
|
|
## Prometheus Setup
|
|
|
|
### Configuring Prometheus
|
|
|
|
Create `/etc/prometheus/prometheus.yml`:
|
|
|
|
```yaml
|
|
global:
|
|
scrape_interval: 15s
|
|
evaluation_interval: 15s
|
|
|
|
scrape_configs:
|
|
- job_name: "watchdog"
|
|
static_configs:
|
|
- targets: ["localhost:8080"]
|
|
|
|
# Optional: scrape multiple Watchdog instances
|
|
# static_configs:
|
|
# - targets:
|
|
# - 'watchdog-1.example.com:8080'
|
|
# - 'watchdog-2.example.com:8080'
|
|
# labels:
|
|
# instance: 'production'
|
|
|
|
# Scrape Prometheus itself
|
|
- job_name: "prometheus"
|
|
static_configs:
|
|
- targets: ["localhost:9090"]
|
|
```
|
|
|
|
### Verify Prometheus' health state
|
|
|
|
```bash
|
|
# Check Prometheus is running
|
|
curl http://localhost:9090/-/healthy
|
|
|
|
# Check it's scraping Watchdog
|
|
curl http://localhost:9090/api/v1/targets
|
|
```
|
|
|
|
### NixOS
|
|
|
|
Add to your NixOS configuration:
|
|
|
|
```nix
|
|
{
|
|
services.prometheus = {
|
|
enable = true;
|
|
port = 9090;
|
|
|
|
# Retention period
|
|
retentionTime = "30d";
|
|
|
|
scrapeConfigs = [
|
|
{
|
|
job_name = "watchdog";
|
|
static_configs = [{
|
|
targets = [ "localhost:8080" ];
|
|
}];
|
|
}
|
|
];
|
|
};
|
|
|
|
# Open firewall if needed
|
|
# networking.firewall.allowedTCPPorts = [ 9090 ];
|
|
}
|
|
```
|
|
|
|
For multiple Watchdog instances:
|
|
|
|
```nix
|
|
{
|
|
services.prometheus.scrapeConfigs = [
|
|
{
|
|
job_name = "watchdog";
|
|
static_configs = [
|
|
{
|
|
labels.env = "production";
|
|
targets = [
|
|
"watchdog-1:8080"
|
|
"watchdog-2:8080"
|
|
"watchdog-3:8080"
|
|
];
|
|
}
|
|
];
|
|
}
|
|
];
|
|
}
|
|
```
|
|
|
|
## Grafana Setup
|
|
|
|
### NixOS
|
|
|
|
```nix
|
|
{
|
|
services.grafana = {
|
|
enable = true;
|
|
settings = {
|
|
server = {
|
|
http_addr = "127.0.0.1";
|
|
http_port = 3000;
|
|
};
|
|
};
|
|
|
|
provision = {
|
|
enable = true;
|
|
|
|
datasources.settings.datasources = [{
|
|
name = "Prometheus";
|
|
type = "prometheus";
|
|
url = "http://localhost:9090";
|
|
isDefault = true;
|
|
}];
|
|
};
|
|
};
|
|
}
|
|
```
|
|
|
|
### Configure Data Source (Manual)
|
|
|
|
If you're not using NixOS for provisioning, then you'll need to do provisioning
|
|
_imperatively_ from your Grafana configuration. Ths can be done through the
|
|
admin panel by navigating to `Configuration`, and choosing "add data source"
|
|
under `Data Sources`. Select your prometheus instance, and save it.
|
|
|
|
### Import Pre-built Dashboard
|
|
|
|
A sample Grafana dashboard is provided with support for multi-host and
|
|
multi-site configurations. Import it, configure the data source and it should
|
|
work out of the box.
|
|
|
|
If you're not using NixOS for provisioning, the dashboard _also_ needs to be
|
|
provisioned manually. Under `Dashboards`, select `Import` and provide the JSON
|
|
contents or upload the sample dashboard from `contrib/grafana/watchdog.json`.
|
|
Select your Prometheus data source and import it.
|
|
|
|
See [contrib/grafana/README.md](../contrib/grafana/README.md) for full
|
|
documentation.
|
|
|
|
## Example Queries
|
|
|
|
Once Prometheus is scraping Watchdog and Grafana is connected, you may write
|
|
your own widgets or create queries. Here are some example queries using
|
|
Prometheus query language, promql. Those are provided as examples and might not
|
|
provide everything you need. Nevertheless, use them to improve your setup at
|
|
your disposal.
|
|
|
|
If you believe you have some valuable widgets that you'd like to contribute
|
|
back, feel free!
|
|
|
|
### Top 10 Pages by Traffic
|
|
|
|
```promql
|
|
topk(10, sum by (path) (rate(web_pageviews_total[5m])))
|
|
```
|
|
|
|
### Mobile vs Desktop Split
|
|
|
|
```promql
|
|
sum by (device) (rate(web_pageviews_total[1h]))
|
|
```
|
|
|
|
### Unique Visitors
|
|
|
|
```promql
|
|
web_daily_unique_visitors
|
|
```
|
|
|
|
### Top Referrers
|
|
|
|
```promql
|
|
topk(10, sum by (referrer) (rate(web_pageviews_total{referrer!="direct"}[1d])))
|
|
```
|
|
|
|
### Multi-Site: Traffic per Domain
|
|
|
|
```promql
|
|
sum by (domain) (rate(web_pageviews_total[1h]))
|
|
```
|
|
|
|
### Cardinality Health
|
|
|
|
```promql
|
|
# Should be near zero
|
|
rate(web_path_overflow_total[5m])
|
|
rate(web_referrer_overflow_total[5m])
|
|
rate(web_event_overflow_total[5m])
|
|
```
|
|
|
|
## Horizontal Scaling Considerations
|
|
|
|
When running multiple Watchdog instances:
|
|
|
|
1. **Each instance exposes its own metrics** - Prometheus scrapes all instances
|
|
2. **Prometheus aggregates automatically** - use `sum()` in queries to aggregate
|
|
across instances
|
|
3. **No shared state needed** - each Watchdog instance is independent
|
|
|
|
Watchdog is almost entirely stateless, so horizontal scaling should be trivial
|
|
as long as you have the necessary infrastructure and, well, the patience.
|
|
Example with 3 instances:
|
|
|
|
```promql
|
|
# Total pageviews across all instances
|
|
sum(rate(web_pageviews_total[5m]))
|
|
|
|
# Per-instance breakdown
|
|
sum by (instance) (rate(web_pageviews_total[5m]))
|
|
```
|
|
|
|
## Alternatives to Prometheus
|
|
|
|
### VictoriaMetrics
|
|
|
|
Drop-in Prometheus replacement with better performance and compression:
|
|
|
|
```nix
|
|
{
|
|
services.victoriametrics = {
|
|
enable = true;
|
|
listenAddress = ":8428";
|
|
retentionPeriod = "12month";
|
|
};
|
|
|
|
# Configure Prometheus to remote-write to VictoriaMetrics
|
|
services.prometheus = {
|
|
enable = true;
|
|
remoteWrite = [{
|
|
url = "http://localhost:8428/api/v1/write";
|
|
}];
|
|
};
|
|
}
|
|
```
|
|
|
|
### Grafana Agent
|
|
|
|
Lightweight alternative that scrapes and forwards to Grafana Cloud or local
|
|
Prometheus:
|
|
|
|
```bash
|
|
# Systemd setup for Grafana Agent
|
|
sudo systemctl enable --now grafana-agent
|
|
```
|
|
|
|
```yaml
|
|
# /etc/grafana-agent.yaml
|
|
metrics:
|
|
wal_directory: /var/lib/grafana-agent
|
|
configs:
|
|
- name: watchdog
|
|
scrape_configs:
|
|
- job_name: watchdog
|
|
static_configs:
|
|
- targets: ["localhost:8080"]
|
|
remote_write:
|
|
- url: http://localhost:9090/api/v1/write
|
|
```
|
|
|
|
## Monitoring the Monitoring
|
|
|
|
Monitor Prometheus itself:
|
|
|
|
```promql
|
|
# Prometheus scrape success rate
|
|
up{job="watchdog"}
|
|
|
|
# Scrape duration
|
|
scrape_duration_seconds{job="watchdog"}
|
|
|
|
# Time since last scrape
|
|
time() - timestamp(up{job="watchdog"})
|
|
```
|
|
|
|
## Additional Recommendations
|
|
|
|
1. **Retention**: Set `--storage.tsdb.retention.time=30d` or longer based on
|
|
disk space
|
|
2. **Backups**: Back up `/var/lib/prometheus` periodically (or whatever your
|
|
state directory is)
|
|
3. **Alerting**: Configure Prometheus alerting rules for critical metrics
|
|
4. **High Availability**: Run multiple Prometheus instances with identical
|
|
configs
|
|
5. **Remote Storage**: For long-term storage, use Thanos, Cortex, or
|
|
VictoriaMetrics
|