docs: provide obserability stack guide
Signed-off-by: NotAShelf <raf@notashelf.dev> Change-Id: Ibadc31d02413da836e85eaa3d446eb9e6a6a6964
This commit is contained in:
parent
13343ef2bd
commit
df06ed38bf
2 changed files with 302 additions and 1 deletions
300
docs/observability.md
Normal file
300
docs/observability.md
Normal file
|
|
@ -0,0 +1,300 @@
|
|||
# Observability Setup
|
||||
|
||||
Watchdog exposes Prometheus-formatted metrics at `/metrics`. You need a
|
||||
time-series database to scrape and store these metrics, then visualize them in
|
||||
Grafana.
|
||||
|
||||
> [!IMPORTANT]
|
||||
>
|
||||
> **Why you need Prometheus:**
|
||||
>
|
||||
> - Watchdog exposes _current state_ (counters, gauges)
|
||||
> - Prometheus _scrapes periodically_ and _stores time-series data_
|
||||
> - Grafana _visualizes_ the historical data from Prometheus
|
||||
> - Grafana cannot directly scrape Prometheus `/metrics` endpoints
|
||||
|
||||
## Prometheus Setup
|
||||
|
||||
### Configuring Prometheus
|
||||
|
||||
Create `/etc/prometheus/prometheus.yml`:
|
||||
|
||||
```yaml
|
||||
global:
|
||||
scrape_interval: 15s
|
||||
evaluation_interval: 15s
|
||||
|
||||
scrape_configs:
|
||||
- job_name: "watchdog"
|
||||
static_configs:
|
||||
- targets: ["localhost:8080"]
|
||||
|
||||
# Optional: scrape multiple Watchdog instances
|
||||
# static_configs:
|
||||
# - targets:
|
||||
# - 'watchdog-1.example.com:8080'
|
||||
# - 'watchdog-2.example.com:8080'
|
||||
# labels:
|
||||
# instance: 'production'
|
||||
|
||||
# Scrape Prometheus itself
|
||||
- job_name: "prometheus"
|
||||
static_configs:
|
||||
- targets: ["localhost:9090"]
|
||||
```
|
||||
|
||||
### Verify Prometheus' health state
|
||||
|
||||
```bash
|
||||
# Check Prometheus is running
|
||||
curl http://localhost:9090/-/healthy
|
||||
|
||||
# Check it's scraping Watchdog
|
||||
curl http://localhost:9090/api/v1/targets
|
||||
```
|
||||
|
||||
### NixOS
|
||||
|
||||
Add to your NixOS configuration:
|
||||
|
||||
```nix
|
||||
{
|
||||
services.prometheus = {
|
||||
enable = true;
|
||||
port = 9090;
|
||||
|
||||
# Retention period
|
||||
retentionTime = "30d";
|
||||
|
||||
scrapeConfigs = [
|
||||
{
|
||||
job_name = "watchdog";
|
||||
static_configs = [{
|
||||
targets = [ "localhost:8080" ];
|
||||
}];
|
||||
}
|
||||
];
|
||||
};
|
||||
|
||||
# Open firewall if needed
|
||||
# networking.firewall.allowedTCPPorts = [ 9090 ];
|
||||
}
|
||||
```
|
||||
|
||||
For multiple Watchdog instances:
|
||||
|
||||
```nix
|
||||
{
|
||||
services.prometheus.scrapeConfigs = [
|
||||
{
|
||||
job_name = "watchdog";
|
||||
static_configs = [
|
||||
{
|
||||
labels.env = "production";
|
||||
targets = [
|
||||
"watchdog-1:8080"
|
||||
"watchdog-2:8080"
|
||||
"watchdog-3:8080"
|
||||
];
|
||||
}
|
||||
];
|
||||
}
|
||||
];
|
||||
}
|
||||
```
|
||||
|
||||
## Grafana Setup
|
||||
|
||||
### NixOS
|
||||
|
||||
```nix
|
||||
{
|
||||
services.grafana = {
|
||||
enable = true;
|
||||
settings = {
|
||||
server = {
|
||||
http_addr = "127.0.0.1";
|
||||
http_port = 3000;
|
||||
};
|
||||
};
|
||||
|
||||
provision = {
|
||||
enable = true;
|
||||
|
||||
datasources.settings.datasources = [{
|
||||
name = "Prometheus";
|
||||
type = "prometheus";
|
||||
url = "http://localhost:9090";
|
||||
isDefault = true;
|
||||
}];
|
||||
};
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### Configure Data Source (Manual)
|
||||
|
||||
If you're not using NixOS for provisioning, then you'll need to do provisioning
|
||||
_imperatively_ from your Grafana configuration. Ths can be done through the
|
||||
admin panel by navigating to `Configuration`, and choosing "add data source"
|
||||
under `Data Sources`. Select your prometheus instance, and save it.
|
||||
|
||||
### Import Pre-built Dashboard
|
||||
|
||||
A sample Grafana dashboard is provided with support for multi-host and
|
||||
multi-site configurations. Import it, configure the data source and it should
|
||||
work out of the box.
|
||||
|
||||
If you're not using NixOS for provisioning, the dashboard _also_ needs to be
|
||||
provisioned manually. Under `Dashboards`, select `Import` and provide the JSON
|
||||
contents or upload the sample dashboard from `contrib/grafana/watchdog.json`.
|
||||
Select your Prometheus data source and import it.
|
||||
|
||||
See [contrib/grafana/README.md](../contrib/grafana/README.md) for full
|
||||
documentation.
|
||||
|
||||
## Example Queries
|
||||
|
||||
Once Prometheus is scraping Watchdog and Grafana is connected, you may write
|
||||
your own widgets or create queries. Here are some example queries using
|
||||
Prometheus query language, promql. Those are provided as examples and might not
|
||||
provide everything you need. Nevertheless, use them to improve your setup at
|
||||
your disposal.
|
||||
|
||||
If you believe you have some valuable widgets that you'd like to contribute
|
||||
back, feel free!
|
||||
|
||||
### Top 10 Pages by Traffic
|
||||
|
||||
```promql
|
||||
topk(10, sum by (path) (rate(web_pageviews_total[5m])))
|
||||
```
|
||||
|
||||
### Mobile vs Desktop Split
|
||||
|
||||
```promql
|
||||
sum by (device) (rate(web_pageviews_total[1h]))
|
||||
```
|
||||
|
||||
### Unique Visitors
|
||||
|
||||
```promql
|
||||
web_daily_unique_visitors
|
||||
```
|
||||
|
||||
### Top Referrers
|
||||
|
||||
```promql
|
||||
topk(10, sum by (referrer) (rate(web_pageviews_total{referrer!="direct"}[1d])))
|
||||
```
|
||||
|
||||
### Multi-Site: Traffic per Domain
|
||||
|
||||
```promql
|
||||
sum by (domain) (rate(web_pageviews_total[1h]))
|
||||
```
|
||||
|
||||
### Cardinality Health
|
||||
|
||||
```promql
|
||||
# Should be near zero
|
||||
rate(web_path_overflow_total[5m])
|
||||
rate(web_referrer_overflow_total[5m])
|
||||
rate(web_event_overflow_total[5m])
|
||||
```
|
||||
|
||||
## Horizontal Scaling Considerations
|
||||
|
||||
When running multiple Watchdog instances:
|
||||
|
||||
1. **Each instance exposes its own metrics** - Prometheus scrapes all instances
|
||||
2. **Prometheus aggregates automatically** - use `sum()` in queries to aggregate
|
||||
across instances
|
||||
3. **No shared state needed** - each Watchdog instance is independent
|
||||
|
||||
Watchdog is almost entirely stateless, so horizontal scaling should be trivial
|
||||
as long as you have the necessary infrastructure and, well, the patience.
|
||||
Example with 3 instances:
|
||||
|
||||
```promql
|
||||
# Total pageviews across all instances
|
||||
sum(rate(web_pageviews_total[5m]))
|
||||
|
||||
# Per-instance breakdown
|
||||
sum by (instance) (rate(web_pageviews_total[5m]))
|
||||
```
|
||||
|
||||
## Alternatives to Prometheus
|
||||
|
||||
### VictoriaMetrics
|
||||
|
||||
Drop-in Prometheus replacement with better performance and compression:
|
||||
|
||||
```nix
|
||||
{
|
||||
services.victoriametrics = {
|
||||
enable = true;
|
||||
listenAddress = ":8428";
|
||||
retentionPeriod = "12month";
|
||||
};
|
||||
|
||||
# Configure Prometheus to remote-write to VictoriaMetrics
|
||||
services.prometheus = {
|
||||
enable = true;
|
||||
remoteWrite = [{
|
||||
url = "http://localhost:8428/api/v1/write";
|
||||
}];
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### Grafana Agent
|
||||
|
||||
Lightweight alternative that scrapes and forwards to Grafana Cloud or local
|
||||
Prometheus:
|
||||
|
||||
```bash
|
||||
# Systemd setup for Grafana Agent
|
||||
sudo systemctl enable --now grafana-agent
|
||||
```
|
||||
|
||||
```yaml
|
||||
# /etc/grafana-agent.yaml
|
||||
metrics:
|
||||
wal_directory: /var/lib/grafana-agent
|
||||
configs:
|
||||
- name: watchdog
|
||||
scrape_configs:
|
||||
- job_name: watchdog
|
||||
static_configs:
|
||||
- targets: ["localhost:8080"]
|
||||
remote_write:
|
||||
- url: http://localhost:9090/api/v1/write
|
||||
```
|
||||
|
||||
## Monitoring the Monitoring
|
||||
|
||||
Monitor Prometheus itself:
|
||||
|
||||
```promql
|
||||
# Prometheus scrape success rate
|
||||
up{job="watchdog"}
|
||||
|
||||
# Scrape duration
|
||||
scrape_duration_seconds{job="watchdog"}
|
||||
|
||||
# Time since last scrape
|
||||
time() - timestamp(up{job="watchdog"})
|
||||
```
|
||||
|
||||
## Additional Recommendations
|
||||
|
||||
1. **Retention**: Set `--storage.tsdb.retention.time=30d` or longer based on
|
||||
disk space
|
||||
2. **Backups**: Back up `/var/lib/prometheus` periodically (or whatever your
|
||||
state directory is)
|
||||
3. **Alerting**: Configure Prometheus alerting rules for critical metrics
|
||||
4. **High Availability**: Run multiple Prometheus instances with identical
|
||||
configs
|
||||
5. **Remote Storage**: For long-term storage, use Thanos, Cortex, or
|
||||
VictoriaMetrics
|
||||
Loading…
Add table
Add a link
Reference in a new issue