docs: provide obserability stack guide
Signed-off-by: NotAShelf <raf@notashelf.dev> Change-Id: Ibadc31d02413da836e85eaa3d446eb9e6a6a6964
This commit is contained in:
parent
13343ef2bd
commit
df06ed38bf
2 changed files with 302 additions and 1 deletions
|
|
@ -212,7 +212,8 @@ While not final, some of the metrics collected are as follows:
|
||||||
- `web_path_overflow_total` - Paths rejected due to cardinality limit
|
- `web_path_overflow_total` - Paths rejected due to cardinality limit
|
||||||
- `web_referrer_overflow_total` - Referrers rejected due to limit
|
- `web_referrer_overflow_total` - Referrers rejected due to limit
|
||||||
- `web_event_overflow_total` - Custom events rejected due to limit
|
- `web_event_overflow_total` - Custom events rejected due to limit
|
||||||
- `web_blocked_requests_total{reason}` - File server requests blocked by security filters
|
- `web_blocked_requests_total{reason}` - File server requests blocked by
|
||||||
|
security filters
|
||||||
|
|
||||||
**Process metrics:**
|
**Process metrics:**
|
||||||
|
|
||||||
|
|
|
||||||
300
docs/observability.md
Normal file
300
docs/observability.md
Normal file
|
|
@ -0,0 +1,300 @@
|
||||||
|
# Observability Setup
|
||||||
|
|
||||||
|
Watchdog exposes Prometheus-formatted metrics at `/metrics`. You need a
|
||||||
|
time-series database to scrape and store these metrics, then visualize them in
|
||||||
|
Grafana.
|
||||||
|
|
||||||
|
> [!IMPORTANT]
|
||||||
|
>
|
||||||
|
> **Why you need Prometheus:**
|
||||||
|
>
|
||||||
|
> - Watchdog exposes _current state_ (counters, gauges)
|
||||||
|
> - Prometheus _scrapes periodically_ and _stores time-series data_
|
||||||
|
> - Grafana _visualizes_ the historical data from Prometheus
|
||||||
|
> - Grafana cannot directly scrape Prometheus `/metrics` endpoints
|
||||||
|
|
||||||
|
## Prometheus Setup
|
||||||
|
|
||||||
|
### Configuring Prometheus
|
||||||
|
|
||||||
|
Create `/etc/prometheus/prometheus.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
global:
|
||||||
|
scrape_interval: 15s
|
||||||
|
evaluation_interval: 15s
|
||||||
|
|
||||||
|
scrape_configs:
|
||||||
|
- job_name: "watchdog"
|
||||||
|
static_configs:
|
||||||
|
- targets: ["localhost:8080"]
|
||||||
|
|
||||||
|
# Optional: scrape multiple Watchdog instances
|
||||||
|
# static_configs:
|
||||||
|
# - targets:
|
||||||
|
# - 'watchdog-1.example.com:8080'
|
||||||
|
# - 'watchdog-2.example.com:8080'
|
||||||
|
# labels:
|
||||||
|
# instance: 'production'
|
||||||
|
|
||||||
|
# Scrape Prometheus itself
|
||||||
|
- job_name: "prometheus"
|
||||||
|
static_configs:
|
||||||
|
- targets: ["localhost:9090"]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Verify Prometheus' health state
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check Prometheus is running
|
||||||
|
curl http://localhost:9090/-/healthy
|
||||||
|
|
||||||
|
# Check it's scraping Watchdog
|
||||||
|
curl http://localhost:9090/api/v1/targets
|
||||||
|
```
|
||||||
|
|
||||||
|
### NixOS
|
||||||
|
|
||||||
|
Add to your NixOS configuration:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
{
|
||||||
|
services.prometheus = {
|
||||||
|
enable = true;
|
||||||
|
port = 9090;
|
||||||
|
|
||||||
|
# Retention period
|
||||||
|
retentionTime = "30d";
|
||||||
|
|
||||||
|
scrapeConfigs = [
|
||||||
|
{
|
||||||
|
job_name = "watchdog";
|
||||||
|
static_configs = [{
|
||||||
|
targets = [ "localhost:8080" ];
|
||||||
|
}];
|
||||||
|
}
|
||||||
|
];
|
||||||
|
};
|
||||||
|
|
||||||
|
# Open firewall if needed
|
||||||
|
# networking.firewall.allowedTCPPorts = [ 9090 ];
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
For multiple Watchdog instances:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
{
|
||||||
|
services.prometheus.scrapeConfigs = [
|
||||||
|
{
|
||||||
|
job_name = "watchdog";
|
||||||
|
static_configs = [
|
||||||
|
{
|
||||||
|
labels.env = "production";
|
||||||
|
targets = [
|
||||||
|
"watchdog-1:8080"
|
||||||
|
"watchdog-2:8080"
|
||||||
|
"watchdog-3:8080"
|
||||||
|
];
|
||||||
|
}
|
||||||
|
];
|
||||||
|
}
|
||||||
|
];
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Grafana Setup
|
||||||
|
|
||||||
|
### NixOS
|
||||||
|
|
||||||
|
```nix
|
||||||
|
{
|
||||||
|
services.grafana = {
|
||||||
|
enable = true;
|
||||||
|
settings = {
|
||||||
|
server = {
|
||||||
|
http_addr = "127.0.0.1";
|
||||||
|
http_port = 3000;
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
|
provision = {
|
||||||
|
enable = true;
|
||||||
|
|
||||||
|
datasources.settings.datasources = [{
|
||||||
|
name = "Prometheus";
|
||||||
|
type = "prometheus";
|
||||||
|
url = "http://localhost:9090";
|
||||||
|
isDefault = true;
|
||||||
|
}];
|
||||||
|
};
|
||||||
|
};
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configure Data Source (Manual)
|
||||||
|
|
||||||
|
If you're not using NixOS for provisioning, then you'll need to do provisioning
|
||||||
|
_imperatively_ from your Grafana configuration. Ths can be done through the
|
||||||
|
admin panel by navigating to `Configuration`, and choosing "add data source"
|
||||||
|
under `Data Sources`. Select your prometheus instance, and save it.
|
||||||
|
|
||||||
|
### Import Pre-built Dashboard
|
||||||
|
|
||||||
|
A sample Grafana dashboard is provided with support for multi-host and
|
||||||
|
multi-site configurations. Import it, configure the data source and it should
|
||||||
|
work out of the box.
|
||||||
|
|
||||||
|
If you're not using NixOS for provisioning, the dashboard _also_ needs to be
|
||||||
|
provisioned manually. Under `Dashboards`, select `Import` and provide the JSON
|
||||||
|
contents or upload the sample dashboard from `contrib/grafana/watchdog.json`.
|
||||||
|
Select your Prometheus data source and import it.
|
||||||
|
|
||||||
|
See [contrib/grafana/README.md](../contrib/grafana/README.md) for full
|
||||||
|
documentation.
|
||||||
|
|
||||||
|
## Example Queries
|
||||||
|
|
||||||
|
Once Prometheus is scraping Watchdog and Grafana is connected, you may write
|
||||||
|
your own widgets or create queries. Here are some example queries using
|
||||||
|
Prometheus query language, promql. Those are provided as examples and might not
|
||||||
|
provide everything you need. Nevertheless, use them to improve your setup at
|
||||||
|
your disposal.
|
||||||
|
|
||||||
|
If you believe you have some valuable widgets that you'd like to contribute
|
||||||
|
back, feel free!
|
||||||
|
|
||||||
|
### Top 10 Pages by Traffic
|
||||||
|
|
||||||
|
```promql
|
||||||
|
topk(10, sum by (path) (rate(web_pageviews_total[5m])))
|
||||||
|
```
|
||||||
|
|
||||||
|
### Mobile vs Desktop Split
|
||||||
|
|
||||||
|
```promql
|
||||||
|
sum by (device) (rate(web_pageviews_total[1h]))
|
||||||
|
```
|
||||||
|
|
||||||
|
### Unique Visitors
|
||||||
|
|
||||||
|
```promql
|
||||||
|
web_daily_unique_visitors
|
||||||
|
```
|
||||||
|
|
||||||
|
### Top Referrers
|
||||||
|
|
||||||
|
```promql
|
||||||
|
topk(10, sum by (referrer) (rate(web_pageviews_total{referrer!="direct"}[1d])))
|
||||||
|
```
|
||||||
|
|
||||||
|
### Multi-Site: Traffic per Domain
|
||||||
|
|
||||||
|
```promql
|
||||||
|
sum by (domain) (rate(web_pageviews_total[1h]))
|
||||||
|
```
|
||||||
|
|
||||||
|
### Cardinality Health
|
||||||
|
|
||||||
|
```promql
|
||||||
|
# Should be near zero
|
||||||
|
rate(web_path_overflow_total[5m])
|
||||||
|
rate(web_referrer_overflow_total[5m])
|
||||||
|
rate(web_event_overflow_total[5m])
|
||||||
|
```
|
||||||
|
|
||||||
|
## Horizontal Scaling Considerations
|
||||||
|
|
||||||
|
When running multiple Watchdog instances:
|
||||||
|
|
||||||
|
1. **Each instance exposes its own metrics** - Prometheus scrapes all instances
|
||||||
|
2. **Prometheus aggregates automatically** - use `sum()` in queries to aggregate
|
||||||
|
across instances
|
||||||
|
3. **No shared state needed** - each Watchdog instance is independent
|
||||||
|
|
||||||
|
Watchdog is almost entirely stateless, so horizontal scaling should be trivial
|
||||||
|
as long as you have the necessary infrastructure and, well, the patience.
|
||||||
|
Example with 3 instances:
|
||||||
|
|
||||||
|
```promql
|
||||||
|
# Total pageviews across all instances
|
||||||
|
sum(rate(web_pageviews_total[5m]))
|
||||||
|
|
||||||
|
# Per-instance breakdown
|
||||||
|
sum by (instance) (rate(web_pageviews_total[5m]))
|
||||||
|
```
|
||||||
|
|
||||||
|
## Alternatives to Prometheus
|
||||||
|
|
||||||
|
### VictoriaMetrics
|
||||||
|
|
||||||
|
Drop-in Prometheus replacement with better performance and compression:
|
||||||
|
|
||||||
|
```nix
|
||||||
|
{
|
||||||
|
services.victoriametrics = {
|
||||||
|
enable = true;
|
||||||
|
listenAddress = ":8428";
|
||||||
|
retentionPeriod = "12month";
|
||||||
|
};
|
||||||
|
|
||||||
|
# Configure Prometheus to remote-write to VictoriaMetrics
|
||||||
|
services.prometheus = {
|
||||||
|
enable = true;
|
||||||
|
remoteWrite = [{
|
||||||
|
url = "http://localhost:8428/api/v1/write";
|
||||||
|
}];
|
||||||
|
};
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Grafana Agent
|
||||||
|
|
||||||
|
Lightweight alternative that scrapes and forwards to Grafana Cloud or local
|
||||||
|
Prometheus:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Systemd setup for Grafana Agent
|
||||||
|
sudo systemctl enable --now grafana-agent
|
||||||
|
```
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# /etc/grafana-agent.yaml
|
||||||
|
metrics:
|
||||||
|
wal_directory: /var/lib/grafana-agent
|
||||||
|
configs:
|
||||||
|
- name: watchdog
|
||||||
|
scrape_configs:
|
||||||
|
- job_name: watchdog
|
||||||
|
static_configs:
|
||||||
|
- targets: ["localhost:8080"]
|
||||||
|
remote_write:
|
||||||
|
- url: http://localhost:9090/api/v1/write
|
||||||
|
```
|
||||||
|
|
||||||
|
## Monitoring the Monitoring
|
||||||
|
|
||||||
|
Monitor Prometheus itself:
|
||||||
|
|
||||||
|
```promql
|
||||||
|
# Prometheus scrape success rate
|
||||||
|
up{job="watchdog"}
|
||||||
|
|
||||||
|
# Scrape duration
|
||||||
|
scrape_duration_seconds{job="watchdog"}
|
||||||
|
|
||||||
|
# Time since last scrape
|
||||||
|
time() - timestamp(up{job="watchdog"})
|
||||||
|
```
|
||||||
|
|
||||||
|
## Additional Recommendations
|
||||||
|
|
||||||
|
1. **Retention**: Set `--storage.tsdb.retention.time=30d` or longer based on
|
||||||
|
disk space
|
||||||
|
2. **Backups**: Back up `/var/lib/prometheus` periodically (or whatever your
|
||||||
|
state directory is)
|
||||||
|
3. **Alerting**: Configure Prometheus alerting rules for critical metrics
|
||||||
|
4. **High Availability**: Run multiple Prometheus instances with identical
|
||||||
|
configs
|
||||||
|
5. **Remote Storage**: For long-term storage, use Thanos, Cortex, or
|
||||||
|
VictoriaMetrics
|
||||||
Loading…
Add table
Add a link
Reference in a new issue