docs: document workflow and configuration fields

Signed-off-by: NotAShelf <raf@notashelf.dev>
Change-Id: I6a1b6adf568d7beae748cdee4ac851f16a6a6964
This commit is contained in:
raf 2026-03-17 16:16:26 +03:00
commit 03215df954
Signed by: NotAShelf
GPG key ID: 29D95B64378DB4BF

147
README.md
View file

@ -3,7 +3,17 @@
Declarative, pipeline-based backup orchestrator for Forgejo with a focus on Declarative, pipeline-based backup orchestrator for Forgejo with a focus on
backpressure tolerance, cryptographic verification, and fan-out concurrency. backpressure tolerance, cryptographic verification, and fan-out concurrency.
## Name Origin ## About
### Why?
Currently my work outside of Github is scattered on various Forgejo instances. I
do not wish to consolidate those into one, as I use those various instances with
different goals and intents but I _do_ want a safeguard that encapsulates all.
Thus, I've come up with a decision to create a proper solution that scratches my
itch.
### Name Origin
The name is derived from similar Esperanto morphology the same way the original The name is derived from similar Esperanto morphology the same way the original
name does: name does:
@ -13,22 +23,131 @@ name does:
morphing into "preservation place" or "archive." morphing into "preservation place" or "archive."
## Why? ## Usage
Currently my work outside of Github is scattered on various Forgejo instances. I ### Prerequisites
do not wish to consolidate those into one, as I use those various instances with
different goals and intents but I _do_ want a safeguard that encapsulates all. - A Forgejo personal access token with read access to repositories
Thus, I've come up with a decision to create a proper solution that scratches my - Access to at least one Forgejo instance with API enabled
itch. Here's how it is meant to look like:
### Configuration
Konservejo works based on configuration files, there is no CLI configuration to
modify the inputs or outputs. You can configure the service, your sources and
your sinks from said configuration file.
```toml
[service]
name = "my-backup"
state_db_path = "/var/lib/konservejo/state.db"
temp_dir = "/var/tmp/konservejo"
# Optional: limit concurrent repository processing
concurrency_limit = 4
# Optional: retry settings
[service.retry]
max_retries = 3
initial_backoff_ms = 500
backoff_multiplier = 2.0
max_backoff_ms = 30000
[[source]]
type = "forgejo"
id = "primary"
api_url = "https://git.example.tld/api/v1"
token = "${FORGEJO_TOKEN}"
[source.scope]
organizations = ["my-org"]
exclude_repos = ["my-org/legacy-repo"]
[[sink]]
type = "filesystem"
id = "local"
path = "/backup/repos"
verify_on_write = true
```
> [!TIP]
> Environment variable interpolation is supported with `${VAR}` syntax. Secrets
> should be passed via environment variables.
### Configuration Reference
**`[service]`**
<!--markdownlint-disable MD013--> <!--markdownlint-disable MD013-->
```plaintext | Field | Type | Default | Description |
[Forgejo A] ──┐ | -------------------------- | ------ | -------- | ------------------------------ |
[Forgejo B] ──┼──> [Source Adapters] --> [Artifact Stream] --> [Dispatcher] --> [Sink A] --> [Verifier] | `name` | string | required | Service identifier |
[Forgejo C] ──┘ │ | `state_db_path` | string | required | SQLite database path |
├──> [Sink B] --> [Verifier] | `temp_dir` | string | required | Temporary download directory |
└──> [Sink C] --> [Verifier] | `concurrency_limit` | usize | 4 | Max concurrent repo backups |
``` | `retry.max_retries` | u32 | 3 | Retry attempts for network ops |
| `retry.initial_backoff_ms` | u64 | 500 | Initial backoff delay |
| `retry.backoff_multiplier` | f64 | 2.0 | Backoff scaling factor |
| `retry.max_backoff_ms` | u64 | 30000 | Maximum backoff delay |
<!--markdownlint-enable MD013--> <!--markdownlint-enable MD013-->
**`[[source]]` (type: "forgejo")**
<!--markdownlint-disable MD013-->
| Field | Type | Description |
| --------------------- | -------- | ------------------------------------------- |
| `id` | string | Source identifier |
| `api_url` | string | Forgejo API base URL |
| `token` | string | Access token (use `${VAR}`) |
| `scope.organizations` | [string] | Orgs to back up |
| `scope.exclude_repos` | [string] | Repos to skip (supports `*/name` wildcards) |
<!--markdownlint-enable MD013-->
**`[[sink]]` (type: "filesystem")**
| Field | Type | Default | Description |
| ----------------- | ------ | -------- | ---------------------------- |
| `id` | string | required | Sink identifier |
| `path` | string | required | Storage directory |
| `verify_on_write` | bool | true | Re-read and hash after write |
### Commands
```bash
# Set credentials
$ export FORGEJO_TOKEN=your_token_here
# Validate configuration
$ konservejo validate-config
# Run backup
$ konservejo backup
# Verify manifest integrity
$ konservejo verify-manifest --run-id <uuid>
```
## Workflow
[Merkle tree]: https://en.wikipedia.org/wiki/Merkle_tree
Repositories are processed concurrently up to `concurrency_limit`. Each
repository is downloaded as a `tar.gz` archive, hashed (Blake3), and written to
all configured sinks. Storage is content-addressed. Artifacts are stored at
`{path}/{hash[0..2]}/{hash[2..4]}/{hash}`.
Then, a [Merkle tree] is computed over all artifacts; the root hash is persisted
to database for integrity verification. You'd do well to protect your database.
There exists a retry logic to handle transient failures, such as network errors.
Permanent failures (4xx) fail immediately while 429, 5xx errors are covered by
retries.
## Current Limitations/TODO
- S3 sink is not implemented (returns explicit error if configured)
- Checkpoint/resume not yet supported
- No retention policy enforcement yet