konservejo/README.md
NotAShelf 03215df954
docs: document workflow and configuration fields
Signed-off-by: NotAShelf <raf@notashelf.dev>
Change-Id: I6a1b6adf568d7beae748cdee4ac851f16a6a6964
2026-03-17 16:26:31 +03:00

4.9 KiB

Konservejo

Declarative, pipeline-based backup orchestrator for Forgejo with a focus on backpressure tolerance, cryptographic verification, and fan-out concurrency.

About

Why?

Currently my work outside of Github is scattered on various Forgejo instances. I do not wish to consolidate those into one, as I use those various instances with different goals and intents but I do want a safeguard that encapsulates all. Thus, I've come up with a decision to create a proper solution that scratches my itch.

Name Origin

The name is derived from similar Esperanto morphology the same way the original name does:

  • konservi = to preserve
  • -ejo = place

morphing into "preservation place" or "archive."

Usage

Prerequisites

  • A Forgejo personal access token with read access to repositories
  • Access to at least one Forgejo instance with API enabled

Configuration

Konservejo works based on configuration files, there is no CLI configuration to modify the inputs or outputs. You can configure the service, your sources and your sinks from said configuration file.

[service]
name = "my-backup"
state_db_path = "/var/lib/konservejo/state.db"
temp_dir = "/var/tmp/konservejo"

# Optional: limit concurrent repository processing
concurrency_limit = 4

# Optional: retry settings
[service.retry]
max_retries = 3
initial_backoff_ms = 500
backoff_multiplier = 2.0
max_backoff_ms = 30000

[[source]]
type = "forgejo"
id = "primary"
api_url = "https://git.example.tld/api/v1"
token = "${FORGEJO_TOKEN}"

[source.scope]
organizations = ["my-org"]
exclude_repos = ["my-org/legacy-repo"]

[[sink]]
type = "filesystem"
id = "local"
path = "/backup/repos"
verify_on_write = true

Tip

Environment variable interpolation is supported with ${VAR} syntax. Secrets should be passed via environment variables.

Configuration Reference

[service]

Field Type Default Description
name string required Service identifier
state_db_path string required SQLite database path
temp_dir string required Temporary download directory
concurrency_limit usize 4 Max concurrent repo backups
retry.max_retries u32 3 Retry attempts for network ops
retry.initial_backoff_ms u64 500 Initial backoff delay
retry.backoff_multiplier f64 2.0 Backoff scaling factor
retry.max_backoff_ms u64 30000 Maximum backoff delay

[[source]] (type: "forgejo")

Field Type Description
id string Source identifier
api_url string Forgejo API base URL
token string Access token (use ${VAR})
scope.organizations [string] Orgs to back up
scope.exclude_repos [string] Repos to skip (supports */name wildcards)

[[sink]] (type: "filesystem")

Field Type Default Description
id string required Sink identifier
path string required Storage directory
verify_on_write bool true Re-read and hash after write

Commands

# Set credentials
$ export FORGEJO_TOKEN=your_token_here

# Validate configuration
$ konservejo validate-config

# Run backup
$ konservejo backup

# Verify manifest integrity
$ konservejo verify-manifest --run-id <uuid>

Workflow

Repositories are processed concurrently up to concurrency_limit. Each repository is downloaded as a tar.gz archive, hashed (Blake3), and written to all configured sinks. Storage is content-addressed. Artifacts are stored at {path}/{hash[0..2]}/{hash[2..4]}/{hash}.

Then, a Merkle tree is computed over all artifacts; the root hash is persisted to database for integrity verification. You'd do well to protect your database.

There exists a retry logic to handle transient failures, such as network errors. Permanent failures (4xx) fail immediately while 429, 5xx errors are covered by retries.

Current Limitations/TODO

  • S3 sink is not implemented (returns explicit error if configured)
  • Checkpoint/resume not yet supported
  • No retention policy enforcement yet