docs: initial specification; we yap

Signed-off-by: NotAShelf <raf@notashelf.dev>
Change-Id: I885e6317d186ccdc847195957dba4ab26a6a6964
This commit is contained in:
raf 2026-02-23 02:26:05 +03:00
commit 56f15d749e
Signed by: NotAShelf
GPG key ID: 29D95B64378DB4BF

265
docs/SPEC.md Normal file
View file

@ -0,0 +1,265 @@
# Nixir Technical Specification
This is a distillation of my personal notes on my "research" within the Nix
codebase and the subsequent design notes on Nixir. While some of those,
naturally, belong in the README I have elected to compile a list of noteworthy
details into a "specification document" for those possibly interested, for some
reason, in integrating with Nixir.
Beware, here be observations.
## What This Project Is
Nixir is, most simply (and elegantly) put, a Nix compiler _and runtime_ packaged
as a plugin. The compiler component compiles a subset of Nix source to a custom
binary intermediate representation (IR) and then executes IR inside a virtual
machine running within the plugin process. Hence it's called Nix-ir.
As you might've caught on from the README already, the project consists of two
artifacts: a standalone compiler tool called `nix-irc` that transforms `.nix`
files into `.nixir` bundles, and a plugin library (`nix-ir-plugin.so`) that Nix
loads to provide three primops for interacting with compiled IR.
The architecture handles the full compilation pipeline. Static imports are
resolved at compile time and inlined into the output bundle, while the compiled
VM handles all evaluation at runtime. This mirrors how Nixpkgs itself
distinguishes between stable library code and application-specific expressions.
The plugin does not intercept evaluation automatically. Instead, it exposes
primops that users invoke explicitly. This design exists because Nix's plugin
API does not provide hooks into the core evaluation loop. Unfortunate, but 'tis
life.
## Why Compile Nix
Every invocation of `nix eval` or `nix build` must parse, type-check, and
evaluate expressions from scratch. For large codebases, this overhead is
measurable.
Nix does provide a persistent evaluation cache, stored in SQLite. However, this
cache only applies to flake-based workflows. Direct imports like
`import ./foo.nix` do not benefit from the cache and re-parse on each
invocation.
For example, a NixOS configuration using direct imports to `nixpkgs.lib`
re-parses source files on every rebuild. The compiler front-end accounts for
substantial wall-clock time before evaluation begins.
Precompiled IR eliminates, or rather, attempts to eliminate this cost. A
`.nixir` bundle contains serialized AST nodes with all variable names converted
to numeric indices. Loading skips parsing entirely and begins directly with the
VM executing pre-processed code.
The project _also_ serves as an implementation study. I say also, but it is
actually the main goal of this project. Reimplementing Nix's evaluation
semantics reveals details that the upstream C++ code obscures. The thunk
mechanism, environment model, and cycle detection become tangible when you can
read and step through the implementation. I don't expect to get a better
understanding of the Nix language, but I now have more reasons to badmouth it.
## The IR Format
The binary format uses 36-byte fixed header followed by variable-length
sections. All multi-byte integers use little-endian byte order.
The header layout:
```plaintext
0x00-0x03: Magic identifier, value 0x4E495258
0x04-0x07: Version number, currently 2
0x08-0x0B: Flags field, reserved
0x0C-0x0F: Offset to string table
0x10-0x13: Offset to primop table
0x14-0x17: Offset to IR blob
0x18-0x1B: String count
0x1C-0x1F: Primop count
0x20-0x23: Reserved
```
The magic value `0x4E495258` corresponds to the bytes N I R X when read in
big-endian order.
The string table follows the header. Each entry encodes length as a varint, then
that many UTF-8 bytes. All attribute names, identifiers, and string literals in
the source are de-duplicated at compile time and stored here. References
throughout the IR use indices into this table rather than inline strings.
The primop table defines built-in operations. Each entry contains the string
table index for the operation name, its arity, and optional flags. This table
enables the VM to dispatch operations by index without string comparison.
The IR blob contains the actual program. Each node begins with a type byte
followed by type-specific payload.
Node type enumeration from the source:
```plaintext
0x01: CONST_INT - Signed 64-bit integer
0x02: CONST_STRING - String table index
0x03: CONST_PATH - String table index
0x04: CONST_BOOL - 0x00 or 0x01
0x05: CONST_NULL - No payload
0x06: CONST_FLOAT - IEEE 754 double
0x07: CONST_URI - String table index
0x08: CONST_LOOKUP_PATH - String table index for <nixpkgs>
0x10: VAR - Two varints: depth and index
0x20: LAMBDA - Arity and body offset
0x21: APP - Function and argument offsets
0x22: BINARY_OP - Operation enum and operands
0x23: UNARY_OP - Operation enum and operand
0x24: IMPORT - String table index for file path
0x30: ATTRSET - Count and recursive flag
0x31: SELECT - Expression, attribute, optional default
0x32: WITH - Attribute set and body offsets
0x33: LIST - Count and element offsets
0x34: HAS_ATTR - Expression and attribute
0x40: IF - Condition, then, and else offsets
0x50: LET - Binding count and body offset
0x51: LETREC - Binding count and body offset
0x52: ASSERT - Condition and body offsets
0x60: THUNK - Expression offset
0x61: FORCE - Expression offset
0xFF: ERROR - Error marker
```
Binary operations supported:
```plaintext
ADD, SUB, MUL, DIV - Arithmetic on integers
CONCAT - List concatenation (++)
EQ, NE - Equality comparison
LT, GT, LE, GE - Ordering comparison
AND, OR, IMPL - Boolean logic
MERGE - Attribute set override (//)
```
## Variable Representation
The compiler converts variable names to De Bruijn indices during IR generation.
Rather than storing strings like "x" in the output, each variable reference
encodes two numbers: the lexical depth and the position within that scope.
The depth indicates how many lambda boundaries enclose the reference. A variable
in the outermost scope has depth zero. A variable referenced from inside one
lambda that refers to the outer scope has depth one.
The index indicates the position in that scope's environment array. The first
bound variable in a scope has index zero, the second has index one, and so
forth.
During evaluation, the VM combines these two numbers into a single 32-bit value
where the high 16 bits encode depth and the low 16 bits encode index. Lookup
traverses the environment chain depth times, then indexes into the resulting
scope's binding array. This achieves O(1) variable resolution.
## The Virtual Machine
The VM implements lazy evaluation using an explicit thunk mechanism. Every
unevaluated expression and function argument wraps in a Thunk structure
containing the expression AST node and a pointer to the captured environment.
When the VM needs a value, it calls `force()` on the thunk. The force operation
checks whether the thunk is already being evaluated. If evaluation attempts to
force a thunk that is currently evaluating, the VM detects the cycle and raises
"infinite recursion encountered". This matches Nix's behavior for recursive
definitions.
The environment structure is an array-based chain. Each scope holds a pointer to
its parent scope and a vector of bound values. Looking up a variable traverses
parent pointers until reaching the scope at the correct depth, then indexes into
that scope's value array. This replaces string comparison with pointer traversal
and array indexing.
Function application follows currying. When applying a function to an argument,
the VM checks whether the function's arity is satisfied. If yes, it extends the
environment with the new binding and evaluates the body. If not, it returns a
partial application awaiting additional arguments.
The evaluator handles binary operations with type-specific dispatch. Addition
supports integers, strings, and paths with appropriate type coercion rules.
Comparison operators work on integers and strings. The merge operator combines
two attribute sets with right-side precedence.
## Plugin Primops
The plugin registers three primops through Nix's `RegisterPrimOp` interface:
`__nixIR_loadIR` accepts a file path string, deserializes the `.nixir` bundle,
evaluates the entry expression, and returns the resulting value. The VM measures
deserialization time and evaluation time separately, printing timing data to
stderr.
`__nixIR_compile` accepts a string containing Nix source code, parses it
in-memory, generates IR, and evaluates the result. This enables runtime
compilation without external tooling.
`__nixIR_info` returns an attribute set containing the plugin name
"nix-ir-plugin", version "0.1.0", and status "runtime-active". This is a
development-only primop that will be removed eventually.
The primops use the double-underscore prefix internally. Users access them
through `builtins.nixIR_loadIR`, `builtins.nixIR_compile`, and
`builtins.nixIR_info` in their expressions.
## Import Handling
The compiler performs static import resolution when the import path meets
specific conditions. The path must be a literal string literal in the source,
not an interpolation or variable. The path must not use home directory
expansion. The resolved path must remain within the project root for security.
The target file must exist and be readable at compile time.
When these conditions hold, the compiler reads the imported file, recursively
processes its imports, and embeds the resulting IR into the output bundle. The
final `.nixir` file is self-contained and requires no additional file lookups at
load time.
When conditions do not hold, the compiler records the import as dynamic and
emits an IMPORT node containing the string table index. At runtime, the VM
evaluates the import expression to obtain the actual file path, then uses Nix's
standard evaluator to load that file.
## What Works And What Does Not
The implementation covers a substantial subset of Nix's expression language.
Literals work across all types including integers, floats, strings, paths, URIs,
booleans, and null. Lambda expressions, function application, and currying are
implemented. Attribute sets with both static and dynamic keys are supported. The
let and letrec forms work with proper recursive binding semantics. The if
expression, assert statement, with expression, and list literals are all
functional.
The implementation does not cover derivations, builtins other than those
required for basic operation, or the full module system. These require
integration with Nix's store and download mechanisms that the VM does not
replicate.
## Building And Using
Create a build directory and configure with CMake:
```
cmake -B build -G Ninja
cmake --build build
```
This produces `nix-irc` in the build directory and `nix-ir-plugin.so` in the
project root.
Compile a Nix file to IR:
```
./build/nix-irc input.nix output.nixir
```
Load and evaluate the compiled bundle through Nix:
```
nix --plugin-files ./nix-ir-plugin.so eval --expr 'builtins.nixIR_loadIR "output.nixir"'
```
Compile and evaluate source at runtime:
```
nix --plugin-files ./nix-ir-plugin.so eval --expr 'builtins.nixIR_compile "1 + 2"'
```