From 56f15d749edc0a1db07b6438d031d82c65a20f46 Mon Sep 17 00:00:00 2001 From: NotAShelf Date: Mon, 23 Feb 2026 02:26:05 +0300 Subject: [PATCH] docs: initial specification; we yap Signed-off-by: NotAShelf Change-Id: I885e6317d186ccdc847195957dba4ab26a6a6964 --- docs/SPEC.md | 265 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 265 insertions(+) create mode 100644 docs/SPEC.md diff --git a/docs/SPEC.md b/docs/SPEC.md new file mode 100644 index 0000000..f948679 --- /dev/null +++ b/docs/SPEC.md @@ -0,0 +1,265 @@ +# Nixir Technical Specification + +This is a distillation of my personal notes on my "research" within the Nix +codebase and the subsequent design notes on Nixir. While some of those, +naturally, belong in the README I have elected to compile a list of noteworthy +details into a "specification document" for those possibly interested, for some +reason, in integrating with Nixir. + +Beware, here be observations. + +## What This Project Is + +Nixir is, most simply (and elegantly) put, a Nix compiler _and runtime_ packaged +as a plugin. The compiler component compiles a subset of Nix source to a custom +binary intermediate representation (IR) and then executes IR inside a virtual +machine running within the plugin process. Hence it's called Nix-ir. + +As you might've caught on from the README already, the project consists of two +artifacts: a standalone compiler tool called `nix-irc` that transforms `.nix` +files into `.nixir` bundles, and a plugin library (`nix-ir-plugin.so`) that Nix +loads to provide three primops for interacting with compiled IR. + +The architecture handles the full compilation pipeline. Static imports are +resolved at compile time and inlined into the output bundle, while the compiled +VM handles all evaluation at runtime. This mirrors how Nixpkgs itself +distinguishes between stable library code and application-specific expressions. + +The plugin does not intercept evaluation automatically. Instead, it exposes +primops that users invoke explicitly. This design exists because Nix's plugin +API does not provide hooks into the core evaluation loop. Unfortunate, but 'tis +life. + +## Why Compile Nix + +Every invocation of `nix eval` or `nix build` must parse, type-check, and +evaluate expressions from scratch. For large codebases, this overhead is +measurable. + +Nix does provide a persistent evaluation cache, stored in SQLite. However, this +cache only applies to flake-based workflows. Direct imports like +`import ./foo.nix` do not benefit from the cache and re-parse on each +invocation. + +For example, a NixOS configuration using direct imports to `nixpkgs.lib` +re-parses source files on every rebuild. The compiler front-end accounts for +substantial wall-clock time before evaluation begins. + +Precompiled IR eliminates, or rather, attempts to eliminate this cost. A +`.nixir` bundle contains serialized AST nodes with all variable names converted +to numeric indices. Loading skips parsing entirely and begins directly with the +VM executing pre-processed code. + +The project _also_ serves as an implementation study. I say also, but it is +actually the main goal of this project. Reimplementing Nix's evaluation +semantics reveals details that the upstream C++ code obscures. The thunk +mechanism, environment model, and cycle detection become tangible when you can +read and step through the implementation. I don't expect to get a better +understanding of the Nix language, but I now have more reasons to badmouth it. + +## The IR Format + +The binary format uses 36-byte fixed header followed by variable-length +sections. All multi-byte integers use little-endian byte order. + +The header layout: + +```plaintext +0x00-0x03: Magic identifier, value 0x4E495258 +0x04-0x07: Version number, currently 2 +0x08-0x0B: Flags field, reserved +0x0C-0x0F: Offset to string table +0x10-0x13: Offset to primop table +0x14-0x17: Offset to IR blob +0x18-0x1B: String count +0x1C-0x1F: Primop count +0x20-0x23: Reserved +``` + +The magic value `0x4E495258` corresponds to the bytes N I R X when read in +big-endian order. + +The string table follows the header. Each entry encodes length as a varint, then +that many UTF-8 bytes. All attribute names, identifiers, and string literals in +the source are de-duplicated at compile time and stored here. References +throughout the IR use indices into this table rather than inline strings. + +The primop table defines built-in operations. Each entry contains the string +table index for the operation name, its arity, and optional flags. This table +enables the VM to dispatch operations by index without string comparison. + +The IR blob contains the actual program. Each node begins with a type byte +followed by type-specific payload. + +Node type enumeration from the source: + +```plaintext +0x01: CONST_INT - Signed 64-bit integer +0x02: CONST_STRING - String table index +0x03: CONST_PATH - String table index +0x04: CONST_BOOL - 0x00 or 0x01 +0x05: CONST_NULL - No payload +0x06: CONST_FLOAT - IEEE 754 double +0x07: CONST_URI - String table index +0x08: CONST_LOOKUP_PATH - String table index for +0x10: VAR - Two varints: depth and index +0x20: LAMBDA - Arity and body offset +0x21: APP - Function and argument offsets +0x22: BINARY_OP - Operation enum and operands +0x23: UNARY_OP - Operation enum and operand +0x24: IMPORT - String table index for file path +0x30: ATTRSET - Count and recursive flag +0x31: SELECT - Expression, attribute, optional default +0x32: WITH - Attribute set and body offsets +0x33: LIST - Count and element offsets +0x34: HAS_ATTR - Expression and attribute +0x40: IF - Condition, then, and else offsets +0x50: LET - Binding count and body offset +0x51: LETREC - Binding count and body offset +0x52: ASSERT - Condition and body offsets +0x60: THUNK - Expression offset +0x61: FORCE - Expression offset +0xFF: ERROR - Error marker +``` + +Binary operations supported: + +```plaintext +ADD, SUB, MUL, DIV - Arithmetic on integers +CONCAT - List concatenation (++) +EQ, NE - Equality comparison +LT, GT, LE, GE - Ordering comparison +AND, OR, IMPL - Boolean logic +MERGE - Attribute set override (//) +``` + +## Variable Representation + +The compiler converts variable names to De Bruijn indices during IR generation. +Rather than storing strings like "x" in the output, each variable reference +encodes two numbers: the lexical depth and the position within that scope. + +The depth indicates how many lambda boundaries enclose the reference. A variable +in the outermost scope has depth zero. A variable referenced from inside one +lambda that refers to the outer scope has depth one. + +The index indicates the position in that scope's environment array. The first +bound variable in a scope has index zero, the second has index one, and so +forth. + +During evaluation, the VM combines these two numbers into a single 32-bit value +where the high 16 bits encode depth and the low 16 bits encode index. Lookup +traverses the environment chain depth times, then indexes into the resulting +scope's binding array. This achieves O(1) variable resolution. + +## The Virtual Machine + +The VM implements lazy evaluation using an explicit thunk mechanism. Every +unevaluated expression and function argument wraps in a Thunk structure +containing the expression AST node and a pointer to the captured environment. + +When the VM needs a value, it calls `force()` on the thunk. The force operation +checks whether the thunk is already being evaluated. If evaluation attempts to +force a thunk that is currently evaluating, the VM detects the cycle and raises +"infinite recursion encountered". This matches Nix's behavior for recursive +definitions. + +The environment structure is an array-based chain. Each scope holds a pointer to +its parent scope and a vector of bound values. Looking up a variable traverses +parent pointers until reaching the scope at the correct depth, then indexes into +that scope's value array. This replaces string comparison with pointer traversal +and array indexing. + +Function application follows currying. When applying a function to an argument, +the VM checks whether the function's arity is satisfied. If yes, it extends the +environment with the new binding and evaluates the body. If not, it returns a +partial application awaiting additional arguments. + +The evaluator handles binary operations with type-specific dispatch. Addition +supports integers, strings, and paths with appropriate type coercion rules. +Comparison operators work on integers and strings. The merge operator combines +two attribute sets with right-side precedence. + +## Plugin Primops + +The plugin registers three primops through Nix's `RegisterPrimOp` interface: + +`__nixIR_loadIR` accepts a file path string, deserializes the `.nixir` bundle, +evaluates the entry expression, and returns the resulting value. The VM measures +deserialization time and evaluation time separately, printing timing data to +stderr. + +`__nixIR_compile` accepts a string containing Nix source code, parses it +in-memory, generates IR, and evaluates the result. This enables runtime +compilation without external tooling. + +`__nixIR_info` returns an attribute set containing the plugin name +"nix-ir-plugin", version "0.1.0", and status "runtime-active". This is a +development-only primop that will be removed eventually. + +The primops use the double-underscore prefix internally. Users access them +through `builtins.nixIR_loadIR`, `builtins.nixIR_compile`, and +`builtins.nixIR_info` in their expressions. + +## Import Handling + +The compiler performs static import resolution when the import path meets +specific conditions. The path must be a literal string literal in the source, +not an interpolation or variable. The path must not use home directory +expansion. The resolved path must remain within the project root for security. +The target file must exist and be readable at compile time. + +When these conditions hold, the compiler reads the imported file, recursively +processes its imports, and embeds the resulting IR into the output bundle. The +final `.nixir` file is self-contained and requires no additional file lookups at +load time. + +When conditions do not hold, the compiler records the import as dynamic and +emits an IMPORT node containing the string table index. At runtime, the VM +evaluates the import expression to obtain the actual file path, then uses Nix's +standard evaluator to load that file. + +## What Works And What Does Not + +The implementation covers a substantial subset of Nix's expression language. +Literals work across all types including integers, floats, strings, paths, URIs, +booleans, and null. Lambda expressions, function application, and currying are +implemented. Attribute sets with both static and dynamic keys are supported. The +let and letrec forms work with proper recursive binding semantics. The if +expression, assert statement, with expression, and list literals are all +functional. + +The implementation does not cover derivations, builtins other than those +required for basic operation, or the full module system. These require +integration with Nix's store and download mechanisms that the VM does not +replicate. + +## Building And Using + +Create a build directory and configure with CMake: + +``` +cmake -B build -G Ninja +cmake --build build +``` + +This produces `nix-irc` in the build directory and `nix-ir-plugin.so` in the +project root. + +Compile a Nix file to IR: + +``` +./build/nix-irc input.nix output.nixir +``` + +Load and evaluate the compiled bundle through Nix: + +``` +nix --plugin-files ./nix-ir-plugin.so eval --expr 'builtins.nixIR_loadIR "output.nixir"' +``` + +Compile and evaluate source at runtime: + +``` +nix --plugin-files ./nix-ir-plugin.so eval --expr 'builtins.nixIR_compile "1 + 2"' +```