docs: initial specification; we yap
Signed-off-by: NotAShelf <raf@notashelf.dev> Change-Id: I885e6317d186ccdc847195957dba4ab26a6a6964
This commit is contained in:
parent
14bbc09280
commit
56f15d749e
1 changed files with 265 additions and 0 deletions
265
docs/SPEC.md
Normal file
265
docs/SPEC.md
Normal file
|
|
@ -0,0 +1,265 @@
|
||||||
|
# Nixir Technical Specification
|
||||||
|
|
||||||
|
This is a distillation of my personal notes on my "research" within the Nix
|
||||||
|
codebase and the subsequent design notes on Nixir. While some of those,
|
||||||
|
naturally, belong in the README I have elected to compile a list of noteworthy
|
||||||
|
details into a "specification document" for those possibly interested, for some
|
||||||
|
reason, in integrating with Nixir.
|
||||||
|
|
||||||
|
Beware, here be observations.
|
||||||
|
|
||||||
|
## What This Project Is
|
||||||
|
|
||||||
|
Nixir is, most simply (and elegantly) put, a Nix compiler _and runtime_ packaged
|
||||||
|
as a plugin. The compiler component compiles a subset of Nix source to a custom
|
||||||
|
binary intermediate representation (IR) and then executes IR inside a virtual
|
||||||
|
machine running within the plugin process. Hence it's called Nix-ir.
|
||||||
|
|
||||||
|
As you might've caught on from the README already, the project consists of two
|
||||||
|
artifacts: a standalone compiler tool called `nix-irc` that transforms `.nix`
|
||||||
|
files into `.nixir` bundles, and a plugin library (`nix-ir-plugin.so`) that Nix
|
||||||
|
loads to provide three primops for interacting with compiled IR.
|
||||||
|
|
||||||
|
The architecture handles the full compilation pipeline. Static imports are
|
||||||
|
resolved at compile time and inlined into the output bundle, while the compiled
|
||||||
|
VM handles all evaluation at runtime. This mirrors how Nixpkgs itself
|
||||||
|
distinguishes between stable library code and application-specific expressions.
|
||||||
|
|
||||||
|
The plugin does not intercept evaluation automatically. Instead, it exposes
|
||||||
|
primops that users invoke explicitly. This design exists because Nix's plugin
|
||||||
|
API does not provide hooks into the core evaluation loop. Unfortunate, but 'tis
|
||||||
|
life.
|
||||||
|
|
||||||
|
## Why Compile Nix
|
||||||
|
|
||||||
|
Every invocation of `nix eval` or `nix build` must parse, type-check, and
|
||||||
|
evaluate expressions from scratch. For large codebases, this overhead is
|
||||||
|
measurable.
|
||||||
|
|
||||||
|
Nix does provide a persistent evaluation cache, stored in SQLite. However, this
|
||||||
|
cache only applies to flake-based workflows. Direct imports like
|
||||||
|
`import ./foo.nix` do not benefit from the cache and re-parse on each
|
||||||
|
invocation.
|
||||||
|
|
||||||
|
For example, a NixOS configuration using direct imports to `nixpkgs.lib`
|
||||||
|
re-parses source files on every rebuild. The compiler front-end accounts for
|
||||||
|
substantial wall-clock time before evaluation begins.
|
||||||
|
|
||||||
|
Precompiled IR eliminates, or rather, attempts to eliminate this cost. A
|
||||||
|
`.nixir` bundle contains serialized AST nodes with all variable names converted
|
||||||
|
to numeric indices. Loading skips parsing entirely and begins directly with the
|
||||||
|
VM executing pre-processed code.
|
||||||
|
|
||||||
|
The project _also_ serves as an implementation study. I say also, but it is
|
||||||
|
actually the main goal of this project. Reimplementing Nix's evaluation
|
||||||
|
semantics reveals details that the upstream C++ code obscures. The thunk
|
||||||
|
mechanism, environment model, and cycle detection become tangible when you can
|
||||||
|
read and step through the implementation. I don't expect to get a better
|
||||||
|
understanding of the Nix language, but I now have more reasons to badmouth it.
|
||||||
|
|
||||||
|
## The IR Format
|
||||||
|
|
||||||
|
The binary format uses 36-byte fixed header followed by variable-length
|
||||||
|
sections. All multi-byte integers use little-endian byte order.
|
||||||
|
|
||||||
|
The header layout:
|
||||||
|
|
||||||
|
```plaintext
|
||||||
|
0x00-0x03: Magic identifier, value 0x4E495258
|
||||||
|
0x04-0x07: Version number, currently 2
|
||||||
|
0x08-0x0B: Flags field, reserved
|
||||||
|
0x0C-0x0F: Offset to string table
|
||||||
|
0x10-0x13: Offset to primop table
|
||||||
|
0x14-0x17: Offset to IR blob
|
||||||
|
0x18-0x1B: String count
|
||||||
|
0x1C-0x1F: Primop count
|
||||||
|
0x20-0x23: Reserved
|
||||||
|
```
|
||||||
|
|
||||||
|
The magic value `0x4E495258` corresponds to the bytes N I R X when read in
|
||||||
|
big-endian order.
|
||||||
|
|
||||||
|
The string table follows the header. Each entry encodes length as a varint, then
|
||||||
|
that many UTF-8 bytes. All attribute names, identifiers, and string literals in
|
||||||
|
the source are de-duplicated at compile time and stored here. References
|
||||||
|
throughout the IR use indices into this table rather than inline strings.
|
||||||
|
|
||||||
|
The primop table defines built-in operations. Each entry contains the string
|
||||||
|
table index for the operation name, its arity, and optional flags. This table
|
||||||
|
enables the VM to dispatch operations by index without string comparison.
|
||||||
|
|
||||||
|
The IR blob contains the actual program. Each node begins with a type byte
|
||||||
|
followed by type-specific payload.
|
||||||
|
|
||||||
|
Node type enumeration from the source:
|
||||||
|
|
||||||
|
```plaintext
|
||||||
|
0x01: CONST_INT - Signed 64-bit integer
|
||||||
|
0x02: CONST_STRING - String table index
|
||||||
|
0x03: CONST_PATH - String table index
|
||||||
|
0x04: CONST_BOOL - 0x00 or 0x01
|
||||||
|
0x05: CONST_NULL - No payload
|
||||||
|
0x06: CONST_FLOAT - IEEE 754 double
|
||||||
|
0x07: CONST_URI - String table index
|
||||||
|
0x08: CONST_LOOKUP_PATH - String table index for <nixpkgs>
|
||||||
|
0x10: VAR - Two varints: depth and index
|
||||||
|
0x20: LAMBDA - Arity and body offset
|
||||||
|
0x21: APP - Function and argument offsets
|
||||||
|
0x22: BINARY_OP - Operation enum and operands
|
||||||
|
0x23: UNARY_OP - Operation enum and operand
|
||||||
|
0x24: IMPORT - String table index for file path
|
||||||
|
0x30: ATTRSET - Count and recursive flag
|
||||||
|
0x31: SELECT - Expression, attribute, optional default
|
||||||
|
0x32: WITH - Attribute set and body offsets
|
||||||
|
0x33: LIST - Count and element offsets
|
||||||
|
0x34: HAS_ATTR - Expression and attribute
|
||||||
|
0x40: IF - Condition, then, and else offsets
|
||||||
|
0x50: LET - Binding count and body offset
|
||||||
|
0x51: LETREC - Binding count and body offset
|
||||||
|
0x52: ASSERT - Condition and body offsets
|
||||||
|
0x60: THUNK - Expression offset
|
||||||
|
0x61: FORCE - Expression offset
|
||||||
|
0xFF: ERROR - Error marker
|
||||||
|
```
|
||||||
|
|
||||||
|
Binary operations supported:
|
||||||
|
|
||||||
|
```plaintext
|
||||||
|
ADD, SUB, MUL, DIV - Arithmetic on integers
|
||||||
|
CONCAT - List concatenation (++)
|
||||||
|
EQ, NE - Equality comparison
|
||||||
|
LT, GT, LE, GE - Ordering comparison
|
||||||
|
AND, OR, IMPL - Boolean logic
|
||||||
|
MERGE - Attribute set override (//)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Variable Representation
|
||||||
|
|
||||||
|
The compiler converts variable names to De Bruijn indices during IR generation.
|
||||||
|
Rather than storing strings like "x" in the output, each variable reference
|
||||||
|
encodes two numbers: the lexical depth and the position within that scope.
|
||||||
|
|
||||||
|
The depth indicates how many lambda boundaries enclose the reference. A variable
|
||||||
|
in the outermost scope has depth zero. A variable referenced from inside one
|
||||||
|
lambda that refers to the outer scope has depth one.
|
||||||
|
|
||||||
|
The index indicates the position in that scope's environment array. The first
|
||||||
|
bound variable in a scope has index zero, the second has index one, and so
|
||||||
|
forth.
|
||||||
|
|
||||||
|
During evaluation, the VM combines these two numbers into a single 32-bit value
|
||||||
|
where the high 16 bits encode depth and the low 16 bits encode index. Lookup
|
||||||
|
traverses the environment chain depth times, then indexes into the resulting
|
||||||
|
scope's binding array. This achieves O(1) variable resolution.
|
||||||
|
|
||||||
|
## The Virtual Machine
|
||||||
|
|
||||||
|
The VM implements lazy evaluation using an explicit thunk mechanism. Every
|
||||||
|
unevaluated expression and function argument wraps in a Thunk structure
|
||||||
|
containing the expression AST node and a pointer to the captured environment.
|
||||||
|
|
||||||
|
When the VM needs a value, it calls `force()` on the thunk. The force operation
|
||||||
|
checks whether the thunk is already being evaluated. If evaluation attempts to
|
||||||
|
force a thunk that is currently evaluating, the VM detects the cycle and raises
|
||||||
|
"infinite recursion encountered". This matches Nix's behavior for recursive
|
||||||
|
definitions.
|
||||||
|
|
||||||
|
The environment structure is an array-based chain. Each scope holds a pointer to
|
||||||
|
its parent scope and a vector of bound values. Looking up a variable traverses
|
||||||
|
parent pointers until reaching the scope at the correct depth, then indexes into
|
||||||
|
that scope's value array. This replaces string comparison with pointer traversal
|
||||||
|
and array indexing.
|
||||||
|
|
||||||
|
Function application follows currying. When applying a function to an argument,
|
||||||
|
the VM checks whether the function's arity is satisfied. If yes, it extends the
|
||||||
|
environment with the new binding and evaluates the body. If not, it returns a
|
||||||
|
partial application awaiting additional arguments.
|
||||||
|
|
||||||
|
The evaluator handles binary operations with type-specific dispatch. Addition
|
||||||
|
supports integers, strings, and paths with appropriate type coercion rules.
|
||||||
|
Comparison operators work on integers and strings. The merge operator combines
|
||||||
|
two attribute sets with right-side precedence.
|
||||||
|
|
||||||
|
## Plugin Primops
|
||||||
|
|
||||||
|
The plugin registers three primops through Nix's `RegisterPrimOp` interface:
|
||||||
|
|
||||||
|
`__nixIR_loadIR` accepts a file path string, deserializes the `.nixir` bundle,
|
||||||
|
evaluates the entry expression, and returns the resulting value. The VM measures
|
||||||
|
deserialization time and evaluation time separately, printing timing data to
|
||||||
|
stderr.
|
||||||
|
|
||||||
|
`__nixIR_compile` accepts a string containing Nix source code, parses it
|
||||||
|
in-memory, generates IR, and evaluates the result. This enables runtime
|
||||||
|
compilation without external tooling.
|
||||||
|
|
||||||
|
`__nixIR_info` returns an attribute set containing the plugin name
|
||||||
|
"nix-ir-plugin", version "0.1.0", and status "runtime-active". This is a
|
||||||
|
development-only primop that will be removed eventually.
|
||||||
|
|
||||||
|
The primops use the double-underscore prefix internally. Users access them
|
||||||
|
through `builtins.nixIR_loadIR`, `builtins.nixIR_compile`, and
|
||||||
|
`builtins.nixIR_info` in their expressions.
|
||||||
|
|
||||||
|
## Import Handling
|
||||||
|
|
||||||
|
The compiler performs static import resolution when the import path meets
|
||||||
|
specific conditions. The path must be a literal string literal in the source,
|
||||||
|
not an interpolation or variable. The path must not use home directory
|
||||||
|
expansion. The resolved path must remain within the project root for security.
|
||||||
|
The target file must exist and be readable at compile time.
|
||||||
|
|
||||||
|
When these conditions hold, the compiler reads the imported file, recursively
|
||||||
|
processes its imports, and embeds the resulting IR into the output bundle. The
|
||||||
|
final `.nixir` file is self-contained and requires no additional file lookups at
|
||||||
|
load time.
|
||||||
|
|
||||||
|
When conditions do not hold, the compiler records the import as dynamic and
|
||||||
|
emits an IMPORT node containing the string table index. At runtime, the VM
|
||||||
|
evaluates the import expression to obtain the actual file path, then uses Nix's
|
||||||
|
standard evaluator to load that file.
|
||||||
|
|
||||||
|
## What Works And What Does Not
|
||||||
|
|
||||||
|
The implementation covers a substantial subset of Nix's expression language.
|
||||||
|
Literals work across all types including integers, floats, strings, paths, URIs,
|
||||||
|
booleans, and null. Lambda expressions, function application, and currying are
|
||||||
|
implemented. Attribute sets with both static and dynamic keys are supported. The
|
||||||
|
let and letrec forms work with proper recursive binding semantics. The if
|
||||||
|
expression, assert statement, with expression, and list literals are all
|
||||||
|
functional.
|
||||||
|
|
||||||
|
The implementation does not cover derivations, builtins other than those
|
||||||
|
required for basic operation, or the full module system. These require
|
||||||
|
integration with Nix's store and download mechanisms that the VM does not
|
||||||
|
replicate.
|
||||||
|
|
||||||
|
## Building And Using
|
||||||
|
|
||||||
|
Create a build directory and configure with CMake:
|
||||||
|
|
||||||
|
```
|
||||||
|
cmake -B build -G Ninja
|
||||||
|
cmake --build build
|
||||||
|
```
|
||||||
|
|
||||||
|
This produces `nix-irc` in the build directory and `nix-ir-plugin.so` in the
|
||||||
|
project root.
|
||||||
|
|
||||||
|
Compile a Nix file to IR:
|
||||||
|
|
||||||
|
```
|
||||||
|
./build/nix-irc input.nix output.nixir
|
||||||
|
```
|
||||||
|
|
||||||
|
Load and evaluate the compiled bundle through Nix:
|
||||||
|
|
||||||
|
```
|
||||||
|
nix --plugin-files ./nix-ir-plugin.so eval --expr 'builtins.nixIR_loadIR "output.nixir"'
|
||||||
|
```
|
||||||
|
|
||||||
|
Compile and evaluate source at runtime:
|
||||||
|
|
||||||
|
```
|
||||||
|
nix --plugin-files ./nix-ir-plugin.so eval --expr 'builtins.nixIR_compile "1 + 2"'
|
||||||
|
```
|
||||||
Loading…
Add table
Add a link
Reference in a new issue