Verifying Move Borrow Checker in Lean: an Experiment in AI-Assisted PL Metatheory

I formalised and proved the correctness of Move’s new borrow checker in Lean: 39,000 lines of mechanised metatheory, produced in under a month with the help of an AI coding assistant. This post tells the story of how it went and what it means for the future of PL research.

Reading guide. This is a long post. Depending on your background, you may want to skip ahead:

If you are new to programming language research and curious about it, keep reading.

If you already know what PL metatheory is, jump to Move and its borrow checker.

For the step-by-step mechanisation story with AI, start at Encoding typing rules in Lean.

For the anecdotes on using AI “in anger”, see Soundness Proof: the Labours of Claude.

For numbers, see Some statistics.

For the big picture, skip to What this all means for PL research.

The Programming Languages (PL) research community was one of the earliest adopters of interactive theorem provers, such as Rocq, Agda, and Isabelle/HOL, as a key technology for gaining trust in the produced formal models and proofs of their properties. The famous POPLmark Challenge, a set of benchmarks aimed at identifying the main hurdles when stating and proving theorems about programming languages, has turned 20 years old last year.

Proving properties of a PL artefact, such as an optimising compiler or a type system proposal, has traditionally been considered a significant challenge, and, while not insurmountable by trained human provers, it has become almost a necessary requirement to publish a research paper at one of the top-tier PL conferences. Some perceive this trend as a rite of passage. I believe, a healthier way is to think of machine-assisted proofs about programming languages as a way to sharpen one’s definitions and statements: it is widely recognised that ugly definitions usually result in significantly more laborious proofs.

The sheer amount of human time spent by PL researchers over the past two decades formalising their results in provers, such as Rocq, is so staggering that studying the proof patterns to discover more concise ways to engineer machine-checked proofs became a research direction of its own with quite a few high-profile publications (Example 1, Example 2, Example 3) and even entire academic conferences dedicated to this topic alone. With the modern trend of applying frontier AI models to facilitate construction of machine-checked proofs of mathematical theorems (predominantly, in Lean) with the help of systems such as Harmonic’s Aristotle and Axiom Prover, it is only a matter of time to see these advances facilitating proofs of theorems about programming languages. In this blogpost, I describe one such experiment.

What is PL Metatheory and why it needs mechanised proofs?

Let us set up some terminology first. When talking about formal verification of programs, it is important to remember that this is only meaningful when we have a particular specification in mind, which describes the properties of the program of interest that always hold (we have talked about this at length in one of the previous posts).

But what if, instead of properties of a program written in a certain language, we are interested in properties of a programming language itself? If you are wondering what these properties might be, think of your favourite programming language and things that you believe cannot happen when running programs in it. For instance, any implementation of Python guarantees memory safety: it would be really surprising for one to observe a dangling pointer or a buffer overflow when running a Python program—Python’s garbage collector ensures that this does not happen at run time.

Some languages go even further and provide similar guarantees without a garbage collector whatsoever, reasoning purely out of the syntax of the programs using mechanisms known as type systems. A particularly prominent example of an interesting yet practical type system is that of Rust: through the mechanisms of borrows and lifetimes it guarantees that, if a program without unsafe blocks is accepted by the Rust compiler, it is free from use-after-free errors, dangling pointers, and data races—all being artefacts of so-called unsafe pointer aliasing. But why should we believe that these guarantees do indeed hold in reality when we run our compiled Rust programs? Well, this is where the math behind PL theory steps in: ideally, a programming language designer must prove a Type Soundness Theorem that connects the fact that a program is accepted by a type checker with the absence of certain runtime behaviours—the famous Well-Typed Programs Don’t Go Wrong mantra coined by Robin Milner in his 1978 paper.¹

For any interesting programming language, the type soundness theorem is surprisingly non-trivial to state. First, it requires a precise description of the type system itself, defining how exactly it “analyses” a program to determine whether it should be accepted or rejected. Second, one needs to define a semantics of the language’s runtime behaviour, and the errors that are considered “preventable” by the type system (for instance, almost no practical type systems promise to catch errors such as divisions by zero or out-of-memory errors). Finally, the statement of the theorem should say something about the validity of the memory state in which we are going to run a program that has been accepted by a type checker.

From the components of a Type Soundness Theorem statement, only the type system itself is not trusted. In contrast, the definition of the runtime semantics of the language is almost always taken for granted (akin to the “laws of nature”), while the properties of the initial state are something that is left for the loader to take care of, so feasibility of such a requirement is not questioned by the theorem itself. What the soundness theorem does deliver is the proof that the type system does not accept programs that would result in a preventable error at run time.

The study of definitions of semantics, type systems, program optimisations, and their interactions with erroneous or harmful runtime behaviours is what is typically called PL metatheory. Type soundness theorems are among the most common statements studied and proven in PL metatheory, and this is, in a nutshell, what we, PL researchers, do for a living.

Why, then, does PL metatheory call for mechanised proofs? For any remotely interesting language, a type soundness proof amounts to a massive case analysis over all possible language constructs, coupled with different cases for how the runtime semantics can treat its state or individual language commands. It is not uncommon for such proofs to span dozens of pages of English prose and, unlike classical mathematical proofs, they are very rarely intellectually rewarding: each case follows a similar pattern, yet every single one must be checked. It is easy to make a mistake, and even a trivial error can render the entire point of a type system proposal false. This is why the PL community adopted proof assistants since early 2000s. The first reason was to gain trust in the results. The second was to tame the tedium using clever proof engineering techniques. That said, when approached by a human prover, the amount of mechanised metatheory required for a top-tier conference paper typically still measures at about 6–10 person-months of work. As I will try to argue below, this is about to change with AI.

Move programming language and its borrow checker

Over the past few months, I have been collaborating with the developers from Mysten Labs on designing a new type system for the Move language for smart contracts. Move is deployed on the Sui and Aptos blockchains. Like Rust, Move enforces an ownership discipline: every value has a unique owner, and references (called borrows) must not outlive the values they borrow. The borrow checker (a special type-based static analysis run by the compiler) rejects programs that could create dangling references, aliased mutations, or use-after-move errors. Unlike Rust, however, Move does not allow references inside data structures (so, no linked lists): every reference is an access path rooted in a local variable and descending through a sequence of field names. This restriction eliminates the need for lifetime annotations entirely: the type system can track the reference provenance using just their paths.

The key idea behind the new borrow checker design is to track reachability between references using regular expressions over field paths. For example, when a reference w borrows a deep path p.x.f, the type system registers the regex x · f as the path from the parent reference to w. When another reference u later re-borrows p.x, the type checker computes the Brzozowski derivative (i.e., “stripping” the shared prefix x) to discover whether u can reach w. If the resulting regex is non-empty, the two references overlap in memory, so writing through u would invalidate w.

Consider the following program, written in MoveIR, an LLVM-like intermediate language of Move with minimalistic syntax and explicit borrow/move/copy annotations, which makes it a better target for mechanisation than plain Move:

struct S { f: u64 }
struct P { x: Self.S }
t() {
    let p: Self.P;
    let w: &u64;
    let u: &mut Self.S;
  label b0:
    p = P { x: S { f: 0 } };
    w = &copy(&p).P::x.S::f;   // deep borrow of p.x.f
    u = &mut p.P::x;           // re-borrow of p.x
    *move(u) = S { f: 1 };     // typing error here!
    return;
}

The diagram below shows the reachability graph that the type checker maintains for this program. Each node $\rho$ is an abstract reference—a symbolic representation for the respective run-time memory locations. The distinguished $\rho_0$ represents the stack frame root; $\rho_p$, $\rho_w$, and $\rho_u$ correspond to the variables p, w, and u. An edge labelled with a regex records which field paths can be traversed from one location to reach another.

Reachability graph for the nested aliasing
example

Here, w borrows the deep path p.x.f, while u re-borrows p.x. The derivative $\delta(\mathtt{x} \cdot \mathtt{f},\, \mathtt{x}) = \mathtt{f} \neq \emptyset$ reveals that $\rho_u$ can reach $\rho_w$ via field f (shown as the blue edge). The write on the last line triggers the safety check, which fails: overwriting the entire S behind u would invalidate the deep borrow w, creating a dangling reference. This is precisely the kind of subtle aliasing that the regex-based approach catches through a clean, decidable mechanism.

This design has been fully implemented in a branch of the Sui blockchain client, where it superseded the original (much more complicated and non-formalised) borrow analyser while maintaining full backwards compatibility. But for a blockchain language, implementation alone is not enough: the designers want ironclad guarantees that the borrow checker is correct, in the sense of the Type Soundness Theorem explained above. A bug in the borrow checker’s logic could allow an attacker to exploit a dangling reference, potentially compromising on-chain funds.

This is where our mechanisation effort comes in. We wanted to formalise the new type system in Lean and prove it sound with respect to reasonable semantics of MoveIR. Our ambition was to cover as much of Move as possible, not just a “toy subset” (as customary in proof-of-concept academic prototypes), and to get confidence that the formalisation faithfully represents the actual deployed implementation.

Encoding Move typing rules in Lean

When PL researchers present a type system, they typically write it down as a collection of inference rules in a logical notation. For instance, the typing rule for writes through a mutable reference looks like this:

\[\text{T-WriteRef} \quad \frac{\displaystyle \Sigma(a) = \mathsf{ref}(\tau, \rho, \mathsf{M}) \quad \Sigma(b) = \mathsf{basic}(\tau) \quad \mathsf{check\_outbound}(\Pi, \rho) \atop \displaystyle \Lambda;\; \mathcal{E}[\Sigma := \Sigma \setminus \{a, b\},\; \Pi := \Pi] \vdash \mathit{cont} : \overline{T}}{\Lambda;\; \mathcal{E} \vdash {*}a := b;\; \mathit{cont} : \overline{T}}\]

Reading bottom-to-top: to type-check a write statement $*a := b$, the rule requires that $a$ holds a mutable reference of type $\tau$, that $b$ holds a basic value of the same type $\tau$, and, crucially, that $\mathsf{check\_outbound}$ passes, meaning no existing borrow extends beyond $\rho$ (otherwise the write would create a dangling reference). After the write, both sites are removed from $\Sigma$. Here, a site is a named temporary slot in the stack frame that holds the value produced by an expression before it is consumed by the next operation (think of SSA registers in LLVM), and $\Sigma$ is the site environment—a map from site names to their types that the type checker maintains as it walks through the program. This is not what the compiler implements directly: the inference rule casts type checking as proof construction in a domain-specific logic. A program that type-checks is one for which a proof tree can be assembled from such rules.

The first step in our mechanisation was to encode MoveIR’s syntax and all its typing rules in Lean. I wrote these definitions by hand. Here is a simplified version of the $\text{T-WriteRef}$ rule in Lean:

inductive typecheck_stmt : LabelEnv → TypeEnv → Stmt → List MoveType → Prop where
  ...
  | write_ref : ∀ Λ 𝓔 a b τ ρ cont T,
      𝓔.Σ(a) = .ref τ ρ .mut →
      𝓔.Σ(b) = .basic τ →
      check_outbound 𝓔.Π ρ →
      typecheck_stmt Λ (𝓔[Σ := 𝓔.Σ \ {a, b}]) cont Ts →
      typecheck_stmt Λ 𝓔 (*a := b; cont) Ts
  ...

Each premise of the inference rule becomes a hypothesis in the Lean constructor: the lookups check site types, check_outbound enforces the borrow safety condition, and the recursive typecheck_stmt call types the continuation under the updated environment.

With this encoding of the MoveIR type system in place, we can already validate how faithful our Lean model is: take a concrete program from Move’s test suite, translate it to MoveIR, and try to prove that the typing judgement holds—I will call such proofs conformance proofs. This was the point, at which I started using Claude Code (Opus 4.5) to construct these proofs. To my surprise, the AI could handle tiny MoveIR programs (4–5 lines of code) successfully, assembling the proof tree step by step. But for anything larger, the approach quickly became impractical: each proof step required instantiating the right rule constructor, providing witnesses for existential variables, and discharging side conditions about the path environment, all of which grew rapidly with program size. This is where I started to take the agenda of conformance proofs seriously, but in a very different form: more on that next.

From conformance proofs to tests via algorithmic type checking

To solve the efficiency bottleneck with conformance proofs for the type system (AI was slow and unreliable), I resorted to a more traditional approach: implementing an actual executable type checker in Lean and running it on the same tests as the production Move implementation, making sure that it accepts and rejects the same programs. Instead of constructing proof trees, the algorithmic checker is a plain recursive Lean function that returns an updated type environment on success or fails with none:

def check_stmt (lenv : LabelEnv) (env : TypeEnv) (s : Stmt)
    (retTypes : List ParamType) : Option TypeEnv

Luckily, there was no shortage of tests in the Move compiler’s test suite, so it did not take long to vibe-code a parser from MoveIR text into our Lean representation and start running the tests.²

But what is the relationship between the relational type checker from before (the inductive typecheck_stmt with inference rules) and the new algorithmic one (the executable check_stmt)? After all, we could have introduced a bug in the algorithmic version that makes it accept programs the relational rules would reject, or vice versa. To close this gap, we proved (entirely by AI, of course) the first important theorem of our development: the soundness of the algorithmic type checker:

theorem check_stmt_sound (lenv : LabelEnv) (env : TypeEnv)
    (s : Stmt) (retTypes : List ParamType) :
    ∃ env', check_stmt lenv env s retTypes = some env' →
    typecheck_stmt lenv env s retTypes

This theorem says: whenever the algorithmic checker accepts a program (returns some env'), the relational typing judgement holds. In other words, the executable checker is a sound decision procedure for the type system: it never accepts a program that the inference rules would reject. Every successful run of check_stmt effectively produces a certificate that the relational typing derivation exists.³

Once we had the parser, the algorithmic checker, and the soundness proof in place, we ran 156 conformance tests on programs drawn from the Move own type checker test suite. This turned out to be one of the most valuable components of the entire development. In the later stages, we frequently needed to revise our encoding of the typing rules—for instance, when extending the type system with additional features—and it was the conformance tests that gave us trust that we were still formalising the right thing.⁴

All this, of course, does not yet mean that a Move program accepted by the type checker is actually free of dangling pointers. The algorithmic soundness theorem only connects the executable checker to the inference rules; it says nothing about runtime behaviour. Next, we had to provide the remaining components for the much desired Type Soundness Theorem and prove it.

Runtime semantics and Type Soundness Theorem

To state a Type Soundness Theorem, we need a precise definition of what it means to run a program. In our Lean development, the runtime semantics takes the form of a definitional interpreter: a recursive function run that executes a program for at most fuel steps, returning either a final result or an error:

def run (fuel : Nat) (state : ExecState) : ExecState :=
  match fuel with
  | 0 => .error .outOfFuel
  | n + 1 =>
    match state with
    | .running _ => run n (step state)
    | other => other

The fuel parameter is a standard technique for defining potentially non-terminating computations in proof assistants: instead of proving termination, we give the interpreter a budget of steps. When the budget runs out, it returns outOfFuel rather than looping forever. The soundness theorem will quantify over all fuel values, so this is not a limitation: if a bug exists, it would manifest at some finite step count.

The step function performs a single transition from one “virtual machine” state (variables, sites, stack, heap) to another. When something goes wrong at runtime, it produces an error state. Not all errors are created equal, though. Some are preventable: reading through a dangling reference, accessing a moved value, or encountering a type mismatch at a write. These are exactly the bugs the borrow checker exists to catch. Others are acceptable: division by zero, running out of fuel, or an explicit abort. No reasonable type system promises to prevent those. The type soundness theorem is exactly what guarantees that a program accepted by the type checker never reaches any preventable error.

The runtime semantics is one of the trusted components of the formalisation: if we get it wrong, the theorem might be true but vacuous. That said, this particular semantics, despite being AI-generated, was relatively straightforward—a standard small-step interpreter over a heap—so I mostly validated it by reading the code. As additional, albeit lightweight assurance, I vibe-coded a number of simple “litmus” tests, checking that programs I believed should crash due to a dangling pointer or a use-after-move error did indeed crash under our semantics.⁵

With the semantics in hand, we can finally state the Type Soundness Theorem. Here is a slightly simplified version from our Lean development:

theorem type_soundness (f : FunDef) (lenv : LabelEnv)
    (enumEnv : EnumEnv) (funEnv : AssocMap Id FunDef)
    (args : List Value) (heap : Heap)
    (htyped : typecheck_fun f lenv enumEnv)
    (hfunEnv : ∀ fname fdef, lookup funEnv fname = some fdef →
               FunTypeSafe fdef funEnv enumEnv)
    (ha : SoundnessAssumptions f lenv enumEnv funEnv heap args)
    (e : RuntimeError) (hna : ¬e.isAcceptable) :
    ∀ n, run n (initState f funEnv args heap) ≠ .error e

Reading this bottom-to-top: for any non-acceptable error e and any fuel budget n, running the function f never produces that error. The premises require three things. First, htyped: the function type-checks under our relational type system. Second, hfunEnv: every function that f might call is itself type-safe (this gives us modular reasoning—we check one function at a time). Third, ha: a record called SoundnessAssumptions that bundles 23 well-formedness preconditions on the initial state—things like “argument types match parameter declarations”, “the heap contains no dangling locations”, “label environments are complete”, and structural invariants on enum definitions. These assumptions on the initial state are yet another point where the formalisation could become vacuous: if the 23 preconditions are mutually contradictory, the theorem would be true but useless: there would simply be no valid inputs to run the function on! We will discuss how to address this concern shortly.

Soundness Proof: the Labours of Claude

Everything described so far—encoding the type system, building the algorithmic checker, writing the parser, running conformance tests—was, in retrospect, the easy part. The real struggle was proving the Type Soundness Theorem itself.

The standard approach, introduced by Wright and Felleisen in 1994, decomposes type soundness into two lemmas: progress and preservation.⁶ Progress says: if the current state is well-typed, the machine can take a step not resulting in a preventable error. Preservation says: if the current state is well-typed and the machine takes a step, the resulting state is also well-typed. Together, they give an inductive argument: the initial state is well-typed (by the soundness assumptions), each step preserves well-typedness (preservation), and no well-typed state can crash (progress), so the machine never reaches a preventable error, no matter how many steps it takes. Progress was straightforward: for each typing rule, the relevant premises guarantee that the corresponding runtime operation succeeds. The real beast was preservation.

Preservation proof: an exercise in invariant inference

The crux of preservation is defining what a “well-typed state” means for a running machine. Remember, the type system deals with syntactic entities such as types, type environments, abstract references, and regexes, while the runtime operates on heap locations, actual values, and reference chains. The two worlds need to be connected. A well-typed state invariant is precisely this bridge: a predicate that relates the concrete machine state to the abstract type environment, asserting that every promise made by the type checker is backed by reality in the heap. In our case, it is a Lean record with 35 fields. For instance, some clauses say that every abstract reference $\rho$ tracked by the type checker maps to a live heap location, that the regex paths between references faithfully reflect actual pointer chains in the heap (so the reachability graph from the diagrams above is adequate), and that the check_outbound condition used by $\text{T-WriteRef}$ is always satisfied for mutable references.

It is nearly impossible to predict all 35 (or more) invariant fields from the start. The way it works in practice is: you attempt the proof for a particular statement kind, get stuck because the invariant is too weak, strengthen it with a new clause, and then propagate the change to all other cases. This is very similar to inferring loop invariants in imperative programs or inductive invariants for distributed systems: you just iterate until the invariant is strong enough to make proof constructing and checking by Lean possible.

Where AI did great, and where it didn’t

Despite MoveIR being a relatively small language, it has over 20 distinct statement forms (borrow, move, copy, field borrow, write, freeze, call, return, jump, branch, pack, unpack, and their vector/enum variants), each with its own semantic step rule. The preservation proof contains 153 lemmas (mostly conjectured by AI), collectively showing that each of these steps preserves the 35-field invariant. Every time I strengthened the invariant with a new clause, dozens of lemmas across 30+ files needed updating. This is where Claude (Opus 4.5, and later 4.6) was invaluable. The AI excelled at proof repair: propagating a change through a large proof landscape, applying the same pattern to case after case. It also handled routine preservation cases—where the environment update is a straightforward pass-through—with high reliability.

But it was not all a walk in the park. At one point, Claude got stuck trying to prove preservation for jumps to LLVM-style labelled blocks, going in circles for hours. I had to step in and recognise that we needed a separate fact: a weakening lemma. Weakening says that if a statement type-checks under a “more restrictive” type environment (one that tracks more paths between references), it also type-checks under a “less restrictive” one. This is what makes control-flow joins sound: the checker types the continuation under a target environment and verifies that each branch’s post-environment subsumes it. The weakening lemma proof turned out to be substantial on its own: about 7,200 lines of Lean.

At some point I considered switching to Harmonic’s Aristotle prover for the more difficult lemmas, but ultimately decided against it. Claude Code made it very easy to keep the tight control over the entire development: I could refactor proof structures on the fly, ask “why are you proving this?” mid-proof, and steer the decomposition into lemmas interactively. Aristotle is better suited for one-shotting complex standalone mathematical statements (or refuting them), which is quite different from the iterative, gradual workflow I needed when prototyping Move metatheory.

The dragon: function calls

The preservation proof totals about 10,300 lines of Lean. Roughly half of that complexity lives in just two cases: the call and return command. The typing rule for calls is the most involved in the entire system: it moves from reasoning about variables allocated in a single stack frame to reasoning about the entire call stack, which requires a bunch of additional invariants on saved frames. Additional complexity comes from the fact that our call rule supports relatively accurate inter-procedural tracking of borrows across call boundaries. The details are beyond the scope of this post, but the upshot is: this was the hardest case to dispatch. Proving that all 35 invariant fields are preserved this operation was, by far, the most difficult part of the entire development. While Claude wrote the actual proof code, it took quite a few high-level hints from me: choosing the right case splits and cutting off unproductive proof attempts early. The call rule alone took nearly two days of Claude running almost non-stop (which forced me to upgrade to the most expensive plan). When it finally went through, it felt like defeating the dragon—the rest was mopping up.

Last wrinkles: soundness assumptions and type system extensions

Let us go back to the Type Soundness Theorem and the 23 SoundnessAssumptions that worried us earlier. Recall the concern: if these preconditions are mutually contradictory, the theorem is vacuously true, as there would be no valid inputs to run the function on, and we would have proven nothing useful.

The fix follows the same pattern that served us well throughout this development: make it executable and test it! Specifically, I vibe-coded a decidable checker that collapses all 23 preconditions plus the type-checking judgement into a single total Boolean function:

def SoundnessAssumptions.checkDecidable
    (f : FunDef) (lenvDec : LabelEnvDec) (enumEnv : EnumEnv)
    (funEnv : AssocMap Id FunDef) (fte : FunTypingEnv)
    (heap : Heap) (args : List Value) : Bool

I then made Claude prove that whenever this Boolean check returns true, the full SoundnessAssumptions record holds—and therefore the type soundness theorem applies. For each of our ~30 runtime test programs, I instructed Claude to instantiate the decidable checker with concrete inputs and let Lean’s evaluator discharge the precondition, producing a per-execution safety certificate: a machine-checked proof that this specific run cannot produce any preventable error.

Stretch goals: vectors and enums

The core type system covers MoveIR’s structs, references, and control flow. Two important features, vectors and multi-variant enums, were initially omitted but added later as “stretch goals.”

Vectors were relatively straightforward to support, as they are conceptually very similar to records that our initial formalisation already supported. The vector extension took about one day to vibe-formalise and touched 30+ files, but the modifications were uniform enough that Claude could apply the same pattern repeatedly without errors. As before, we have tested the extended algorithmic type checker extensively.

Enums were harder. The challenge was choosing the right runtime representation. The preservation proof requires that two values of the same type always have identical field structure (so that overwriting one with the other preserves all existing pointer paths). For structs this is trivially true, but enum variants carry different fields. Claude proposed several encodings that broke this property; the fix required me to provide some insight: represent every enum value as a flat record carrying fields for all variants at once, with inactive ones filled by type-specific default values. This made the structural property hold unconditionally, and the enum extension went through in about five days.

Some statistics

Everyone loves numbers, so here they are.

The entire Lean development comprises approximately 39,000 non-blank, non-comment lines of code, across 267 commits, with zero axioms or sorry declarations. To the best of my knowledge, this is the second largest PL metatehory mechanisation in Lean to date (after the Cedar specification, which is around 70,000 LOC) and, very likely, the largest one accomplished predominantly using AI. The soundness proofs dominate the codebase at ~23,000 LOC (59%), with the preservation proof alone weighing in at ~10,300 lines and the weakening lemma at ~7,200. The typing rules (both relational and algorithmic, with the soundness theorem) account for ~5,800 LOC, the parser and translator for ~1,300, and the test suite for ~6,100. The vector extension added ~3,000 LOC in one day; enums required ~5,000 LOC over five.

As a relatively large Lean development with diverse proof patterns, this codebase might serve as an interesting benchmark for AI-assisted theorem proving in Lean. I plan to open-source the full codebase in the coming weeks; feel free to get in touch if you would like early access for research purposes.

In terms of calendar time, the active development phase spanned 27 working days (some of them were on weekends, duh!). The chart below shows the daily commit intensity:

Daily commit intensity during the active development
phase

The algorithmic type checker and its soundness proof took two days and kicked-off my AI-powered metatheory journey. Parser, macros, and testing harnesses took five. The soundness proofs consumed 13 days, with peaks at the weakening proof (22 commits in a single day) and the call rule (17 commits).

While this post covers the “interesting” PL-theory parts of the formalisation, in my estimate, the vast majority of the generated proofs are horribly boring: wrangling lists, threading facts through semantic rules, and proving excruciatingly mind-numbing lemmas about runtime state representation. Having done quite a few mechanisations in the past (by hand, in Rocq), I would estimate this effort at about 5–6 months of my full-time engagement without AI (the described here mechanisation has been done by me concurrently with all my other duties as a faculty). It is, of course, if I wouldn’t have gone insane from dealing with the tedious parts first. I am very happy that I didn’t have to do those proofs “old style” (and, probably, never will have to again).

What this all means for the future of PL research?

Congratulations on making it to the end of this post. Let me reward your patience with some of my thoughts on what all this might mean for PL research.

I have read somewhere recently that one reason skilled programmers love generative AI is that it lets them focus on the creative parts of computing while the machine handles the tedium.⁷ I think the same dynamic applies to PL researchers. The creative work—designing logics and type systems, choosing the right semantics, figuring out non-standard proof strategies—still belongs to humans. The tedium—threading a new invariant clause through 153 lemmas, or proving that removing a key from an key-value map preserves some property of the remaining entries—is exactly what AI handles well. This is a good trade.

I personally know some established PL academics who actively resist mechanising their metatheory proofs, because the overhead slows down their research. I believe, that objection is no longer that convincing. While I probably shouldn’t generalise from a single data point of this experience, I would estimate that for a solid type system idea and using a well-known proof technique (progress and preservation, in my case, which every PL researcher learns in graduate program), a single researcher can go from inception to a mechanised result at the level of a top-tier conference submission in about one month. To put it differently (and I apologise if this offends anyone) the effort of writing a POPL/PLDI paper will soon be comparable to the effort of writing an ICML/ICLR paper.

This does not mean that PL researchers will be out of a job any time soon. In another experiment (which I will save for a different post), I tried to tackle a far less conventional published result that relies on a non-standard proof technique, which was only briefly sketched in the respective manuscript. After about a week of fairly involved “vibe-mathing” with Claude Opus 4.6, my AI-assisted attempts produced no usable result (there are still dozens of sorrys in that project and I have no good idea how provable they are). In the experiment described in this post, the model clearly benefited from the fact that the progress-and-preservation technique is 32 years old and has numerous mechanised instances in the easily accessible training data. Novel proof strategies remain beyond AI’s reach (for now).

But let us forget about academic publications for a moment and think bigger. Over the past two decades, the PL community has done a tremendous job distilling reusable “proof harnesses” for challenges like type soundness, compiler correctness, and program logics. With AI-assisted theorem proving, we can hope to see many more real-world languages formalised end-to-end—not just Wasm and Cedar. Programming language design has a reputation for moving slowly, in part because of its insistence on rigour. AI-assisted mechanisation lets us keep the rigour and lose the slowness. We can finally move fast and break nothing.

Acknowledgements

I am grateful to Todd Nowacki and Sam Blackshear for many discussions on the design and implementation of Move’s borrow checker, and for their patience in answering my endless questions about corner cases. Thanks to the members of the VERSE lab and participants of the IFIP WG 2.8 meeting in March 2026 for their comments on the presentation that preceded this post.

Does it mean that in practice PL designers prove such a theorem for every new programming language proposal? Well, of course not: real programming languages are large beasts with complicated syntax and unclear semantics, so errors in the type system design (not even type checker implementation) often go unnoticed for decades, making academic researchers very happy when they discover them. ↩
The parser itself was validated by 37 alpha-equivalence tests comparing the output of the translation against hand-written reference ASTs. ↩
The converse property—completeness—would say that the algorithmic checker accepts every program the relational rules accept. We have not proved completeness, but this is less important for our goals: soundness guarantees that no unsafe program slips through, while a completeness gap would only cause the checker to reject some safe programs, which would be caught by the conformance tests. ↩
This approach is not entirely novel. The formalisation of the AWS Cedar authorisation language follows a similar pattern: a Lean implementation of Cedar’s evaluator is tested against its production version in Rust, ensuring that the two agree on a shared test suite. ↩
A more rigorous approach would be to instrument the Move runtime to emit intermediate machine states, parse them into our Lean formalisation, and check that our definitional interpreter passes through corresponding states at each step. In the interest of time, I have not carried out this exercise so far. ↩
More powerful techniques have been developed since 1994, notably, logical relations, that can handle features like higher-order state and recursive types. We did not need them here: Move’s first-order setting is well within the reach of classical progress-and-preservation. ↩
The opposite, apparently, is true for creative professions: writers and artists. ↩

Proofs and Intuitions