On the Unreasonable Effectiveness of Property-Based Testing for Validating Formal Specifications

In this post, we show that property-based testing (PBT) is surprisingly effective for validating LLM-synthesised specifications of Lean programs: it is a cheap alternative to symbolic proofs, which helped to detect underspecification in 10% of the specs in state-of-the-art benchmarks for verified code generation.

Getting Program Specifications that are “Just Right”

Formal program verification and program synthesis are only as reliable as the specifications used to validate the programs in question. A specification is a mathematical contract between a programmer and a machine: it captures what a program is supposed to do, and the verifier (or the synthesiser) ensures the contract is respected when producing the implementation and the proof. If the contract is flawed, the result is meaningless—a program verified against a wrong spec does what the specification states, but not necessarily what the user wants.

It is easy to write a contract that is useless on purpose. In Hoare logic, a specification takes the form

\[\{P\}\ \texttt{program}\ \{Q\},\]

read as: if the precondition $P$ holds of the inputs, then after program runs, the postcondition $Q$ holds of the outputs. This is, for instance, the style of specification used by Velvet—a program verifier embedded in Lean that we discussed in one of previous posts. Setting $P \equiv \bot$ (i.e., $\mathit{false}$) makes the triple hold vacuously for any program: there are no inputs satisfying the precondition, so there is nothing to check, and even completely degenerate code is certified. Symmetrically, setting $Q \equiv \top$ ($\mathit{true}$) means that every possible output satisfies the postcondition, so once again any program will pass. The first specification’s precondition is too strong (it rules out every input), and the second specification’s postcondition is too weak (it rules out no output, making verification and synthesis useless, since such a program can return anything). Neither tells us anything about what the program should do.

The interesting middle ground—a specification that is just right—is an open research problem. A good specification must be precise enough to pin down the programmer’s intent, but not so precise that it inadvertently re-states the program itself. Consider a textbook task: sort a list of integers in ascending order. A specification that says “the output is the list produced by merge sort applied to the input” is, technically, a specification, but it has fused the what with the how: it commits to an algorithm and inherits all of its incidental details, defeating the purpose of having a specification in the first place. A good specification of sorting, by contrast, demands only two things of the result: it is in ascending order, and it is a permutation of the input. Whether the implementation runs merge sort, quicksort, gnome sort, bogosort, or sleepsort is now irrelevant—the specification abstracts over all of them. Producing this kind of clean, intent-capturing formal statement from an informal English description is the part that humans are, even today, still better at than machines.

Hence the conundrum. The most reliable way to write a “just right” specification is to put a human expert in the loop: someone who reads the formal statement, compares it to the informal intent, and corrects it. But reading formal specifications fluently requires considerable training—precisely the kind of cognitive overhead that has kept formal methods out of the mainstream. If we want certified programs to become widely accessible, we need to dramatically reduce the human effort required to write and validate specifications, without simply passing control to a large language model and hoping for the best.

Our goal, thus, is to produce formal specifications with minimal user involvement, and to identify principles for generating specifications that are close enough to a human’s intent to drive the synthesis of certified programs—i.e., programs that come bundled with their formal specifications and machine-checkable correctness proofs. This problem has only become more urgent: in a world where more and more of the code we run is produced by LLMs, we cannot afford to also delegate the synthesis of specifications to LLMs without a reliable, automated way to validate that the resulting specs are just right. The rest of this post discusses two such validation techniques and the trade-offs between them.

Symbolic Specification Testing

One natural approach is to validate an LLM-synthesised specification by proving it on a handful of representative inputs—using a verifier and an SMT solver in the loop. This idea has been actively explored over the past year by Shuvendu Lahiri, who frames the problem as intent formalisation: closing the gap between informal natural-language intent and a machine-checkable specification.

Take the sorting task from earlier and imagine that an LLM proposes the following candidate postcondition: “result is in ascending order, and every element of arr also occurs in result“. It looks plausible, but it is silently too weak: it lets the implementation add elements that were never in arr. The Small Proof-Oriented Tests (SPOTs) methodology of Nik Swamy and Shuvendu Lahiri catches defects like this by writing tiny verified test cases against the spec. A SPOT for our sorting task looks like:

let arr    := #[3, 1, 2]
let result := sort(arr)
assert result == #[1, 2, 3]

and is handed to the verifier with a single demand: prove that the assertion holds using only the specification of sort, without running the code. If the spec is precise enough to force sort(#[3, 1, 2]) to equal #[1, 2, 3], the proof goes through, and the SPOT becomes a small, machine-checked theorem about the spec. If the spec is too weak, the proof fails: our buggy postcondition admits not only $\texttt{#[1,2,3]}$ but also $\texttt{#[1,2,3,0]}$, $\texttt{#[-7,1,2,3]}$, and infinitely many other ascending lists that contain $1$, $2$, and $3$, so the verifier has no way to derive the asserted equality. The failure pinpoints exactly the weakness in the postcondition.

This approach works remarkably well in auto-active verifiers such as Dafny, Verus, and F*, where industrial-grade SMT solvers discharge the proof obligations a SPOT generates—typically over arithmetic, bitvectors, fixed-size arrays, and list or string equality—in milliseconds.

Reframing Validation: Soundness and Uniqueness

We tried to use SPOTs in Lean too, and the result was disappointing. The reason is: Lean is a foundational proof assistant, with a logic much richer than what auto-active verifiers expose, and a correspondingly smaller fraction of its proof obligations fall inside what SMT solvers can discharge automatically. Many of the assertions a SPOT generates—equalities between recursively defined functions, arithmetic that needs induction, list manipulations—sit just outside the solver’s automation. To close such a proof in Lean one usually has to fall back on an interactive proof script, or, increasingly, on an LLM-driven proof search agent, such as Aristotle. Both are orders of magnitude slower per obligation than discharging the same goal in Dafny, and each attempt also burns through a non-trivial number of LLM tokens. A budget that comfortably validates dozens of SPOTs in Dafny barely covers three or four in Lean.

To make the symbolic style work in Lean, then, we need to ask less of the verifier per test case. Rather than treating a SPOT as a single, all-or-nothing proof obligation, we can decompose it into three strictly weaker checks—one that captures whether the input is one the spec claims to handle, one that captures whether the spec accepts the intended output, and one that captures whether it forbids the unintended ones.

Concretely, given a test case $(i, o)$—an input $i$ paired with its intended output $o$—we ask three separate questions about a candidate specification, given by a precondition $\mathit{pre}$ and a postcondition $\mathit{post}$:

Admissibility: $\mathit{pre}(i)$ holds. Intuitively, the chosen test input is one the spec actually claims to handle. A failure here means the test case lies outside the precondition—we would be checking the spec on inputs it has explicitly opted out of, and any verdict would be uninformative.
Soundness: $\mathit{post}(i, o)$ holds. Intuitively, the spec accepts the intended output. A failure here means the spec is too strong—it rejects something the informal intent says is correct.
Uniqueness: $\forall o’ \neq o,\ \neg\mathit{post}(i, o’)$ holds. Intuitively, the spec rejects every alternative output. A failure here means the spec is too weak—it admits outputs the informal intent does not.

All three properties can be discharged by a symbolic verifier—this is essentially what a SPOT does, just packaged as a single obligation rather than three. More importantly for us, however, each of them can also be effectively invalidated by testing, without attempting a proof at all: a single counterexample is enough to refute any one of these properties and flag the spec. The next sections explain how to do so.

Property-Based Testing

To turn the invalidation strategy from the previous section into actual code, we need a tool that can draw candidate inputs (and candidate alternative outputs $o’$) automatically and check a Boolean property on each. That tool is property-based testing (PBT): a technique for checking that an object satisfies a property by drawing random inputs from a generator, evaluating the property on each, and reporting any concrete counterexample it finds. The idea was introduced in 2000 in the seminal paper on QuickCheck for Haskell, and has since been re-implemented in essentially every modern programming language, including Lean.

In the twenty-five years since, PBT has proven dramatically effective across a wide range of domains: it has uncovered bugs in production compilers, driven testing of financial software, surfaced subtle defects in computational geometry algorithms and in smart contract runtimes. What makes PBT so effective is that a precise formal property is, all by itself, an automatic refutation engine: any concrete input that violates it counts as a bug, and finding one requires only that the generator stumbles onto a witness.

Lean 4 ships with Plausible, a property-based testing library in the QuickCheck tradition. Given a theorem statement, Plausible generates random inputs from typeclass-derived generators and tries to refute the goal by exhibiting a counterexample.¹

To see Plausible in action, consider an insertion-sort implementation written in Velvet. The Velvet method comes with a postcondition that we believe expresses what sorting means:

method insertionSort (mut arr: Array Int) return (u: Unit)
  require 1 ≤ arr.size
  ensures ∀ i j, 0 ≤ i ∧ i ≤ j ∧ j < arr.size → arr[i]! ≤ arr[j]!
  ensures arr.toMultiset = arrOld.toMultiset
  do
    -- implementation elided

The two ensures clauses say, respectively, that the array is in ascending order after the method runs, and that its multiset of elements is unchanged—i.e., the result is a permutation of the input. Without ever proving this method correct, we can already use Plausible to test the implementation against the postcondition:

let g : Plausible.Gen (_ × Bool) := do
  let arr ← Plausible.SampleableExt.interpSample (Array Int)
  let res := insertionSortTester arr
  pure (arr, res)

for _ in [1 : 500] do
  let res ← Plausible.Gen.run g 10
  unless res.2 do
    IO.println s!"postcondition violated for input {res.1}"
    break

The helper insertionSortTester is auto-derived from the method’s signature:² it runs insertionSort on the sampled array and evaluates the postcondition on the resulting state. The loop is trying to invalidate the postcondition by drawing 500 fresh inputs and looking for one that breaks it. If none is found, we have a strong evidence—though not a proof—that the implementation respects the spec on the distribution of inputs the generator explores; if one is found, the offending array is a concrete witness of a bug. This is the canonical PBT trade-off against a symbolic correctness proof: dramatically cheaper to run, at the cost of giving up the universal guarantee that a proof would provide.

Notice, though, what we are testing here: we are checking the implementation of insertionSort against a specification we already trust. The next step—the one this whole post has been building towards—is to flip the script, and use the very same machinery to validate the specification itself.

Catching a Bad Spec with PBT

Let’s put PBT and the soundness/uniqueness pair to work on a concrete LLM-synthesised specification. We stay with the sorting task: given a list of integers, produce one in ascending order. Asked to write a Lean specification for this problem, an LLM might propose:

def precondition (arr : List Int) : Prop :=
  True

def postcondition (arr : List Int) (result : List Int) : Prop :=
  (∀ i j, 0 ≤ i < j < result.size → result[i]! ≤ result[j]!) ∧
  (∀ i, 0 < i < arr.size → arr.count arr[i]! = result.count arr[i])

In plain English, the postcondition says that result is in ascending order and that every element of arr also appears in result. This is silently too weak: it never forbids result from containing extra elements that were never in arr.

To validate this spec, we run the three PBT checks from the previous section. Admissibility is trivial here: the precondition is True, so every input we pick passes. For soundness, we choose a few small test cases $(i, o)$ where $o$ is the intended sort of $i$³—say $(\texttt{#[]}, \texttt{#[]})$ and $(\texttt{#[3,1,2]}, \texttt{#[1,2,3]})$—and check that postcondition i o holds. It does, on every case we try: the spec accepts the intended outputs, so soundness is not the problem here.

For uniqueness, Plausible randomly samples alternative outputs $o’ \neq o$ for each test case and checks whether postcondition i o' is still accepted. On the very first test case $i = \texttt{#[]}$, the generator quickly stumbles onto $o’ = \texttt{#[0]}$: the array is trivially in ascending order and trivially contains every element of arr (because arr is empty), so the postcondition is satisfied. Plausible reports the counterexample, and the spec is flagged as too weak—exactly as the reframing predicted.

What’s Hard to Test (and How We Cope)

PBT thrives when the property under test is a universally-quantified Boolean predicate one can evaluate directly on each sampled input. It struggles when the property contains an unbounded existential quantifier: refuting $\exists x,\, P(x)$ means establishing $\forall x,\, \neg P(x)$, which testing alone cannot do. We found two simple patterns that let us tackle the awkward cases without falling back to a full symbolic proof.

First, in program specifications, existentials are almost always implicit and bounded by the surrounding context—an index into arr lives in $[0,\, \texttt{arr.size})$, not in all of $\mathbb{N}$. We infer such bounds heuristically from the property’s structure, prove that they indeed hold for the respective program variables (such as array indices) with Lean tactics like grind and omega, and then enumerate the existential variable over the resulting finite range.

Second, sampling alternative outputs $o’ \neq o$ uniformly at random is usually quite ineffective: the odds of landing on a meaningful counterexample in this case are astronomically small. Instead, we take a page from the fuzz testing book and use mutation-based sampling: we perturb the intended output by flipping a Boolean, adding $\pm 1$ to an integer, or inserting/deleting an element from a list. This is because bad outputs tend to look much like good ones, and this empirical fact makes such “small-scale” mutation much more effective than blind random search in a large space of possible outputs.

With these two adaptations in place, one might wonder whether the testing-based approach to validating formal specifications actually pays off in the wild. The rest of the post discusses the experience of using a PBT-based spec-validation pipeline on two state-of-the-art benchmarks of specifications for Lean programs.

Catching Specification Bugs in Verified Coding Benchmarks

Our hypothesis going in was as follows: if PBT-based spec validation works as advertised, it should be able to find underspecified specifications even in published, human-written Lean benchmarks that have already been vetted by their authors as reference examples.

To put this to the test, we ran our pipeline on two state-of-the-art benchmarks for Lean specification synthesis: VERINA and CLEVER. Both benchmark suites provide natural-language problem descriptions, formal specifications, and a handful of test cases; some problems also include a reference implementation. When one was available, we used it to compute the intended outputs from precondition-satisfying inputs. CLEVER’s specifications do not separate preconditions from postconditions, so we wrote a small script to do that conversion; due to formatting issues, only 104 of CLEVER’s specifications converted cleanly, and those are the ones we tested.

Across 188 problems from VERINA and 104 from CLEVER, PBT flagged 13 underspecified specifications in the former and 18 in the latter—about 10% of everything we tested.

We reported these findings to the benchmark authors. The VERINA team has acknowledged the bugs we surfaced and patched nearly all of them (one is still under review); the CLEVER team has acknowledged all 18 issues, though fixes have not yet shipped at the time of writing.

Let us now look at three of the bugs we found.

Example 1: Forgotten Range Constraints

The first comes from VERINA’s Basic 46. The task is to find the last position of a given element in a sorted array of integers, returning -1 if the element is absent. VERINA’s specification is:

def lastPosition_precond (arr : Array Int) (elem : Int) : Prop :=
  List.Pairwise (· ≤ ·) arr.toList

def lastPosition_postcond
  (arr : Array Int) (elem : Int) (result : Int)
  (h_precond : lastPosition_precond arr elem) :=
    (result ≥ 0 →
      arr[result.toNat]! = elem ∧
      (arr.toList.drop (result.toNat + 1)).all (· ≠ elem)) ∧
    (result = -1 → arr.toList.all (· ≠ elem))

At a glance this looks fine: when result ≥ 0, the result-th element equals elem and no element after it does; when result = -1, no element of the array equals elem.

PBT quickly finds the problem. With arr = #[] and elem = 0, the postcondition accepts both -1 and 0, even though only -1 is correct. The reason: when result = 0, the lookup arr[result.toNat]! is out of bounds, and Lean’s accessor silently returns the default Int, which is 0, and happens to equal elem. The postcondition is satisfied by accident. Worse, the postcondition also accepts -2: nothing in the spec restricts result to be in [-1, arr.size), so anything below -1 is unconstrained.

What looks like an aesthetic problem is actually an exploit waiting to happen. A motivated attacker—or just a lazy LLM optimising for the cheapest implementation the verifier will accept—can ship code that returns 0 whenever the input array is empty, or uses any negative integer below -1 as a “not found” sentinel. The verifier, reading only the postcondition, will stamp this program as correct; the user, trusting that stamp, will never notice that the “verified” program disagrees with the natural-language task it was supposed to solve.

Adding the explicit constraint -1 ≤ result < arr.size fixes the spec. The general lesson: when writing a specification, always constrain the domain of every output variable.

Example 2: Silently Truncated Subtraction

The second example comes from CLEVER’s problem 79. The task is to convert a number in decimal to binary and wrap the result with "db" at both ends—so the desired output for 32 is "db100000db". CLEVER’s specification reads:

def problem_spec
  (implementation : Nat → String)
  (decimal : Nat) :=
  let spec (result : String) :=
    4 < result.length ∧
    result.drop (result.length - 2) = "db" ∧
    result.take 2 = "db" ∧
    let resultTrimmed :=
      (result.toList.drop 2).dropLast.dropLast.map
        (fun c => c.toNat - '0'.toNat)
    decimal = Nat.ofDigits 2 resultTrimmed.reverse
  ∃ result, implementation decimal = result ∧
  spec result

The first three lines say result starts and ends with "db". The next line strips the wrappers and turns each remaining character into a digit by subtracting '0'’s Unicode value. The last line says the resulting digits, read as base 2, equal the input.

The bug hides in that subtraction. Lean’s Nat subtraction is truncated: any negative result clamps to 0. So any character whose Unicode value is $\le$ that of '0'—including '/'—silently maps to 0. When the input is 0, the postcondition therefore accepts both "db0db" and "db/db".

A random generator would essentially never produce a string starting and ending with "db" by chance—the structural constraint is too tight. Mutation-based sampling, on the other hand, perturbs the expected output one character at a time and immediately exposes the bug. Mutation also surfaced a different underspecification in VERINA’s Basic 97 (an in-place update of an array element to 60): the postcondition does not require the output to have the same length as the input, and PBT caught this by appending an extra element to the expected result.

Example 3: Catching Implementation Bugs

The third example, from CLEVER’s problem 9, is a bonus: we were hunting for spec bugs, but this one is an implementation bug. The task is to compute the running maximum of a list of integers, and the reference implementation is:

def implementation (numbers : List Int) : List Int :=
  let rec rolling_max
          (numbers : List Int)
          (results : List Int)
          (acc : Int) : List Int :=
            match numbers with
            | [] => results
            | n :: ns =>
              let new_acc := max acc n
              let new_results := results ++ [new_acc]
              rolling_max ns new_results new_acc
  rolling_max numbers [] 0

The accumulator acc is initialised to 0, which gives the wrong answer whenever the first element of the input is negative. PBT generated such an input within a handful of samples, the implementation’s output failed the spec, and the bug came out for free.

Limitations and Conclusions

In the traditional formal verification setting—where we want to check that a program meets a specification we trust—testing is undoubtedly a much weaker tool than a formal proof: a test suite can never quite rule out an unseen bug, while a verifier gives a guarantee that no amount of testing can match. When the artefact under scrutiny is the specification itself, however, the picture becomes much subtler. The underlying problem—does this formal statement faithfully capture what the human actually meant?—is inherently non-formal, and even SPOT-style symbolic validation does not give a 100% guarantee: it is only as good as the test cases one chooses. Once absolute certainty is no longer on the table for any method, PBT-based validation becomes a genuinely worthy alternative in the design space, with the added benefit of being much cheaper than anything that involves proofs.

That said, our method has several opportunities for future improvements, which we discuss next.

First, the uniqueness property simply does not hold for every task. A clean example is quicksort-style partitioning, where the relative order within each partition is irrelevant to correctness and the spec legitimately admits many distinct outputs for the same input. Our approach can, however, be extended by replacing uniqueness with other, more permissive universally-quantified meta-properties of specifications—for instance, “any two accepted outputs are related by some user-supplied equivalence”. Identifying useful such properties and integrating them into a single PBT-driven validation pipeline is an exciting direction for future work.

Second, a good specification should capture the relation between inputs and outputs, rather than encode a particular algorithm in disguise. PBT can flag specs that are too weak or too strong, but it has nothing to say about specs that are simply too operational. We believe this dimension can be addressed by restricting the specification language itself—e.g., to a fragment restricted to particular sets of computational primitives—and we view this, too, as a promising future direction.

More broadly, the most interesting story here is probably not “testing versus proof” but “testing and proofs”: combining lightweight randomised validation with heavyweight deductive methods opens up an intriguing design space for cutting the cost of formal verification and verified code generation, using each technique where it is the most appropriate.⁴ Readers curious about how specification testing fits into the larger picture of verifiable code generation may enjoy our recent paper on LeetProof, where PBT-based specification validation works alongside SMT and agentic proof search in a single end-to-end pipeline.

There is a subtlety here: Plausible can only refute propositions that are decidable—i.e., come with a procedure that returns a Boolean for every concrete input. In practice, Lean’s type class resolution synthesises the required Decidable instance automatically for most propositions one encounters in program specifications: equalities over built-in types, comparisons, finitely-quantified statements, and Boolean combinations of all of the above. ↩
The details of how Velvet turns a method signature into an executable tester are described in our CAV’26 paper. ↩
In practice, these input–output pairs can be drafted by an LLM and then sanity-checked by a human—the same assumption SPOTs make about their concrete witnesses. ↩
The same applies beyond specifications for synthesised programs: testing the definitions that themselves appear inside our theorems—e.g., a programming language semantics in a soundness statement—is just as valuable, since a flawed definition can make even a true theorem useless. See our earlier post on mechanising the Move borrow checker in Lean for a workflow of this kind. ↩

Proofs and Intuitions