- CLEVER: A Curated Benchmark for Formally Verified Code Generation
We use CLEVER to evaluate several few-shot and agentic approaches based on state-of-the-art language models These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning
- CLEVER: A Curated Benchmark for Formally Verified Code Generation
Unlike prior benchmarks, C L E V E R avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness
- CLEVER: Curated Lean Verified Code Generation Benchmark
CLEVER: Curated Lean Verified Code Generation Benchmark Overview CLEVER is a benchmark suite for end-to-end code generation and formal verification in Lean 4, adapted from the HumanEval dataset
- CLEVER: A Curated Benchmark for Formally Verified Code Generation
We introduce CLEVER, a high-quality, manually curated benchmark of 161 problems for end-to-end verified code generation in Lean
- CLEVER: A Curated Benchmark for Formally Verified Code Generation
These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning Our benchmark can be found on GitHub (https: github com trishullab clever) as well as HuggingFace (https: huggingface co datasets amitayusht clever)
- CLEVER: A Curated Benchmark for Formally Verified Code Generation
CLEVER: A Curated Benchmark for Formally Verified Code Generation: Paper and Code We introduce $ {\rm C {\small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean
- CLEVER: A Curated Benchmark for Formally Verified Code Generation
These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning
- CLEVER: A Curated Benchmark for Formally Verified Code Generation
We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications - in contrast to vibe coding, which generates potentially buggy code from…
|