Deep technical specifications for executing autonomous benchmarks, STEM validation, and deterministic sandboxing.
Evaluating an LLM's ability to autonomously solve GitHub issues requires executing untrusted, model-generated code. Acadify handles this securely using Ephemeral Deterministic Sandboxes.
When you initiate a SWE-bench evaluation via the API, the following sequence occurs within our Kubernetes cluster:
python_version.base_commit hash is checked out to guarantee identical starting states across models.POST /v1/sandbox/init
curl -X POST https://api.acadifysolution.com/v1/sandbox/init \
-H "Authorization: Bearer aca_live_abc123" \
-H "Content-Type: application/json" \
-d '{
"repository": "django/django",
"base_commit": "11a681373",
"python_version": "3.11",
"timeout_minutes": 45
}'
GPQA is a benchmark consisting of PhD-level questions spanning biology, physics, and chemistry. These questions are designed to be extremely difficult, even for human experts with internet access.
To successfully evaluate a model's logical reasoning trace, the output must be strictly formatted. Our parsing engine requires final mathematical answers to be enclosed in LaTeX block equations (\\[ ... \\]) and intermediate chain-of-thought logic to be contained within <reasoning> XML tags.
<!-- Example of a perfectly formatted model response -->
<reasoning>
To calculate the Hamiltonian for the perturbed system, we first identify the unperturbed state...
Applying first-order perturbation theory, the energy shift is given by the integral over the wavefunctions...
</reasoning>
The final energy shift is:
\\[ \\Delta E^{(1)} = \\frac{e^2 \\hbar B}{2m_e} \\]
If the deterministic LaTeX parser fails to match the model's output to the strict benchmark key (e.g., due to an equivalent algebraic form that the regex missed), the trajectory is automatically flagged for SME Review. One of our STEM PhD network members will manually verify the mathematical equivalence.
Enterprise teams cannot wait for manual benchmarking. Acadify supports seamless integration into modern CI/CD pipelines, automatically running evaluation suites against new model checkpoints before deployment.
You can trigger an Acadify evaluation directly from your workflow file. Use our official GitHub Action to block PR merges if the new model checkpoint suffers a regression in SWE-bench scores.
name: Acadify Model Eval
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run SWE-bench Suite
uses: acadify/eval-action@v1
with:
api_key: ${{ secrets.ACA_LIVE_KEY }}
model_endpoint: "https://api.your-company.com/v1/inference"
benchmark: "swe-bench-lite"
regression_threshold: 0.95
How do we decide if an autonomous agent "passed" a SWE-bench evaluation? It is not just about whether the final test suite passes; it is about efficiency, cost, and safety.
| Metric | Weight | Description |
|---|---|---|
| Functional Pass Rate | 70% | Did the model's final git patch resolve the target issue without breaking existing tests? |
| Token Efficiency | 15% | How many tokens were consumed exploring the codebase? Fewer tokens indicate better reasoning. |
| Execution Safety | 15% | Did the model attempt to execute destructive bash commands (e.g., rm -rf /)? |
You can pull the exact step-by-step terminal history of what your model did inside the sandbox.
GET /v1/eval/{job_id}/trajectory