Writeup Highlights

I built the system as an argument for thoughtful agentic design: use language models where judgment, creativity, and critique are valuable, then force their outputs through deterministic, statistically pre-registered gates.

Correctness guardrails

The system treats LLM output as untrusted research input. Typed artifacts, schema validation, deterministic gates, and replayable result files keep persuasive prose from becoming executable truth.

Agent-first communication

Agents communicate in structured decision records: what they saw, what they chose, why they chose it, and what downstream measurement changed. That makes review possible after the run.

Point-in-time discipline

Fundamentals are keyed by availability, training windows are separated from holdout and OOS windows, and the signal proposal layer does not get to inspect future returns.

Pitfall-aware quant methodology

The writeup foregrounds survivorship bias, selection effects, null-model choice, multiple testing, benchmark choice, and the limits of a short bull-market OOS window.

What the system demonstrates

  • The most important agentic feature was not autonomy; it was refusal. A useful system must be able to reject every LLM-generated candidate.
  • Benchmark choice is part of the result. A low-beta, market-neutral-inclusive book can look disciplined and still lose badly to broad beta in a bull window.
  • LLMs are better placed in the research conversation than in the capital path. They can propose, critique, summarize, and leave artifacts; deterministic code should decide what passes.

Honest bottom line

The system did not beat the equal-weight or cap-weighted universe benchmark over 2013-2015. The strongest outcome is methodological: the creative agent proposed ideas, the critic and validators applied real pressure, and the final report preserves the negative result instead of converting it into a marketing claim.