docs: add auto-reverse design spec

Conversational CLI that reverse-engineers a website API: LLM-driven headed browser (approach 3) + embedded mitmproxy capture/doc pipeline (approach 5), unified as a single tool-use agent. Free-threaded single-process architecture, intent-driven exploration, hybrid human/agent control, bounded LLM cost via endpoint-signature dedup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 23:13:40 +08:00
parent adcd280bbd
commit 879dfc347d
1 changed files with 274 additions and 0 deletions
@@ -0,0 +1,274 @@
+# auto-reverse — Design
+
+**Date:** 2026-05-31
+**Status:** Approved (pending implementation plan)
+
+## Summary
+
+`auto-reverse` is a conversational CLI that reverse-engineers a website's API by
+combining an LLM-driven headed browser (so you can watch and take over) with an
+embedded intercepting proxy that captures and documents real traffic in real
+time. You state intent in plain language ("map the checkout flow"); Claude
+pursues it through the browser; every real request is captured, deduplicated,
+and turned into a growing OpenAPI spec + markdown as it happens; Claude reports
+findings and you steer — back and forth — until the intent is covered.
+
+It unifies two classic approaches into one tool:
+
+- **Approach 3** — an agent drives a real (headed) browser, so a React/SPA
+  behaves normally and runtime API calls actually fire.
+- **Approach 5** — an in-process capture pipeline documents each new endpoint in
+  real time as traffic flows.
+
+The 3+5 split is preserved as an *internal* boundary: the Driver Agent only
+*acts* (browser) and *queries/commands* (flows, doc); capture + documentation
+run independently in the background and keep working even when a human drives.
+
+## Goals
+
+- Conversational, intent-driven exploration (not a blind crawler, not
+  fire-and-forget). "Sensible length" is Claude's judgment against the user's
+  stated intent.
+- Watchable: headed browser by default; human can grab control at any time
+  (hybrid driving), with all traffic still captured.
+- Bounded LLM cost: documentation cost scales with *distinct endpoints*, not
+  request volume.
+- Lossless capture: a raw archive retains everything even when the spec filters
+  noise out.
+
+## Non-goals (v1)
+
+- Robust automated login. Auth is a configurable, pluggable concern with a
+  stubbed default (manual-pause strategy); deeper strategies come later.
+- Defeating bot-detection / captcha / anti-automation.
+- Documenting non-HTTP protocols (websockets/gRPC) — recorded to archive but not
+  modeled in v1.
+
+## Approach decision
+
+Chosen: **single agent, with browser + recon + doc all exposed as tools**
+(approach "A" from brainstorming). One Claude tool-use loop is the brain;
+capture and deterministic schema inference run in background threads; the LLM
+enriches only *new* endpoint signatures. Rejected: two cooperating agents (more
+cost/orchestration, harder to keep coherent in one chat) and fully-deterministic
+docs (cheapest but mechanical descriptions — offered instead as a `--no-llm-doc`
+flag).
+
+## Architecture
+
+Single free-threaded process (Python 3.14, `3.14+freethreaded`), four concurrent
+roles sharing an in-memory flow store. Free-threading is what gives real
+parallelism to the Python-side work (schema inference, doc generation, capture)
+without GIL contention.
+
+```
+┌────────────────────────────────────────────────────────────────┐
+│  auto-reverse  (one process, free-threaded)                      │
+│                                                                  │
+│  [main thread]  Chat REPL  ◄──── you type intent / steer         │
+│       │              ▲                                           │
+│       ▼              │ streamed replies + findings               │
+│  ┌─────────────────────────────┐                                │
+│  │  Driver Agent (Claude API,  │   tools:                        │
+│  │  tool-use loop)             │   • browser.*  (act)            │
+│  │                             │   • flows.*    (recon)          │
+│  └───┬─────────────┬───────────┘   • doc.*      (document)       │
+│      │ browser.*   │ flows.* / doc.*                             │
+│      ▼             ▼                                             │
+│  ┌─────────┐   ┌──────────────────────────────┐                 │
+│  │Playwright│  │  Flow Store (locked)          │◄──┐             │
+│  │ headed   │  │  dedup by signature, samples  │   │ push flows  │
+│  │ browser  │  └───────────┬──────────────────┘   │             │
+│  └────┬─────┘              │ new-signature events  │             │
+│       │ proxied            ▼                       │             │
+│       │            ┌───────────────┐               │             │
+│       │            │ Doc Worker     │  genson schema│            │
+│       │            │ [thread]       │  + LLM enrich │            │
+│       │            └───────┬────────┘   (new only)  │            │
+│       │                    ▼                        │            │
+│       │            openapi.yaml + API.md            │            │
+│       ▼                                             │            │
+│  ┌──────────────────────────────────────────┐      │            │
+│  │ mitmproxy DumpMaster [thread, asyncio]     │──────┘            │
+│  │ + addon  →  raw archive (flows dump + HAR) │                  │
+│  └──────────────────────────────────────────┘                  │
+└────────────────────────────────────────────────────────────────┘
+```
+
+### Thread roles
+
+- **Main thread** — chat REPL + Driver Agent tool-use loop (synchronous, easy to
+  reason about).
+- **mitmproxy thread** — embedded `DumpMaster` on its own asyncio loop; a capture
+  addon pushes each flow into the Flow Store and streams raw flows to disk.
+- **Doc Worker thread(s)** — consume *new-signature* events, run deterministic
+  schema inference, call the LLM only to enrich novel endpoints, write
+  spec/markdown.
+- Playwright's browser driver is a separate Node subprocess, so it sidesteps
+  free-threaded C-extension concerns.
+
+### Components (modules under `src/auto_reverse/`)
+
+- `cli.py` — entrypoint, arg parsing, wires everything, starts threads.
+- `repl.py` — chat loop, renders streamed agent output, handles steer/interrupt
+  and the take-over keypress, dispatches `/` meta-commands locally.
+- `agent.py` — Claude tool-use loop; owns the conversation.
+- `tools/browser.py`, `tools/flows.py`, `tools/doc.py` — the three tool groups.
+- `browser.py` — Playwright launch (headed, proxied), take-over/release.
+- `proxy.py` — embedded mitmproxy master + capture addon.
+- `store.py` — thread-safe Flow Store: signature dedup, sample retention, scope
+  filtering.
+- `doc/schema.py` — deterministic JSON Schema inference + merge.
+- `doc/openapi.py` — incremental OpenAPI assembly.
+- `doc/markdown.py` — human-readable API docs.
+- `doc/client.py` — optional typed httpx client generation from the spec.
+- `config.py` — config + the stubbed pluggable auth strategy.
+
+## Data flow
+
+The intent → action → capture → doc cycle:
+
+1. **You state intent** in the REPL. It is added to the conversation; the Driver
+   Agent takes the turn.
+2. **Agent acts** via `browser.navigate` / `click` / `type` etc. Each action
+   returns a *compact* page snapshot (URL, accessibility-tree summary or trimmed
+   DOM, visible interactive elements) — not raw HTML — so the agent reasons
+   cheaply about the next step.
+3. **Browser fires real requests** through the proxy. The capture addon
+   intercepts every flow regardless of who triggered it (agent or human).
+4. **Flow Store ingests** each flow: applies the scope filter, computes a
+   signature, dedups, retains a bounded set of samples, and streams the raw flow
+   to the archive on disk. New signatures emit an event.
+5. **Doc Worker** consumes new-signature events: infers/merges JSON Schema from
+   samples (deterministic), and on *first* sighting of a signature calls the LLM
+   once to name it, describe it, and group it. Writes `openapi.yaml` + `API.md`
+   incrementally.
+6. **Agent observes & reports**: between actions it calls `flows.search` /
+   `flows.get` to see what surfaced, then summarizes in chat (including noting
+   filtered third-party calls).
+7. **You steer**: redirect, ask questions, approve, or take the mouse. The loop
+   continues until the agent judges the intent covered, then it summarizes and
+   awaits the next intent.
+
+### Dedup signature
+
+```
+signature = (method, host, path_template, response_status_class)
+```
+
+- `path_template` collapses variable segments via heuristics (numeric ids,
+  UUIDs, hashes, long opaque tokens → `{param}`), e.g.
+  `/api/users/4812/orders/99` → `/api/users/{id}/orders/{id}`.
+- Query params are recorded as parameters, not part of the signature.
+- A repeated signature triggers **no LLM call**; its body/response are merged
+  into the existing schema samples (widening the schema, capturing optional
+  fields).
+- Net effect: LLM doc cost scales with *distinct endpoints*, not request volume.
+
+### Scope filtering
+
+- Default in-scope: same-site / same-origin XHR/fetch/document requests to the
+  target host(s). Static assets (`.js/.css/.png/.woff`…) dropped.
+- Common third-party/analytics hosts (google-analytics, segment, stripe-js,
+  sentry, doubleclick…) dropped by a default denylist but *noted* so the agent
+  can mention them.
+- Configurable allowlist/denylist of hosts + path globs in `config.py`; the
+  agent can also be told in chat to include/exclude a host.
+- Everything is still written to the **raw archive** even when filtered from the
+  spec — filtering only affects what gets documented.
+
+## CLI / REPL UX
+
+Invocation:
+
+```
+auto-reverse <target-url> [options]
+
+  --out DIR            output dir (default ./auto-reverse-out/<host>-<timestamp>/)
+  --proxy-port N       mitmproxy listen port (default 8080)
+  --headless           run browser headless (default: headed, so you can watch)
+  --profile DIR        persistent browser profile (cookies persist across runs)
+  --gen-client         after the session, generate a typed httpx client from openapi.yaml
+  --model NAME         Claude model (default: claude-opus-4-8)
+  --scope HOST,...     extra in-scope hosts (added to target)
+  --no-llm-doc         deterministic docs only (zero doc-LLM cost)
+  --resume DIR         reopen a previous session's store/spec and keep going
+```
+
+REPL — plain chat plus a few `/` meta-commands handled locally (not sent to the
+LLM):
+
+```
+> map the checkout flow                ← natural-language intent
+/take         hand browser control to you; capture keeps running; /done to return
+/stop         interrupt the agent's current pursuit (keeps session alive)
+/flows [q]    print discovered endpoints (optionally filter) — local, no LLM
+/spec         show current openapi.yaml path + endpoint count
+/save         flush spec/markdown/archive now
+/help  /quit
+```
+
+- **Streaming**: agent replies and tool-call narration stream live.
+- **Take-over mechanics**: `/take` pauses the agent loop and surfaces the headed
+  browser to the human; mitmproxy keeps capturing into the same store; `/done`
+  resumes the agent, which first calls `flows.search` to catch up on what the
+  human did, then continues.
+- **Interrupt**: Ctrl-C / `/stop` cleanly interrupts mid-pursuit without killing
+  the session or losing captured data.
+
+## Error handling
+
+- **Browser action fails** (selector gone, navigation timeout): the tool returns
+  a structured error + fresh snapshot; the agent re-plans rather than crashing.
+  Bounded retries per action.
+- **LLM errors / rate limits**: exponential backoff in the agent loop;
+  doc-enrichment failures degrade gracefully to deterministic-only docs (the
+  endpoint is still recorded with a mechanical description) and are retried
+  later.
+- **Proxy/TLS**: first run installs/uses the mitmproxy CA; undecryptable flows
+  are logged to the archive and skipped for docs. Clear error if the port is in
+  use.
+- **Crash safety**: spec, markdown, and raw archive are written incrementally and
+  flushed on every new signature and on exit (including Ctrl-C via a signal
+  handler), so a mid-session crash never loses discovered endpoints. `--resume`
+  reopens them.
+- **Free-threading caveat**: the Flow Store is guarded by a lock; queues are
+  thread-safe. If a required C-extension lacks free-threaded wheels, the README
+  documents the fallback (run on the GIL-enabled interpreter).
+
+## Testing
+
+- **Unit (pure, no network/LLM):**
+  - `store`: signature templating (ids/UUIDs/hashes → `{param}`), dedup, sample
+    merging, scope filter allow/deny.
+  - `doc/schema`: schema inference + merge widening (optional fields, unions).
+  - `doc/openapi` + `doc/markdown`: golden-file output from canned flows.
+- **Tool layer**: browser/flows/doc tools tested against a Playwright-driven
+  **local fixture site** (a tiny Flask/Starlette app with a few JSON endpoints)
+  routed through a real embedded mitmproxy — verifies the full
+  capture→store→doc path with zero external dependencies and no LLM.
+- **Agent loop**: tested with a **mocked Claude client** returning scripted tool
+  calls, asserting the intent→action→observe cycle and graceful error re-planning.
+- **End-to-end smoke**: against the fixture site, assert a known endpoint lands in
+  `openapi.yaml` with the correct method/path-template/schema.
+- LLM-dependent enrichment is mocked in CI; a manual/optional live test is gated
+  behind an env flag.
+
+## Key dependencies
+
+- `playwright` — headed browser automation.
+- `mitmproxy` — embedded intercepting proxy (`DumpMaster` + addon).
+- `anthropic` — Claude API (tool-use loop + doc enrichment).
+- `genson` (or equivalent) — deterministic JSON Schema inference.
+- `openapi-python-client` (or equivalent) — optional `--gen-client` codegen.
+- All must be validated for Python 3.14 free-threaded wheel availability during
+  implementation; fallback documented if any are missing.
+
+## Open questions for implementation
+
+- Confirm free-threaded wheel availability for mitmproxy / playwright on 3.14t;
+  decide fallback interpreter if needed.
+- Exact compact-snapshot format the browser tools return to the agent
+  (accessibility tree vs. trimmed DOM) — tune for token cost vs. usefulness.
+- Path-template heuristics tuning (avoid over-collapsing legitimately distinct
+  static paths).