research-ingest
Unified investigative research pipeline. Ingest articles, books, depositions, and media. Track themes across sources. Query in plain English.
cp research-ingest.md .claude/commands/
A Claude Code skill that handles the full investigative research pipeline — from raw source acquisition through structured fact extraction, database accumulation, and theme tracking. Point it at any URL or file and it builds a queryable facts database you can research against.
- Auto-detects source type from URL or file extension — no prefix needed for articles, books, or media
- Article ingestion with multi-strategy paywall bypass
- Multi-agent book extraction: Haiku×4 specialists + Sonnet×2 synthesizers per chunk
- Theme tracking — investigative threads that accumulate evidence across sources
- Project-agnostic: configure any investigation via
--projectflag - No external dependencies beyond the Python standard library
Subsystems
Pass any URL or file path. Claude detects the source type automatically and routes to the right pipeline.
| Command | What it does |
|---|---|
ingest <url> |
Web article, auto-detected |
ingest <path> |
Book (EPUB/PDF), auto-detected |
ingest <url> |
YouTube / audio, auto-detected |
ingest doc <path> |
Court docs / PDFs (use 'doc' for explicit) |
ingest extract <source> |
Run multi-agent extraction on a saved source file |
ingest theme <action> |
Define, tag, suggest, and report on investigative themes |
ingest report |
Project status + per-theme evidence summaries |
Article ingestion
Tries multiple access strategies automatically — direct fetch, archived copies, syndication — and stops when it gets full text.
You say
ingest https://www.bloomberg.com/news/articles/2025-09-11/epstein-emails --queue 7
What lands in sources/
# Maxwell Sends Mystery USB to DOJ
**Source:** Daily Beast
**Author:** Jose Pagliery
**Date:** April 22, 2026
**URL:** https://www.thedailybeast.com/...
**Access:** DONE — via Jina AI Reader
**Retrieved:** 2026-04-27
---
## Key Content
Maxwell sent a USB drive via FedEx to U.S. Attorney Jay Clayton on April 16, 2026,
days after a sealed filing referenced Melania Trump. The USB arrived at the U.S.
Attorney's office in Manhattan without explanation. Clayton had been confirmed as
U.S. Attorney for SDNY three days earlier...
## Key Facts for Sourcing
- **Maxwell sent USB drive to Jay Clayton via FedEx** — April 16, 2026
- **Government deadline to respond to Maxwell petition** — June 5, 2026
Book extraction
Large documents are processed in 8,000-character chunks. Six specialist agents run on each chunk in parallel — four Haiku agents for speed and a Sonnet agent for synthesis. Facts accumulate into facts.jsonl as chunks complete.
The six agents
| Agent | Model | Extracts |
|---|---|---|
| People | Haiku | Names, roles, relationships, affiliations |
| Timeline | Haiku | Dated events, sequences, cause-and-effect |
| Locations | Haiku | Named places, addresses, geographic context |
| Bibliography | Haiku | Citations, sources, cross-references |
| Claims | Sonnet | Assertions with certainty scores + attribution |
| Scenes | Sonnet | Narrative moments: who, where, what was said |
You say
ingest books/brown-perversion-of-justice.md
What happens
# Splits 340,000 chars into ~43 chunks
# Launches 6 agents × 43 chunks = 258 parallel extractions
# Agents run with Haiku (fast, cheap) or Sonnet (nuanced)
# Each chunk returns structured JSON
→ Chunk 1/43: People (14 entities), Timeline (8 events), Claims (11 facts)...
→ Chunk 2/43: ...
→ Complete: 394 facts extracted · Total in store: 9,241
Theme tracking
Themes are investigative threads — cross-cutting patterns that accumulate evidence across sources. Distinct from tags (low-level descriptors). Each theme has keywords Claude uses to auto-suggest fact assignments as new sources are ingested.
Workflow
| Command | What it does |
|---|---|
theme define "financial-concealment" |
Create a theme with description + keywords |
theme tag F-0123 financial-concealment |
Manually assign a fact to a theme |
theme suggest |
Auto-propose fact→theme mappings via keyword match + Haiku classification |
theme report financial-concealment |
Evidence summary: count, sources, date range, gaps |
Theme schema (themes.jsonl)
{
"id": "T-001",
"name": "financial-concealment",
"display_name": "Financial Concealment",
"description": "Methods used to hide assets: shell companies, offshore trusts, nominee accounts",
"keywords": ["offshore", "shell", "trust", "launder", "conceal", "nominee"],
"fact_ids": ["F-0123-0045", "F-0456-0012"],
"source_ids": ["SRC-BOOK-brown-poj"],
"fact_count": 127,
"source_count": 23,
"earliest_date": "1985",
"latest_date": "2019",
"status": "active"
}
Theme report output
## Theme: Financial Concealment (T-001)
127 facts across 23 sources | 1985–2019
### Core claim
[2–3 sentence synthesis from highest-certainty facts]
### Evidence by period
- 1985–1990: 12 facts (3 sources)
- 1990–2000: 45 facts (8 sources)
- 2000–2019: 70 facts (12 sources)
### Strongest sources
- Brown, Perversion of Justice (certainty 7) — 34 facts
- DOJ-OGR-000234 (certainty 9) — 8 facts
### Gaps
- No sources cover offshore account origins pre-1985
The database
Three JSONL files grow as you ingest. Migrate to SQLite when you need cross-entity queries or timeline generation — the migration is safe to re-run at any point.
| File | Contains |
|---|---|
facts/facts.jsonl |
Every extracted fact — claim, date, certainty, source, entities, tags, themes |
facts/sources.jsonl |
Source registry — title, author, type, certainty tier, ingest date |
facts/themes.jsonl |
Investigative threads — keywords, fact IDs, source IDs, date range |
Fact schema
{
"id": "F-08910",
"claim": "Maxwell sent a USB drive via FedEx to U.S. Attorney Jay Clayton on April 16, 2026",
"date": "2026-04-16",
"certainty": 7,
"source_id": "ART-dailybeast-maxwell-usb-2026",
"people": ["Ghislaine Maxwell", "Jay Clayton"],
"places": ["New York"],
"orgs": ["DOJ", "SDNY"],
"tags": ["maxwell", "doj", "2026"],
"themes": ["T-003"]
}
The skill file
Copy this into .claude/commands/research-ingest.md. Claude Code picks it up automatically and knows how to route all subsystem commands.
---
allowed-tools: Read, Write, Edit, Bash, Glob, Grep, WebSearch, WebFetch, Task
description: "Unified investigative research pipeline — articles, documents, books, media, multi-agent extraction, and theme tracking"
argument-hint: "<url or file path> — auto-detects source type"
---
# research-ingest
Unified investigative research ingestion skill (v1.2).
## Auto-detection dispatch
Inspect the first token of `$ARGUMENTS`:
| Input | Routes to |
|-------|-----------|
| URL containing `youtube.com` or `youtu.be` | `media` |
| Any other URL (starts with `http`) | `article` |
| Path ending in `.epub`, `.pdf`, or `.md` | `book` |
| First token is `doc` | `doc` subsystem |
| First token is `extract` | `extract` subsystem |
| First token is `theme` | `theme` subsystem |
| First token is `report` | `report` subsystem |
If no match, assume `article`.
---
## article — Paywall bypass cascade
Flags: `--title`, `--pub`, `--author`, `--date`, `--queue N`, `--project DIR`, `--no-save`
### CRITICAL: Ask first — never reconstruct silently
If Strategy 1 fails and the user is in the conversation, STOP. Say:
> "I couldn't get the full text from [domain]. Do you have access? Can you paste it?"
Only proceed to Strategy 2+ if user says go ahead. Fail loudly in first response line, never buried.
### Strategy cascade (try in order, stop when you have full text)
1. **Direct fetch + Trafilatura** — WebFetch original URL
2. **Jina AI Reader** — `WebFetch: https://r.jina.ai/{URL}` — clean Markdown, works for soft paywalls
3. **Wayback CDX check** — `http://web.archive.org/cdx/search/cdx?url={URL}&output=json&fl=timestamp,statuscode&filter=statuscode:200&limit=1` — if empty, skip; else fetch `https://web.archive.org/web/{TIMESTAMP}/{URL}`
4. **archive.ph** — `https://archive.ph/newest/{URL}` — conditional, only if previously archived
5. **Syndication hunt** — WebSearch: `"{title}" site:yahoo.com OR site:msn.com OR site:britbrief.co.uk`
6. **AMP/mobile versions** — `https://amp.{domain}/{path}` or `https://{domain}/amp/{path}`
7. **Secondary reconstruction** — only with explicit user permission
8. **Parallel agent** — Task (Sonnet) deep research sweep
### Source file format
Save to `03_Resources/{project}/sources/{publication}-{year}-{slug}.md`:
```
# {Article Title}
**Source:** {Publication}
**Author:** {Author}
**Date:** {Month Day, Year}
**URL:** {url}
**Access:** {DONE | PARTIAL | DEAD} — {notes}
**Retrieved:** {today}
---
## Summary
{2–3 sentence summary}
## Key Content
{Full article text}
## Key Facts for Sourcing
- **{Fact}** — {context}
```
---
## book — Multi-agent extraction
Splits the source into 8,000-character chunks. Launches 6 specialist agents per chunk in parallel.
| Agent | Model | Extracts |
|-------|-------|---------|
| People | Haiku | Names, roles, relationships |
| Timeline | Haiku | Dated events, sequences |
| Locations | Haiku | Places, addresses, geography |
| Bibliography | Haiku | Citations, cross-references |
| Claims | Sonnet | Assertions + certainty scores |
| Scenes | Sonnet | Narrative moments: who, where, what was said |
Each agent returns structured JSON. Facts accumulate to `facts/facts.jsonl`. After all chunks complete, run `generate_views.py` and `validate.py`.
---
## extract — Run extraction on a saved source
Point at any saved source file. Runs the same 6-agent pipeline as `book`. Use after ingesting articles or docs when you want deeper structured extraction beyond the initial key facts.
```
python3 scripts/extract_facts_from_source.py sources/filename.md
python3 scripts/build_source_registry.py
python3 scripts/generate_views.py
python3 scripts/validate.py
```
---
## theme — Investigative thread tracking
Themes are cross-cutting patterns that accumulate evidence across sources. Distinct from `tags`.
### Actions
**`theme define <name>`**
Create a theme. Ask for: display name, description, keywords (comma-separated). Write to `facts/themes.jsonl`.
**`theme tag <fact-id> <theme-id>`**
Manually assign a fact to a theme. Update both the fact in `facts.jsonl` (add theme_id to `themes[]`) and the theme entry (add fact_id to `fact_ids[]`).
**`theme suggest`**
Scan all facts in `facts.jsonl` against all theme keyword lists. For each potential match, use Haiku to classify (yes/no/uncertain). Present proposed mappings for user approval before writing.
**`theme report <theme-name>`**
Generate structured evidence summary:
- Fact count + source count + date range
- Core claim synthesis (from highest-certainty facts)
- Evidence by time period
- Strongest sources by fact count
- Coverage gaps
### Schema (facts/themes.jsonl)
```json
{
"id": "T-001",
"name": "financial-concealment",
"display_name": "Financial Concealment",
"description": "Methods used to hide assets: shell companies, offshore trusts, nominee accounts",
"keywords": ["offshore", "shell", "trust", "launder", "conceal", "nominee"],
"fact_ids": [],
"source_ids": [],
"fact_count": 0,
"source_count": 0,
"earliest_date": null,
"latest_date": null,
"created": "{today}",
"updated": "{today}",
"status": "active"
}
```
Facts gain a `themes: [theme_id, ...]` field alongside existing `tags`.
---
## report — Project status
Show:
- Facts in `facts.jsonl` (total, by source type, last 7 days)
- Sources in `sources.jsonl` (total, by type)
- Themes in `themes.jsonl` (active, fact coverage)
- Any validation errors
---
## Project config
Default: uses paths configured in the project's `CLAUDE.md`.
With `--project <name>`: reads `~/.research-ingest/<name>.toml`:
```toml
[project]
name = "my-investigation"
vault_path = "~/research/my-investigation"
facts_path = "{vault_path}/facts/facts.jsonl"
themes_path = "{vault_path}/facts/themes.jsonl"
sources_path = "{vault_path}/facts/sources.jsonl"
sources_dir = "{vault_path}/sources"
```
---
## Rules
- **Never fabricate content.** If you can't find a quote, don't make one up.
- **Ask before reconstructing.** Strategy 1 failure = ask user, not proceed silently.
- **Fail loudly.** Access failure goes in first response line, never buried.
- **User-pasted content = full access.** If the article is in the conversation, use it.
- **Save before analyzing.** Content not saved will evaporate between sessions.
- **Check for existing source files** before creating new ones — don't duplicate.
- **Always run validate.py** after bulk extraction. Zero errors = clean corpus.
---
## Examples
```
ingest https://bloomberg.com/... --pub Bloomberg --queue 7
ingest https://substack.com/... --no-save
ingest ~/Downloads/perversion-of-justice.epub --author "Julie K. Brown"
ingest https://www.youtube.com/watch?v=abc123
ingest https://www.youtube.com/@bekahdayyy/videos --channel --confidence MEDIUM
doc court-filing.pdf
extract sources/bloomberg-2025-epstein-emails.md --agents claims,scenes
theme define "victim-network"
theme suggest --source sources/bloomberg-2025-epstein-emails.md
theme report financial-concealment
report
```
Tips
Start with themes before you have all your sources
Define your investigative themes early — even with just a few sources. As new material comes in, theme suggest auto-proposes mappings against the keyword lists. The theme database fills up gradually rather than requiring a big classification pass at the end.
Ingest everything — let certainty sorting do its job
Don't skip sources you don't fully trust. Flag them with their tier (--certainty 3 for conspiracy-adjacent, --certainty 9 for court docs). Facts land in the database tagged accordingly. When a trusted source later corroborates a low-certainty claim, the score rises. The database is the filter, not the front door.
Use extract on articles, not just books
ingest <url> saves a source file and pulls key facts from the summary. extract <source> runs the full 6-agent pipeline on that file. Run extract on high-value articles when you need deeper structured extraction — named entities, claimed relationships, precise datings.
For large sources, watch chunk progress
Book extraction launches hundreds of parallel agents. Ask Claude to show chunk progress inline so you can see the extraction working in real time. If a chunk fails, it's safe to re-run — the extractor deduplicates on fact content before appending.
FAQ
What's the difference between tags and themes?
Tags are low-level descriptors — palm-beach, 2003, maxwell. Themes are investigative threads — financial-concealment, victim-network, institutional-complicity. A theme accumulates evidence over time and produces reports. A tag is just a label.
Extraction ran but I got 0 facts.
Three causes: empty input (check wc -l on the file — conversion may have failed silently), score threshold too high (try --threshold 3), or page-marker artifacts from HTML EPUB extraction (strip --- Page N --- lines before running).
How do I use this for a non-Epstein project?
Create a config file at ~/.research-ingest/your-project.toml with vault_path, facts_path, themes_path, and sources_path. Then pass --project your-project to any subsystem command. The skill defaults to whatever paths are set in the current project's CLAUDE.md.
JSONL vs SQLite — when do I migrate?
Stay in JSONL while actively ingesting. Migrate to SQLite when you need entity cross-queries, timeline generation, or theme bucketing across the full corpus. The SQLite database is fully rebuilt from JSONL — re-migrating is safe and normal.
What's the relationship to the old /ingest skill?
ingest <url> is the full article ingestion subsystem from the old /ingest skill — same cascade, same source file format. Book and extract add the multi-agent pipeline. Theme tracking is new in v1.2. If you have an existing /ingest setup, the source files and facts database are fully compatible.