research-ingest

Unified investigative research pipeline. Ingest articles, books, depositions, and media. Track themes across sources. Query in plain English.

Add to Claude Code
$ cp research-ingest.md .claude/commands/

A Claude Code skill that handles the full investigative research pipeline — from raw source acquisition through structured fact extraction, database accumulation, and theme tracking. Point it at any URL or file and it builds a queryable facts database you can research against.

  • Auto-detects source type from URL or file extension — no prefix needed for articles, books, or media
  • Article ingestion with multi-strategy paywall bypass
  • Multi-agent book extraction: Haiku×4 specialists + Sonnet×2 synthesizers per chunk
  • Theme tracking — investigative threads that accumulate evidence across sources
  • Project-agnostic: configure any investigation via --project flag
  • No external dependencies beyond the Python standard library

Subsystems

Pass any URL or file path. Claude detects the source type automatically and routes to the right pipeline.

Command What it does
ingest <url> Web article, auto-detected
ingest <path> Book (EPUB/PDF), auto-detected
ingest <url> YouTube / audio, auto-detected
ingest doc <path> Court docs / PDFs (use 'doc' for explicit)
ingest extract <source> Run multi-agent extraction on a saved source file
ingest theme <action> Define, tag, suggest, and report on investigative themes
ingest report Project status + per-theme evidence summaries

Article ingestion

Tries multiple access strategies automatically — direct fetch, archived copies, syndication — and stops when it gets full text.

You say

ingest https://www.bloomberg.com/news/articles/2025-09-11/epstein-emails --queue 7

What lands in sources/

# Maxwell Sends Mystery USB to DOJ

**Source:** Daily Beast
**Author:** Jose Pagliery
**Date:** April 22, 2026
**URL:** https://www.thedailybeast.com/...
**Access:** DONE — via Jina AI Reader
**Retrieved:** 2026-04-27

---

## Key Content

Maxwell sent a USB drive via FedEx to U.S. Attorney Jay Clayton on April 16, 2026,
days after a sealed filing referenced Melania Trump. The USB arrived at the U.S.
Attorney's office in Manhattan without explanation. Clayton had been confirmed as
U.S. Attorney for SDNY three days earlier...

## Key Facts for Sourcing

- **Maxwell sent USB drive to Jay Clayton via FedEx** — April 16, 2026
- **Government deadline to respond to Maxwell petition** — June 5, 2026

Book extraction

Large documents are processed in 8,000-character chunks. Six specialist agents run on each chunk in parallel — four Haiku agents for speed and a Sonnet agent for synthesis. Facts accumulate into facts.jsonl as chunks complete.

The six agents

AgentModelExtracts
PeopleHaikuNames, roles, relationships, affiliations
TimelineHaikuDated events, sequences, cause-and-effect
LocationsHaikuNamed places, addresses, geographic context
BibliographyHaikuCitations, sources, cross-references
ClaimsSonnetAssertions with certainty scores + attribution
ScenesSonnetNarrative moments: who, where, what was said

You say

ingest books/brown-perversion-of-justice.md

What happens

# Splits 340,000 chars into ~43 chunks
# Launches 6 agents × 43 chunks = 258 parallel extractions
# Agents run with Haiku (fast, cheap) or Sonnet (nuanced)
# Each chunk returns structured JSON

→ Chunk 1/43: People (14 entities), Timeline (8 events), Claims (11 facts)...
→ Chunk 2/43: ...
→ Complete: 394 facts extracted · Total in store: 9,241

Theme tracking

Themes are investigative threads — cross-cutting patterns that accumulate evidence across sources. Distinct from tags (low-level descriptors). Each theme has keywords Claude uses to auto-suggest fact assignments as new sources are ingested.

Workflow

CommandWhat it does
theme define "financial-concealment" Create a theme with description + keywords
theme tag F-0123 financial-concealment Manually assign a fact to a theme
theme suggest Auto-propose fact→theme mappings via keyword match + Haiku classification
theme report financial-concealment Evidence summary: count, sources, date range, gaps

Theme schema (themes.jsonl)

{
  "id": "T-001",
  "name": "financial-concealment",
  "display_name": "Financial Concealment",
  "description": "Methods used to hide assets: shell companies, offshore trusts, nominee accounts",
  "keywords": ["offshore", "shell", "trust", "launder", "conceal", "nominee"],
  "fact_ids": ["F-0123-0045", "F-0456-0012"],
  "source_ids": ["SRC-BOOK-brown-poj"],
  "fact_count": 127,
  "source_count": 23,
  "earliest_date": "1985",
  "latest_date": "2019",
  "status": "active"
}

Theme report output

## Theme: Financial Concealment (T-001)

127 facts across 23 sources | 1985–2019

### Core claim
[2–3 sentence synthesis from highest-certainty facts]

### Evidence by period
- 1985–1990: 12 facts (3 sources)
- 1990–2000: 45 facts (8 sources)
- 2000–2019: 70 facts (12 sources)

### Strongest sources
- Brown, Perversion of Justice (certainty 7) — 34 facts
- DOJ-OGR-000234 (certainty 9) — 8 facts

### Gaps
- No sources cover offshore account origins pre-1985

The database

Three JSONL files grow as you ingest. Migrate to SQLite when you need cross-entity queries or timeline generation — the migration is safe to re-run at any point.

FileContains
facts/facts.jsonl Every extracted fact — claim, date, certainty, source, entities, tags, themes
facts/sources.jsonl Source registry — title, author, type, certainty tier, ingest date
facts/themes.jsonl Investigative threads — keywords, fact IDs, source IDs, date range

Fact schema

{
  "id": "F-08910",
  "claim": "Maxwell sent a USB drive via FedEx to U.S. Attorney Jay Clayton on April 16, 2026",
  "date": "2026-04-16",
  "certainty": 7,
  "source_id": "ART-dailybeast-maxwell-usb-2026",
  "people": ["Ghislaine Maxwell", "Jay Clayton"],
  "places": ["New York"],
  "orgs": ["DOJ", "SDNY"],
  "tags": ["maxwell", "doj", "2026"],
  "themes": ["T-003"]
}

The skill file

Copy this into .claude/commands/research-ingest.md. Claude Code picks it up automatically and knows how to route all subsystem commands.

research-ingest.md
---
allowed-tools: Read, Write, Edit, Bash, Glob, Grep, WebSearch, WebFetch, Task
description: "Unified investigative research pipeline — articles, documents, books, media, multi-agent extraction, and theme tracking"
argument-hint: "<url or file path> — auto-detects source type"
---

# research-ingest

Unified investigative research ingestion skill (v1.2).

## Auto-detection dispatch

Inspect the first token of `$ARGUMENTS`:

| Input | Routes to |
|-------|-----------|
| URL containing `youtube.com` or `youtu.be` | `media` |
| Any other URL (starts with `http`) | `article` |
| Path ending in `.epub`, `.pdf`, or `.md` | `book` |
| First token is `doc` | `doc` subsystem |
| First token is `extract` | `extract` subsystem |
| First token is `theme` | `theme` subsystem |
| First token is `report` | `report` subsystem |

If no match, assume `article`.

---

## article — Paywall bypass cascade

Flags: `--title`, `--pub`, `--author`, `--date`, `--queue N`, `--project DIR`, `--no-save`

### CRITICAL: Ask first — never reconstruct silently

If Strategy 1 fails and the user is in the conversation, STOP. Say:
> "I couldn't get the full text from [domain]. Do you have access? Can you paste it?"

Only proceed to Strategy 2+ if user says go ahead. Fail loudly in first response line, never buried.

### Strategy cascade (try in order, stop when you have full text)

1. **Direct fetch + Trafilatura** — WebFetch original URL
2. **Jina AI Reader** — `WebFetch: https://r.jina.ai/{URL}` — clean Markdown, works for soft paywalls
3. **Wayback CDX check** — `http://web.archive.org/cdx/search/cdx?url={URL}&output=json&fl=timestamp,statuscode&filter=statuscode:200&limit=1` — if empty, skip; else fetch `https://web.archive.org/web/{TIMESTAMP}/{URL}`
4. **archive.ph** — `https://archive.ph/newest/{URL}` — conditional, only if previously archived
5. **Syndication hunt** — WebSearch: `"{title}" site:yahoo.com OR site:msn.com OR site:britbrief.co.uk`
6. **AMP/mobile versions** — `https://amp.{domain}/{path}` or `https://{domain}/amp/{path}`
7. **Secondary reconstruction** — only with explicit user permission
8. **Parallel agent** — Task (Sonnet) deep research sweep

### Source file format

Save to `03_Resources/{project}/sources/{publication}-{year}-{slug}.md`:

```
# {Article Title}

**Source:** {Publication}
**Author:** {Author}
**Date:** {Month Day, Year}
**URL:** {url}
**Access:** {DONE | PARTIAL | DEAD} — {notes}
**Retrieved:** {today}

---

## Summary

{2–3 sentence summary}

## Key Content

{Full article text}

## Key Facts for Sourcing

- **{Fact}** — {context}
```

---

## book — Multi-agent extraction

Splits the source into 8,000-character chunks. Launches 6 specialist agents per chunk in parallel.

| Agent | Model | Extracts |
|-------|-------|---------|
| People | Haiku | Names, roles, relationships |
| Timeline | Haiku | Dated events, sequences |
| Locations | Haiku | Places, addresses, geography |
| Bibliography | Haiku | Citations, cross-references |
| Claims | Sonnet | Assertions + certainty scores |
| Scenes | Sonnet | Narrative moments: who, where, what was said |

Each agent returns structured JSON. Facts accumulate to `facts/facts.jsonl`. After all chunks complete, run `generate_views.py` and `validate.py`.

---

## extract — Run extraction on a saved source

Point at any saved source file. Runs the same 6-agent pipeline as `book`. Use after ingesting articles or docs when you want deeper structured extraction beyond the initial key facts.

```
python3 scripts/extract_facts_from_source.py sources/filename.md
python3 scripts/build_source_registry.py
python3 scripts/generate_views.py
python3 scripts/validate.py
```

---

## theme — Investigative thread tracking

Themes are cross-cutting patterns that accumulate evidence across sources. Distinct from `tags`.

### Actions

**`theme define <name>`**
Create a theme. Ask for: display name, description, keywords (comma-separated). Write to `facts/themes.jsonl`.

**`theme tag <fact-id> <theme-id>`**
Manually assign a fact to a theme. Update both the fact in `facts.jsonl` (add theme_id to `themes[]`) and the theme entry (add fact_id to `fact_ids[]`).

**`theme suggest`**
Scan all facts in `facts.jsonl` against all theme keyword lists. For each potential match, use Haiku to classify (yes/no/uncertain). Present proposed mappings for user approval before writing.

**`theme report <theme-name>`**
Generate structured evidence summary:
- Fact count + source count + date range
- Core claim synthesis (from highest-certainty facts)
- Evidence by time period
- Strongest sources by fact count
- Coverage gaps

### Schema (facts/themes.jsonl)

```json
{
  "id": "T-001",
  "name": "financial-concealment",
  "display_name": "Financial Concealment",
  "description": "Methods used to hide assets: shell companies, offshore trusts, nominee accounts",
  "keywords": ["offshore", "shell", "trust", "launder", "conceal", "nominee"],
  "fact_ids": [],
  "source_ids": [],
  "fact_count": 0,
  "source_count": 0,
  "earliest_date": null,
  "latest_date": null,
  "created": "{today}",
  "updated": "{today}",
  "status": "active"
}
```

Facts gain a `themes: [theme_id, ...]` field alongside existing `tags`.

---

## report — Project status

Show:
- Facts in `facts.jsonl` (total, by source type, last 7 days)
- Sources in `sources.jsonl` (total, by type)
- Themes in `themes.jsonl` (active, fact coverage)
- Any validation errors

---

## Project config

Default: uses paths configured in the project's `CLAUDE.md`.

With `--project <name>`: reads `~/.research-ingest/<name>.toml`:
```toml
[project]
name = "my-investigation"
vault_path = "~/research/my-investigation"
facts_path = "{vault_path}/facts/facts.jsonl"
themes_path = "{vault_path}/facts/themes.jsonl"
sources_path = "{vault_path}/facts/sources.jsonl"
sources_dir = "{vault_path}/sources"
```

---

## Rules

- **Never fabricate content.** If you can't find a quote, don't make one up.
- **Ask before reconstructing.** Strategy 1 failure = ask user, not proceed silently.
- **Fail loudly.** Access failure goes in first response line, never buried.
- **User-pasted content = full access.** If the article is in the conversation, use it.
- **Save before analyzing.** Content not saved will evaporate between sessions.
- **Check for existing source files** before creating new ones — don't duplicate.
- **Always run validate.py** after bulk extraction. Zero errors = clean corpus.

---

## Examples

```
ingest https://bloomberg.com/... --pub Bloomberg --queue 7
ingest https://substack.com/... --no-save
ingest ~/Downloads/perversion-of-justice.epub --author "Julie K. Brown"
ingest https://www.youtube.com/watch?v=abc123
ingest https://www.youtube.com/@bekahdayyy/videos --channel --confidence MEDIUM
doc court-filing.pdf
extract sources/bloomberg-2025-epstein-emails.md --agents claims,scenes
theme define "victim-network"
theme suggest --source sources/bloomberg-2025-epstein-emails.md
theme report financial-concealment
report
```

Tips

Start with themes before you have all your sources

Define your investigative themes early — even with just a few sources. As new material comes in, theme suggest auto-proposes mappings against the keyword lists. The theme database fills up gradually rather than requiring a big classification pass at the end.

Ingest everything — let certainty sorting do its job

Don't skip sources you don't fully trust. Flag them with their tier (--certainty 3 for conspiracy-adjacent, --certainty 9 for court docs). Facts land in the database tagged accordingly. When a trusted source later corroborates a low-certainty claim, the score rises. The database is the filter, not the front door.

Use extract on articles, not just books

ingest <url> saves a source file and pulls key facts from the summary. extract <source> runs the full 6-agent pipeline on that file. Run extract on high-value articles when you need deeper structured extraction — named entities, claimed relationships, precise datings.

For large sources, watch chunk progress

Book extraction launches hundreds of parallel agents. Ask Claude to show chunk progress inline so you can see the extraction working in real time. If a chunk fails, it's safe to re-run — the extractor deduplicates on fact content before appending.

FAQ

What's the difference between tags and themes?

Tags are low-level descriptors — palm-beach, 2003, maxwell. Themes are investigative threads — financial-concealment, victim-network, institutional-complicity. A theme accumulates evidence over time and produces reports. A tag is just a label.

Extraction ran but I got 0 facts.

Three causes: empty input (check wc -l on the file — conversion may have failed silently), score threshold too high (try --threshold 3), or page-marker artifacts from HTML EPUB extraction (strip --- Page N --- lines before running).

How do I use this for a non-Epstein project?

Create a config file at ~/.research-ingest/your-project.toml with vault_path, facts_path, themes_path, and sources_path. Then pass --project your-project to any subsystem command. The skill defaults to whatever paths are set in the current project's CLAUDE.md.

JSONL vs SQLite — when do I migrate?

Stay in JSONL while actively ingesting. Migrate to SQLite when you need entity cross-queries, timeline generation, or theme bucketing across the full corpus. The SQLite database is fully rebuilt from JSONL — re-migrating is safe and normal.

What's the relationship to the old /ingest skill?

ingest <url> is the full article ingestion subsystem from the old /ingest skill — same cascade, same source file format. Book and extract add the multi-agent pipeline. Theme tracking is new in v1.2. If you have an existing /ingest setup, the source files and facts database are fully compatible.