Citation format for history.croton.news stories
This file documents the stable citation format used in all `_long.md` story files. Written during the April 2026 citation-integrity refactor.
The problem we solved
Before this refactor, stories cited primary sources by raw `chunk N` row IDs from `history.db`:
(Bolton 1848) chunks 3987, 3989 — Kitchawank villages
`chunks.id` is an `AUTOINCREMENT` column. Every time a source was re-ingested the ids shifted, and those inline references pointed at nothing or at the wrong row. The audit in April 2026 found 127 broken chunk references across stories 03–10.
The stable format
Citations now use the `[src: slug#hash]` format:[1][2]
Where: - `slug` = the `source_file` value in `history.db.chunks.source_file` - `hash` = the first 12 hex characters of `sha1(content.strip())` — i.e. `content_hash` column on the `chunks` table
Why this survives re-ingestion
The `chunks.content_hash` column is computed deterministically from the chunk text itself. Re-ingesting the same source file produces the same chunks with the same hashes. `ingest_sources.py` has been patched to upsert by `(source_file, content_hash)` — when the hash is already present for that source_file, the existing row's `id` is preserved and only its chunk_index/metadata is updated. Only genuinely new content gets a new `id`. Obsolete content (in the old version but not the new) is deleted along with its embeddings.
The practical consequence: stable citation IDs survive re-chunking, re-embedding, or full DB rebuilds, as long as the underlying source text has not changed.
Where to put citations
Inline body refs
Place `[src: slug#hash]` at the end of a sentence or paragraph that makes a specific factual claim from the source. The renderer strips these from visible prose automatically — they are editorial pins for migration and fact-checking tooling, not rendered text. Readers see the claim without visible markup.
The first stone of the dam was placed in 1892 and the last on January 10, 1906.[3]
Sources block refs
In the `Sources:` block at the end of each story, use `[src: slug#hash]` at the end of each bibliography line — or just `[src: slug]` without a hash for entries that cover the whole source rather than a specific passage. The renderer converts these into clickable `[source]` links pointing to the source page anchor.
- Bolton, Robert Jr. A History of the County of Westchester*, Vol. I (1848).[4] Primary for the Catharine Philipse will.
Out of scope
Do not use the `[src:]` format inside: - Footnote markers (`[^name]`) — footnotes are for standalone explanations, not sources - Markdown link titles or alt text - Code blocks or pre-formatted sections
How to get the right hash
When writing a new story, find the chunk you want to cite and look up its `content_hash`:
```bash ssh croton "sqlite3 /opt/croton-news/rag/history.db \ \"SELECT source_file, content_hash, substr(content, 1, 200) FROM chunks WHERE source_file = 'bolton_1848_v1' AND content LIKE '%Matty and Sarah%' LIMIT 3\"" ```
Or use the helper script:
```bash python3 /opt/croton-news/rag/history/resolve_citation.py \ bolton_1848_v1 "Matty and Sarah, my Indians or muster slaves" ```
The helper returns the best FTS match for the given phrase within the named source_file, printing `[src: slug#hash]` ready to paste.
Migration history
- April 2026: 127 raw `(chunk N)` refs across stories 03–10 migrated to `[src: slug#hash]` format via `/tmp/migrate_citations_v3.py`. 48 resolved via FTS matching; 79 had no recoverable anchor and were stripped (they were already invisible to readers via `_parse_story()` regex).
- Backup files: each migrated story has a `.bak_cite_v3` backup in-place, plus various `.bak_factcheck` / `.bak_dam_fix` / `.bak_higgins` backups from the preceding editorial pass. All backups are safe to delete once the next round of edits is complete.
Renderer contract
`app.py::_parse_story()` handles the format at render time:
1. In prose body: `[src: slug#hash]` patterns are stripped entirely before the body is passed to the template. Readers never see them.
2. In sources block: `[src: slug#hash]` is converted to `[source]`, producing clickable anchors that jump to the specific chunk on the source page.
3. Backward compatibility: the old `(chunk N)` and `chunks N, M` stripping regexes remain in place for any files that slip through without migration. New stories should use only the `[src:]` format.
References
- Bolton, Robert Jr. A History of the County of Westchester, from its First Settlement to the Present Time, Vol. I. New York: Alexander S. Gould, 1848., §0 — "M.ll Gc 974.701 W52bo I v.l 1281018 GENEALOGY COLLECTION 1^ ALLEN COUNTY PUBLIC LIBRARY 3 1833 01149 0262 Digitized by the Internet Archive in 2010 w…" [source] ↩
- lossing_1866_hudson ↩
- croton_jubilee_1948 ↩
- bolton_1848_v1 ↩