※
UAP Archives
Public Record
◆ Sources · Methodology
Cross-source corroboration
Most UAP-related releases live in silos — NARA, AARO, civilian sighting databases, independent FOIA archives.
For every document in this archive, we look across those four sources and surface matching records on the document's page.
Below: what the four sources are, what we promised them, and how the matcher scores a pair.
Sources
US-government records, public domain. NARA publishes a structured bulk-download manifest (S3 + JSON) that we mirror metadata-only. We never republish full PDFs from NARA — every match links to catalog.archives.gov where the canonical PDF lives.
Today · 35 record-series ingested. Series-level only for now; per-item NAID ingest is gated on getting a NARA Catalog API v2 key.
US-government case-resolution PDFs from the All-domain Anomaly Resolution Office. Public domain, but aaro.mil hard-blocks programmatic fetch (Akamai bot wall on every UA/header combo). Our index is hand-curated from public reporting + congressional-hearing coverage. Matches link to the source PDF on aaro.mil.
Today · 7 case resolutions indexed. PDF bodies not OCR'd at ingest time — title + date + location drive the matcher.
Civilian-witness sighting database with an explicit no-scraping ToS clause. We do not maintain a local cache and we never excerpt witness narratives. Matching against NUFORC is gated behind an explicit operator opt-in (NUFORC_FEDERATION_ENABLED=1) so a fresh checkout doesn't quietly federate. When enabled, the matcher hits NUFORC's own public search URLs at sidecar-build time and stores only the resulting outbound link. If NUFORC grants a written research-archive exception, that's the path to broaden the integration.
Today · 0 matches surfaced. Live-federation disabled by default.
Independent FOIA archive curated by John Greenewald. Underlying documents are public domain (FOIA releases) but the curation, OCR, and editorial annotation are Greenewald's work. We render outbound link + title + date only — no body excerpt — until an attribution exchange is on file. robots.txt explicitly allows our crawler, so the index pull is sanctioned.
Today · 15 prominent posts indexed by hand. Future upgrade: WP REST or sitemap-driven crawl.
Scoring
Every (our_document, foreign_record) pair gets a score in [0, 1] from a weighted sum of four components.
Weights are renormalized while the embedding-similarity component (worth 0.20 in the long-term design) is still offline.
- Date proximity weight 0.35
- 1.0 when our doc's occurred_at falls inside the foreign record's date range (NARA series like Blue Book Case Files, 1947–1969). Otherwise, distance-to-nearest-endpoint with a 14-day half-life.
- Location proximity weight 0.30
- Best-of-N: for each of our doc's top_mentions[].canonical location coordinates, take the max of (1 − haversine_km / 100). Falls back to 0 when either side lacks lat/lon.
- Title token overlap weight 0.20
- Jaccard similarity over stopword-filtered token sets. Our side folds in top_mentions entity names + the first 240 chars of OCR (most of our titles are numeric file IDs that score 0 on their own). Foreign side folds in scope-and-content description when present.
- Agency family weight 0.15
- 1.0 on exact match, 0.5 on family match (defense / intel / civilian / diplomatic), 0.0 otherwise. NUFORC sits in its own civilian-witness family so it never registers same-family-credit with government sources.
Pairs scoring ≥ 0.50 surface as definite matches by default.
Pairs from 0.40 to 0.50 are possible matches, tucked behind a disclosure on each doc page.
Below 0.40 the matcher discards the pair before it reaches the sidecar.
Wrong match? Missing source?
The matcher is heuristic — it errs on the side of recall in this MVP. If a definite match on your document is wrong, or a source you maintain wants its material handled differently, we'd like to know.
The full design and per-phase changelog lives in
docs/superpowers/plans/2026-05-19-sprint-d.md
in the project repo.