Back to Blog
Engineering

Building a Real-Time OSINT Pipeline

James Reed|March 25, 2026|12 min read

Key Takeaways

  • -The pipeline processes 60,000+ events/hour from 20 sources using a four-stage architecture: fetch, normalize, deduplicate, store.
  • -Two-layer deduplication (Redis + Postgres) ensures one record per event while handling updates gracefully.
  • -Each feed is self-contained — one feed crash never affects others, and the system degrades gracefully without Redis.
  • -Entity tracking maintains position history for aircraft and satellites, enabling persistent surveillance over time.
  • -Hash-based staggered startup prevents thundering herd when all feeds boot simultaneously.

Processing 60,000+ events per hour from 20 different data sources is a non-trivial engineering challenge. This post walks through the architecture of Sentinel's data pipeline — from feed collection to normalized storage.

The pipeline follows a four-stage architecture: fetch, normalize, deduplicate, and store. Each stage is designed for independent failure — if one feed goes down, the others keep running. If Redis is unavailable, deduplication falls back to Postgres unique constraints. If Supabase is unreachable, events are logged locally for later replay.

At the collection layer, each feed has its own collector that extends a common BaseCollector class. The collector handles polling intervals, error tracking, and automatic disable after consecutive failures. Polling intervals range from 15 seconds (ADS-B military tracking) to 12 hours (satellite TLE data from CelesTrak).

Normalization is where the magic happens. Every event, regardless of source, is transformed into a NormalizedEvent schema: source, event type, coordinates, timestamp, severity, raw data, and a computed content hash. This common schema is what allows cross-domain correlation downstream.

Deduplication runs at two layers. Redis provides fast, in-memory dedup with a 24-hour TTL per event hash. This catches the 90% case — the same earthquake reported in consecutive USGS polls, the same aircraft appearing in sequential ADS-B sweeps. Postgres provides durable dedup via a unique index on (source, source_id), catching anything Redis missed.

Storage uses batch writes to Supabase (Postgres via PostgREST). We batch up to 500 rows per write to stay within payload limits, using upsert operations to handle race conditions between the two dedup layers.

The scheduler uses hash-based staggered startup to prevent thundering herd on boot. Each feed's first poll is delayed by a deterministic offset derived from its name, spreading initial requests across the first few seconds of startup.

Entities — aircraft and satellites — get special treatment. Rather than just recording events, we maintain an entity history table that tracks position, altitude, speed, and metadata over time. This enables the persistent tracking features in Sentinel's analyst and professional tiers.

The entire pipeline is async Python, running on FastAPI with asyncio tasks managed by a custom scheduler. No Celery. No external queue for the core pipeline. Just clean async/await with proper error boundaries.

In a future post, we will cover the AI agent layer that sits on top of this pipeline — how domain-specific agents consume normalized events and produce the delta intelligence that Sentinel is built on.