Written By: Anirudh Mehta, Engineering Manager, and Jonas Björk, CTO
Most data engineering problems are hard because the data is big. Ours is hard because the data keeps changing its mind.
At Funnel, we've spent over a decade building infrastructure to collect marketing data from hundreds of platforms - Google Ads, Meta, TikTok, LinkedIn, and 600+ more - and deliver it to our customers in a form that's fresh, accurate, and trustworthy. What looks like a straightforward ETL problem quickly becomes a system for managing continuously mutating data at scale. Here's why - and here's what we're building next.
The Problem: Marketing Data Doesn't Behave
Imagine you're a meteorologist, and the weather from yesterday keeps getting revised. Not the forecast - the actual recorded temperature from last Tuesday. That's roughly what it's like to work with marketing data.
Performance metrics from ad platforms are frequently estimated when they first appear, then quietly corrected days later. This breaks the standard data engineering playbook. The immutable partition - the bedrock of scalable pipelines - assumes data, once written, stays written. With marketing data, that assumption fails constantly. We have to revisit, compare, and selectively overwrite. At scale. Across thousands of customers and hundreds of sources.
How We Ingest It
Airflow excels at DAG-based orchestration; Kubernetes at long-running services. Neither is designed for the combination we need: millions of short-lived, API-driven jobs per day with per-partition freshness tracking, per-tenant isolation, and dynamic adaptation to undocumented rate limits.
In practice, the overhead of DAG scheduling and task coordination made it difficult to operate efficiently at this granularity, while still reacting in near real-time to external API behaviour. Rather than contorting either into something it wasn’t built for, we built a scheduler and runtime purpose-fit for that specific problem.
Our Data-In Platform - now in its fourth generation - has an intelligent scheduler at its core. Each data source carries its own mutation profile built from years of observed platform behaviour, which drives per-partition expiration policies. In practice, this means we track how frequently data changes over time per source and apply a decay-based refresh strategy - aggressively re-fetching recent data while letting older partitions stabilize unless signals suggest drift. Recent partitions are checked frequently; older ones less so. For APIs with undocumented quota behaviour - Meta’s Insights API being the canonical example - the scheduler infers limits dynamically based on observed responses and error patterns, rather than tripping over invisible ceilings.
The numbers: around 100 refresh jobs per second, scaling to hundreds of thousands for heavy users, with the platform ingesting tens of terabytes daily in aggregate. Over 70% of fetched data doesn't need to be rewritten - our change detection, based on comparing partition-level snapshots, catches it first, saving compute for our customers downstream.
The execution layer is a Rust-based runtime managing a fleet of worker nodes. Each job runs inside a Linux-native OS-level sandbox with hard controls on CPU, memory, network, and filesystem access. Multi-tenancy means a bug in one customer's connector must never reach another's data - so isolation is enforced at the OS level, not the application layer. Every job reads and writes exclusively through pre-signed URLs managed by the platform.
What We're Building Next
The ingestion layer is strong, but the platform our customers interact with is undergoing a significant architectural shift - toward a composable data platform - and we're building most of it from the ground up.

A storage format the whole industry can speak. We're migrating away from Funnel's internal storage format to Parquet with Iceberg table metadata. The goal is interoperability: Funnel data as a first-class citizen in the open table format ecosystem, with all the snapshot isolation, schema evolution, and partition pruning that comes with it.
A new query layer. We're replacing our internal query engine with DataFusion - the Arrow-native query engine built on the Apache Arrow columnar memory format. Vectorised execution, zero-copy reads from Parquet, and a clean extension model we can build on.
A query planner that outlives any single engine. Coupling query logic to a specific execution engine makes migration expensive. We're building a query planner that emits Substrait - a portable, cross-language query plan representation - with SQL as the input interface. The plan compiles once and is executable against whatever engine sits underneath, now or in the future.
A semantic layer that speaks both to dashboards and to LLMs. We're building a semantic layer that participates in query compilation: it resolves business-level metrics and dimensions into physical query plans before they hit DataFusion. The same layer doubles as structured context for LLMs - grounded, schema-aware, queryable - which matters as AI-assisted analytics becomes a real part of how our customers consume data.
Direct data access via Delta Sharing. Today Funnel pushes exports to wherever customers want them - BigQuery, Snowflake, Databricks, GCS, S3, Google Sheets, and more. We're adding Delta Sharing as a first-class access primitive. This gives customers a standardized and secure access layer without exposing underlying storage credentials or requiring data duplication.
Taken together: a composable data platform where storage, query, and semantic layers evolve independently and can be swapped or extended without rewriting the ones above or below.
Why This Is Interesting Work
The surface description - "we sync marketing data" - undersells it considerably. What we're actually doing is operating a high-scale ingestion system with hard correctness guarantees over mutable data, while simultaneously rebuilding the analytical layer on top of it using some of the most interesting open-source infrastructure the data ecosystem has produced in years: DataFusion, Arrow, Iceberg, Substrait, Delta Sharing. Each one is a composable primitive and together they form a platform where the storage, query, and semantic layers can evolve independently.
This approach comes with trade-offs, particularly the operational overhead of running our own orchestration layer, but gives us the control needed to handle unpredictable external APIs at scale. The decisions we make now will shape how hundreds of thousands of marketing datasets are stored, queried, and reasoned over for years to come.
If these kinds of challenges sound exciting to you, we’d love to connect and exchange experiences!