Why Funnel Built Its Own Data Ingestion Platform

By: Jonas Björk, CTO at Funnel

At Funnel, we deal with a surprisingly interesting problem: keeping marketing data fresh. On the surface, it sounds simple, call some APIs, store the data, and you’re done. But anyone who’s worked with APIs like Meta’s or Google Ads knows that freshness, completeness, and correctness are moving targets given the nature of marketing data.

We’ve spent over a decade building a purpose-built Data-In Platform, now in its fourth generation, designed to handle this complexity at scale across hundreds of connectors and thousands of customers. Here’s how we do it.

Funnel-May-22-2025-11-19-16-0953-AM

Why Ingesting Marketing Data Isn't Simple

While modern orchestration tools and container runtimes like Airflow or Kubernetes might seem like natural fits, they don't fully meet the demands of a multi-tenant, high-scale, freshness-critical data platform for marketing data. Tools like Airflow excel at directed acyclic workflows (DAGs) and scheduled pipelines, while Kubernetes is optimized for long-running services and infrastructure orchestration, not for executing millions of short-lived jobs per day.

Most marketing platforms expose their data via APIs designed for reporting and analytics. But these APIs often deliver data with delayed accuracy. For example, performance metrics are frequently unavailable in real-time and may only be updated once per day. Early data is often estimated, and these estimates can change, sometimes several days after they first appear.

In traditional data engineering, the goal is to produce immutable partitions, as mutations require expensive reprocessing downstream. Unfortunately, with marketing data, immutability is rarely feasible. Keeping data both accurate and up-to-date turns out to be a far more nuanced problem than it may appear.

Scheduler Optimized for Freshness and Scale

To tackle this, Funnel has developed an intelligent scheduler within our Data-In Platform. It tracks data freshness for every partition, across all data sources and all customers. Each platform’s historical mutation patterns are built into the platform and encoded into expiration policies for partitions. When a partition expires, the platform re-runs the extraction job, checks for changes and finally rewrites the partition only if necessary.

The scheduler uses adaptive refresh strategies where recent data is checked frequently, while older data that is less likely to change is checked less often. It also respects a wide variety of API constraints, including rate limits and quotas. For well-behaved APIs, such as Google Ads, these constraints are clearly documented. For others, like Meta’s Insights API, black-box quota behavior requires dynamic scheduling adjustments to avoid throttling and maintain performance and ensure data availability.

On average, our customers see around 400 data refresh jobs per day. For those with many connected data sources, this can scale to hundreds of thousands of jobs daily, with several download jobs completing every second. Most jobs download relatively small volumes of data, typically in the range of a few megabytes, while others handle much larger loads reaching into the gigabytes. On average, our platform ingests tens of terabytes of data per day across all customers. Some APIs support ETags to detect changes efficiently, but in most cases we perform custom change detection logic internally. As a result, over 70% of fetched data does not need to be rewritten, saving resources both in Funnel’s infrastructure and in our customers’ data warehouses, by preventing unnecessary downstream processing.

Isolation, Performance, and Security

Funnel currently offers +600 connectors and to power them we’ve built a proprietary connector runtime optimized for performance and isolation. At a high level it is a Rust-based orchestration layer that manages a fleet of worker nodes. Each worker node runs many different jobs in parallel, such as executing a connector and downloading data. Each connector process is isolated using Linux-native sandboxing at the operating system level, with strict controls over CPU, memory, network access, and filesystem permissions providing robust, low-level isolation beyond the application layer. To further prevent data leaks across tenants, even in the presence of bugs, each job can only read and write data via pre-signed URLs managed by the platform.

FunnelFerris

Why We Keep Building

Our core promise is to deliver business-ready data that is as fresh and accurate as possible, in a secure way. To uphold that promise, we continue to evolve our proprietary Data-In Platform, purpose-built for the realities of mutable marketing data, API rate limits, and secure, per-tenant isolation at massive scale. As the marketing ecosystem evolves and more types of data are used, we will continue to evolve our Data-In platform to deliver even more types of fresh, trusted data at scale.