Ad Click Event Aggregation : With platforms like YouTube, TikTok, and Facebook drawing massive audiences, online advertising has become a cornerstone of digital marketing. Businesses are increasingly focused on tracking ad spend to ensure efficient and effective campaigns.
Real-Time Bidding (RTB) plays a central role in this ecosystem, allowing advertisers to bid for ad impressions in real-time based on user data and campaign goals. This dynamic process ensures ads reach the right users at the right moment.
To evaluate performance, companies analyze Ad Click Aggregation—data on impressions and clicks—to understand user engagement and optimize their strategies. These insights support smarter bidding and better returns on investment.
An effective system must offer near real-time analytics, manage delayed data, and ensure high reliability and accuracy to support timely and informed decision-making.
Metric Source : Online advertising platforms like Google and Meta collect impressions and clicks from a variety of sources. These include their own properties—such as search engines, social media platforms, email services, and app stores—as well as external display networks made up of partner websites, mobile apps, and embedded content. Search and content partners, including third-party directories and search engines, help extend ad reach beyond the core platform ecosystem.
To re-engage users, advertisers use remarketing and retargeting strategies, targeting individuals who previously interacted with their websites, apps, or videos. Cost data for these ads is captured through real-time bidding (RTB) during ad auctions, where pricing is determined by competition, ad relevance, and bidding strategies.
Platforms also integrate with ad exchanges and publishers to gather cost information for display and video ads. Most support APIs and data feeds, enabling advertisers to import cost data from other marketing channels. For a unified view of spending, third-party ad servers and attribution tools consolidate cost data across sources. Advertisers can also manually upload data for offline conversions or non-digital campaigns.
To collect impression and click data, a common approach involves having the source system send event data to a designated endpoint whenever a view or click occurs. This endpoint then pushes the events into a messaging system—such as Apache Kafka—for downstream processing. Kafka supports critical streaming needs, including exactly-once delivery, log retention, and distributed execution, making it a reliable choice for scalable, fault-tolerant event handling.
Messaging Layer : Kafka is a robust and widely adopted solution for storing and processing click and impression events in large-scale advertising systems. Its distributed architecture makes it particularly well-suited for high-throughput, low-latency event streaming use cases. Kafka excels in scenarios that demand scalability, fault tolerance, and reliability, offering exactly-once processing semantics that ensure accurate, duplicate-free event handling—an essential requirement for maintaining the integrity of advertising metrics.
One of Kafka's key advantages is its ability to retain logs for configurable durations. This feature enables reprocessing of historical data in cases where downstream systems fail, logic changes, or analytics need to be recalculated. Kafka integrates seamlessly with popular real-time analytics frameworks such as Apache Flink and Spark Streaming, enabling near real-time processing and decision-making based on incoming events.
Kafka clusters can scale horizontally, often consisting of dozens or even hundreds of brokers, and are capable of handling millions of messages per second. With petabyte-scale storage and partitioned topic architecture, Kafka ensures durability and efficient data distribution, even under heavy workloads. Its high availability and built-in replication provide resilience against hardware or node failures, making it a reliable backbone for event-driven systems in the ad tech ecosystem.
Despite its strengths, operating Kafka at scale introduces significant operational complexity. It’s crucial to size brokers and storage appropriately based on expected throughput, message size, and retention policies. Running Kafka effectively requires a skilled team responsible for managing the cluster, tuning performance, handling capacity planning, and ensuring security. Proactive monitoring and alerting are critical, as any downtime or bottlenecks can directly impact data pipelines and business operations.
For teams seeking a managed alternative, cloud-native services such as Google Pub/Sub and Amazon Kinesis offer similar streaming capabilities without the operational overhead of maintaining the infrastructure. These platforms abstract away many of the complexities of scaling and monitoring, though they may trade off some flexibility and fine-grained control.
In summary, Kafka is an excellent choice for click and impression event storage and processing when reliability, scalability, and flexibility are top priorities. However, its operational demands must be carefully considered and addressed to ensure smooth and uninterrupted performance in production environments.
Stream Processors: We can use stream processors like Spark Streaming or Apache Flink to read the event stream from message queue and write into the database.
Features to consider, when using Spark Streaming:
Max Bytes : Set fetch.max.bytes and max.partition.fetch.bytes appropriately to maximize batch processing efficiency.
Auto Commit : Configure Kafka auto commit (enable.auto.commit = false) and manage offsets manually for reliability.
Store Offset : We store offset into Redis / MongoDB etc programmatically and read from the database for the latest offset to process data exactly once.
Batch Intervals : We should optimize batch intervals (e.g., 1-5 seconds) to balance latency and throughput.
Back Pressure : Enabling backpressure (spark.streaming.backpressure.enabled = true) to prevent overload of the consumers
Checkpointing : Enable checkpointing in HDFS, S3, or cloud storage to recover from failures
Sliding Windows : Ensure we use sliding windows to handle the late arrival data.
Watermarks : Setup watermarks in the stream processing to make sure we are handling late arrivals
Map State : Use Map with State or StateStore to handle late arriving data sets to make sure we have the latest aggregated values for each user_id and session
Apache Avro: Use Apache Avro or Protobuf instead of JSON for efficient serialization and also enable Kafka compression (e.g., Snappy, LZ4, or Gzip) to reduce network bandwidth usage.
Features to consider, when using Apache Flink:
Watermarks : Flink allows processing data based on event time rather than arrival time. Also we can setup Watermarks define a threshold to determine how long late events are still accepted. Any events after watermark will be dropped.
Setup Lateness : Instead of dropping late events, Flink can route them to a side output stream for further processing. Allow lateness allowedLateness(2 minutes) keeps the event window open for a bit longer. We can also sideOutputLateData routes late events to a separate stream for potential reprocessing.
Paralellism : In order to scale, increase parallelism (env.setParallelism(n)) to utilize more cores and scale horizontally.
State Management : You can manage state by using RocksDB as a state backend for massive data storage without memory overhead.
Enable checkpointing using periodic checkpoints (recovery in case of failure) and use savepoints to persist the current state for future reprocessing.
Raw Data Storage for Time-Series Events :There are several database technologies well-suited for storing time-series data such as impressions and clicks. Choosing the right one depends on specific requirements like query patterns, scalability, and integration with analytics tools. Below are some commonly used options:
Apache Cassandra: A distributed, highly available, and scalable NoSQL database. It works well for time-series event storage. Data can be partitioned by event_date and hour to optimize query performance and parallel processing.
Google BigQuery: A fully managed, columnar data warehouse designed for fast analytics. Proper schema design is crucial for performance. Leveraging clustering on relevant columns (like date or campaign ID) can further improve query efficiency.
Apache Druid: Designed for real-time analytics on high-volume streaming data, Druid is ideal for clickstreams and log data. Partitioning by day and setting storage granularity to daily intervals helps manage data efficiently. Enabling rollups allows pre-aggregation of data, reducing storage needs and improving performance.
InfluxDB: A purpose-built time-series database that is highly optimized for this use case. It's schemaless, meaning fields can be stored as values without a fixed schema. Data is categorized using tags, and the equivalent of a table is called a measurement. Time-based queries are straightforward and efficient.
Best Practices
Regardless of the technology chosen, it’s critical to:
Design the data model carefully upfront to handle all potential edge cases and ensure future scalability.
Choose the right partitioning strategy based on time intervals or other logical keys to support fast and efficient data retrieval.
Implement regular backups to a secondary storage layer (such as cloud object storage) to safeguard against data loss due to failures or corruption.
Proper planning and maintenance of your raw data storage layer are essential to ensure performance, reliability, and long-term data integrity in any analytics or streaming pipeline.
Aggregator service :The Aggregator Service is responsible for consolidating ad click data consumed from the message queue (e.g., Kafka). It reads incoming event streams and performs real-time aggregation to compute key advertising metrics. The core aggregation tasks include:
Top 10 Ads per minute based on click or impression volume
Total number of clicks or impressions per campaign
These aggregations can be efficiently performed using stream processing frameworks such as Apache Spark Streaming or Apache Flink.
To ensure correctness and efficient processing using Kafka:
All events from the same ad should be routed to the same Kafka partition. This guarantees that events for a particular ad are handled by the same executor or reducer during stream processing.
Input data should be read in 1-minute intervals, aligning with the aggregation window.
Streaming System Design Best Practices
Resource Allocation : Allocate sufficient CPU and memory to streaming jobs to handle high-throughput data ingestion and processing.
Checkpointing : Enable checkpointing to persist intermediate results. This allows recovery and continuation in case of job failures or restarts, minimizing data loss.
Late Events Handling : Use watermarks and define appropriate window durations to handle late-arriving events gracefully, ensuring accurate and complete aggregations.
Offset Management : Avoid relying solely on Kafka for offset storage. Instead:
Persist offsets in an external system like MongoDB after each batch is successfully processed.
This ensures exactly-once semantics and enables safe restarts without reprocessing already handled data.
Idempotent Processing : Ensure that message processing is idempotent, meaning duplicate messages do not affect final results. This may add complexity to the system design but is crucial for correctness and reliability.
Query Service : The Query Service is responsible for handling requests originating from the dashboard. It processes these requests by generating and executing queries against the golden database, which serves as the trusted source of truth for analytics data.
Upon receiving a query request, the service constructs the appropriate query, executes it on the database, and returns the results to the dashboard for visualization or further analysis.
Dashboard : The dashboard can be built using BI tools such as Tableau, Apache Superset, or cloud-based visualization platforms like Amazon QuickSight (AWS), Looker (GCP), or Azure Dashboards.