To be a data-driven organization is still one of the top strategic priorities for any company in 2022. Data remains a key differentiator to create an edge in today’s market. From curating customer experience to achieving operational excellence data enables every initiative to support quick and informed decision making. People, usually relate being data-driven as the ability to store and process massive volumes of data however ensuring timely data and insights is equally crucial. For example, consider a fraudulent credit card transaction detection scenario in Banking & Financial Services (BFS) industry, an immediate notification trigger to the customer can prevent the fraud. The traditional batch (ETL) architectures are not capable to accommodate these low latency data requirements. The velocity of retrieval was not as important in the past, however with the increasing volume of streaming data, enterprises today need to design infrastructures to meet such aspects of big data.
The Big data buzz originally started with 3V’s (Volume, Variety and Velocity), however it has crossed over 40 Vs(42 V’s of Big Data & Data Science) today. Big data architecture handles the ingestion, processing and analysis of data that cannot be handled easily by traditional database systems. To address how the big data pipelines can be architected based on the velocity/speed requirements of the use cases, we will discuss two popular architectures here — Lambda and Kappa. But before getting into those, let’s first revisit the architectural approaches — batch and streaming.
Data Processing — Batch and Streaming
Batch Processing — In this approach, data is collected over a period of time in batches and then processed using transformation and aggregation functions, before moving it to analytical systems (DWH) or any database. Many legacy systems still run completely on batch processing as it is ideal for processing large datasets and implementing complex transformations using powerful tools. Some of the characteristics of batch processing are -
· Able to recalculate complete data from raw/ persistence layer based on the logic in data pipelines.
· Easy to implement complex logic using powerful tools available.
· The results computed are accurate. Queries are re-executed across the entire dataset in order to update results when new data arrive.
· Significant delays in data processing (depending on the job frequency)
· Data needs to be loaded and processed; hence the volume of processed data is limited by disk space of the platform.
Platforms — ETL/ELT tools,
Processing — Hadoop (map-reduce), Spark
Use cases — Payroll Processing, Credit Cards bill generation etc.
Figure-1 below, shows the traditional ETL and ELT architectures where batch processing is leveraged.
Streaming scenarios are event oriented. In event driven architecture, an event can be defined as “a significant change in state”. For example, an event could be a click on the website or any write/modify operation in the database that triggers a data stream which is processed in real time and before moving it to target systems. Streaming processing is ideal for scenarios that require speed and nimbleness. Some of the characteristics of streaming processing are -
· Data transformation such as aggregation, filtering etc. are executed on the fly.
· Difficult to implement the complex processing logic in data streams.
· Process data streams record-by-record as they arrive over time or based on the query windows size, so processing is not limited by disk space.
· Results are likely to be less accurate due to incremental update of result for each new arriving record.
· Real time processing makes it difficult to correct any errors retroactively since the data has already processed and moved to the target system.
Platforms — Apache Kafka
Processing — Apache Storm, Apache Flink, Spark Streaming
Use Cases — Fraud detection, ATM transactions, financial stocks trading.
Following is fair comparison between Batch and Stream Processing based on the capabilities:
To summarize, there is no winner here since both batch and stream processing, each have their own strengths and weaknesses. As promised, let us now discuss different architectures on how one of them or both are leveraged in designing big data pipelines.
Lambda architecture is a combination of both batch and stream-processing methods which was first proposed by Nathan Marz. This approach attempts to mitigate the high latencies of batch processing, maintain a balanced throughput to avoid storage and memory issues, and ensure a fault-tolerant architecture by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The processing layers are intended to ingest from an immutable master copy of the entire data set, making it easy to restore the system state by just reprocessing the entire data.
Event streams are folked/ duplicated across two independent processing layers. The events are timestamped and appended to the existing events rather than overwriting them. State is determined from the natural time-based ordering of the data.
Lambda architecture (figure-2) consists of three layers — batch, speed, and serving. Let’s discuss about each component first before moving to the overall architecture.
Batch Layer — combines the incoming data events each time with large volumes of historical data, stores it as an immutable, constantly growing master dataset and then recomputes the entire dataset iteratively to produce accurate results. Initially, the raw data is stored in a Data Lake based on Hadoop, or a data warehouse(cloud) on Snowflake, Redshift etc. The results are processed with high latency and are stored generally in a denormalized view called as a batch view. Due to the batch layer, the system can update the business logic or fix a bug by reprocessing the entire immutable dataset and restore the state based on it.
Stream Layer — handles data that streamed after the last processing cycle in the batch layer. The incremental loads are processed in low latency, producing near real time results. Data in here may only be behind by a small-time window, but the trade-off is the data accuracy for low latency. Outputs are called as speed views, typically stored on fast NoSQL databases or column-oriented databases and are deleted once that data is processed by batch layer.
Servicing Layer — Eventually, the batch layer views and the incremental flow from streaming layer views are merged and analyzed to produce the final output. Serving layer indexes the batch view for efficient querying and updates it with incremental updates from stream layer. Any database can be used here, either a resident (in memory) or persistent, including special purpose storages, e.g., for full-text search.
The data is queried from this merged view through any communication paradigm (real-time, near real-time, batch, request-response) to be ingested into a business application or an analytical database.
Figure-3 below shows an implemented lambda architecture for a real-time analysis of Twitter’s tweets. It can process and analyze a tweet dataset that is highly random and frequent. This project is using twitter4j streaming api and Apache Kafka to collect and store twitter’s real-time data. Apache Spark engine processes the data in batch, while Akka scheduler schedules batch processes in specified intervals. Stream layer processing is handled by Spark Streaming. The results of batch view and speed view are stored in Cassandra. The batch view and speed view data are combined in serving layer on Apache Hbase — a column oriented non-relational database. Finally, Akka HTTP is leveraged to build Rest APIs to fire ad hoc queries from serving layer.
Concerns with Lambda –
Companies invest a lot of time, effort, and money to augment the traditional ETL with real-time processing through Lambda architecture. Despite many success stories, we still hear about issues with this architecture. Below are some of the common issues discussed -
Dual code base — Two different code base needs to be maintained i.e., Code in Batch layer and in Streaming layer. Imagine any changes in code, developing new functionalities or hotfixes in one needs to be propagated in the other. In batch[ss1] [ss2] layer due to availability of sophisticated tools, development could be relatively easy but translating the same logic over streaming layer is complex and challenging.
Infrastructure Overhead — two distributed systems need to be set-up to execute this architecture, as everything needs to be processed (at least) twice, which results in doubling the costs in — platform maintenance (storage, network, compute), soft costs (development) infrastructure monitoring, logs, support etc.
Data Quality — Ideally, both layers should match the processing logic to ensure quality of the data delivered, but it is difficult to measure the coherence across two distributed systems. Developers mostly use manual code checks or programming acumen to validate these, which questions the overall data quality of the results.
Complexity — Difficult to implement a seamless toggle mechanism on which the results of batch or stream layers are read. Delays in batch jobs often show through stream layer.
Due to the added complexity of Lambda architecture of maintaining two different processing layers, a new architecture was proposed by Jay Kreps that could perform both processing with a single technology stack, known as Kappa architecture.
Kappa Architecture is an event-based software architecture pattern for handling data at all scale in real-time for transactional and analytical workloads. The key idea is to simplify the infrastructure and process the entire data on a single processing engine. The data is ingested as a stream of events into a distributed and fault tolerant immutable log.
Figure-4 below depicts a Kappa Architecture, which is composed of two components — Stream and Serving layers. These are essentially the same layers as in Lambda Architecture, the stream layer runs the stream processing jobs to enable real-time data processing of the event streams, and the serving layer enables to query results. Along with real-time processing in stream layer, ordered immutable log is mirrored to a cheap and persistent storage such as S3 object storage on public cloud or Hadoop on-prem systems. This tiered storage is introduced in Kappa architecture to address issues related to scalability, efficiency, and operations. Unlike the lambda architecture, the entire data is re-processed only when the processing logic is changed. The immutable log is replayed from persistent storage at high throughput, typically using parallelism to complete the computation in a timely fashion.
Kappa architecture is implemented at Uber to manage trillions of messages and petabytes data per day, running on 2000+ nodes. Figure-5 below, shows how Kappa is implemented at Uber enabling both transactional and analytical workloads. The infrastructure is real-time, scalable, fault-tolerant, and reliable. Events are ingested from different producers such as applications, relational and No-SQL databases, and processed in real-time on a Kafka platform. The results can be consumed by different microservices in the data mesh or analytical platforms via any communication paradigms.
Concerns with Kappa -
Event streaming is a paradigm shift. Though Kappa sounds a perfect fit in the enterprise architecture, but there are a few challenges and limitations faced by this architecture pattern.
Cost — Simplicity achieved with Kappa comes at a huge cost. With Kappa cost of running streams with indefinite TTL (Time to Live) is very high. Native cloud services don’t support high TTL of more than 7 days, hence the architecture runs on Platform as a Service (PaaS) or Infrastructure as a Service (IaaS), which adds more administration cost to your architecture.
Event streaming platforms don’t align to the traditional data catalog tooling. A strong organizational governance over data in motion would require implementing solutions such as confluent enterprise, that would incur additional licensing costs.
There are not many options for a good streaming framework that can be used in this architecture as of now. Further, not all algorithms can be made streaming. Hence, a lot of development cost is involved.
Reprocessing data — Any changes in existing processing code such as change in logic or an addition of a new field, requires the entire stream to be replayed and rebuild the time slices in serving as needed.
Out of Order Data — Delay in processing might lead to accumulation of an unordered dataset in the stream. Hence, there is a need for an exception dispatcher for data re-processing or validation.
Complex Queries — Kappa is not ideal to cater complex queries which translates to joining large number of tables from a RDBMS.
Big Data architectures for enterprises require a lot of considerations. Many pipelines today are still based on the traditional batch only mode. Once the data is ingested using a publish-subscribe messaging system, either of the architectural patterns can be implemented by fusing various open-source technologies such as — Spark, Apache Hadoop (HDFS, MapReduce), Apache Drill, Spark Streaming, Apache Samza etc. Benefit of one architecture looks like a pitfall for another, hence which one to be applied when depends on the “Velocity vs Veracity” tradeoff based on the requirements for a use case.
Kappa architecture provides enormous benefits and a much simpler infrastructure than the Lambda architecture, but it is not a silver bullet to every problem. When algorithms used for real-time and historical data are identical and implementable on streaming. It is clearly beneficial to implement Kappa and process historical and real-time data on the same codebase. For more complex use cases with dynamic busty loads, heavy compute and long processing times, Lambda will be an ideal choice. Machine learning and data science applications often implement Lambda architectures for these reasons.
Ultimately the primary goal is to minimize time to value — the reason for considering distributed systems architecture in the first place! If you are still experimenting, the key is to start small in scope with well-defined deliverables for a use case and then iterate.