Tech Kaizen: Apache Kafka

Apache Kafka is a framework implementation of a software bus using stream-processing. It is an open-source distributed event streaming platform to provide a unified, high-throughput, low-latency platform for handling real-time data feeds(high-performance data pipelines), streaming analytics, data integration, and mission-critical applications. Kafka is an open-source software platform developed by the Apache Software Foundation written in Scala and Java.

Apache Kafka was built with the vision to become the central nervous system that makes real-time data available to all the applications that need to use it, with numerous use cases like stock trading and fraud detection, to transportation, data integration, and real-time analytics. It is a distributed streaming platform with plenty to offer from redundant storage of massive data volumes, to a message bus capable of throughput reaching millions of messages each second. These capabilities and more make Kafka a solution that’s tailor-made for processing streaming data from real-time applications.

Kafka is essentially a commit log with a very simplistic data structure. It just happens to be an exceptionally fault-tolerant and horizontally scalable one. The Kafka commit log provides a persistent ordered data structure. Records cannot be directly deleted or modified, only appended onto the log. The order of items in Kafka logs is guaranteed. The Kafka cluster creates and updates a partitioned commit log for each topic that exists. All messages sent to the same partition are stored in the order that they arrive. Because of this, the sequence of the records within this commit log structure is ordered and immutable. Kafka also assigns each record a unique sequential ID known as an “offset,” which is used to retrieve data.

Kafka Terminology:

Kafka uses its own terminology when it comes to its basic building blocks and key concepts. The usage of these terms might vary from other technologies. The following provides a list and definition of the most important concepts of Kafka:

Broker
A broker is a server that stores messages sent to the topics and serves consumer requests.

Topic
A topic is a queue of messages written by one or more producers and read by one or more consumers.

Producer
A producer is an external process that sends records to a Kafka topic.

Consumer
A consumer is an external process that receives topic streams from a Kafka cluster.

Client
Client is a term used to refer to either producers and consumers.

Record
A record is a publish-subscribe message. A record consists of a key/value pair and metadata including a timestamp.

Partition
Kafka divides records into partitions. Partitions can be thought of as a subset of all the records for a topic.