In this world of data where things and systems started depending on data, it is very important to get the right data at a right time to get the most of it. In this a great architecture of data streaming – “Apache Kafka” has introduced in 2011
Here I am brining a short course for Kafka where try to provide a basic understanding of Kafka with it’s core architecture and some hands-on on the producer consumer code.
So let’s get started 😊
What is Kafka?
Apache Kafka was originated at LinkedIn and later became an open-sourced Apache project in 2011, then a first-class Apache project in 2012. Kafka is written in Scala and Java.
Apache Kafka is a publisher-subscriber concept based on a fault-tolerant messaging system. It is fast, scalable, and distributed by design.
“Kafka is an Event Streaming architecture.”
Event streaming is capturing data in real-time from various event sources like databases, cloud services, software applications, etc.
Kafka is a messaging system. This is typically suits for the application that requires high throughput and low latency. It can be used for real-time analytics.
Kafka can work with Flume/Flafka, Spark Streaming, Storm, HBase, Flink, and Spark for real-time ingesting, analysis and processing of streaming data. Kafka is a data stream used to feed Hadoop BigData lakes. Kafka brokers support massive message streams for a low-latency follow-up analysis in Hadoop or Spark.
Basics of Kafka:
Apache.org states that:
- Kafka runs as a cluster on one or more servers.
- The Kafka cluster stores a stream of records in categories called topics.
- Each record consists of a key, a value, and a timestamp.
Key Concepts :
Events and Offset :
Kafka uses Log data structure to store the Event/Messages. Each message/Event has a unique Key. Kafka ensures that the message should not be duplicate and must be in sequence.
Offsets are the pointers to understand from where data needs to be picked.
Events/Messages can stay in the partition for very long period and even forever.
Topic and Partitions :
Topic is a uniquely defined category in which producer publishes messages.
Each topic contain one or many partitions. Partitions contains messages.
Messages are written to topics and kafka uses round robin to selects which partition to write the message to.
To make sure that some particular type of messages should go to same partition we can assign Key to the messages, attaching a key to messages will ensure messages with the same key always go to the same partition in a topic. Kafka guarantees order within a partition, but not across partitions in a topic.
Cluster and Broker :
Kafka cluster can have multiple brokers inside it, to maintain load balancing. A single Kafka server is called as Kafka broker. Kafka cluster is stateless hence to maintain cluster state Kafka uses Zookeeper.
I’ll cover zookeeper in the next point. For now let’s understand what is broker.
Broker receives messages from producer and assign offset to it and then store it on local disk.
Broker is also responsible to serve message fetch request coming from consumer.
Each broker contains one or more Topics. Each topic along with their partitions can be assigned to multiple broker but the owner or leader will be only one.
For example in the below diagram Partition 0 is replicated along with topic X in Broker 1 and Broker 2, but the leader will always be only one. The replica is used as a backup of partition. So that if any particular broker fails then the replicator takes leadership.
Producer and consumer only connects to the Leader partition.
Kafka uses Zookeeper to maintain and coordinate between brokers.
Zookeeper is also sends notification to the Producer and consumer about the presence of any new broker or if any new leader created. So that according to that they can make decision and start coordinating the task accordingly.
A consumer group is a platform where we can have multiple consumers. Each consumer group has one unique Id.
Only one consume in the group can pull the messages from a particular partition. Same consumer group can not have multiple consumers of same partition.
Multiple consumers can consume messages from same partition but they must be from different consumer groups.
If the consumers are more in same group and partitions are less then there are changes to have some inactive consumers in the group.
Kafka is an event based messaging system. Mostly suited for applications where big amount of real time data needs to be processed.
In the complete architecture of Kafka it provides load balancing, data backup, maintain message order, facility to read messages from a particular position, message storage for longer period, message can be fetched by multiple consumers of different groups.