Stream ingestion and stream processing are fundamental components of distributed systems. This blog series focuses on how Kafka can be used to create fast and fault-tolerant systems for processing data streams. The overall goal of this multi-part guide is to take a practical learning approach where we build a simple web application that produces data on click events and a Kafka cluster that can process this data.
The goals of this post center around getting acquainted with Kafka and its main uses. Here are some of our goals:
- Understand what Kafka is used for.
- Be familiar with Kafka’s terminology.
- Build a kafka cluster that runs on the command line.
The features that distinguish Apache Kafka from other technologies may initially seem rather subtle. I will introduce concepts and terminology gradually, as they become relevant rather than up front. If you’d rather brush up on your own first, feel free to browse these resources:
- Kafka’s “Get Started” page is a very thorough but at times undermotivated description of Kafka and its capabilities.
- Cloudera breaks down some use cases for Kafka.
- Hooking up Kafka Manager is a great way to start visualizing clusters.
What Kafka Gives You
To put things somewhat crudely, Kafka is a waypoint for data moving into or out of a distributed system. Kafka decouples data producers from data consumers, but it is different from other messaging systems due to the way it stores and handles data as it passes through your pipeline. Here are Kafka’s three primary roles in a data pipeline:
- Messaging System: When data enters Kafka, it can be routed to one or more “consumer groups.” Like a queue, multiple consumers within the group can help process the data. Like a publish-subscribe system, multiple consumer groups can subscribe to a single stream of data.
- Storage System: Data passing through Kafka is written to disk and replicated. It can be stored for an arbitrary period of time, and consumers are responsible for maintaining their read position of the data. This means that if a server fails, the consumer knows what position to read from in the replicated server.
- Stream Processing System: Data passing through Kafka can be transformed before it is streamed out to consumers.
Exploring a Kafka Cluster
Before we build anything, it’s instructive to run a Kafka cluster and get used to some of Kafka’s main entities. The walkthrough below is largely taken from Apache’s Quickstart Guide, but it includes alternative installation instructions and my editorials.
On a Unix-based machine, you can download a tarball and untar it.
tar -xzf kafka_<my-kafka-version>.tgz
If you are on Mac OS X, you can use Homebrew to install a command line utility that runs Kafka.
brew install kafka
Note: The Homebrew installation on Mac OS X provides convenience wrappers for the shell scripts found within the
bin directory of the un-tarred archive. However, you will still need the tarball above for its various configuration files, which are passed as arguments to the commands below. The commands in the next code block demonstrate this pattern.
2. Setting Up A Cluster
Kafka requires ZooKeeper, which is a shared configuration service that provides distributed synchronization and naming. ZooKeeper keeps track of the many streams that pass through a Kafka cluster and coordinates their consumption. The packaged configuration runs ZooKeeper on
cd kafka_<my-kafka-version> bin/zookeeper-server-start.sh config/zookeeper.properties # without homebrew zookeeper-server-start config/zookeeper.properties # with homebrew
Next, we need to start up our Kafka message broker in another terminal.
At this point, we have the configuration server (ZooKeeper) and a single message broker (Kafka Server) set up. Technically, this is a Kafka cluster consisting of a single Kafka message broker. However, this cluster is not terribly useful because there are no data producers or consumers in our cluster yet. (Imagine the diagram below without a Producer or a Consumer Group.)
3. Adding Topics to our “Cluster”
Before we can register our producers and consumers, our Kafka cluster needs to define topics. A topic is Kafka’s term for a stream of records or messages. (Aside: The term “stream” implies that the data is moving whereas the term “topic” captures the fact that a particular record could be in transit or stored on disk.)
When making a topic, we must pass in four arguments:
- ZooKeeper’s address
- How many replicas of the topic we want
- How many data partitions the topic should use
- The topic’s name
The following topic, named
test, has one replica and one partition.
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
We can see all of the topics that our cluster is serving by issuing the following command:
bin/kafka-topics.sh --list --zookeeper localhost:2181
4. Adding Producers and Consumers to our “Cluster”
Now that we have a topic, we are ready to create a producer and consumer. We can use Kafka’s built-in console producer to send messages to our Kafka message broker from the command line. In a new terminal session, issue the command below and write a few messages separated by a newline. Notice that our producer must know the address for the broker and the name of the topic it is producing messages for.
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test This is my first message! This is my second message!
Kafka also comes with a command line consumer that we can use to access the messages that we just sent to Kafka’s message broker. Issue the command below in yet another terminal session. Notice that the consumer must know the ZooKeeper address and the topic for which it is consuming messages. This is because ZooKeeper keeps track of which message in a topic each consumer has consumed. If a consumer goes offline for a while and comes back, ZooKeeper can inform that consumer where to resume consumption.
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
Because we instructed the consumer to access all messages
--from-beginning, we should see our initial messages:
This is my first message! This is my second message!
Up Next: Building Our Clickstream
The single-broker cluster that we just made is rather instructive, but it lacks some features of a realistic Kafka cluster:
- Automated producers and consumers. Real systems allow any application to connect to the message broker as a producer or consumer. There is a specific API that producers and consumers must adhere to.
- Multiple message brokers. Multiple broker nodes help process the many messages flooding in or out for a given topic. Additionally, having multiple brokers makes clusters fault-tolerant. If the “leader” node for a given topic goes down, a “follower” node becomes the new leader.
- Multiple topics. A messaging system like Kafka is probably overkill for handling a single data stream. The real utility of Kafka comes from handling multiple topics gracefully.
In the next post, we will deal with each of these concerns.