DATA PIPELINE

A Messaging System is responsible for transferring data from one application to another, so the applications can focus on data, but not worry about how to share it. In Big Data, an enormous volume of data is used.
Two types of messaging patterns are available:
In a point-to-point system, messages are persisted in a queue. One or more consumers can consume the messages in the queue, but a particular message can be consumed by a maximum of one consumer only.

In the publish-subscribe system, messages are persisted in a topic. Unlike point-to-point system, consumers can subscribe to one or more topic and consume all the messages in that topic. In the Publish-Subscribe system, message producers are called publishers and message consumers are called subscribers.

Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables us to pass messages from one end-point to another.
The following diagram illustrates the main terminologies.

A stream of messages belonging to a category is called a topic. Data is stored in topics. Kafka topics are analogous to radio / TV channels. Multiple consumers can subscribe to same topic and consume the messages.
Topics are split into partitions. For each topic, Kafka keeps a minimum of one partition. Each such partition contains messages in an immutable ordered sequence.
Each partitioned message has a unique sequence id called as offset. For each topic, the Kafka cluster maintains a partitioned log that looks like this:

Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
The Kafka cluster durably persists all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space.

Replicas are nothing but backups of a partition. Replicas are never read or write data. They are used to prevent data loss.
Brokers are simple system responsible for maintaining the published data. Each broker may have zero or more partitions per topic.
Kafka’s having more than one broker are called as Kafka cluster.

ZooKeeper is used for managing and coordinating Kafka broker. ZooKeeper service is mainly used to notify producer and consumer about the presence of any new broker in the Kafka system or failure of the broker in the Kafka system.
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group.
Prerequisite: Install Java
> tar -xzf kafka_2.11-2.1.0.tgz
> cd kafka_2.11-2.1.0
> bin/zookeeper-server-start.sh config/zookeeper.properties
> bin/kafka-server-start.sh config/server.properties
Let's create a topic named "test" with a single partition and only one replica:
In below command 2181 is the port number we have specified in zookeeper.properties
> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
To list the current list of topics, we can query the zookeeper using below command:
> bin/kafka-topics.sh --list --zookeeper localhost:2181
Test
Kafka comes with a command line client that will take input from a file or from standard input and send it out as messages to the Kafka cluster. By default, each line will be sent as a separate message. Run the producer and then type a few messages into the console to send to the server
(In below command 9092 is the port number we configured for the broker in server.properties)
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
This is a message
This is another message
Kafka also has a command line consumer that will dump out messages to standard output.
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
This is a message
This is another message
The Kafka producer and consumer can be coded in many languages like java, python, etc. In this section, we will see how to send and receive messages from a python topic using python.
pip install kafka-python
Producer.py
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers=['localhost:9092'])
with open('/example/data/inputfile.txt') as f:
lines = f.readlines()
for line in lines:
producer.send('test', line)
Below python program consumes the messages from the kafka topic and prints them on the screen.
Consumer.py
from kafka import KafkaConsumer
consumer = KafkaConsumer(bootstrap_servers=['localhost:9092'], auto_offset_reset='earliest')
consumer.subscribe(['test'])
for msg in consumer:
print(msg)
In the above program, we have
auto_offset_reset=’earliest’.
This will cause the consumer program to read all the messages from beginning if the same program is run again and again.
Instead if we have
auto_offset_reset=’latest’,
the consumer program will read messages starting from the latest offset which was consumed earlier.
Share this:

In the first part of this series, we introduced the idea of moving beyond dashboards to build diagnostic AI agents capable of uncovering the why behind business performance shifts. That article focused on architectural principles and the role of AWS Strands in enabling controlled agentic behavior. In this follow-up, we take a more detailed look at how […]

Organizations continue to process a significant portion of their operational data through documents—particularly invoices, which arrive in multiple formats, structures, and levels of quality. Traditionally, handling these documents requires manual review, data entry, and routing, which introduces delays and increases the likelihood of errors. With the steady advancement of Azure’s AI capabilities and serverless integration services, customers […]

The AI era demands more from our applications than ever before. Legacy ASP.NET applications, while reliable workhorses, often struggle with the scalability, flexibility, and integration capabilities needed to leverage modern AI services. But how do you modernize without risking business continuity? At CloudIQ, we've not only researched and documented the best strategies—we've built them. This post brings together everything we've learned: comprehensive strategy, […]
Partner with CloudIQ to achieve immediate gains while building a strong foundation for long-term, transformative success.