Getting started with Kafka client metrics
Apache Kafka is a popular open source event storage and stream processing platform. Used by more than 80% of Fortune 500 companies, it has evolved into the de facto standard for data streaming. All major cloud providers offer managed data streaming services to meet this growing demand.
One of the key benefits of choosing a managed Kafka service is that it delegates responsibility for brokers and operational metrics, allowing you to focus only on metrics relevant to your application. In this article, Product Manager Uche Nwankwo provides guidance on a set of producer and consumer metrics customers should monitor for optimal performance.
With Kafka, monitoring typically includes a variety of metrics related to topics, partitions, brokers, and consumer groups. Standard Kafka metrics include information about throughput, latency, replication, and disk usage. Consult the Kafka documentation and related monitoring tools to understand the specific metrics available for your Kafka version and how to interpret them effectively.
Why is it important to monitor Kafka clients?
Monitoring your IBM® Event Streams for IBM Cloud® instance is important to ensure optimal functioning and overall health of your data pipeline. Monitoring Kafka clients helps you identify early signs of application failures, such as high resource usage, lagging consumers, and bottlenecks. Identifying these warning signs early allows you to proactively respond to potential problems to minimize downtime and prevent disruption to business operations.
Kafka clients (producers and consumers) have their own set of metrics to monitor performance and health. The Event Streams service also supports a rich set of server-generated metrics. For more information, see Monitoring Event Streams metrics using IBM Cloud Monitoring.
Client metrics to monitor
Producer Indicators
metric system | explanation |
Recording error rate | This metric measures the average number of transmitted records per second that encountered errors. A high or increasing recording error rate may indicate data loss or data not being processed as expected. All of these impacts can compromise the integrity of the data processed and stored by Kafka. Monitoring this metric helps ensure that the data sent by producers is accurately and reliably written to Kafka topics. |
Request Latency Average | This is the average latency in milliseconds for each production request. Increased latency can affect performance and indicate problems. Measuring the average request latency metric can help you identify bottlenecks within your instance. For many applications, low latency is critical to ensuring a high-quality user experience, and a spike in average request latency can indicate that the limit for provisioned instances has been reached. You can fix the problem by changing producer settings, for example, batching or scaling plans to optimize performance. |
byte rate | The average number of bytes transferred per second for a topic is a measure of throughput. If you stream data regularly, a decrease in throughput may indicate something is wrong with your Kafka instance. Event Streams Enterprise plans start at 150MB per second with a one-to-one split between ingress and egress, and it’s important to know how much you’re consuming for effective capacity planning. Do not exceed 2/3 of maximum throughput to account for the possible impact of operational operations such as internal updates or failure modes (e.g. loss of availability zone). |
Scroll to see the full table.
consumer indicators
metric system | explanation |
import speed Import size average | The number of fetch requests per second (fetch-rate) and the average number of bytes fetched per request (fetch-size-avg) are key indicators of the performance of a Kafka consumer. High fetch rates can be inefficient, especially for small numbers of messages. This is because it means that not enough data is received each time. fetch-rate and fetch-size-avg are affected by three settings: fetch.min.bytes, fetch.max.bytes, and fetch.max.wait.ms. Adjust these settings to achieve the desired overall latency while minimizing the number of fetch requests and potentially the load on the broker CPU. Monitoring and optimizing both metrics ensures efficient data processing for current and future workloads. |
Commit Latency Average | This metric measures the average time between a committed record being sent and a commit response being received. Similar to the producer metric request-latency-avg, a stable commit-latency-avg means that offset commits occur in a timely manner. High commit latency may indicate a problem internal to the consumer that is preventing it from committing offsets quickly, which directly affects the stability of data processing. Message processing can become redundant if the consumer has to restart and reprocess messages at a previously uncommitted offset. High commit latency also means that more time is spent on administrative tasks than on actual message processing. This issue can cause a backlog of messages waiting to be processed, especially in high-volume environments. |
byte consumption rate | This is a consumer fetch metric that measures the average number of bytes consumed per second. Similar to byte transfer rate, which is a producer metric, this should be a stable and expected metric. A sudden change in the expected trend in byte consumption rate may indicate a problem with your application. Low speeds can be a sign of inefficiency in data retrieval or over-provisioned resources. Higher rates can overwhelm the processing power of consumers, so scaling is necessary, creating more consumers to balance the load or changing consumer configurations such as fetch size. |
Hourly rebalancing rate | Number of group readjustments attended per hour. Rebalancing occurs whenever there is a new consumer or when a consumer leaves the group and processing is delayed. This happens because with many rebalances per hour, partitions are reallocated, making Kafka consumers less efficient. Misconfiguration can result in unstable consumer behavior resulting in higher hourly rebalancing rates. These rebalancing operations can increase latency and cause application crashes. Track low and stable hourly rebalancing rates to ensure your consumer group is stable. |
Scroll to see the full table.
Metrics should cover a variety of applications and use cases. Event Streams on IBM Cloud provides a rich set of metrics documented here and provides additional useful insights depending on the domain of your application. Please follow these steps: Learn more about Event Streams for IBM Cloud.
What are your future plans?
You now have knowledge of the essential Kafka clients to monitor. We encourage you to put these into practice and try the fully managed Kafka offering on IBM Cloud. For setup-related issues, see the Getting Started Guide and FAQ.
Learn more about Kafka and its use cases Provisioning an Event Streams instance on IBM Cloud
Was this article helpful?
yesno