At LinkedIn, Kafka is the de-facto messaging platform that powers diverse sets of geographically-distributed applications at scale. Examples include our distributed NoSQL store (Espresso), stream processing framework (Samza), monitoring infrastructure (InGraphs), and derived data serving platform (Venice).
Given these use cases, it’s not surprising that Kafka usage at LinkedIn has grown exponentially. Recent production figures show that Kafka is handling around 4.5 trillion messages/day, with over 2,000 Kafka brokers. Given the scale at which Kafka is operating at LinkedIn, the underlying hardware is really put to the test. Kafka distribution ships with scripts to assist with basic operations on the cluster, such as broker-removal and partition-movement. However, advanced operations, including rebalancing the cluster, were not available in the basic distribution. So, over the last several years, LinkedIn has created and open sourced Kafka Tools that help automate and also manage Kafka clusters, up to a certain level.
As the usage of Kafka has grown at an unprecedented rate, the number of operational issues has also increased. Therefore, in August 2017, we open sourced Kafka Cruise Control to handle the large-scale operational challenges with running Kafka. As Cruise Control has evolved from its creation in mid-2017, we have added a REST interface to help with the administration of Kafka clusters remotely. As more and more Cruise Control instances have been deployed, we felt the need to have a centralized dashboard to operate and check the status of any Kafka cluster at LinkedIn. For this reason, we’re excited to announce today our latest open source Kafka tool, Cruise Control Frontend (CCFE).
For those that may be unfamiliar, Cruise Control features include:
- Kafka broker resource utilization tracking
- The ability to query the latest replica state (offline, URP, out of sync) from brokers
- Goal-based resource distribution
- Anomaly detection with self-healing
- Admin operations on Kafka (add/remove/demote brokers, rebalance cluster, run PLE)
In this post, we will take a look at the frontend for Cruise Control, which provides a birds-eye view of all the Kafka installations and provides a single place to manage all of them.
Cruise Control Frontend (CCFE)
At LinkedIn, Cruise Control has simplified Kafka cluster management. However, the Cruise Control REST API is very powerful, with many features to play around with, and this can be overwhelming to beginners. Given the distributed nature of both Kafka clusters and the teams who operate them, it was a challenge to keep everyone on the same page. CCFE has been created by the streaming SREs to address these above challenges.
CCFE is a Single Page Web Application that can be deployed with either Cruise Control or any standard webserver. Details about the clusters to be managed are made available via a simple configuration file.
CCFE features
CCFE leverages the REST API exposed by Cruise Control and provides a neat GUI to interact with. Its features include (but aren’t limited to):
-
UI Features
-
Central dashboard: Go-to place for managing all Kafka clusters in the organization.
-
Dry-run by default: To ensure that a certain action is intended, all requests to Cruise Control (CC) are made in dry-run mode by default.
-
Expensive operations: Some CC features, such as rebalancing, are expensive and depend on the cluster load; therefore, these features are highlighted in a bold color on the UI.
-
API flags: The REST API exposed by CC has conflicting flags when passed as part of the URL. It requires some experience to understand which are conflicting, so in CCFE, the UI takes care of these flags by showing only the relevant ones for a chosen action.
-
Exception responses: User can choose to see full stack traces or partial ones on the UI.
-
Async response: When a request cannot be served within a given time limit, CC automatically converts the request to async and sends the progress data back.
-
Manual refresh: All features on the UI have manual refresh (to avoid bombarding CC with periodic polling on heavy-duty clusters)
-
Show/Hide URL: Toggle the URL display in every tab to see where the request is being made from the application.
-
Preferences: Allows users to show/hide advanced features on the UI.
-
-
Informative Features
-
Cruise Control status: Displays the CC internal status, which includes details of Monitor, Executor, Analyzer, and Anomaly Detector components.
-
Kafka cluster load: Displays the Kafka cluster load, which includes broker- and host-level metrics.
-
Partition resource utilization: Displays the resource (Network In/Out, CPU, Disk) utilization for the partitions in the cluster.
-
Partition state: Displays the leaders/followers/out-of-sync replica details on each broker in the cluster.
-
Replica state: Displays the leaders/followers/in-sync/out-of-sync replica information for online/offline/under-replicated partitions.
-
Optimization proposals: Displays the optimization proposals generated based on the workload model.
-
Cruise Control tasks: Displays the history of tasks submitted to CC, along with their status.
-
-
Administrative Features
-
Add brokers to cluster: Allows movement of replicas only to new brokers.
-
Remove brokers from cluster: Allows removal of brokers from the cluster. This doesn't allow movement of replicas in the remaining brokers.
-
Demote broker from cluster: Remove the leadership of all replicas owned by the selected broker(s), and if requested so, make them the least-preferred replica for leadership election within their corresponding partitions.
-
Rebalance cluster: Performs cluster rebalance according to the default workload or user-selected parameters.
-
Stop executions: Stops ongoing add/remove/demote brokers, rebalance on the cluster.
-
Fix offline replicas: Identify and relocate offline replicas to healthy disks on alive brokers.
-
Let’s take a closer look at some of these features.
Kafka cluster status
This screen gives a quick summary of the selected Kafka cluster's status. We can get details about the number of Kafka brokers, total leader partitions, total replicas, average replication factor, out-of-sync replicas, and a detailed view of each broker in the Kafka cluster, along with details about offline partitions, under-replicated partitions, and offline log directories.
Kafka cluster status (with three brokers)
Kafka cluster load
Cruise Control internally leverages the metrics exported by the brokers and computes the resource usage (e.g., CPU, network, and disk), which is further exposed via REST API. On this page, shown in the image below, we can see the computed load from both the broker and server perspective. This is helpful to understand Kafka server and broker performance metrics.
In addition to the resource details provided by Cruise Control, several other metrics, like leader-to-follower ratio and input/output ratio, are displayed to get a clear picture of the message patterns on the brokers.
Kafka cluster load as calculated by Cruise Control
Kafka cluster administration
This is the place where we can perform all administrative activities on Kafka clusters, such as: PLE (preferred leader election), Kafka cluster rebalance, add/remove/demote brokers, and fix offline replicas.
In the case of advanced users, all supported operations of the REST API are also displayed as different UI controls (along with their conflicting behaviors).
Kafka cluster administration page
For example, when accessing the advanced features of the Rebalance cluster operation, the options look something like this:
Rebalancing Kafka cluster using advanced options
Preferences
Preferences allows users to view advanced exception responses in every tab; to view the REST endpoint that’s used for communication in every tab; and to show/hide the list of available tabs. The UI is designed in such a way that each top-level feature is shown as a tab.
Cruise Control UI preferences
Other features
In addition to those described above, there are other features in the UI that we recommend exploring to make the best utilization of CCFE. One important thing to remember is that most of the actions you can implement from the UI are run in dry-run mode by default for safety reasons. This behavior can be easily switched in the appropriate page if needed.
CCFE deployment methods
At the core, CCFE is a web application that’s written in JavaScript and runs completely in the web browser. To accommodate different types of installations, CCFE was designed to run as an admin frontend for a single Cruise Control server or as a Central UI to manage many Kafka clusters.
We define these two methods as Embedded and Standalone Methods.
Embedded Method
In this method, CCFE is deployed within Cruise Control itself and is used to manage the Cruise Control instance, where it’s deployed. To run CCFE in this mode, users have to download the CCFE distribution from GitHub and extract the artifacts into the cruise-control-ui folder within the Cruise Control application. Cruise Control has built-in support to detect this method of deployment of CCFE and automatically serves the UI when the user visits the hostname and port combination of Cruise Control in a web browser. In other words, if Cruise Control is deployed on example.com on port 8080, navigating to http://example.com:8080/ shows the CCFE.
More details about this method of deployment is available on the CCFE GitHub page.
Standalone Method
In this method, CCFE is deployed within a webserver like Apache or Nginx that has support for serving static content from a given directory. Application servers can also be used if there is any need to discover the deployed Cruise Control hostnames and ports from an external configuration system.
To let the CCFE know all of the available Cruise Control server details, a configuration file should be made available at the URL /static/config.csv. At LinkedIn, we have built a custom webserver, using the Flask framework, that pulls the Cruise Control server and port details from our own internal cloud to auto-generate the configuration file.
Additionally, proper firewalls should be set up to ensure that the deployment is secure.
In the Standalone Method, CCFE can be deployed in these two architectures: Reverse Proxy Architecture or CORS Architecture.
Reverse Proxy Architecture
In this architecture, CCFE always communicates with the Reverse Proxy to access the REST API provided by Cruise Control. An architecture diagram for this setup looks like this:
CORS Architecture
Both Cruise Control and CCFE have built-in support for leveraging CORS to manage the Kafka clusters. In this setup, CCFE is served from a standalone webserver, and all the requests related to managing Kafka clusters are made directly to Cruise Control.
This architecture is recommended for environments where end-to-end encryption is present in the network and Cruise Control servers are protected from unwanted traffic by using a firewall. As of writing, SSL/TLS support for Cruise Control is a work in progress. For more details, see these authentication-support and ssl support tickets on GitHub.
Conclusion
Kafka usage at LinkedIn has grown at an unprecedented rate and so have the operational challenges. During the course of Kafka's evolution, LinkedIn has created and open sourced tools like Kafka Monitor, Kafka Tools, and Burrow that provide insight into Kafka clusters. Along the same lines, Cruise Control has been built and made open source, providing a unique system to automatically manage Kafka clusters.
Continuing the same trend, Cruise Control Frontend has been built to address challenges with running distributed teams and Kafka clusters. CCFE leverages the Cruise Control API to operate on Kafka clusters and act as a central dashboard for the entire Kafka ecosystem. Cruise Control Frontend is the go-to application in the Kafka ecosystem at LinkedIn, which we are very happy to make available to the open source community.
Acknowledgments
The Data Streaming Team at LinkedIn has been very helpful in designing and implementing both the Cruise Control and CCFE applications to address the large-scale operational challenges with Kafka deployments. This work is not possible without their help. We also would like to thank the open source community on GitHub for providing feedback to improve the Cruise Control project.