Less is More: How Salla Optimized MySQL-to-ClickHouse CDC Pipeline with Timeplus

Salla, Saudi Arabia's leading e-commerce platform provider, partnered with Timeplus to improve the efficiency of its CDC and ETL data pipeline, from MySQL to ClickHouse. By implementing Timeplus Enterprise, they significantly reduced the CPU and memory footprint of ClickHouse and expanded the scope of analytics to handle e-commerce data from thousands of MySQL tables.

Webinar: How Salla Manages Real-Time Deduplication and Denormalization with Timeplus How did Salla reduce the overhead of their CDC pipelines to one-third using Timeplus, and which new Timeplus features did they leverage to efficiently scale their data processing? Check out our Timeplus Live webinar featuring our team and Ibrahim Bakhsh (Senior Cloud Data Manager at Salla). See Webinar Recording

About Salla

Founded in 2016, Salla has emerged as a cornerstone of Saudi Arabia's digital commerce revolution. The company provides a comprehensive e-commerce platform that enables businesses of all sizes to establish and manage their online presence. As a Software-as-a-Service (SaaS) provider, Salla serves over 60,000 merchants across the region, processing millions of transactions and generating massive amounts of operational data daily.

The Challenge: Scaling Data Analytics for E-commerce

As Salla's platform grew, so did their data challenges. Their existing data pipelines replicated data from thousands of relational database tables to ClickHouse for analytics. As one would expect, there are thousands of tables configured in a typical star schema:

Example of snowflake schema. Source link — *Example of snowflake schema.* *Source link*

A Change Data Capture (CDC) process creates a Kafka topic for each table which in turn is read by a table in ClickHouse via the Kafka Table Engine. At least one materialized view needs to be created in ClickHouse to extract the CDC JSON data and put it in a landing table. There are other tables created for different analysis purposes.

The data engineering team at Salla has put great effort into setting up and maintaining this data infrastructures. It works well in many ways, solving critical business problems, the team still faced several challenges:

Increasing growth of CPU and memory requirements with the number of tables. For each table, there is a thread being created which updates a materialized view. Since there are further joins and enrichment steps there, the thread and memory usage compounds within ClickHouse as the number of tables increases.
Kafka to ClickHouse inefficiency. For each table, if a Kafka consumer is instantiated, that takes up a lot of threads as well as puts pressure on Apache Kafka itself as ClickHouse has thousands of connections open. Each Kafka message being written to ClickHouse individually is inefficient but there is also additional load in Kafka for each message committed by ClickHouse.
Performance of UPSERT. It’s natural for the OLTP database to contain multiple updates for records with the same primary key. Occasionally, the data engineering team may need to replay the CRUD events in the CDC pipeline due to an outage or reconciliation exercise. As a result, there is a considerable amount of duplicated data in the Kafka topics. ClickHouse as a columnar database, is not designed for such frequent INSERT, UPDATE or DELETE. Although ReplacingMergeTree is the recommended solution to handle such workload in ClickHouse, it’s sub-optimal for Salla’s use case. For example, to get the correct result from tables with ReplacingMergeTree, the FINAL keyword should be added to the SELECT query. However, adding FINAL keywords will slow down the query performance and when the query needs to JOIN data from other tables, this will significantly slow down the end-to-end query experience.
Lagging performance of complex views via JOINs. Some business requirements often need to join data with 5-30 fact/dimensional tables in ClickHouse. This is not efficient when done on an on-demand basis by user applications. Some of the dimension tables with tens of millions or billions of rows. The ability to create complex materialized views across many tables incrementally could alleviate these issues and reduce the need for end users to use up with complex heavy-weight queries to hammer ClickHouse. This is difficult to do in ClickHouse across many tables in a performant way.
Serving multiple-consumers downstream. Having to write streaming changes to other downstream systems such as search engines or make API calls to third party systems.

The Solution: A Streamlined Data Pipeline with Timeplus

Salla chose Timeplus Enterprise to modernize their data infrastructure, by adding Timeplus between the Kafka and ClickHouse.

A 3-node Timeplus Enterprise cluster was set up, with auto-load-balancing for:

Reading and processing Kafka messages
JSON transformation and normalization
Efficient loading into ClickHouse
Managing materialized views
Downstream data sinks to third party system such as search engines and APIs

Timeplus can read data from Kafka efficiently and make efficient commits to data that has been consumed within Timeplus for downstream uses. It also has some efficient pooling of Kafka consumer objects per Timeplus instance to try to minimize the number of connections open to Kafka and thus reduce the load and latency of reads from Kafka.

Timeplus writes data to ClickHouse and makes use of multiplexing writes to ClickHouse via a fixed number of ClickHouse threads. The data is batched up according to arrival rate or configured time intervals to send a smaller number of requests to ClickHouse reducing the load even further.

Timeplus' Materialized Views are based on Streaming SQL and continuously run when any part of the data streams in the SQL are inserted or updated. It supports powerful and flexible multidimensional JOINs across as many as 30 tables and writes those to ClickHouse. An innovation here is the implementation of a Remote Dictionary interface that is able to lookup the data directly from the relational databases rather than replicating the reference data to Timeplus and then performing the join (like typical stream processing patterns). This saves on requiring large amounts of memory and disk space that is required to keep the JOIN state for very high cardinality joins. For certain cases, caching can be used on such reference data to increase performance. For very high throughput cases, a local copy of data can be kept up to date from Kafka using Timeplus' Mutable Streams which can also be used in these Materialized Views which in turn emit their changes to ClickHouse in efficient batches.

Timeplus' Materialized Views can thus be used by the central data team to create much more complex pre-aggregated views than ClickHouse materialized views that are incrementally updated. These updated views in ClickHouse can be queried very fast compared to ad-hoc versions of those queries.

There are additional advanced features used in this project, such as JavaScript UDF for highly customizable data processing logic; hybrid hash table to avoid loading all dimensional data into memory for JOIN.

Impact and Results

Reduced ClickHouse Load
By moving the materialized views from ClickHouse to Timeplus Enterprise, which supports powerful streaming JOIN, lookup and caching, ClickHouse now has a cleaner and dedicated role for servicing user aggregating queries against the denormalized wide tables. This not only frees up the CPU and memory for the expensive hardware, but also allows the team to further expand the data coverage for the MySQL to ClickHouse replication.
Reduce Kafka Load and Processing Lag
Having thousands of Kafka clients connected to Kafka can increase the load on Kafka. At the same time small commits also reduce processing throughput and can result in delays in the final data arriving in ClickHouse.
Minimize Data Latency
Timeplus is a streaming-first data platform. Enhancements in Kafka External Streams pooling/commit strategies, ClickHouse client batching/pooling and Materialized Views JOIN strategies created a much more smooth data pipeline reducing data processing lag from minutes to seconds.
Enforced Data Quality and Performance
Whereas previously a lot of end user ad-hoc queries which included many complex JOINs without reuse put a lot of load on ClickHouse and issues with data consistency, now with pre-aggregated views created from the enhanced Stream Processing layer allowed data to be consistently updated with reusable materialized views that just be read quickly by end users using simpler queries.

“Timeplus enabled us to efficiently scale our data processing with minimal latency and maximum throughput, which lets us reduce bottlenecks and enhance service resilience.”

Salah Alkhwlani | CTO, Salla

“We constantly seek ways to accelerate analytics and reporting. However, handling multi-stream table JOINs on the fly is not ClickHouse’s strength. This is where Timeplus' MULTI-JOIN on stream acts as a crucial solution, minimizing duplicate processing overhead.”

Ibrahim Bakhsh | Senior Cloud Data Engineer, Salla

Looking Ahead

With Timeplus Enterprise as their data pipeline foundation, Salla is well-positioned to:

Scale out/In our timeplus cluster per processing required to handle more and more operations without any latency or delayed messages
Scale the current pipeline to serve a higher number of threads/tables.
Grow to cover all processing layer materialized views in Timeplus Enterprise within nine months.
Further consolidate the ETL processing and data orchestration with Timeplus
Move workloads from other solutions(Airflow, dbt) to Timeplus
Continue optimizing data infrastructure cost and efficiency

Do you want to take your ClickHouse analytics to the next level like the Salla team? Try Timeplus Enterprise free for 30-days: timeplus.com/download.

WHY TIMEPLUS?

PRODUCT

DEPLOYMENT

WHY TIMEPLUS?

PRODUCT

WHY TIMEPLUS?

PRODUCT

Less is More: How Salla Optimized MySQL-to-ClickHouse CDC Pipeline with Timeplus

About Salla

The Challenge: Scaling Data Analytics for E-commerce

The Solution: A Streamlined Data Pipeline with Timeplus

Impact and Results

Reduced ClickHouse Load

Reduce Kafka Load and Processing Lag

Minimize Data Latency

Enforced Data Quality and Performance

Looking Ahead

Related Posts