Query and Process Pulsar data with SQL: Introducing Pulsar External Stream

Congratulations to the Apache Pulsar community on their recent 4.0 LTS (Long-Term-Support) release! At Timeplus, we’re thrilled to announce that in Timeplus Enterprise v2.5, a new type of External Stream is implemented in our C++ core timeplusd engine. With a single binary, developers can read millions of events from Pulsar each second with streaming SQL, or write data to Pulsar topics in JSON, CSV, Protobuf, or Avro format.

Apache Pulsar is a multi-tenant, high-performance solution for server-to-server messaging. Originally developed by Yahoo, Pulsar is under the stewardship of the Apache Software Foundation.

Both Apache Kafka and Apache Pulsar are popular event streaming platforms. Compared to Apache Kafka, the key differences:

Native support for multiple clusters in a Pulsar instance, with seamless geo-replication of messages across clusters.
Very low end-to-end latency.
Seamless scalability to over a million topics.
Multiple subscription types (exclusive, shared, and failover) for topics.
Guaranteed message delivery with persistent message storage provided by Apache BookKeeper.
Tiered Storage offloads data from hot/warm storage to cold/long-term storage (such as S3 and GCS) when the data is aging out.
Different queuing semantics other than simple append-only log.

Over 2 years ago, Timeplus added the integration for Apache Pulsar. We built a rich UI for users to easily setup connections to Apache Pulsar or StreamNative Cloud, preview messages and load them into Timeplus streams, then run streaming SQL. This integration was done via Redpanda Connect, formerly known as Benthos framework.

In Timeplus Enterprise v2.5, the new Pulsar External Stream is available in the core engine, without any dependencies. By leveraging the Pulsar C++ Client, Timeplus Core Engine (timeplusd) can interact with the Pulsar server at high performance and low latency. The External Stream feature of Timeplus also has the added bonus that it does not copy the full stream to Timeplus and only brings in the relevant data as needed by transformations and aggregations. This saves on latency as well as storage costs.

Why Timeplus + Pulsar ?

For data teams who are using Apache Pulsar or StreamNative Cloud, choosing Timeplus Enterprise makes exploring real-time data in Pulsar easy and intuitive:

Rich web UI. No need to write Java code to connect to the Pulsar clusters or use command line interface to preview the raw messages. Your data team can use the Timeplus web UI to quickly load data from Pulsar to Timeplus, then use the purpose-built streaming tables and charts in Timeplus to view live data and understand their patterns.
Powerful streaming analytics with SQL. Apache Flink can be a powerful and flexible tool; however, when building real-time analytics applications, it’s oftentimes not easy to integrate Apache Pulsar and Apache Flink. With Timeplus, a similar analytics workload can be implemented using just one-tenth of the time and cost. For example, a new hopping window analytics app with FlinkSQL may take upwards of 1-2 minutes while the Flink job submits and waits to return results. Writing and uploading Java libraries is also another added overhead with managing dependencies and vulnerability patching. In the Timeplus web UI, you can write ad-hoc streaming SQL, and get results in the next second, creating a seamless, fast experience. This also has the added benefit of not needing additional code packages to be uploaded and managed against the running cluster.
Full SQL-based routing/transformation/alerting. With Pulsar as a natively supported external stream type in Timeplus, your data engineering team can use Timeplus as the tool to transform data among Pulsar topics, or migrate data from one message bus to another. Compared to the pfSQL, a new feature in StreamNative cloud for public preview, the streaming SQL in Timeplus supports Multi-JOINs, UDF, high-cardinality GROUP BY, and many other features.
Queryable Materialized Views. Any intermediate or final transformations done within Timeplus produces queryable materialized views that are available for querying from your applications like normal database/data warehouse tables.

Additional use cases for the Timeplus and Pulsar integration:

Internet-of-Things (IoT): The explosive growth of connected remote devices is posing challenges for the centralized computing paradigm. Pulsar is cloud-native and can run in any cloud, on-premises, or Kubernetes environment. It supports multiple messaging protocols, including MQTT, AMQP, and JMS. By connecting Timeplus with Pulsar, the real-time data from IoT sensors can be easily processed with sub-second latency. Real-time visibility can be provided to the internal/external users to improve operational efficiency or customer satisfaction.
Application monitoring and observability. With Pulsar’s built-in multi-tenant support, tier-storage, and the separation of computing and storage, it’s one of the most common tools to consolidate logs/metrics/tracing from various IT systems and applications. Timeplus provides a unified storage and processing engine for both historical and real-time data. With our high performance, low latency, and purpose-built streaming charts and alerts, your IT infrastructure team can gain real-time observability with much lower complexity and cost-of-ownership.

Check out the short demo video for the new Pulsar integration in Timeplus:

For more technical details, please check our documentation: https://docs.timeplus.com/pulsar-external-stream

Here’s how you can load your Pulsar data into Timeplus:

Select "Apache Pulsar" on the Data Collection page.

Enter the service URL, API key, and optionally enable any additional settings.

Choose your Pulsar topic and data format.

Preview your data and confirm schema.

Enter a name for the external stream and take a look at the SQL to create it.

This UI wizard will help you create a Pulsar external stream with SQL DDL in the following syntax. You can also customize the parameters in the SQL Console.

CREATE EXTERNAL STREAM [IF NOT EXISTS] stream_name
    (<col_name1> <col_type>)
SETTINGS
    type='pulsar', -- required
    service_url='pulsar://host:port', -- required
    topic='..', -- required
    jwt='..',
    data_format='..',
    format_schema='..',
    one_message_per_row=..,
    skip_server_cert_check=..,
    validate_hostname=..,
    ca_cert='..',
    client_cert='..',
    client_key='..',
    connections_per_broker=..,
    memory_limit=..,
    io_threads=..

Once the external stream is created, you can query it in SQL Console via our REST or SDK:

Depending on the network connectivity between Timeplus and Pulsar servers, as well as number of partitions and other factors, you can use Timeplus to scan existing messages in the Pulsar topic. This can be helpful for “backtesting” while building trading algorithms. For example, with a single partition in StreamNative cloud, from my laptop, it can read data at tens of thousands per second.

select count() from pulsar_ext_stream where _tp_time>earliest_ts() EMIT PERIODIC 5s

┌─count()─┐
│   57900 │
└─────────┘
┌─count()─┐
│  114200 │
└─────────┘
↖ Progress: 155.40 thousand rows, 1.24 MB (10.79 thousand rows/s., 86.29 KB/s.)

Want to learn how Timeplus can help you get powerful real-time insights with minimal set-up time? Try Timeplus Enterprise for free for 30-days.

Join our Timeplus Community! Connect with other users or get support in our Slack community.

WHY TIMEPLUS?

PRODUCT

DEPLOYMENT

WHY TIMEPLUS?

PRODUCT

WHY TIMEPLUS?

PRODUCT

Query and Process Pulsar data with SQL: Introducing Pulsar External Stream

Why Timeplus + Pulsar ?

Related Posts