Streaming Data Integration for Hadoop

 

 

With Striim’s streaming data integration for Hadoop, you can easily feed your Hadoop and NoSQL solutions continuously with real-time, pre-processed data from enterprise databases, log files, messaging systems, and sensors to support operational intelligence.

Ingest Real-time, Pre-Processed Data for Operational Intelligence

Striim is a software product that continuously moves real-time data from a wide range of sources into Hadoop, Kafka, relational and NoSQL databases — on-prem or in the cloud — with in-line transformation and enrichment capabilities. Brought to you by the core team behind GoldenGate Software, Striim offers a non-intrusive, quick-to- deploy solution for streaming integration so your Hadoop solution can support a broader set of operational use cases.

With the following capabilities, Striim’s streaming data integration for Hadoop enables a smart data architecture that supports use-case-driven analytics in enterprise data lakes:

  • Ingests large volumes of real-time data from databases, log files, message systems, and sensors
  • Collects change data non-intrusively from enterprise databases such as Oracle, SQL Server, MySQL, HPE NonStop, MariaDB, Amazon RDS
  • Delivers data in milliseconds to Hadoop (HDFS, HBase, Hive, Kudu), Kafka, Cassandra, MongoDB, relational databases, cloud environments, and other targets
  • Supports mission-critical environments with end-to-end security, reliability, HA, and scalability

Benefits

  • Uses low-latency data for operational use cases
  • Accelerates time to insight with a continuous flow of transformed data
  • Ensures scalability, security, and reliability for business-critical solutions
  • Achieves fast time-to-market with wizards-based UI and SQL-based language

Key Features

  • Enterprise-grade and fast-to-deploy streaming integration for Hadoop
  • Real-time integration of structured and unstructured data
  • In-flight filtering, aggregation, transformation, and enrichment
  • Continuous ingestion and processing, at scale
  • Integration with existing technologies and open source solutions

Striim enables businesses to get the maximum value from high-velocity, high-volume data by delivering it to Hadoop environments in real-time and in the right format for operational use cases.

Real-time, Low-impact Change Data Capture

Striim ingests real-time data from transactional databases, log files, message queues, and sensors. For enterprise databases, including Oracle, Microsoft SQL Server, MySQL, and HPE NonStop, Striim offers a non-intrusive change data capture (CDC) feature to ensure real-time data integration has minimal impact on source systems and optimizes the network utilization by moving only the change data.

In-Flight Data Processing

As data volumes continue to grow, having the ability to filter out and aggregate the data before analytics becomes a key way to manage the limited storage resources. Striim enables in-flight data filtering
and aggregation before it delivers to Hadoop to reduce data storage footprint. By performing in-line transformation (such as denormalization) and enrichment with static or dynamically changing data in memory, Striim feeds large data volumes in the right format without introducing latency.

Enterprise-grade Solution

Striim is designed to meet the needs of mission-critical environments with end-to-end security and reliability — including out-of-the-box exactly once processing — high-performance, and scalability. Users can focus on the application logic knowing that from ingestion to alerting and delivery, the platform is bulletproof to support the business as required.

Fast Time to Market

Intuitive development experience with drag-and-drop UI along with prebuilt data flows for multiple Hadoop targets from popular sources allow fast deployment. Striim uses an SQL-based language that requires no special skills to develop or modify streaming applications.

Operationalizing Machine Learning

Striim can pre-process and extract features suitable for machine learning before continually delivering training files to Hadoop. Once data scientists build their models using Hadoop technologies, these can be brought into Striim, using the new open processor component, so real-time insights can guide operational decision making and truly transform the business. Striim can also monitor model fitness and trigger retraining of models for full automation.

Differences from ETL

Compared to traditional ETL offerings that use bulk data extracts, Striim enables continuous ingestion of structured, semi-structured, and unstructured data in real time delivering granular data flow for richer analytics. By performing in-memory transformations on data-in-motion using SQL-based continuous queries, Striim avoids adding latency and enables real-time delivery. While ETL solutions are optimized for database sources and targets, Striim provides native integration and optimized delivery for Hadoop, Kafka, databases, and files, on-prem or in the cloud. Striim also offers stream analytics and data visualization capabilities within the same platform, without requiring additional licenses.

To learn more about streaming data integration for Hadoop, visit our Hadoop and NoSQL Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started!

Move Real-Time Data to Cloudera Using the Striim Platform

In this blog post, we’re going to take a look at how you can use the Striim platform to move real-time data to Cloudera from a variety of sources.

The Striim platform provides an enterprise-grade streaming integration solution for moving real-time change data from a wide variety of sources to Cloudera distributions of Apache Kafka, Apache Kudu, and Apache Hadoop, without impacting source systems. With support for hybrid IT infrastructures, Striim complements Cloudera solutions by enabling organizations to use full breadth and depth of their data in real time in order to gain a complete and up-to-date view into their operations.

Benefits

  • Ingest real-time data into CDK (Kafka), Kudu, Hadoop with low impact
  • Continuously collect data from databases, logs, messaging, sensors, and more
  • Process data in-flight without extensive coding
  • Get immediate insights and alerts
  • Use low-latency data in Cloudera for operational decision making

Why Striim?

  • Real-time data integration from a wide variety of data sources
  • Designed for high-volume, high-velocity data
  • Non-intrusive CDC from databases with event guarantees
  • Built-in security, scalability, and reliability
  • In-flight enrichment via built-in cache
  • Quick to deploy via SQL-like queries and wizards-based UI

Non-intrusive, Real-time Data Ingestion

The Striim platform continuously ingests real-time data from a variety of sources out-of-the-box – including databases, cloud applications, files, message queues, and devices – on-premises or in the cloud. For enterprise databases such as Oracle, SQL Server, MySQL, HPE NonStop, and MariaDB, the platform offers non-intrusive change data capture (CDC) to minimize the impact on source systems. Striim supports major data formats, including JSON, XML, AVRO, delimited binary, free text, and change records.

With a drag-and-drop UI and wizards, Striim simplifies creating data flows from popular sources to move data to Cloudera solutions including CDK (Kafka), Hadoop, HBase, Hive, and Kudu. The data can be delivered “as-is,” or be put through a series of in-flight transformations and enrichments. By using real-time, pre-processed data – especially in Kudu, Impala, and Kafka – customers can rapidly gain timely, operational intelligence from their Cloudera applications.

Delivery to Cloudera, On-premises or Cloud

The Striim platform can continuously apply pre-processed, streaming data to Cloudera solutions with sub-second latency. With parallelization capabilities, Striim offers optimized loading to Cloudera solutions. Striim can also deliver real-time data to other targets such as databases and files.

Built-in Stream Processing and Monitoring

Through SQL-based continuous queries, the Striim platform filters, aggregates, transforms, joins, and enriches multiple streams of real-time data in-memory to rapidly prepare the data for different downstream users before delivering to Cloudera environments. 

Striim also comes with built-in validation and monitoring capabilities. The platform enables users to continuously monitor the health of the data pipelines via real-time dashboards and alerts.

Enterprise-grade Modern Streaming Integration

Striim is designed form the ground up to support high-volume, high velocity data with built-in validation, security, high-availability, reliability, and scalability to support mission-critical applications.

Unlike traditional ETL solutions, Striim continuously ingests granular and larger data sets for richer analytics. It does so without impacting source systems, and processes the data in-memory, while it is streaming, to enable sub-second latency. Striim also differs from traditional logical replication tools with its optimized support for a wide range of data types, data sources, and targets, and its out-of-the-box comprehensive stream processing capabilities.

To learn more about how you can utilize the Striim platform to move data to Cloudera, please reach out to schedule a demo with a Striim expert or download the platform and try it for yourself.

Moving Real-Time Data to the Google Cloud Platform

Unedited Transcript:

but don’t do the Google cloud platform is important to your business. Well, why are realtime data movement change data capture the stream processing necessary parts of this process? You’ve already decided that you want to adopt the Google play platform. This could be Google big query and pubsub Pie sinkhole type data protocol. Any number of other technologies you may want to migrate existing applications to the cloud scale elastically as necessary or you say five analytics on machine learning, but running applications in the cloud as vms or containers is only part of the problem. You also need to consider how to move data to the cloud and share your applications or analytics always up to date and make sure the data is in the right format to be valuable. The most important starting point is ensuring you can stream data to the cloud in real time. Batch data movement can cause unpredictable load on the play of targets and has a high latency meaning.

Speaker 2: 00:59 The data is often how as all for modern applications have up to the second inflammation is essential. For example, to provide current customer information, accurate business reporting or for real time decision making, streaming data from on premise to the Google play platform because making use of appropriate data collection technologies for example, change that to capture or CDC dark thing continuously intercepted database activity and collects all the inserts, updates and deletes as events as they happen. Mt Data. It requires file taping which reads at the end of one of our file and potentially multiple machines and streams. The latest record society ratio. Other sources like Iot data or third party SAS applications also require specific treatment. You know that’s what your daddy can street in real time, which it has streaming data. The next consideration is what processing is necessary to make the data valuable for your specific bouquet destination.

Speaker 2: 01:57 And this depends on the use case for database migration and the elastic scalability use cases where the targets here might have similar to the source, maybe rule change data from on premise databases to Google cloud sequel may be sufficient. However, there are real time applications sourcing from Google pubsub. Well analytics use cases built on Google. Big Query of data pro, it may be necessary to perform street processing before that data is delivered to the cloud. This processing can transform the data structure and then enrich it with additional context information while the data is in flight, adding value to the data and optimizing the industry and analytics stream streaming integration platform. We continuously collect data from on premise or private databases and deliver to only go Google cloud endpoints. Street could take care of initial leverage as well as CDC for the continuous application of change. And these data flows can be created rapidly and monitored and validated continuously through our intuitive UI and stream your cloud migration, scaling and analytics. Can we build an iterative download speed of your business and should be? Your data is always way warranted when you

 

What Is Streaming Data Integration?

 

 

Streaming data integration is a fundamental component of any modern data architecture. Increasingly, companies need to make data-driven decisions – regardless of where data resides, when it matters most – immediately. Streaming data integration is one of the first steps in being able to leverage the next-generation infrastructures such as Cloud, Big Data, real-time applications, and IoT that underlie these decisions.

In this post, we’re going to take a look at how the Striim platform was built from the ground up for streaming data integration, and how organizations are benefitting from it. Striim enables businesses to move to Cloud, easily build real-time applications, and get more value from Hadoop solutions.

Striim is patented, enterprise-grade software for streaming data integration, which offers continuous data collection, stream processing, pipeline monitoring, and real-time delivery with verification across heterogeneous systems. Striim provides up-to-date data in a consumable form in Kafka, Hadoop, and databases — on-prem or in the Cloud — to support operational intelligence and other high-value workloads.

Core Platform Capabilities

  • Continuous, Structured, and Unstructured Data Collection: Striim captures real-time data from a wide variety of sources including databases (using low-impact chance data capture), cloud applications, log files, IoT devices, and message queues.
  • SQL-based Stream Processing: Striim applies filtering, transformations, aggregations, masking, and enrichment using static or streaming reference data.
  • Pipeline Monitoring and Alerting: Striim allows users to visualize the data flow and the content of data in real time, and offers delivery validation.
  • Real-Time Delivery: Striim distributes real-time data in a consumable form to all major targets including Cloud environments, Kafka and other messaging systems, Hadoop, relational and NoSQL databases, and flat files.

Key Platform Differentiators

  • Streaming data integration with intelligence via an in-memory platform
  • Real-time data movement across on-prem and cloud environments
  • Low-impact CDC for Oracle, SQL Server, HPE NonStop, and MYSQL
  • In-flight filtering, aggregation, transformation, and enrichment using SQL
  • Quick-to-deploy and easy-to-integrate via drag-and-drop UI
  • Continuous data pipeline monitoring and built-in delivery validation
  • Integration with existing technologies and open source solutions

Common Use Cases

Here are just a few of the most common ways Striim customers leverage its patented software to solve critical enterprise challenges:

Hybrid Cloud Integration

Striim eases cloud adoption by continuously moving real-time data from on-premises and cloud sources to Microsoft Azure, Amazon Web Services (AWS), and Google Cloud Platform environments. Many Striim customers use pre-built data pipelines to feed their cloud solutions from their on-premises databases, files, messaging systems, and sensors to enable operational workloads in the cloud. By filtering, aggregating, transforming, and enriching the data-in-motion before delivering to the cloud, Striim delivers real-time data in consumable form and helps to optimize cloud storage. Available on-premises or in the cloud, Striim enables businesses to get up and running in a matter of minutes.

Data Integration for Real-Time Applications

Striim enables real-time applications on event-based messaging systems such as Kafka, fast analytics storage solutions such as Kudu, and NoSQL databases such as Cassandra by continuously feeding pre-processed data in real time. Striim offers a wizard-based UI and SQL-based language for easy and fast development. Also, when needed Striim performs SQL-based streaming analytics and visualizes the streaming data, before delivering the data to the target to provide real-time operational intelligence.

Real-Time Integration and Pre-Processing for Hadoop

Striim enables a modern, smart data architecture for data lakes by non-intrusively and continuously collecting real-time data from databases, logs, messaging systems, and sensors, and pre-processing the data-in-motion for operational reporting and analytics. To accelerate insights and optimize storage, Striim filters, masks, aggregates, transforms, and enriches the data before delivering with sub-second latency to HDFS, HBase, and Hive. Striim can also pre-process and extract features suitable for machine learning before continually delivering training files to Hadoop. Models built using Hadoop technologies can be brought into Striim, so real-time insights can guide operational decision making and truly transform the business. Striim can also monitor model fitness and trigger retraining of models for full automation.

To learn more about our streaming data integration capabilities, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started!

Real-Time Data Warehousing with Azure SQL Data Warehouse and Striim

[This post was originally published by Ellis Butterfield, Program Manager for Azure SQL Data Warehouse, on the Microsoft Azure blog. For more information about Azure SQL Data Warehouse, please visit https://azure.microsoft.com/en-us/services/sql-data-warehouse/.]

Gaining insights rapidly from data is critical to competitiveness in today’s business world. Azure SQL Data Warehouse (SQL DW), Microsoft’s fully managed analytics platform, leverages Massively Parallel Processing (MPP) to run complex interactive SQL queries at every level of scale.

Users today expect data within minutes, a departure from traditional analytics systems which used to operate on data latency of a single day or more. With the requirement for faster data, users need ways of moving data from source systems into their analytical stores in a simple, quick, and transparent fashion. In order to deliver on modern analytics strategies, it is necessary that users are acting on current information. This means that users must enable the continuous movement from enterprise data, from on-premises to cloud and everything in-between.

SQL Data Warehouse is happy to announce that Striim now fully supports SQL Data Warehouse as a target for Striim for Azure. Striim enables continuous non-intrusive performant ingestion of all your enterprise data from a variety of sources in real time. This means that users can use intelligent pipelines for change data capture from sources such as Oracle Exadata straight into SQL Data Warehouse. Striim can also be used to move fast-moving data landing in your data lake into SQL Data Warehouse with advanced functionality such as on-the-fly transformation and model-based scoring with Azure Databricks.

“Enterprises adopting cloud-based analytics need to ensure reliable, real-time and continuous data delivery from on-prem and cloud-based data sources to reduce decision latencies inherent in batch based analytics. Striim’s solution for SQL Data Warehouse is offered in the Azure marketplace, and can help our customers quickly ingest, transform, and mask real time data from transactional systems or Kafka into SQL Data Warehouse to support both operational and analytics workloads”.

– Alok Pareek, Founder and EVP of Products for Striim

 

Via in-line transformations, including denormalization, before delivering to Azure SQL Data Warehouse, Striim reduces on-premises ETL workload as well as data latency. Striim enables fast data loading to Azure SQL DW through optimized interfaces such as streaming (JDBC) or batching (PolyBase). Azure customers can store the data in the right format, and provide full context for any downstream operations, such as reporting and analytical applications.

Next steps

To learn more about how you can build a modern data warehouse using Azure SQL Data Warehouse and Striim, watch this video, schedule a demo with a Striim technologist, or get started now on the Azure Marketplace.

Learn more about SQL DW and stay up-to-date with the latest news by following @AzureSQLDW on Twitter.

Striim TQL vs. KSQL: An Analysis of Streaming SQL Engines

 

 

The following blog outlines some benchmarks on streaming SQL engines that we cited in our recent paper, Real-time ETL in Striim, at VLDB Rio de Janeiro in August 2018.

BIRTE ’18
Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics
Article No. 3

In the past couple of years, Apache Kafka has proven itself as a fast, scalable, fault-tolerant messaging system, and has been chosen by many leading organizations as the standard for moving data around in a reliable way. Once data has landed into Kafka, enterprises want to derive value out of that data. This fueled the need to support a declarative way to access, manage and manipulate the data residing in Kafka. Striim introduced its streaming SQL engine, TQL (Tungsten Query Language), in 2014 for data engineers and business analysts to write SQL-style declarative queries over streaming data including data in Kafka topics. Recently, KSQL was announced as an open source, streaming SQL engine that enables real-time data processing against Apache Kafka.

In this blog post, we will attempt to do a competitive analysis of these streaming SQL engines – Striim’s TQL Engine vs. KSQL – based on two dimensions (a) Usability and (b) Performance. We will compare and contrast approaches taken by both the platforms and we will use two workloads to test the performance of both the engines:

  • Workload A: We use the ever popular data engineering benchmark TPCH and use a representative query (with modifications for streaming).
  • Workload B: We use a workload clickstream-analysis that is part of KSQL’s github page and use a query file that is also part of KSQL’s sample query set.

Usability

In this section, we will spend some time discussing how the two platforms differ in terms of basic constructs and capabilities. In every streaming compute/analytics platform, the following constructs are core to developing applications:

  • Streams: A stream is an unbounded sequence or series of data
  • Windows: A window is used to bound a stream by time or count
  • Continuous Queries: Continuously running SQL-like queries to filter, enrich, aggregate, join and transform the data.
  • Caches: A cache is set of historical or reference data to enrich streaming data
  • Tables: A table is a view of the events(rows) in a stream. Rows in a Table are mutable, which means that existing rows can be updated or deleted.

In addition to the above core constructs, because of the high volume and velocity of today’s streaming applications, all streaming platforms must be horizontally and vertically scalable. Also, because of the business nature of the data, all platforms must support Exactly Once Processing (E1P) semantics even for non-replayable sources.

In the following table, we will highlight some differences between Striim TQL and KSQL in terms of how core streaming compute constructs are defined and managed.

Construct KSQL TQL
Streams No in-memory version.

Always required a disk-based Kafka topic.

Both in-memory and persisted versions.
Windows No attribute (column)-based time windows.

Same window cannot be used in multiple queries.

Supports all types of windows.

Same window can be used in multiple queries, amortizing memory cost.

Queries No support for grouping on derived columns, limited aggregate support and no inner join Supports all types of join, and aggregate queries.
Caches Maintain external cache In-house built cache with refresh
Tables Has external dependency on RocksDB In-house built EventTable

 

Performance Using Workload A

In this section, we will attempt to do a performance evaluation of the two platforms using a well-known benchmark in the data engineering space. We selected the TPCH benchmark, which is a very popular analytics benchmark amongst the data processing vendors, and modified the core nature of the queries from batching to streaming. The experiments were conducted in an EC machine of type i3xlarge.

As KSQL does not support inner joins, we were very limited by what we could potentially run in KSQL since most of the queries in TPCH require inner join support. So, we limited ourselves to just one query that had some kind of filtering and aggregation. We generated data for Scale Factor 10 which led to a rowcount of 60M for the lineitem table. In order to make the workload streaming, we introduced timestamps (borrowed from Lorderdate from Orders table) in the rows so that we could apply windowing and aggregation on the data. Here is the schema for the lineitem table (prior to adding timestamps).

We then performed a set of experiments.

  • The first experiment is when the data comes in as raw files and being constantly fed to Striim TQL Engine. This experiment could not be repeated for KSQL since KSQL can only get data from Kafka topics.
  • The second experiment is when the data comes in as events in a Kafka topic. Striim TQL can directly read from Kafka topics (by using a construct called persisted streams that directly map to a Kafka topic).

We selected TPCH Q6 that has a filter and an aggregation.

SELECT
    sum(l_extendedprice * l_discount) as revenue
FROM
    lineitem
WHERE
    l_shipdate >= date '1994-01-01'
    AND l_shipdate < date '1994-01-01' + interval '1' year
    AND l_discount between 0.05 AND 0.07
    AND l_quantity < 24;


Since we had to convert the query to something that made sense in a streaming environment, we removed the predicates on l_shipdate and instead applied a 5 minute jumping (also commonly known as tumbling) window on the streaming data as it comes in while still retaining the predicates on l_discount, l_quantity and aggregate on l_extendedprice and l_discount. The original query gets converted to the following pseudo-queries

  • First create a stream S1 based on the stream of rows in the fact table lineitem
  • Filter out rows in S1 based on the predicates on l_discount and l_quantity
  • The filtered rows would keep forming 5 minute windows
    • For each window, compute the aggregate and output to a result Kafka topic

KSQL

We inserted all the data in a Kafka topic line_10 and executed the following queries. Since KSQL did not support the original form of the query, we had to insert an arbitrary column ‘group_key’ (that had a single unique value) and use it for the grouping. The output of the final query also goes to a Kafka table named line_query6.

CREATE STREAM lineitem_raw_10 (shipDate varchar, orderKey bigint, discount Double , extendedPrice Double,  suppKey bigint, quantity bigint, returnflag varchar, partKey bigint, linestatus varchar, tax double , commitdate varchar, recieptdate varchar,shipmode varchar, linenumber bigint,shipinstruct varchar,orderdate varchar, group_key smallint) WITH (kafka_topic='line10', value_format='DELIMITED'); 

CREATE STREAM lineitem_10 AS 
Select STRINGTOTIMESTAMP(orderdate, 'yyyy-MM-dd HH:mm:ss') , orderkey, discount, suppKey, extendedPrice , quantity, group_key from lineitem_raw_10 where quantity<24 and   discount <0.07 and discount >0.05;

CREATE TABLE line_query6 AS Select  group_key, ,sum(extendedprice * (1-discount)) as revenue from lineitem_10  WINDOW TUMBLING (SIZE 5 MINUTES) GROUP BY group_key;


TQL

In Striim, there are several ways to enabling the same workload. Striim allows data to be directly read from files and also from Kafka topics thereby preventing a complete IO cycle.

In order to be compatible (testing wise) with KSQL, we loaded the data into a Kafka topic ‘LineItemDataStreamAll’ and modeled it as a persisted stream ‘LineItemDataStreamAll’. We wrote the following TQL queries, where the final query writes the results to a Kafka topic named LineDiscountStreamGrouped. Alternatively the first query LineItemFilter could also be done via an in-built adapter of Striim named KafkaReader.

CREATE CQ LineItemFilter
INSERT INTO LineItemDataStreamFiltered
select * from LineItemDataStreamAll l
WHERE l.discount >0.05 and l.discount <0.07 and l.quantity < 24
;

CREATE JUMPING  WINDOW LineWindow OVER LineItemDataFilterStream6 KEEP WITHIN 5 MINUTE ON LOrderDate;
;

CREATE TYPE ResultType (
   Group_key Integer,
  revenue Double
);
create stream LineDiscountStreamGrouped of ResultType
persist using KafkaProps;


CREATE OR REPLACE CQ LineItemDiscount
INSERT INTO LineDiscountStreamGrouped
Select
group_key, SUM(l.extendedPrice*(1- l.discount)) as revenue
 from LineWindow2Mins l
Group by group_key


Performance Numbers

We measured the execution time and average event throughput for both the platforms. We also tried a variant where we only performed the filter (more like an ETL case) and not the windowing and subsequent aggregation. The following are the total execution times and average event throughput for both the platforms. We used Striim 3.7.4 and KSQL 0.4 for the experiments.

As we can see from both the charts, Striim’s TQL beat KSQL in both the scenarios. We believe that the main reason why Striim outperforms KSQL is because in Striim, the entire computation pipeline can be run in-memory whereas for KSQL, the output of every KSQL query is either a stream or a table which are backed by (disk-based) Kafka topics. Another interesting element is partitioning; in this experiment we could not partition the data because the aggregate query did not have any inherent grouping. Having said that, if there is partitioning in the storage and querying, Striim would also benefit from running computation tasks in parallel.

Another point to note is that KSQL is severely constrained on how many analytical query forms it can run since it still doesn’t support inner joins and aggregation like avg or count distinct without a grouping key. Till KSQL adds these core capabilities to the product, we really cannot compare performance of analytical queries across the two platforms.

Performance Using Workload B

As mentioned in the last section, KSQL is severely constrained on the types and forms of analytical query forms it can support and run, it was very hard to do an apples to apples comparison with Striim, since Striim TQL is very feature rich and can run many complex forms of streaming analytics queries. Therefore, in order to make a realistic comparison, we decided to pick a dataset and query from the KSQL github page and used to run the next set of experiments. The experiments were conducted in an EC machine of type i3xlarge.

The dataset that we picked up in the clickstream dataset that is available in the KSQL github page and we picked up the following sample query from one of their files clickstream-schema.sql. We ran the following queries that fall into the category of streaming data enrichment where the incoming streaming data is enriched with data that belongs to a another table or cache (one use case mentioned in this KSQL article).

KSQL Queries

CREATE STREAM clickstream (_time bigint,time varchar, ip varchar, request varchar, status int, userid int, bytes bigint, agent varchar) with (kafka_topic = 'clickstream', value_format = 'json');

CREATE TABLE WEB_USERS (user_id int, registered_At long, username varchar, first_name varchar, last_name varchar, city varchar, level varchar) with (key='user_id', kafka_topic = 'clickstream_users', value_format = 'json');

CREATE STREAM customer_clickstream WITH (PARTITIONS=2) as SELECT userid, u.first_name, u.last_name, u.level, time, ip, request, status, agent FROM clickstream c LEFT JOIN web_users u ON c.userid = u.user_id;

CREATE TABLE custClickStream as select * from customer_clickstream ;


We then used Striim to run a similar query where you read and write from Kafka. We used KafkaReader with JSON Parser to read into a typed stream. For “users” data set we loaded Striim’s refreshable cache component before performing the join. And the resulting stream is written back to Kafka as a new topic via the KafkaWriter. Striim TQL for the same is as follows:

CREATE STREAM clickStrm1 of clickStrmType;
CREATE SOURCE clickStreamSource USING KafkaReader (
 brokerAddress:'localhost:9092',
 Topic:'clickstream',
 startOffset:'0'
)
PARSE USING JSONParser (
 eventType:'admin.clickStrmType'
)
OUTPUT TO clickStrm1;

CREATE CQ customer_clickstreamCQ
INSERT INTO customer_clickstream1
select c.userid,u.first_name,u.last_name,u.level,c.time,c.ip,c.request,c.status,c.agent
From clickStrm1 c LEFT JOIN users_cache u
on c.userid = u.user_id ;

create Target writer using KafkaWriter VERSION '0.10.0' (
        brokerAddress:'localhost:9092',
        Topic:'StriimclkStrm'
)
format using JSONFormatter (
        EventsAsArrayOfJsonObjects:false,
        members:'userid,first_name,last_name,level,time,ip,request,status,agent'
)
INPUT FROM customer_clickstream1;


It is worthwhile to note here that even though we read and write to Kafka in this experiment, Striim is not limited to reading data from Kafka alone. Striim supports a much wider variety of input sources and output targets.


Performance Numbers

We measured the execution time and average event throughput for both the platforms for the following datasets

(a) DataSet1: 2 million rows in clickstream topic and 4 thousand rows in users topic.

(b) DataSet2: 4 million rows in clickstream topic and 8 thousand rows in users topic.

(c) DataSet3: 16 million rows in clickstream topic and 32 thousand rows in users topic.

The data was generated using scripts provided by KSQL github page; both KSQL and Striim consumed data from the same Kafka topic named clickstream.

The following are the total execution times and average event throughput for both the platforms. We used Striim 3.7.4 and KSQL 0.4 (released December 2017) for the experiments.
A
s we can see from both the charts, Striim’s TQL beat KSQL in both the scenarios by a multiple of 3. Again, we believe that the main reason why Striim outperforms KSQL is because in Striim, the entire computation pipeline can be run in-memory whereas for KSQL, the output of every KSQL query is either a stream or a table which are backed by (disk-based) Kafka topics. Also, since the input Kafka topic is partitioned, Striim was able to employ auto-parallelism and use multiple cores to read from Kafka, perform the query and write to the output Kafka topic.

Hardware

The experiments were all done using an EC2 machine of type i3xlarge. The hardware configuration is as follows

  1. 4 vCPUs each vCPU (Virtual CPU) is a hardware hyperthread on an Intel E5-2686 v4 (Broadwell) processor running at 2.3 GHz.
  2. 30.5 GB RAM
  3. We used EBS-disk for storage.

Code

All the code that was used to run the experiments on streaming SQL engines is available in Striim’s github page in https://github.com/striim/KSQL-Striim.

Striim Talks Cloud Integration at Strata Data Conference NYC, September 11-13

We at Striim are looking forward to seeing everyone next week at Strata Data Conference in NY! Stop by booth #1107 for ongoing presentations by Striim CTO, Steve Wilkes on the topics of Microsoft Azure, AWS and Google Cloud integration, as well as delivering real-time analytics in Apache Kudu.  

This year at the Strata Data Conference, Striim will be showcasing how enterprise companies can utilize the Striim platform to adopt a hybrid cloud solution. Over the last few months, we have delivered two new platform releases (3.8.4 and 3.8.5) that focus on cloud integration and the extensibility of the platform, providing users with easier access to major cloud environments from Microsoft, AWS, and Google, along with further integration with Cloudera, Kafka, and Kudu.

To learn more about these capabilities, read our recent blog post, “Striim’s Latest Releases Boost Cloud Integration Capabilities, Ease of Use, and Extensibility.”

Booth Presentations/Demos

To learn more about how Striim can help support your hybrid cloud initiatives, Steve Wilkes, co-founder and CTO of Striim, will be giving booth presentations on our integration capabilities with Microsoft, AWS, and Google. Presentations at Strata Data NY will run throughout all three days of the conference.

Expo Hall Hours

Tuesday, September 11: 5:00-6:30pm ET

Wednesday, September 12: 10:30am-7:05pm ET

Thursday, September 13: 10:30am-3:30pm ET

Discounted Passes to Strata Data NY

If you haven’t picked up your pass yet, feel free to use our code “Striim20” to save 20% off all Strata conference passes.

On top of all of this, we’ll also be featuring our Geek Gadget Giveaway, so stop by booth #1107 to try your luck at winning cool prizes such as mini drones, smart watches, bobblehead collectables, tech gear, and Star Wars gadgets.

For more information, download our Striim Overview datasheet, visit our blog, or request a demo. We look forward to speaking with you more at Strata!

Using Change Data Capture to Solve the Cache Consistency Problem

In this post, we take a look at change data capture (CDC) as a solution for the cache consistency problem. As a visual aid, included below is a brief video demo that will run you through how to push changes from a database to the cache in real time.

Imagine that you have an application that works by retrieving and storing information in a database. To get faster response times, you may utilize an in-memory cache for rapid access to data. However, other applications also make database updates, which leads to a cache consistency problem, and the application now shows out of date or invalid information.

Hazelcast Striim Hot Cache

Hazelcast Striim Hot Cache easily solves issues with cache consistency by using streaming change data capture to synchronize the cache with the database in real time, ensuring the cache and associated application always have the correct data. In the demo video, we have a MySQL database, Hazelcast Cache, and Striim server. We’re using test programs to work with the MySQL database to create, modify, and dump data, and for the Hazelcast cache which can also dump data.

We start by creating a table using the test code and loading it with data. We then use Striim to load the data from the database into the Hazelcast cache. Next, we create a CDC flow from the database and use this to deliver live changes into the cache. We run continuous modifications against a database which are replicated to the cache.

After some time, we dump the database and the cache to files and run a diff to prove that they are the same. In the demo, you’ll see the use of wizards to simplify the CDC setup and delivery to Hazelcast, real-time CDC from the database, real time delivery of change to synchronize the cache, and a real-time custom monitoring of the whole solution.

 

The first thing we need to do is set up the demo by using our test program to create a table in the MySQL database. Next, we use the test code to insert 200,000 rows of data. To perform the initial load from database to cache, we need a data flow. We use a database reader to extract data from MySQL, which we configure for MySQL instance and test table.

The target is a Hazelcast writer, which is configured to map the table data into our cache. To load the cache, we need to deploy it and start the application, which streams the table data from the database into the cache. The cache now has 200,000 entries. For Hot Cache, we need to set up change data capture to stream live database data to the cache.

First, we can figure properties to connect to the MySQL database, and Striim checks to make sure CDC will work properly. We will tell you if the configuration needs changing and what to do. You can then browse and select tables of interest.

Next, you enter the Hazelcast cluster information and mapping information in the form of a file linking database tables to cache objects. The last step finalizes the configuration of the Hazelcast target. The wizard results in a data flow from the MySQL CDC source to the Hazelcast target.

When deployed and started, the cache is synced with modifications from database change. We run some modifications in the form of inserts, updates, and deletes against the table. We see the table size, which we compare with the number of entries in the cache. Both the table and cache have 180,551 records.

If we dump the MySQL data from the database into a file and do the same with the cache data into a different file, we can do a diff between the two files with no results, proving that they are the same. In both cases, the Striim can also monitor the data flow using a separate data flow for analytics and fully customizable dashboards to meet your business requirements.

Cache Consistency

Are you having issues with cache consistency? Download Striim for Hazelcast to try it out yourself, or chat with a Striim technologist by scheduling a demo to learn more.

Move Oracle to Azure SQL Server in Real-Time

In this demo, you’re going to see how you can utilize Striim to do real-time collection of change data capture from Oracle Database and deliver that, in real-time, into Microsoft Azure SQL Server. I’m also going to build a custom monitoring solution of the whole end-to-end data flow. (The demo starts at the 1:43 mark.)

Unedited Transcript: 

Today, see how you can move data from Oracle to Azure SQL Server running in the cloud in real time using Striim and change data capture. So you have data in lots of article tables on premise and you want to move this into Microsoft Azure SQL Server in real time. How do you go about doing this without affecting your production databases? You can’t use SQL queries because typically these would be queries against a timestamp like table scans that you do over and over again and that puts a load on the Oracle database and you can also skip important transactions. You need change data capture, and CDC enables non-intrusive collection of streaming database change stream provides change data capture as a collector out of the box. This enables real time collection of change data from Oracle SQL server and my sequel. The CDC works because databases write all the operations that occur into transaction logs, change data capture listens to those transaction logs and instead of using triggers or timestamps, it directly reads these logs to collect operations.

This means that every DML operation, every insert update and delete is written to the logs captured by change data capture and turned into events by our platform. So in this demo you’re going to see how you can utilize Striim to do real time collection of change data capture from your Oracle database and deliver that in real time into Microsoft Azure SQL server. Also going to build the custom monitoring solution, the whole end to end data flow. First of all, connect to Microsoft Azure SQL server. In this instance we have two tables, t customer and t cost odd that we can show here are currently completely empty.

We’re going to use a data flow that we’ve built in Striim to capture data from a on premise Oracle database using change data capture. You can see some of the configuration properties here and deliver that. After doing some processing into Microsoft Azure SQL Server and you can see the properties for configuring that here. To show this, we’re going to run some SQL against Oracle and the SQL does a combination of inserts, updates and deletes against our two Oracle tables. When we run this, you can see the data immediately in the initial stream. That data stream is then split into multiple processing steps and then delivered into Azure SQL server. If we redo the query against Azure tables here, you can see that the previously empty tables now have data in them and that data was delivered and will continue to be delivered live as long as changes are happening in the Oracle database. In addition to the data movement, we’ve also built a monitoring application complete with dashboard that shows you the data floating through the various tables, the types of operations are occurring and the entire end to end transaction leg. This is the difference between when a transaction was committed on the source system and when it was captured and applied to the target and also some of the most recent transactions. This was built again using a data flow within the Striim platform.

This data flow uses the original streaming change data from the Oracle database and then the place of processing in the form of SQL queries to generate statistics. In addition to generating data for the dashboard, you can also use this as rules to generate alerts for thresholds, etc. And the dashboard itself is not hard coded. It’s generated using our dashboard builder, which utilizes queries to connect to the backend. Each visualization you’re seeing here is powered by a query. It’s the back end data, and there are lots of visualizations to choose from. So you’ve hoped you’d have enjoyed seeing how to move Oracle data on premise into the cloud using Striim.

 

Streaming Integration: What Is It and Why Do I Need It?

 

 

In this blog post, we’re going to tackle the basics behind what streaming integration is, and why it’s critical for enterprises looking to adopt a modern data architecture. Let’s begin with a working definition of streaming integration.

What is Streaming Integration?

You’ve heard about streaming data and streaming data integration and you’re wondering, why is it an essential part of any enterprise infrastructure?

Well, streaming integration is all about continuously moving any enterprise data with real high throughput in a scalable fashion, while processing that data, correlating it, and analyzing it in-memory so that you can get real value out of that data, and visibility into it in a verifiable fashion.

And streaming data integration is the foundation for so many different use cases in this modern world, especially if you have legacy systems and you need to modernize, you need to use new technologies to get the right answers from your data, and you need to do that continuously, in real time.

Why Streaming Integration?

Now that we’ve outlined a high-level understanding of what streaming integration is, let’s discuss why it’s important. You now know streaming integration is an essential part of enterprise modernization. But why? Why streaming integration and why now?

Well, streaming data integration is all about treating your data the way it should be treated. Batch data is an artifact of technology and technology history – that storage was cheap, and memory and CPU were expensive. And so, people would store lots of data and then process it later.

But data is not created in batches. Data is created row-by-row, line-by-line, event-by-event as things in the real world happen. So, if you’re treating your data in batches, you’re not respecting it; you’re not treating it the way that it’s created. In order to do that, you need to collect that data and process it as it’s being produced, and do all of this in a streaming fashion. And that’s what streaming integration is all about.

 

If you’re interested in learning more about streaming data integration and why it’s needed, please visit our Real-Time Data Integration Solution page, or view the wide variety of sources and targets that Striim supports.

Back to top