Irem Radzik

15 Posts

The Inevitable Evolution from Batch ETL to Real-Time ETL (Part 1 of 2)

 

 

Traditional extract, transform, load (ETL) solutions have, by necessity, evolved into real-time ETL solutions as digital businesses have increased both the speed in executing transactions, and the need to share larger volumes of data across systems faster. In this two-part blog post series, I will describe the transition from traditional ETL to a streaming, real-time ETL and how that shift benefits today’s data-driven organizations.

The Evolution of Real-Time ETLData integration has been the cornerstone of the digital innovation for the last several decades, enabling the movement and processing of data across the enterprise to support data-driven decision making. In decades past, when businesses collected and shared data primarily for strategic decision making, batch-based ETL solutions served these organizations well. A traditional ETL solution extracts data from databases (typically at the end of the day), transforms the data extensively on disk in a middle-tier server to a consumable form for analytics, and then loads it in batch to a target data warehouse with a significantly different schema to enable various reporting and analytics solutions.

As consumers demanded faster transaction processing, personalized experience, and self-service with up-to-date data access, the data integration approach had to adapt to collect and distribute data to customer-facing applications and analytical applications more efficiently and with lower latency. In response, two decades ago, logical data replication with change data capture (CDC) capabilities emerged. CDC moves only the change data in real time, as opposed to all available data as a snapshot, and delivers data to various databases.

These “new” technologies enabled businesses to create real-time replicas of their databases to support customer applications, migrate databases without downtime, and allow real-time operational decision making. Because CDC was not designed for extensive transformations of the data, logical replication and CDC tools lead to an “extract, load, and transform” (ELT) approach where significant transformations and enrichment would be required on the target system to put the data in the desired form for analytical processing. Many of the original logical replication offerings are also architected to run single processes on one node, which creates a single point of failure and requires an orchestration layer to achieve true high availability.

The next wind of change came with the analytical solutions shifting from traditional on-premises data warehousing on relational databases to Hadoop and NoSQL environments and Kafka-based streaming data platforms, deployed heavily in the cloud. Traditional ETL had to now evolve further to a real-time ETL solution that works seamlessly with the data platforms both on-premises and in the cloud, and combines the robust transformation and enrichment capabilities of traditional ETL with low-latency data capture and distribution capabilities of logical replication and CDC.

In Part 2 of this blog post, I will discuss these real-time ETL solutions in more detail, particularly focusing on Striim’s streaming data integration software which moves data across cloud and on-premises environments with in-memory stream processing before delivering data in milliseconds to target data platforms. In the meantime, please check out our product page to learn more about Striim’s real-time ETL capabilities.

Feel free to Schedule a technical demo with one of our lead technologists, or download or provision Striim for free to experience first-hand its broad range of capabilities.

 

Streaming Data Integration for Hadoop

 

 

With Striim’s streaming data integration for Hadoop, you can easily feed your Hadoop and NoSQL solutions continuously with real-time, pre-processed data from enterprise databases, log files, messaging systems, and sensors to support operational intelligence.

Ingest Real-time, Pre-Processed Data for Operational Intelligence

Striim is a software product that continuously moves real-time data from a wide range of sources into Hadoop, Kafka, relational and NoSQL databases — on-prem or in the cloud — with in-line transformation and enrichment capabilities. Brought to you by the core team behind GoldenGate Software, Striim offers a non-intrusive, quick-to- deploy solution for streaming integration so your Hadoop solution can support a broader set of operational use cases.

With the following capabilities, Striim’s streaming data integration for Hadoop enables a smart data architecture that supports use-case-driven analytics in enterprise data lakes:

  • Ingests large volumes of real-time data from databases, log files, message systems, and sensors
  • Collects change data non-intrusively from enterprise databases such as Oracle, SQL Server, MySQL, HPE NonStop, MariaDB, Amazon RDS
  • Delivers data in milliseconds to Hadoop (HDFS, HBase, Hive, Kudu), Kafka, Cassandra, MongoDB, relational databases, cloud environments, and other targets
  • Supports mission-critical environments with end-to-end security, reliability, HA, and scalability

Benefits

  • Uses low-latency data for operational use cases
  • Accelerates time to insight with a continuous flow of transformed data
  • Ensures scalability, security, and reliability for business-critical solutions
  • Achieves fast time-to-market with wizards-based UI and SQL-based language

Key Features

  • Enterprise-grade and fast-to-deploy streaming integration for Hadoop
  • Real-time integration of structured and unstructured data
  • In-flight filtering, aggregation, transformation, and enrichment
  • Continuous ingestion and processing, at scale
  • Integration with existing technologies and open source solutions

Striim enables businesses to get the maximum value from high-velocity, high-volume data by delivering it to Hadoop environments in real-time and in the right format for operational use cases.

Real-time, Low-impact Change Data Capture

Striim ingests real-time data from transactional databases, log files, message queues, and sensors. For enterprise databases, including Oracle, Microsoft SQL Server, MySQL, and HPE NonStop, Striim offers a non-intrusive change data capture (CDC) feature to ensure real-time data integration has minimal impact on source systems and optimizes the network utilization by moving only the change data.

In-Flight Data Processing

As data volumes continue to grow, having the ability to filter out and aggregate the data before analytics becomes a key way to manage the limited storage resources. Striim enables in-flight data filtering
and aggregation before it delivers to Hadoop to reduce data storage footprint. By performing in-line transformation (such as denormalization) and enrichment with static or dynamically changing data in memory, Striim feeds large data volumes in the right format without introducing latency.

Enterprise-grade Solution

Striim is designed to meet the needs of mission-critical environments with end-to-end security and reliability — including out-of-the-box exactly once processing — high-performance, and scalability. Users can focus on the application logic knowing that from ingestion to alerting and delivery, the platform is bulletproof to support the business as required.

Fast Time to Market

Intuitive development experience with drag-and-drop UI along with prebuilt data flows for multiple Hadoop targets from popular sources allow fast deployment. Striim uses an SQL-based language that requires no special skills to develop or modify streaming applications.

Operationalizing Machine Learning

Striim can pre-process and extract features suitable for machine learning before continually delivering training files to Hadoop. Once data scientists build their models using Hadoop technologies, these can be brought into Striim, using the new open processor component, so real-time insights can guide operational decision making and truly transform the business. Striim can also monitor model fitness and trigger retraining of models for full automation.

Differences from ETL

Compared to traditional ETL offerings that use bulk data extracts, Striim enables continuous ingestion of structured, semi-structured, and unstructured data in real time delivering granular data flow for richer analytics. By performing in-memory transformations on data-in-motion using SQL-based continuous queries, Striim avoids adding latency and enables real-time delivery. While ETL solutions are optimized for database sources and targets, Striim provides native integration and optimized delivery for Hadoop, Kafka, databases, and files, on-prem or in the cloud. Striim also offers stream analytics and data visualization capabilities within the same platform, without requiring additional licenses.

To learn more about streaming data integration for Hadoop, visit our Hadoop and NoSQL Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started!

Move Real-Time Data to Cloudera Using the Striim Platform

In this blog post, we’re going to take a look at how you can use the Striim platform to move real-time data to Cloudera from a variety of sources.

The Striim platform provides an enterprise-grade streaming integration solution for moving real-time change data from a wide variety of sources to Cloudera distributions of Apache Kafka, Apache Kudu, and Apache Hadoop, without impacting source systems. With support for hybrid IT infrastructures, Striim complements Cloudera solutions by enabling organizations to use full breadth and depth of their data in real time in order to gain a complete and up-to-date view into their operations.

Benefits

  • Ingest real-time data into CDK (Kafka), Kudu, Hadoop with low impact
  • Continuously collect data from databases, logs, messaging, sensors, and more
  • Process data in-flight without extensive coding
  • Get immediate insights and alerts
  • Use low-latency data in Cloudera for operational decision making

Why Striim?

  • Real-time data integration from a wide variety of data sources
  • Designed for high-volume, high-velocity data
  • Non-intrusive CDC from databases with event guarantees
  • Built-in security, scalability, and reliability
  • In-flight enrichment via built-in cache
  • Quick to deploy via SQL-like queries and wizards-based UI

Non-intrusive, Real-time Data Ingestion

The Striim platform continuously ingests real-time data from a variety of sources out-of-the-box – including databases, cloud applications, files, message queues, and devices – on-premises or in the cloud. For enterprise databases such as Oracle, SQL Server, MySQL, HPE NonStop, and MariaDB, the platform offers non-intrusive change data capture (CDC) to minimize the impact on source systems. Striim supports major data formats, including JSON, XML, AVRO, delimited binary, free text, and change records.

With a drag-and-drop UI and wizards, Striim simplifies creating data flows from popular sources to move data to Cloudera solutions including CDK (Kafka), Hadoop, HBase, Hive, and Kudu. The data can be delivered “as-is,” or be put through a series of in-flight transformations and enrichments. By using real-time, pre-processed data – especially in Kudu, Impala, and Kafka – customers can rapidly gain timely, operational intelligence from their Cloudera applications.

Delivery to Cloudera, On-premises or Cloud

The Striim platform can continuously apply pre-processed, streaming data to Cloudera solutions with sub-second latency. With parallelization capabilities, Striim offers optimized loading to Cloudera solutions. Striim can also deliver real-time data to other targets such as databases and files.

Built-in Stream Processing and Monitoring

Through SQL-based continuous queries, the Striim platform filters, aggregates, transforms, joins, and enriches multiple streams of real-time data in-memory to rapidly prepare the data for different downstream users before delivering to Cloudera environments. 

Striim also comes with built-in validation and monitoring capabilities. The platform enables users to continuously monitor the health of the data pipelines via real-time dashboards and alerts.

Enterprise-grade Modern Streaming Integration

Striim is designed form the ground up to support high-volume, high velocity data with built-in validation, security, high-availability, reliability, and scalability to support mission-critical applications.

Unlike traditional ETL solutions, Striim continuously ingests granular and larger data sets for richer analytics. It does so without impacting source systems, and processes the data in-memory, while it is streaming, to enable sub-second latency. Striim also differs from traditional logical replication tools with its optimized support for a wide range of data types, data sources, and targets, and its out-of-the-box comprehensive stream processing capabilities.

To learn more about how you can utilize the Striim platform to move data to Cloudera, please reach out to schedule a demo with a Striim expert or download the platform and try it for yourself.

What Is Streaming Data Integration?

 

 

Streaming data integration is a fundamental component of any modern data architecture. Increasingly, companies need to make data-driven decisions – regardless of where data resides, when it matters most – immediately. Streaming data integration is one of the first steps in being able to leverage the next-generation infrastructures such as Cloud, Big Data, real-time applications, and IoT that underlie these decisions.

In this post, we’re going to take a look at how the Striim platform was built from the ground up for streaming data integration, and how organizations are benefitting from it. Striim enables businesses to move to Cloud, easily build real-time applications, and get more value from Hadoop solutions.

Striim is patented, enterprise-grade software for streaming data integration, which offers continuous data collection, stream processing, pipeline monitoring, and real-time delivery with verification across heterogeneous systems. Striim provides up-to-date data in a consumable form in Kafka, Hadoop, and databases — on-prem or in the Cloud — to support operational intelligence and other high-value workloads.

Core Platform Capabilities

  • Continuous, Structured, and Unstructured Data Collection: Striim captures real-time data from a wide variety of sources including databases (using low-impact chance data capture), cloud applications, log files, IoT devices, and message queues.
  • SQL-based Stream Processing: Striim applies filtering, transformations, aggregations, masking, and enrichment using static or streaming reference data.
  • Pipeline Monitoring and Alerting: Striim allows users to visualize the data flow and the content of data in real time, and offers delivery validation.
  • Real-Time Delivery: Striim distributes real-time data in a consumable form to all major targets including Cloud environments, Kafka and other messaging systems, Hadoop, relational and NoSQL databases, and flat files.

Key Platform Differentiators

  • Streaming data integration with intelligence via an in-memory platform
  • Real-time data movement across on-prem and cloud environments
  • Low-impact CDC for Oracle, SQL Server, HPE NonStop, and MYSQL
  • In-flight filtering, aggregation, transformation, and enrichment using SQL
  • Quick-to-deploy and easy-to-integrate via drag-and-drop UI
  • Continuous data pipeline monitoring and built-in delivery validation
  • Integration with existing technologies and open source solutions

Common Use Cases

Here are just a few of the most common ways Striim customers leverage its patented software to solve critical enterprise challenges:

Hybrid Cloud Integration

Striim eases cloud adoption by continuously moving real-time data from on-premises and cloud sources to Microsoft Azure, Amazon Web Services (AWS), and Google Cloud Platform environments. Many Striim customers use pre-built data pipelines to feed their cloud solutions from their on-premises databases, files, messaging systems, and sensors to enable operational workloads in the cloud. By filtering, aggregating, transforming, and enriching the data-in-motion before delivering to the cloud, Striim delivers real-time data in consumable form and helps to optimize cloud storage. Available on-premises or in the cloud, Striim enables businesses to get up and running in a matter of minutes.

Data Integration for Real-Time Applications

Striim enables real-time applications on event-based messaging systems such as Kafka, fast analytics storage solutions such as Kudu, and NoSQL databases such as Cassandra by continuously feeding pre-processed data in real time. Striim offers a wizard-based UI and SQL-based language for easy and fast development. Also, when needed Striim performs SQL-based streaming analytics and visualizes the streaming data, before delivering the data to the target to provide real-time operational intelligence.

Real-Time Integration and Pre-Processing for Hadoop

Striim enables a modern, smart data architecture for data lakes by non-intrusively and continuously collecting real-time data from databases, logs, messaging systems, and sensors, and pre-processing the data-in-motion for operational reporting and analytics. To accelerate insights and optimize storage, Striim filters, masks, aggregates, transforms, and enriches the data before delivering with sub-second latency to HDFS, HBase, and Hive. Striim can also pre-process and extract features suitable for machine learning before continually delivering training files to Hadoop. Models built using Hadoop technologies can be brought into Striim, so real-time insights can guide operational decision making and truly transform the business. Striim can also monitor model fitness and trigger retraining of models for full automation.

To learn more about our streaming data integration capabilities, please visit our Real-time Data Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started!

Logical Replication vs. Streaming Data Integration – Which is Better for Building a Streaming Architecture?

 

 

Out with the old and in with the new: Streaming data integration offers so much more than traditional logical replication – from cloud integration, to advanced analytics, to support for machine learning.

There was once a time, not too long ago, when only data in databases were collected and analyzed. This sounds like a crazy concept today based on the fact that now there are a myriad of digital frameworks for data to reside in – log files in machines, connected devices, Hadoop, Kafka, cloud-based data stores and applications etc. This is due, in part, to the wide variety of sources where data originates, not to mention the insane amount of data now being created. In its day, logical replication was the best option for organizations to share data across systems with low-latency so that companies were working with the most up-to-date data possible, regardless of location.

However, over the last few years, thanks to digital transformation, we’ve seen a fundamental shift in data management, demand for faster and better analytics, and advancements in computing technologies. We’ve seen CPU and RAM get cheaper and faster, enabling organizations to ingest, process, and analyze broader types of data, in real time, regardless of what environment enterprise data is in.

While logical replication (also known as transactional replication) vendors have done their best to support integration with modern data sources and targets, they weren’t designed to reliably and securely stream high-velocity data across new IoT, advanced analytics, and cloud systems. As a result, companies who try to use logical replication systems for next-generation analytics solutions, whether on-premises or in the cloud, often feel like they’re fitting round pegs in square holes.

The Striim platform was built from the ground up with a streaming architecture in mind, offering capabilities that far exceed where transactional replication falls short to bring enterprise companies to the evolutionary next stage of a modern data architecture.

The image below succinctly details the differences between logical replication and streaming data integration, and why Striim is the better option for implementing a streaming architecture to gain maximum value from real-time data.

Companies need to work with all of their data, while it’s still relevant, in order to gain data-driven insights. A streaming architecture helps digital businesses make the most of their data assets for operational excellence, and for that, it needs solutions that go beyond just real-time data movement between databases. Companies can start building a streaming architecture with platforms like Striim that make it easy to collect, prepare, analyze, and visualize high-volume, high-velocity data (structured, semi-structured, or unstructured) from diverse set of sourcesin real time, and share with any system regardless of its location.

Learn more about how your organization can take the first step in adopting a streaming architecture by visiting the Striim website, where you can find further information about our platform’s capabilities, use cases, case studies, and other materials to guide you in the right direction. Additionally, you can download the Striim platform or schedule a demo to learn more.

Back to top