Streaming Integration to Azure Cosmos DB

Real-time integration to Azure Cosmos DB enables companies to make the most of the environment’s globally-distributed, multi-model database service. With Striim’s streaming integration to Azure Cosmos DB solution, companies can continuously feed real-time operational data from a wide-range of on-premises and cloud-based data sources.

What is Striim?

The Striim software platform offers continuous, real-time data movement from enterprise document and relational databases, sensors, messaging systems, and log files into Azure Cosmos DB with in-flight transformations and built-in delivery validation to support real-time reporting, IoT analytics, and transaction processing.

Streaming Integration to Azure Cosmos DB

Offload Operational Reporting

  • Move real-time unstructured and structured data to Cosmos DB to support operational workloads including real-time reporting
  • Continuously collect data from a diverse set of sources (such as Internet of Things (IoT) sensors) for timely and rich insight

Accelerate and Simplify Processing

  • Perform filtering, transformations, aggregation, and enrichments in-flight before delivery to Cosmos DB
  • Avoid adding latency via stream processing
  • Easily convert structured data to document form

Ease the Cosmos DB Adoption Process

  • Use phased and zero-downtime migration from MongoDB by running them in parallel
  • Continuously visualize and monitor data pipelines with real-time alerts
  • Prevent data loss with built-in validation

How Striim Delivers Streaming Integration to Azure Cosmos DB

Low-Impact Change Data Capture from Enterprise Databases

  • Continuous, non-intrusive data ingestion for high-volume data
  • Support for databases such as Oracle, SQL Server, HPE NonStop, MySQL, PostgreSQL, MongoDB, Amazon RDS for Oracle, and Amazon RDS for MySQL
  • Real-time data collection from logs, sensors, Hadoop and message queues to support rich and timely analytics

Continuous, In-Flight Data Processing

  • In-line transformation, filtering, aggregation, enrichment to store only the data you need, in the right format
  • Uses SQL-based continuous queries via a drag-and-drop UI

Real-Time Data Delivery with Built-In Monitoring

  • Continuous verification of source and target database consistency
  • Interactive, live dashboards for streaming data pipelines
  • Real-time alerts via web, text, email

Streaming Integration to Azure Cosmos DB

To learn more about how to leverage Striim’s solution for streaming integration to Azure Cosmos DB, check out our Striim for Azure Cosmos DB solution page, schedule a brief demo with a Striim technologist, provision Striim for Cosmos DB on the Azure marketplace, or download a free trial of the Striim platform and get started today!

Streaming Integration to Azure

To adopt modern data warehousing, advanced big data analytics, and machine learning solutions in the Azure Cloud, businesses need streaming integration to Azure. They need to be able to continuously feed real-time operational data from existing on-premises and cloud-based data stores and data warehouses.

What is Striim?

The Striim software platform offers continuous, real-time data movement from heterogeneous, on-premises systems and AWS into Azure with in-flight transformations and built-in delivery validation to make data immediately available in Azure, in the desired format.

Streaming Integration to Azure

Implement Operational Data Warehouse on Azure Cloud

  • Rapidly set up real-time data pipelines from on-prem databases and AWS to enable real-time operational data store
  • Perform transformations, including denormalization, in-flight
  • Use phased and zero downtime migration from Oracle Exadata, Teradata, AWS Redshift by running them in parallel
  • Prevent data loss with built-in validation

Run Operational Workloads in Azure Databases

  • Continuously stream on-prem and AWS data to Azure SQL DB, Cosmos DB, Azure Database for MySQL, and Azure Database for PostgreSQL
  • Use non-intrusive change data capture to avoid impacting sources
  • Offload operational reporting
  • Move data continuously from MongoDB, sensors and other sources to Cosmos DB

Use Pre-Processed, Real-Time Data for Advanced Big Data Analytics and ML

  • Feed real-time data to Azure Data Lake Storage, Azure DataBricks, and Azure HDInsight from on-prem or AWS databases, log files, messaging systems, Hadoop, and sensors
  • Pre-process data-in-motion to reduce ETL efforts and accelerate insight
  • Continuously visualize and monitor data pipelines with real-time alerts

How Striim Works to Achieve Streaming Integration to Azure

Low Impact Change Data Capture from Enterprise Databases

  • Non-stop, non-intrusive data ingestion for high-volume data
  • Support for data warehouses such as Oracle Exadata, Teradata, Amazon Redshift; and databases such as Oracle, SQL Server, HPE NonStop, MySQL, PostgreSQL, MongoDB, Amazon RDS for Oracle, Amazon RDS for MySQL
  • Real-time data collection from logs, sensors, Hadoop and message queues to support operational decision making

Continuous Data Processing and Delivery

  • In-flight transformation, incl. denormalization, filtering, aggregation, enrichment to store only the data you need, in the right format
  • Real-time data delivery to Azure SQL Data Warehouse, SQL Server on Azure, Azure SQL Database, Azure Data Lake Storage, Azure Databricks, Kafka, Azure HDInsight, and Cosmos DB

Built-In Monitoring and Validation

  • Interactive, live dashboards for streaming data pipelines
  • Continuous verification of source and target database consistency
  • Real-time alerts via web, text, emailStreaming Integration to Azure

Why Striim?

As an enterprise-grade platform with built-in high-availability, scalability, and reliability, Striim is designed to deliver tangible ROI with low TCO to meet the real-time requirements for streaming integration to Azure in mission-critical environments.

With a broad set of supported sources, Striim enables you to make virtually any data available on Azure in real time and the desired format to support next-generation cloud analytics and operational decision making on a continuous basis.

To learn more about how to use Striim for streaming integration to Azure, check out our Striim for Azure product page, schedule a short demo with a Striim technologist, or download a free trial of the Striim platform and get started today.

Setting Up Streaming ETL to Snowflake

Snowflake, the data warehouse built for the cloud, is designed to bring power and simplicity to your cloud-based analytics solutions, especially when combined with a streaming ETL to Snowflake running in the cloud.Streaming ETL to Snowflake

Snowflake helps you make better and faster business decisions using your data on a massive scale, fueling data-driven organizations. Just take a look at Snowflake’s example use cases and you can see how companies are creating value from their data with Snowflake. There’s just one key caveat – how do you get your data into Snowflake in the first place?

Approaches – ETL/CDC/ELT

There are plenty of options when it comes to using data integration technologies, including ETL to Snowflake.

Let’s start with traditional ETL. Now a 50+ year old legacy technology, ETL was the genesis of data movement and enabled batch, disk-based transformations. While ETL is still used for advanced transformation capabilities, the high latencies and immense load on your source databases leave something to be desired.

Next, there was Change Data Capture (CDC). Pioneered by the founders of Striim at their previous company, GoldenGate Software (acquired by Oracle), CDC technology enabled use cases such as zero downtime database migration and heterogeneous data replication. However, CDC lacks transformational capabilities, forcing you into an ELT approach – first landing the data into a staging area such as storage, and then transforming to its final form. While this works, the multiple hops increase your end-to-end latency and architectural complexity.

Continuously Integrating Transactional Data into Snowflake

Enter, Striim. Striim is an evolution from GoldenGate and combines the real-time nature of CDC with many of the transformational capabilities of ETL into a next-generation streaming solution for ETL to Snowflake and other analytics platforms, on-premises or in the cloud. Enabling real-time data movement into Snowflake, Striim continuously ingests data from on-premises systems and other cloud environments to Snowflake. In this quick start guide, we will walk you through, step-by-step, how to use Striim for streaming ETL to Snowflake by loading data in real time, whether you run Snowflake on Azure or AWS.

Data Flow

We’ll get started with an on-premises Oracle to Snowflake application with in-line transformations and denormalization. This guide assumes you already have Striim installed either on-premises or in the cloud, along with your Oracle database and Snowflake account configured.

After installing Striim, there are a variety of ways to create applications, or data pipelines, from a source to a target. Here, I’ll focus on using our pre-built wizards and drag-and-drop UI, but you can also build applications with the drag-and-drop UI from scratch, or using a declarative language using the CLI.

We will show how you can set up the flow between the source and target, and then how you can enrich records using an in-memory cache that’s preloaded with reference data.

  1. In the Add App page, select Start with Template.

Striim for ETL to Snowflake

2. In the following App Wizard screen, search for Snowflake.

Striim for ETL to Snowflake

3. For this example, we’ll choose Oracle CDC to Snowflake.

Striim for ETL to Snowflake

4. Name the application whatever you’d like – we’ll choose oracleToSnowflake. Go ahead and use the default admin Namespace. Namespaces are used for both application organization and enable a microservices approach when you have multiple data pipelines. Click Save.

Striim for ETL to Snowflake

5. Follow the wizards, entering first your on-premises Oracle configuration properties, and then your Snowflake connection properties. In this case I’m migrating an Oracle orders table. Click Save, and you’ll be greeted by our drag and drop UI with the source and target pre-populated. If you want to just do a straight source-to-target migration, that’s it! However, we’ll continue this example with enrichment and denormalization, editing our application using the connectors located on the left-hand side menu bar.

Striim for ETL to Snowflake

6. In this use case, we’ll enrich the Orders table with another table of the same on-premises Oracle database. Locate the Enrichment tab on the left-hand menu bar, and drag and drop the DB Cache to your canvas.

Striim for ETL to Snowflake

7. First, name the cache whatever you’d like – I chose salesRepCache. Then, specify the Type of your cache. In this case, my enrichment table contains three fields: ID, Name, and Email. Specify a Key to map. This tells Striim’s in-memory cache how to position the data for the fastest possible joins. Finally, specify your Oracle Username, the JDBC Connection URL, your password, and the tables that you want to use as a cache. Click Save.

Striim for ETL to Snowflake

8. Now we’ll go ahead and join our streaming CDC source with the static Database Cache. Click the circular stream beneath your Oracle source, and click Connect next CQ component.

Striim for ETL to Snowflake

9. Application logic in Striim is expressed using Continuous Queries, or CQs. You do so using standard SQL syntax and optional Java functionality for custom scenarios. Unlike a query on a database where you run one query and receive one result, a CQ is constantly running, executing the query on an event-by-event basis as the data flows through Striim. Data can be easily pre-formatted or denormalized using CQs.

10. In this example, we are doing a few simple transformations of the fields of the streaming Oracle CDC source, as well as enriching the source with the database cache – adding in the SALES_REP_NAME and SALES_REP_EMAIL fields where the SALES_REP_ID of the streaming CDC source equals the SALES_REP_ID of the static database cache. Specify the name of the stream you want to output the result to, and click Save. Your logic here may vary depending on your use case.

Striim for ETL to Snowflake

11. Lastly, we have to configure our SnowflakeTarget to read from the enrichedStream, not the original CDC source. Click on your Snowflake target and change the Input Stream from the Oracle source stream to your enriched stream. Click Save.

Striim for ETL to Snowflake

12. Now you’re good to go! In the top menu bar, click on Created and press Deploy App.

Striim for ETL to Snowflake

13. The deployment page allows you to specify where you want specific parts of your data pipeline to run. In this case I have a very simple deployment topology – I’m just running Striim on my laptop, so I’ll choose the default option.

Striim for ETL to Snowflake

14. Click the eye next to your enrichedStream to preview your data as it’s flowing through, and press Start App in the top menu bar.

Striim for ETL to Snowflake

15. Now that the apps running, let’s generate some data. In this case I just have a sample data generator that is connecting to my source Oracle on-premises database.

Striim for ETL to Snowflake

16. Data is flowing through the Striim platform, and you can see the enriched Sales Rep Name and Emails. 

Striim for ETL to Snowflake

17. Lastly, let’s go to our Snowflake warehouse and just do a simple select * query. Data is now being continuously written to Snowflake.

Striim for ETL to Snowflake

That’s it! Without any coding, you now have set up streaming ETL to Snowflake to load data continuously, in real time.

Interested in learning more about streaming ETL to Snowflake? Check out our Striim for Snowflake solution page, schedule a demo with a Striim technologist, or download the Striim platform to get started!

Processing and Analytics In-Flight on Streaming Data Pipelines

With the validation of usage scenarios and the promotion of those applications to the “production floor” with enterprise-grade streaming integration facilities, organizations are nearly always drawn to using analytics in-flight on their streaming data pipelines. However, these analytics are not typical retroactive charts and graphs. More often than not, these analytical scenarios are about the data and how it is behaving within the streaming landscape.

For example, analytics on a streaming data pipeline usually start with examining the data within the pipeline. While all streaming integration facilities start with the ideal path in mind for the performance and content of the data, the real-time nature of business events often requires organizations to continually monitor for exceptions in the data pipeline. Occasional missing values in the event stream are dealt with by the error-handling capabilities of an enterprise-grade platform. However, fundamental changes to the values of events (new data, different data types, etc.) or the quality of the data (corrupted, missing data) need to be monitored and accounted for by support teams in the short-term, or data architects and data engineering teams in the long-term.

Another use for analytics is understanding the operational aspects of the data stream. How many events are flowing from the sources? How many reach the targets? How do these match up against the historical or projected norms for the environment? In each of these cases, it important to understand these operational-visibility key performance indicators to properly manage any streaming environment. Data engineers cannot wait for a static report or dashboard to provide the real-time insights to properly manage a streaming environment.

Finally, as organizations implement streaming analytical applications based on their business events, they will need to perform analytics in-flight and augment the data with that information. As usage scenarios like fraud management in real-time ordering become more prevalent, it is important for organizations to be able to provide analytical results that match the speed of those use cases. This enables not only the competitive advantage aspects of these use cases, but it allows the risks associated with these leading-edge real-time usage scenarios to be mitigated as well.

To learn more about the benefits of performing processing and analytics in-flight, visit our Striim Platform Overview product page, schedule a demo with a Striim technologist, or download a free trial of the Striim platform and demo it yourself.

Streaming Data Integration to AWS

As businesses adopt Amazon Web Services, streaming data integration to AWS – with change data capture (CDC) and stream processing – becomes a necessary part of the solution.

You’ve already decided that you want to enable integration to AWS. This could be to Amazon RDS or Aurora, Amazon Redshift, Amazon S3, Amazon Kinesis, Amazon EMR, or any number of other technologies.

You may want to migrate existing applications to AWS, scale elastically as necessary, or use the cloud for analytics or machine learning, but running applications in AWS, as VMs or containers, is only part of the problem. You also need to consider how to you move data to the cloud, ensure your applications or analytics are always up to date, and make sure the data is in the right format to be valuable.

The most important starting point is ensuring you can stream data to the cloud in real time. Batch data movement can cause unpredictable load on cloud targets, and has a high latency, meaning your data is often hours old. For modern applications, having up-to-the-second information is essential, for example to provide current customer information, accurate business reporting, or for real-time decision making.

Moving Data to Amazon Web Services in Real-Time

integration to wasStreaming data integration to AWS from on-premise systems requires making use of appropriate data collection technologies. For databases, this is change data capture, or CDC, which directly and continuously intercepts database activity, and collects all the inserts, updates, and deletes as events, as they happen. Log data requires file tailing, which reads at the end of one or more files across potentially multiple machines and streams the latest records as they are written. Other sources like IoT data, or third party SaaS applications, also require specific treatment in order to ensure data can be streamed in real time.

Once you have streaming data, the next consideration is what processing is necessary to make the data valuable for your specific AWS destination, and this depends on the use-case.

Use Cases

For database migration or elastic scalability use-cases, where the target schema is similar to the source, moving raw data from on-premise databases to Amazon RDS or Aurora may be sufficient. The important consideration here is that the source applications typically cannot be stopped, and it takes time to do an initial load. This is why collecting and delivering database change, during and after the initial load, is essential for zero downtime migrations.

For real-time applications sourcing from Amazon Kinesis, or analytics use-cases built on Amazon Redshift or Amazon EMR, it may be necessary to perform stream processing before the data is delivered to the cloud. This processing can transform the data structure, and enrich it with additional context information, while the data is in-flight, adding value to the data and optimizing downstream analytics.

Striim’s Streaming Integration to AWS

Striim’s streaming integration to AWS can continuously collect data from on-premise, or other cloud databases, and deliver to all of your Amazon Web Services endpoints. Striim can take care of initial loads, as well as CDC for the continuous application of change, and these data flows can be created rapidly, and monitored and validated continuously through our intuitive UI.

With Striim, your cloud migrations, scaling, and analytics can be built and iterated-on at the speed of your business, ensuring your data is always where you want it, when you want it.

To learn more about streaming integration to AWS with Striim, visit our “Striim for Amazon Web Services” product page, schedule a demo with a Striim technologist, or download a free trial of the platform.

Real-Time AWS Cloud Migration Monitoring: 3-Minute Demo

AWS cloud migration requires more than just being able to run in VMs or cloud containers. Applications rely on data, and that data needs to be migrated as well.

In most cases, the original applications are essential to the business, and cannot be stopped during this process. Since it takes time to migrate the data, and time to verify the application after migration, it is essential that data changes are collected, and delivered during and after that initial load.

As the data is so crucial to the business, and change data will be continually applied for a long time, mechanisms that verify that the data is delivered correctly are an important aspect of any AWS cloud migration.

Migration Monitoring Demo

In this Migration Monitoring Demo we will show how, by collecting change data from source and target and matching transactions applied to each in real time, you can ensure your cloud database is completely synchronized with on-premise, and detect any data divergence when migrating from an on-premise database.

AWS Cloud Migration Monitoring with Striim

Key Challenges

The key challenges with monitoring AWS cloud migration include:

  • Enabling data migration without a production outage with monitoring during and after migration.
  • Detecting out-of-sync data should any divergence occur with this detection happening immediately at the time of divergence, preventing further data corruption.
  • Running the monitoring solution non-intrusively with low overhead and obtaining sufficient information to enable fast resynchronization

In our scenario, we are monitoring the migration of an on-premise application to AWS. A Striim dashboard shows real-time status, complete with alerts, and is powered by a continuously running data pipeline. The on-premise application uses an Oracle Database and cannot be stopped. The database transactions are continually replicated to an Amazon Aurora MySQL Database. The underlying migration solution could be either Striim’s Migration Solution or other solutions such as AWS DMS.  

The objective is to monitor ongoing migration of transactions and alert when any transactions go out-of-sync, indicating any potential data discrepancy. This is achieved in the Striim platform through its continuous query processing layer. Transactions are continuously collected from the source and target databases in real-time and matched within a time window. If matching transactions do not occur within a period of time, they are considered long-running. If no match occurs in an additional time period, the transaction is considered missing. Alerts are generated in both cases.

Results

The number of alerts for missing transactions and long-running transactions are displayed in the dashboard. Transaction rates and operation activity are also available in the dashboard and can be displayed for all tables, or for critical tables and users.

You can immediately see live updates and alerts when the transactions do not get propagated to the target within a user configured window, with long-running transaction that eventually make it to the target also tracked.

The dashboard is user-customizable, making it easy to add additional visualizations for specific monitoring as necessary.

You have seen how Striim can be used for continuous monitoring of your on-premise to AWS cloud migration. For more information, visit our AWS solution page, schedule a demo with a Striim technologist, or get started immediately using a download from our website, or via the AWS marketplace.

 

Real-Time Data Visualization and Data Exploration

When business operations run at lightning speed generating large data volumes and operational complexity abounds, real-time data visualization and data exploration becomes increasingly critical to manage daily operations. Striim enables businesses to access, analyze, visualize and explore live operational data to understand their “Now,” and take control of business operations.Real-Time Data Visualization and Data Exploration

Real-Time, Comprehensive Insight Made Easy

By combining real-time data integration, streaming analytics, and rich data visualization in a single, enterprise-grade platform, Striim allows businesses to respond to business trends and emerging issues proactively and with full context. With Striim, users not only have up-to-the-second visibility into all corners of the business with advanced custom metrics, but also the flexibility to explore streaming data without needing to write code.

Create Sophisticated Metrics Easily

Unlike packaged solutions with fixed and generic metrics, Striim’s software platform gives businesses the flexibility to gain fast and deep insight using business-specific metrics. By ingesting, filtering, aggregating, transforming, enriching, and analyzing real-time data from virtually any source, it enables custom metrics using all relevant data and the ability to dice and slice the metrics across a wide range of dimensions for fast insight. A comprehensive set of built-in SQL operations and functions – such as Math, Statistics, Date, Spatial, String – along with customizable, jumping and sliding time windowsprovide the granular and precise metric definitions that deliver accurate performance assessment.

Gain Real-Time and Flexible Visibility into Operations

By combining streaming integration and analytics capabilities with in-memory processing, Striim updates all metrics in real time as new data streams in from various sources, and stores historical data within the built-in results store for time-based comparisons.

Via the dashboards, users can compare live data to historical averages or to a specific date and time in the past, without having to write code. Real-time, interactive dashboards allow business users to view live data with detailed field and time-based filtering at the page or chart level. In addition, users can search streaming data directly on the dashboard and drill down to detail pages.

Real-Time Data Visualization and Data Exploration

Key Platform Features for Real-Time Data Visualization and Data Exploration

Striim offers an end-to-end, enterprise-grade platform to deliver instant insights from high-volume, high-velocity data. Some of the key features for real-time data visualization and data exploration are as follows:

  • Real-time data ingestion from diverse sources: Ingests, processes, and enriches unstructured, semi-structured, and structured data from databases, log files, message queues, and sensors
  • Multi-source stream processing and analytics: Performs SQL-based continuous processing on multiple streams of live data including enrichment with static and streaming reference data
  • Flexible time windows: Offers time-based, event-based, and session-based windowing
  • Interactive, live dashboards: Delivers push-based visualization with automatic refresh
  • Rewinding: Enables to view and compare historical data via the UI
  • Search: Offers keyword search on live, streaming data
  • Field and time-based filtering: Allows filtering and comparing each chart by different dimensions
  • Page and chart level filtering: Gives the flexibility to use filter at the chart or page level
  • Embedding into custom websites: Striim charts can be embedded into any HTML5 page via iFrame along with filtering and search capabilities.

Deploy and Modify Easily as Business Needs Change

Businesses can quickly gain real-time visibility into their operations via Striim’s intuitive UI without any coding. Using Striim’s simple yet powerful streaming SQL engine, Striim applications can ingest millions of data points per second and create visualization-specific aggregates. Striim’s GUI and SQL-based language makes it easy to correlate live, streaming data with historical aggregates.

Data Visualization and Data Exploration
Striim offers an intuitive UI to easily set up data flows and correlate historical data with streaming data

Within seconds of establishing data sources and flows, users can create dashboards to view live data, and modify the dashboards and charts as needed to meet ever-changing business needs. Visualizations can be done via a variety of charts such line, area, column, maps, heat maps, tables etc. Dashboards can contain multiple pages with in-page filtering and drill down available for deeper understanding of operational metrics.

Striim’s charts can be embedded to any custom dashboard or web page to support broad collaboration and distribution of real-time insights. Striim issues real-time alerts based on custom thresholds, and can trigger workflows to enable timely action.

Benefits of Data Exploration with Striim

Using Striim for live operational dashboards and streaming data exploration, businesses gain several competitive advantages including:

  • Real-time, granular, and comprehensive insights with business-specific metrics
  • Correlation of real-time and historical data to detect deviations immediately
  • Rapid iteration of the dashboards and data flows as business needs change
  • Proactive response to emerging trends based on in-time, in-context insights
  • The ability to easily meet strict SLAs and improve customer experience

Striim enables businesses to accurately track operational performance with the right metrics, in real time, so they can course-correct fast, with full confidence.

To learn more about Striim’s real-time data visualization and data exploration capabilities, visit our Creating and Monitoring Operational Metrics solutions page, schedule a demo with a Striim technologist, or download a free trial of the platform and try it for yourself!

The Inevitable Evolution from Batch ETL to Real-Time ETL (Part 1 of 2)

 

 

Traditional extract, transform, load (ETL) solutions have, by necessity, evolved into real-time ETL solutions as digital businesses have increased both the speed in executing transactions, and the need to share larger volumes of data across systems faster. In this two-part blog post series, I will describe the transition from traditional ETL to a streaming, real-time ETL and how that shift benefits today’s data-driven organizations.

The Evolution of Real-Time ETLData integration has been the cornerstone of the digital innovation for the last several decades, enabling the movement and processing of data across the enterprise to support data-driven decision making. In decades past, when businesses collected and shared data primarily for strategic decision making, batch-based ETL solutions served these organizations well. A traditional ETL solution extracts data from databases (typically at the end of the day), transforms the data extensively on disk in a middle-tier server to a consumable form for analytics, and then loads it in batch to a target data warehouse with a significantly different schema to enable various reporting and analytics solutions.

As consumers demanded faster transaction processing, personalized experience, and self-service with up-to-date data access, the data integration approach had to adapt to collect and distribute data to customer-facing applications and analytical applications more efficiently and with lower latency. In response, two decades ago, logical data replication with change data capture (CDC) capabilities emerged. CDC moves only the change data in real time, as opposed to all available data as a snapshot, and delivers data to various databases.

These “new” technologies enabled businesses to create real-time replicas of their databases to support customer applications, migrate databases without downtime, and allow real-time operational decision making. Because CDC was not designed for extensive transformations of the data, logical replication and CDC tools lead to an “extract, load, and transform” (ELT) approach where significant transformations and enrichment would be required on the target system to put the data in the desired form for analytical processing. Many of the original logical replication offerings are also architected to run single processes on one node, which creates a single point of failure and requires an orchestration layer to achieve true high availability.

The next wind of change came with the analytical solutions shifting from traditional on-premises data warehousing on relational databases to Hadoop and NoSQL environments and Kafka-based streaming data platforms, deployed heavily in the cloud. Traditional ETL had to now evolve further to a real-time ETL solution that works seamlessly with the data platforms both on-premises and in the cloud, and combines the robust transformation and enrichment capabilities of traditional ETL with low-latency data capture and distribution capabilities of logical replication and CDC.

In Part 2 of this blog post, I will discuss these real-time ETL solutions in more detail, particularly focusing on Striim’s streaming data integration software which moves data across cloud and on-premises environments with in-memory stream processing before delivering data in milliseconds to target data platforms. In the meantime, please check out our product page to learn more about Striim’s real-time ETL capabilities.

Feel free to Schedule a technical demo with one of our lead technologists, or download or provision Striim for free to experience first-hand its broad range of capabilities.

 

Streaming Data Integration for Hadoop

 

 

With Striim’s streaming data integration for Hadoop, you can easily feed your Hadoop and NoSQL solutions continuously with real-time, pre-processed data from enterprise databases, log files, messaging systems, and sensors to support operational intelligence.

Ingest Real-time, Pre-Processed Data for Operational Intelligence

Striim is a software product that continuously moves real-time data from a wide range of sources into Hadoop, Kafka, relational and NoSQL databases — on-prem or in the cloud — with in-line transformation and enrichment capabilities. Brought to you by the core team behind GoldenGate Software, Striim offers a non-intrusive, quick-to- deploy solution for streaming integration so your Hadoop solution can support a broader set of operational use cases.

With the following capabilities, Striim’s streaming data integration for Hadoop enables a smart data architecture that supports use-case-driven analytics in enterprise data lakes:

  • Ingests large volumes of real-time data from databases, log files, message systems, and sensors
  • Collects change data non-intrusively from enterprise databases such as Oracle, SQL Server, MySQL, HPE NonStop, MariaDB, Amazon RDS
  • Delivers data in milliseconds to Hadoop (HDFS, HBase, Hive, Kudu), Kafka, Cassandra, MongoDB, relational databases, cloud environments, and other targets
  • Supports mission-critical environments with end-to-end security, reliability, HA, and scalability

Benefits

  • Uses low-latency data for operational use cases
  • Accelerates time to insight with a continuous flow of transformed data
  • Ensures scalability, security, and reliability for business-critical solutions
  • Achieves fast time-to-market with wizards-based UI and SQL-based language

Key Features

  • Enterprise-grade and fast-to-deploy streaming integration for Hadoop
  • Real-time integration of structured and unstructured data
  • In-flight filtering, aggregation, transformation, and enrichment
  • Continuous ingestion and processing, at scale
  • Integration with existing technologies and open source solutions

Striim enables businesses to get the maximum value from high-velocity, high-volume data by delivering it to Hadoop environments in real-time and in the right format for operational use cases.

Real-time, Low-impact Change Data Capture

Striim ingests real-time data from transactional databases, log files, message queues, and sensors. For enterprise databases, including Oracle, Microsoft SQL Server, MySQL, and HPE NonStop, Striim offers a non-intrusive change data capture (CDC) feature to ensure real-time data integration has minimal impact on source systems and optimizes the network utilization by moving only the change data.

In-Flight Data Processing

As data volumes continue to grow, having the ability to filter out and aggregate the data before analytics becomes a key way to manage the limited storage resources. Striim enables in-flight data filtering
and aggregation before it delivers to Hadoop to reduce data storage footprint. By performing in-line transformation (such as denormalization) and enrichment with static or dynamically changing data in memory, Striim feeds large data volumes in the right format without introducing latency.

Enterprise-grade Solution

Striim is designed to meet the needs of mission-critical environments with end-to-end security and reliability — including out-of-the-box exactly once processing — high-performance, and scalability. Users can focus on the application logic knowing that from ingestion to alerting and delivery, the platform is bulletproof to support the business as required.

Fast Time to Market

Intuitive development experience with drag-and-drop UI along with prebuilt data flows for multiple Hadoop targets from popular sources allow fast deployment. Striim uses an SQL-based language that requires no special skills to develop or modify streaming applications.

Operationalizing Machine Learning

Striim can pre-process and extract features suitable for machine learning before continually delivering training files to Hadoop. Once data scientists build their models using Hadoop technologies, these can be brought into Striim, using the new open processor component, so real-time insights can guide operational decision making and truly transform the business. Striim can also monitor model fitness and trigger retraining of models for full automation.

Differences from ETL

Compared to traditional ETL offerings that use bulk data extracts, Striim enables continuous ingestion of structured, semi-structured, and unstructured data in real time delivering granular data flow for richer analytics. By performing in-memory transformations on data-in-motion using SQL-based continuous queries, Striim avoids adding latency and enables real-time delivery. While ETL solutions are optimized for database sources and targets, Striim provides native integration and optimized delivery for Hadoop, Kafka, databases, and files, on-prem or in the cloud. Striim also offers stream analytics and data visualization capabilities within the same platform, without requiring additional licenses.

To learn more about streaming data integration for Hadoop, visit our Hadoop and NoSQL Integration solution page, schedule a demo with a Striim expert, or download the Striim platform to get started!

Move Real-Time Data to Cloudera Using the Striim Platform

In this blog post, we’re going to take a look at how you can use the Striim platform to move real-time data to Cloudera from a variety of sources.

The Striim platform provides an enterprise-grade streaming integration solution for moving real-time change data from a wide variety of sources to Cloudera distributions of Apache Kafka, Apache Kudu, and Apache Hadoop, without impacting source systems. With support for hybrid IT infrastructures, Striim complements Cloudera solutions by enabling organizations to use full breadth and depth of their data in real time in order to gain a complete and up-to-date view into their operations.

Benefits

  • Ingest real-time data into CDK (Kafka), Kudu, Hadoop with low impact
  • Continuously collect data from databases, logs, messaging, sensors, and more
  • Process data in-flight without extensive coding
  • Get immediate insights and alerts
  • Use low-latency data in Cloudera for operational decision making

Why Striim?

  • Real-time data integration from a wide variety of data sources
  • Designed for high-volume, high-velocity data
  • Non-intrusive CDC from databases with event guarantees
  • Built-in security, scalability, and reliability
  • In-flight enrichment via built-in cache
  • Quick to deploy via SQL-like queries and wizards-based UI

Non-intrusive, Real-time Data Ingestion

The Striim platform continuously ingests real-time data from a variety of sources out-of-the-box – including databases, cloud applications, files, message queues, and devices – on-premises or in the cloud. For enterprise databases such as Oracle, SQL Server, MySQL, HPE NonStop, and MariaDB, the platform offers non-intrusive change data capture (CDC) to minimize the impact on source systems. Striim supports major data formats, including JSON, XML, AVRO, delimited binary, free text, and change records.

With a drag-and-drop UI and wizards, Striim simplifies creating data flows from popular sources to move data to Cloudera solutions including CDK (Kafka), Hadoop, HBase, Hive, and Kudu. The data can be delivered “as-is,” or be put through a series of in-flight transformations and enrichments. By using real-time, pre-processed data – especially in Kudu, Impala, and Kafka – customers can rapidly gain timely, operational intelligence from their Cloudera applications.

Delivery to Cloudera, On-premises or Cloud

The Striim platform can continuously apply pre-processed, streaming data to Cloudera solutions with sub-second latency. With parallelization capabilities, Striim offers optimized loading to Cloudera solutions. Striim can also deliver real-time data to other targets such as databases and files.

Built-in Stream Processing and Monitoring

Through SQL-based continuous queries, the Striim platform filters, aggregates, transforms, joins, and enriches multiple streams of real-time data in-memory to rapidly prepare the data for different downstream users before delivering to Cloudera environments. 

Striim also comes with built-in validation and monitoring capabilities. The platform enables users to continuously monitor the health of the data pipelines via real-time dashboards and alerts.

Enterprise-grade Modern Streaming Integration

Striim is designed form the ground up to support high-volume, high velocity data with built-in validation, security, high-availability, reliability, and scalability to support mission-critical applications.

Unlike traditional ETL solutions, Striim continuously ingests granular and larger data sets for richer analytics. It does so without impacting source systems, and processes the data in-memory, while it is streaming, to enable sub-second latency. Striim also differs from traditional logical replication tools with its optimized support for a wide range of data types, data sources, and targets, and its out-of-the-box comprehensive stream processing capabilities.

To learn more about how you can utilize the Striim platform to move data to Cloudera, please reach out to schedule a demo with a Striim expert or download the platform and try it for yourself.

Back to top