The Modern Data Divide with Arpit Choudhury

Posted on June 29, 2022 by Striim Team | 1 min read | 4 views

We host Arpit Choudhury – well known for his work in building data communities such as Astorik.com – to talk about the ‘modern data divide’ and how to overcome friction between data people and non-data people. Arpit also talks about the value of ‘all in one’ tools versus having a multivariate modern data stack. Follow Arpit Choudhury on Linkedin and check out his community of data practitioners at Astorik.com

Oracle Change Data Capture to Databricks

Posted on June 27, 2022 by Striim Team | 7 min read | 4 views

Tutorial

Oracle Change Data Capture to Databricks

Benefits

Migrate your database data and schemas to Databricks in minutes.

Stream operational data from Oracle to your data lake in real-time

Automatically keep schemas and models in sync with your operational database.

On this page

We will go over two ways on how to create smart pipelines to stream data from Oracle to Databricks. Striim also offers streaming integration from popular databases such as PostgreSQL, SQLServer, MongoDB, MySQL, and applications such as Salesforce to Databricks Delta Lake.

In the first half of the demo, we will be focusing on how to move historical data for migration use cases, which are becoming more and more common as many users start moving from traditional on-prem to cloud hosted services.

Striim is also proud to offer the industry’s fastest and most scalable Oracle change data capture to address the most critical use cases.

Striim makes initial load, schema conversion, and change data capture a seamless experience for data engineers.

In a traditional pipeline approach, there are times we would have to manually create the schema either through code or infer the schema from the csv file etc.

And next, configure the connectivity parameters for the source and target.

Striim offers the ability to reduce the amount of time and manual effort when it comes to setting up these connections and also creates the schema at the target with the help of a simple wizard.

Here we have a view of the databricks homepage with no schema or table created in the DBFS.

In the Striim UI, under the ‘Create app’ option, we can choose from templates offered for a wide array of data sources and targets.

With our most recent 4.1 release, we have also support the Delta Lake adapter as a Target datasink.

Part 1: Initial Load and Schema Creation

In this demo, we will be going over on how to move historical data from Oracle to Databrick’s Delta lake.

With the help of Striim’s Intuitive Wizard we name the application,
With the added option to create multiple namespaces depending on our pipelines needs and requirements
First we configure the source details for the Oracle Database.
We can validate our connection details
Next we have to option to choose the schemas and tables that we specifically want to move, providing us with more flexibility instead of replicating the entire database or schema.
Now we can start to configure our target Delta Lake.
Which supports ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.
Striim has the capability to migrate schemas too as part of the wizard which makes it very seamless and easy.
The wizard takes care of validating the target connections, using the oracle metadata to create schema in the target and initiate the historical data push to delta lake as well.
Making the whole end to end operation finish in less then a fraction of the time it would take with traditional pipelines.

Once the schema is created, we can also verify it before we go ahead with the migration to Delta lake
Striim’s unified data integration provides unprecedented speed and simplicity which we have just observed on how simple it was to connect a source and target.
In case, we want to make additional changes to the Fetch size, provide a custom Query. The second half of the demo highlights , how we can apply those changes without the wizard.
We can Monitor the progress of the job with detailed metrics which would help with the data governance to ensure data has been replicated appropriately.

Part 2: Change Data Capture

As part of our second demo, we will be highlighting Striim’s Change data Capture that helps drive Digital transformation and leverage true real time analytics.

Earlier we have gone through how to create a pipeline through the wizard, and Now we will have a look at how we can tune our pipeline without the wizard and use the intuitive drag and drop flow design

From the Striim dashboard , we can navigate the same way as earlier to create An Application from scratch or also import a TQL file if we already have a pipeline created.
From the search bar, we can search for the oracle CDC adapter. The UI is super friendly with an easy drag and drop approach.
We can skip the wizard if we want and go ahead and enter the connection parameters like earlier.
In the additional parameters, we have the flexibility to make any changes to the data we pull from the source.
Lastly, we can create an output stream that will connect to the data sink

We can test connections and validate our connections even without deploying the app or pipeline.
Once the source connection is established , we can connect to a target component, and select the delta Lake adapter from the drop down.
Databricks has a unified approach to its design that allows us to bridge the gap between different types of users ranging from Analysts, Data Scientists, and Machine Learning Engineers.
From the Databricks dashboard, we can navigate to the Compute section to access the cluster’s connection parameters.
Under the advanced settings, select the JDBC/ODBC settings to view the cluster’s Hostname and JDBC URL.
Next, we can go ahead and generate a Personal access token that will be used to authenticate the user’s access to DatabricksFrom the settings, we can navigate to the user’s settings and click on Generate a new token.
After adding the required parameters, we can go ahead and create the directory in DBFS through the following commands in a notebook
Next, we can go ahead and deploy the app and start the flow to initiate the CDC.
We can refresh Databricks to view the CDC data, Striim allows us to view the detailed metrics of a pipeline in real-time.

Tools you need

Striim

Striim’s unified data integration and streaming platform connects clouds, data and applications.

Databricks

Databricks combines data warehouse and Data lake into a Lakehouse architecture

Oracle

Oracle is a multi-model relational database management system.

Delta Lake

Delta Lake is an open-source storage framework that supports building a lakehouse architecture

Conclusion

Managing large-scale data is a challenge for every enterprise. Real-time, integrated data is a requirement to stay competitive, but modernizing your data architecture can be an overwhelming task.

Striim can handle the volume, complexity, and velocity of enterprise data by connecting legacy systems to modern cloud applications on a scalable platform. Our customers don’t have to pause operations to migrate data or juggle different tools for every data source—they simply connect legacy systems to newer cloud applications and get data streaming in a few clicks.

Seamless integrations. Near-perfect performance. Data up to the moment. That’s what embracing complexity without sacrificing performance looks like to an enterprise with a modern data stack.

Use cases

Integrating Striim’s CDC capabilities with Databricks makes it very easy to rapidly expand the capabilities of a Lakehouse with just a few clicks.

Striim’s additional components allow not only to capture real-time data, but also apply transformations on the fly before it even lands in the staging zone, thereby reducing the amount of data cleansing that is required.

The wide array of Striim’s event transformers makes it as seamless as possible with handling any type of sensitive data allowing users to maintain compliance norms on various levels.

Allow high-quality data into Databricks which can then be transformed via Spark code and loaded into Databrick’s new services such as Delta Live tables.

How to Stream Data to Google Cloud with Striim

Posted on June 22, 2022 by Striim Team | 1 min read | 4 views

The move to Google Cloud is an attractive path for data modernization and for achieving a solid foundation for digital transformation. Real-time data integration allows you to run high-value workloads in the cloud and reap the full benefits of your cloud environment to improve your business operations and embrace innovation. As with adopting any new technology, there is complexity in the move and a number of things to consider, especially when dealing with mission-critical systems.

In this on-demand technical demo, Fahad Ansari and Srdan Dvanajscak show you how to stream data from an Oracle database to Google BigQuery and other Google Cloud targets with Striim. They demonstrate how Striim enables you to:

Ingest data from in-production sources with negligible impact
Make your operational data available immediately for applications and services on the Google Cloud
Process and analyze in-flight data using SQL queries and UI-based operators

Migrating from MySQL to BigQuery for Real-Time Data Analytics

Posted on June 20, 2022 by Striim Team | 8 min read | 4 views

Tutorial

Migrating from MySQL to BigQuery for Real-Time Data Analytics

How to replicate and synchronize your data from on-premises MySQL to BigQuery using change data capture CDC)

Benefits

Operational AnalyticsAnalyze your data in real-time without impacting the performance of your operational database.Act in Real TimePredict, automate, and react to business events as they happen, not minutes or hours later.Empower Your TeamsGive teams across your organization a real-time view into operational data
On this page

Overview

In this post, we will walk through an example of how to replicate and synchronize your data from on-premises MySQL to BigQuery using change data capture (CDC).

Data warehouses have traditionally been on-premises services that required data to be transferred using batch load methods. Ingesting, storing, and manipulating data with cloud data services like Google BigQuery makes the whole process easier and more cost effective, provided that you can get your data in efficiently.

Striim real-time data integration platform allows you to move data in real-time as changes are being recorded using a technology called change data capture. This allows you to build real-time analytics and machine learning capabilities from your on-premises datasets with minimal impact.

Step 1: Source MySQL Database

Before you set up the Striim platform to synchronize your data from MySQL to BigQuery, let’s take a look at the source database and prepare the corresponding database structure in BigQuery. For this example, I am using a local MySQL database with a simple purchases table to simulate a financial datastore that we want to ingest from MySQL to BigQuery for analytics and reporting.

I’ve loaded a number of initial records into this table and have a script to apply additional records once Striim has been configured to show how it picks up the changes automatically in real time.

Step 2: Targeting Google BigQuery

You also need to make sure your instance of BigQuery has been set up to mirror the source or the on-premises data structure. There are a few ways to do this, but because you are using a small table structure, you are going to set this up using the Google Cloud Console interface. Open the Google Cloud Console, and select a project, or create a new one. You can now select BigQuery from the available cloud services. Create a new dataset to hold the incoming data from the MySQL database.

Once the dataset has been created, you also need to create a table structure. Striim can perform the transformations while the data flies through the synchronization process. However, to make things a little easier here, I have replicated the same structure as the on-premises data source.

You will also need a service account to allow your Striim application to access BigQuery. Open the service account option through the IAM window in the Google Cloud Console and create a new service account. Give the necessary permissions for the service account by assigning BigQuery Owner and Admin roles and download the service account key to a JSON file.

Step 3: Set Up the Striim Application

Now you have your data in a table in the on-premises MySQL database and have a corresponding empty table with the same fields in BigQuery. Let’s now set up a Striim application on Google Cloud Platform for the migration service.

Open your Google Cloud Console and open or start a new project. Go to the marketplace and search for Striim. A number of options should return, but the option you are after is the first item that allows integration of real-time data to Google Cloud services.

Select this option and start the deployment process. For this tutorial, you are just using the defaults for the Striim server. In production, you would need to size appropriately depending on your load.

Click the deploy button at the bottom of this screen and start the deployment process.

Once this deployment has finished, the details of the server and the Striim application will be generated.

Before you open the admin site, you will need to add a few files to the Striim Virtual Machine. Open the SSH console to the machine and copy the JSON file with the service account key to a location Striim can access. I used /opt/striim/conf/servicekey.json.

You also need to restart the Striim services for these setting and changes to take effect. The easiest way to do this is to restart the VM.

Give these files the right permissions by running the following commands:

chown striim:striim

chmod 770

You also need to restart the Striim services for this to take effect. The easiest way to do this is to restart the VM.

Once this is done, close the shell and click on the Visit The Site button to open the Striim admin portal.

Before you can use Striim, you will need to configure some basic details. Register your details and enter in the Cluster name (I used “DemoCluster”) and password, as well as an admin password. Leave the license field blank to get a trial license if you don’t have a license, then wait for the installation to finish.

When you get to the home screen for Striim, you will see three options. Let’s start by creating an app to connect your on-premises database with BigQuery to perform the initial load of data. To create this application, you will need to start from scratch from the applications area. Give your application a name and you will be presented with a blank canvas.

The first step is to read data from MySQL, so drag a database reader from the sources tab on the left. Double-click on the database reader to set the connection string with a JDBC-style URL using the template:

jdbc:mysql://:/

You must also specify the tables to synchronize — for this example, purchases — as this allows you to restrict what is synchronized.

Finally, create a new output. I called mine PurchasesDataStream.

You also need to connect your BigQuery instance to your source. Drag a BigQuery writer from the targets tab on the left. Double-click on the writer and select the input stream from the previous step and specify the location of the service account key. Finally, map the source and target tables together using the form:

.,.

For this use case this is just a single table on each side.

Once both the source and target connectors have been configured, deploy and start the application to begin the initial load process. Once the application is deployed and running, you can use the monitor menu option on the top left of the screen to watch the progress.

Because this example contains a small data load, the initial load application finishes pretty quickly. You can now stop this initial load application and move on to the synchronization.

Step 4: Updating BigQuery with Change Data Capture

Striim has pushed your current database up into BigQuery, but ideally you want to update this every time the on-premises database changes. This is where the change data capture application comes into play.

Go back to the applications screen in Striim and create a new application from a template. Find and select the MySQL CDC to BigQuery option.

Like the first application, you need to configure the details for your on-premises MySQL source. Use the same basic settings as before. However, this time the wizard adds the JDBC component to the connection URL.

When you click Next, Striim will ensure that it can connect to the local source. Striim will retrieve all the tables from the source. Select the tables you want to sync. For this example, it’s just the purchases table.

Once the local tables are mapped, you need to connect to the BigQuery target. Again, you can use the same settings as before by specifying the same service key JSON file, table mapping, and GCP Project ID.

Once the setup of the application is complete, you can deploy and turn on the synchronization application. This will monitor the on-premises database for any changes, then synchronize them into BigQuery.

Let’s see this in action by clicking on the monitor button again and loading some data into your on-premises database. As the data loads, you will see the transactions being processed by Striim.

Next Step

As you can see, Striim makes it easy for you to synchronize your on-premises data from existing databases, such as MySQL, to BigQuery. By constantly moving your data into BigQuery, you could now start building analytics or machine learning models on top, all with minimal impact to your current systems. You could also start ingesting and normalizing more datasets with Striim to fully take advantage of your data when combined with the power of BigQuery.

To learn more about Striim for Google BigQuery, check out the related product page. Striim is not limited to MySQL to BigQuery integration, and supports many different sources and targets. To see how Striim can help with your move to cloud-based services, schedule a demo with a Striim technologist or download a free trial of the platform.

Tools you need

Striim

Striim’s unified data integration and streaming platform connects clouds, data and applications.

MySQL

MySQL is an open-source relational database management system.

Snowflake

Snowflake is a cloud-native relational data warehouse that offers flexible and scalable architecture for storage, compute and cloud services.

Snowflake Summit & Data In a Recession with Matt Turck

Posted on June 17, 2022 by Striim Team | 1 min read | 4 views

Matt Turck from FirstMark Capital joins us on a live episode of ‘What’s New In Data’ from Snowflake Summit. We recap the summit, talk about the state of the data industry, and look ahead to how a potential recession will play a role.

Unlock Insights to Your Data with Azure Synapse Link

Posted on June 16, 2022 by Striim Team | 1 min read | 4 views

Edward Bell from Striim and Mahesh Prakriya from Microsoft demonstrate the value of Striim and Azure Synapse for Oracle and Salesforce users.

How to Choose the Right Change Data Capture Solution

Posted on June 14, 2022 by Striim Team | 1 min read | 4 views

An introduction to CDC (and pros and cons of different CDC methods) | 5 reasons organizations need CDC | Key features to consider in your CDC solution

Introducing Striim Platform 4.1

Posted on June 14, 2022 by Striim Team | 1 min read | 4 views

Striim is pleased to introduce a broad set of enhancements to our on-premise and cloud marketplace offerings that add additional sources and targets, provide increased manageability, and further enhance performance.

Kafka Stream Processing with Striim

Posted on June 9, 2022 by John Kutay | 4 min read | 3 views

Apache Kafka has proven itself as a fast, scalable, fault-tolerant messaging system, chosen by many leading organizations as the standard for moving data around in a reliable way.

However, Kafka was created by developers, for developers. This means that you’ll need a team of developers to build, deploy, and maintain any stream processing or analytics applications that use Kafka.

Striim is designed to make it easy to get the most out of Kafka, so you can create business solutions without writing Java code. Striim simplifies and enhances Kafka stream processing by providing:

Continuous ingestion into Kafka and a range of other targets from a wide variety of sources (including Kafka) via built-in connectors
UI for data formatting
In-memory, SQL-based stream processing for Kafka
Multi-thread delivery for better performance
Enterprise-grade Kafka applications with built-in high availability, scalability, recovery, failover, security, and exactly-once processing guarantees

5 Key Areas Where Striim Simplifies and Enhances Kafka Stream Processing

1. Ingestion from a wide range of data sources with Change Data Capture support

Striim has over 150 out-of-the-box connectors to ingest real-time data from a variety of sources, including databases, files, message queues, and devices. It also provides wizards to automate developing data flows between popular sources to Kafka. These sources include MySQL, Oracle, SQL Server, and others. Striim can also read from Kafka as a source.

Striim uses change data capture (CDC) — a modern replication mechanism — to track changes from a database for Kafka. This can help Kafka to receive real-time updates of database operations (e.g., inserts, updates).

2. UI for data formatting

Kafka handles data at the byte level, so it doesn’t know the data format. However, Kafka consumers have varying requirements. They want data in JSON, structured XML, delimited data (e.g., CSVs), plain text, or other formats. Striim provides a UI — known as Flow Designer — that includes a drop-down menu, that lets users customize data formats. This way, you don’t have to do any coding for data formatting.

3. TQL for flexible and fast in-memory queries

Once data has landed in Kafka, enterprises want to derive value out of that data. In 2014, Striim introduced its streaming SQL engine, TQL (Tungsten Query Language) for data engineers and business analysts to write SQL-style declarative queries over streaming data including data in Kafka topics. Users can access, manage, and manipulate data residing in Kafka with Striim’s TQL. In 2017, Confluent announced the release of KSQL, an open-source, streaming SQL engine that enables real-time data processing against Apache Kafka. However, there are some significant performance differences between TQL and KSQL.

TQL-vs-KSQL — Execution time for different types of queries using Striim’s TQL vs KSQL

In a benchmarking study, TQL was observed to be 2–3 times faster than KSQL using the TCPH benchmark (as shown in the execution time chart above). This is because Striim’s computation pipeline can be run in memory, while KSQL relies on disk-based Kafka topics. In addition to speed, TQL offers additional features including:

Windows: You cannot make attribute-based time windows with KSQL. It also doesn’t support writing multiple queries for the same window. TQL supports all forms of windows and lets you write multiple queries for the same window.
Queries: KSQL comes with limited aggregate support, and you can’t use inner joins in it. Meanwhile, TQL supports all types of aggregate queries and joins (including inner join).

4. Multi-thread delivery for better performance

Striim has features that can improve performance while handling large amounts of data in real time. It uses multi-threaded delivery with automated thread management and data distribution. This is done through Kafka Writer in Striim, which can be used to write to topics in Kafka. When your target system struggles to keep up with incoming streams, you can use the Parallel Threads property in Kafka Writer to create multiple instances for better performance. This helps you to handle large volumes of data.

5. Support for mission-critical applications

Striim delivers built-in, exactly-once processing (E1P) in addition to the security, high availability and scalability required of an enterprise-grade solution. Using Striim’s Kafka Writer, if recovery is enabled, events are written in order with no duplicates (E1P). This means that in the event of cluster failure, Striim applications can be recovered with no loss of data.

Take Kafka to the Next Level: Try Striim

If you want to make the most of Kafka, you shouldn’t have to architect and build a massive infrastructure, nor should you need an army of developers to craft your required processing and analytics. Striim enables Data Scientists, Business Analysts and other IT and data professionals to get the most value out of Kafka without having to learn, and code to APIs.

See for yourself how Striim can help you take Kafka to the next level. Start a free trial today!

What Is Batch Processing? Understanding Key Differences Between Batch Processing vs Stream Processing

Posted on June 7, 2022 by John Kutay | 13 min read | 3 views

Before stream processing became essential for businesses, batch processing was the standard. Today, batch processing can feel outdated—can you imagine having to book a ride-share hours in advance or playing online multiplayer games with significant delays? What about trading stocks based on prices that are minutes or hours old?

Fortunately, stream processing has transformed how we handle real-time data, eliminating these inefficiencies. To fully grasp why stream processing is crucial for modern businesses, it’s important to first understand batch processing. In this guide, we’ll explore the fundamentals of batch processing, compare batch processing vs stream processing, and provide a clear batch processing definition for your reference.

Batch Processing Definition: What is Batch Processing?

Batch processing involves collecting data over time and processing it in large, discrete chunks, or “batches.” This data is moved at scheduled intervals or once a specific amount has been gathered. In a batch processing system, data is accumulated, stored, and processed in bulk, typically during off-peak hours to reduce system impact and optimize resource usage.

Batch processing does still have various uses, including:

Credit card transaction processing
Maintaining an index of company files
Processing electric consumption for billing purposes once monthly

“Batch will always have its place,” shares Benjamin Kennady, a Cloud Solutions Architect at Striim. “There are many situations and data sources where batch processing is the only technical option. This doesn’t negate the value that streaming can provide … but to say it’s outdated compared to streaming would be incorrect. Most organizations are going to require both.”

Batch processing, however, isn’t ideal for businesses that need to respond to real-time events—hence why its use cases are fairly limited. For immediate data handling, stream processing is the solution. Stream processing processes and transfers data as soon as it is collected, allowing businesses to act on current information without delay.

“There are many use cases where the current pipeline built using batch processing could be upgraded into a streaming process,” says Kennady. “Real time streaming unlocks potential use cases that aren’t available when using batch, but batch is relatively simpler to manage is one way to view the tradeoff.”

Batch Processing and Batch-Based Data Integration

When discussing batch processing, you’ll often hear the term batch-based data integration. While related, they differ slightly. Batch processing involves executing tasks on large volumes of data at scheduled intervals, such as generating reports or processing payroll. Batch-based data integration, however, specifically focuses on moving and consolidating data from various sources into a target system in batches. In short, batch-based data integration is a subset of batch processing, with its primary focus on unifying data across systems.

How does Batch Processing Work?

Logistically speaking, here’s how batch processing works.

1. Data collection occurs.

Batch processing begins with the collection of data over time from several sources. This data is stored in a staging area, and may include transactional records, logs, sensor data, inventory data, and more.

2. Batches are created.

Once you collect a predefined quantity of data, it gets assembled to form a batch. This batch could be made based on specific triggers, such as the end of a day’s transactions or reaching a certain data volume.

3. Batch processing occurs.

Your batches are processed as a singular unit. Processing includes executing data transformation tasks including aggregations, calculations, and conversions, which are required to produce the final output.

4. Results are transferred and stored.

After processing, the results are typically stored in a database or data warehouse. The processed data may be used for reporting, analysis, or other business functions.

The most important thing to remember about this process is that it is performed only at scheduled intervals. Depending on your business requirements and data volume, you can determine if you’d like this to occur daily, weekly, monthly, or as necessary.

Let’s dive deeper and compare batch processing vs stream processing to get a clearer understanding of key differences.

Batch Processing vs Stream Processing: What’s the Difference?

While batch processing and stream processing aim to achieve the same result—data processing and analysis—the way they go about doing so differs tremendously.

Batch processing:

Processes data in bulk: Data is collected over time and processed in large, discrete batches, often at scheduled intervals (e.g., hourly, daily, or weekly).
Latency is higher: Since data is processed in batches, there is an inherent delay between when data is collected and when it is analyzed or acted upon. This makes it suitable for tasks where real-time response isn’t critical.
Inefficient for real-time needs: While batch processing can handle large volumes of data, it delays action by processing data in bulk at scheduled times, making it unsuitable for businesses that need real-time insights. This lag can lead to outdated information and missed opportunities.

Batch processing isn’t inherently bad; it’s effective for tasks like large-scale data aggregation or historical reporting where real-time updates aren’t critical. However, stream processing is a better fit in certain scenarios. For example, technologies like Change Data Capture (CDC) capture real-time data changes, while stream processing immediately processes and analyzes those changes. This makes stream processing ideal for use cases such as operational analytics and customer-facing applications, where stale data can lead to missed insights or a poor user experience.

Stream processing’s use cases include:

Processes data in real-time: Stream processing continuously processes data as it’s collected, enabling immediate analysis and action. This capability is crucial for businesses that rely on up-to-the-minute insights to stay competitive, such as in fraud detection, stock trading, or personalized customer interactions.
Low latency: Stream processing delivers results with minimal delay, providing businesses with real-time information to make timely and informed decisions. “Real time streaming and processing of data is most crucial for dynamic environments where low-latency data handling is required,” says Kennady. “This is vital for dynamic datasets that are continuously changing. Anywhere you have databases or datasets changing and you need a low latency replication solution is where you should consider a data streaming solution like Striim.” This speed is essential for applications where every second counts, ensuring rapid responses to critical events.
Maximized system performance: While stream processing requires continuous system operation, this investment ensures that data is always up-to-date, empowering real-time decision-making and giving businesses a competitive edge in fast-paced industries. The always-on nature of stream processing ensures no opportunity is missed.

That being said, modern data streaming platforms, such as Striim, can still support batch processing should you choose to use it. “Batch still has its role in the modern world and Striim fully supports it via its initial load capabilities,” says Dmitriy Rudakov, Director of Solution Architecture at Striim.

Batch Processing Example

Let’s walk through a batch processing example, using a bank for example. In a traditional banking setup, batch processing is often used to generate monthly credit card statements. It usually works like this:

Data Accumulation: Throughout the month, the bank collects all credit card transactions from customers. These transactions include purchases, payments, and fees, which are stored in a staging area.
Batch Processing: At the end of the month, the bank processes all collected transactions in one large batch. This involves calculating totals, applying interest rates, and preparing the statements for each customer.
Statement Generation: After processing the batch, the bank generates and sends out the statements to customers.

Batch processing is well-suited for tasks like statement generation, where the process only needs to occur periodically, such as once a month. In this case, there’s no need for real-time updates, and the focus is on processing large volumes of data at scheduled intervals.

If we tried to use the same batch processing pipeline for a more operational use case like fraud detection, we’d face several challenges, including:

Delayed Insights: Because transactions are processed in bulk at the end of the month, any discrepancies or issues, such as fraudulent charges, are only identified after the batch processing is complete. This delay means that customers or the bank may not detect and address issues until after they’ve had a significant impact.
Missed Opportunities for Immediate Action: If a customer reports a suspicious transaction shortly after it occurs, the bank might not be able to take immediate action due to the delay inherent in batch processing. Real-time fraud detection and response are not possible, potentially allowing fraudulent activity to continue for weeks.
Customer Dissatisfaction: Customers who experience issues with their transactions or statements must wait until the end of the month for resolution, leading to potential dissatisfaction and erosion of trust.

However, by leveraging stream processing instead, the bank gains the ability to analyze transactions as they occur, enabling real-time fraud detection, immediate customer notifications, and quicker resolution of issues. “In any use case where latency or speed is important, data engineers want to use steaming instead of batch processing,” shares Dmitriy Rudakov. “For example if you have a bank withdrawal and simultaneously there’s an audit check or some other need to see an accurate account balance.”

This approach ensures that both the bank and its customers can respond to and manage transactions in real-time, avoiding the delays and missed opportunities associated with batch processing. Through this batch processing example, you see why stream processing is imperative for modern businesses to utilize.

Stream Processing and Real-Time Data Integration

Often when discussing stream processing, real-time data integration is also a key topic—similar to how batch processing and batch-based data integration go hand-in-hand. These two concepts are closely related and work together to provide immediate insights and ensure synchronized data across systems.

Stream processing involves the continuous analysis of data as it flows in, allowing businesses to respond to events and trends in real time. It handles data streams instantaneously to deliver up-to-the-minute information and actions. Stream processing platforms are essential for businesses aiming to harness real-time data effectively. According to Dmitriy Rudakov, “Striim supports real-time streaming from all popular data sources such as files, messaging, and databases. It also provides an SQL like language that allows you to enhance your streaming pipelines with any transformations.”

Real-time data integration, on the other hand, ensures that the processed data is accurately and consistently updated across various systems and platforms. By integrating data in real-time, organizations synchronize their databases, applications, and data warehouses, ensuring that all components operate with the most current information. Together, stream processing and real-time data integration offer a unified approach to dynamic data management, significantly enhancing operational efficiency and decision-making capabilities.

Four Reasons You Need Real-Time Data Integration

Now that you understand why batch processing falls short for modern businesses seeking to gain real-time insights, respond swiftly to critical events, and optimize operational efficiency, it’s clear that adopting stream processing is essential for meeting these needs effectively. Here are four reasons real-time data integration is a must-have.

It enables quick, informed decision-making.

According to Statista, in July 2024, 67% of the global population were internet users, each producing ever-larger amounts of data. Real-time integration enables businesses to act on this information quickly.

Data from on-premises and cloud-based sources can easily be fed, in real-time, into cloud-based analytics built on, for instance, Kafka (including cloud-hosted versions such as Google PubSub, AWS Kinesis, Azure EventHub), Snowflake, or BigQuery, providing timely insights and allowing fast decision making.

The importance of speed can’t be understated. Detecting and blocking fraudulent credit card usage requires matching payment details with a set of predefined parameters in real time. If, in this case, data processing took hours or even minutes, fraudsters could get away with stolen funds. But real-time data integration allows banks to collect and analyze information rapidly and cancel suspicious transactions.

Companies that ship their products also need to make decisions quickly. They require up-to-date information on inventory levels so that customers don’t order out-of-stock products. Real-time data integration prevents this problem because all departments have access to continuously updated information, and customers are notified about sold-out goods.

Cumulatively, the result is enhanced operational efficiency. By ensuring timely and accurate data, businesses can not only respond to immediate issues but also optimize their operations for improved service delivery and strategic decision-making.

It breaks down data silos.

When dealing with data silos, real-time data integration is crucial. It connects data from disparate sources—such as Enterprise Resource Planning (ERP) software, Customer Relationship Management (CRM) software, Internet of Things (IoT) sensors, and log files—into a unified system with sub-second latency. This consolidation eliminates isolation, providing a comprehensive view of operations.

For example, in hospitals, real-time data integration links radiology units with other departments, ensuring that patient imaging data is instantly accessible to all relevant stakeholders. This improves visibility, enhances decision-making, and optimizes operational efficiency by breaking down data silos and delivering timely, accurate information.

It improves customer experience.

The best way to give customer experience a boost is by leveraging real-time data integration.

Your support reps can better serve customers by having data from various sources readily available. Agents with real-time access to purchase history, inventory levels, or account balances will delight customers with an up-to-the-minute understanding of their problems. Rapid data flows also allow companies to be creative with customer engagement. They can program their order management system to inform a CRM system to immediately engage customers who purchased products or services.

Better customer experiences translate into increased revenue, profits, and brand loyalty. Almost 75% of consumers say a good experience is critical for brand loyalties, while most businesses consider customer experience as a competitive differentiator vital for their survival and growth.

It boosts productivity.

Spotting inefficiencies and taking corrective actions is crucial for modern companies. Having access to real-time data and continuously updated dashboards is essential for this purpose. Relying on periodically refreshed data can slow progress, causing delays in problem identification and leading to unnecessary costs and increased waste.

Optimizing business productivity hinges on the ability to collect, transfer, and analyze data in real time. Many companies recognize this need. According to an IBM study, businesses expect that rapid data access will lead to better-informed decisions (44%).

Real-Time Data Integration Requires New Technology: Try Striim

Real-time data integration involves processing and transferring data as soon as it’s collected, utilizing advanced technologies such as Change Data Capture (CDC), and in-flight transformations. Luckily, Striim can help. Striim’s CDC tracks changes in a database’s logs, converting inserts, updates, and other events into a continuous data stream that updates a target database. This ensures that the most current data is always available for analysis and action. Transform-in-flight is another key feature of Striim’s that enables data to be formatted and enriched as it moves through the system. This capability ensures that data is delivered in a ready-to-use format, incorporating inputs from various sources and preparing it for immediate processing.

Striim leverages these technologies to provide seamless real-time data integration. By capturing data changes and transforming data in-flight, Striim delivers accurate, up-to-date information that supports efficient decision-making and operational excellence. Ready to ditch batch processing and experience the difference of stream processing and real-time data integration? Book a demo today and see for yourself how Striim can fuel better decision-making, enhanced customer experience, and beyond.