Tutorial: Migrate and Replicate Data from MySQL to Snowflake Using Striim

In this article, we’ll go over how to use Striim to migrate and replicate data from an existing MySQL database to Snowflake.

What is Striim?

Striim is a next generation Cloud Data Integration product that offers change data capture (CDC) enabling continuous replication from popular databases (such as MySQL, Oracle, SQL Server, PostgreSQL ) to data warehouses, databases, messaging systems, and more.

In addition to CDC connectors, Striim has hundreds of automated adapters for file-based data (logs, xml, csv), IoT data (OPCUA, MQTT), and applications such as Salesforce and SAP. Our SQL-based stream processing engine makes it easy to enrich and normalize data before it’s written to Snowflake.

In this tutorial, we’ll show you how to use Striim to migrate schemas and data from an existing MySQL database into Snowflake.

This tutorial uses the Sakila sample database to illustrate the steps.

Tutorial: Migrate and Replicate Data from MySQL to Snowflake

Prerequisites

The following are required to complete this tutorial:

  • Striim Account. You can sign up for a free trial account.
  • Snowflake Account. You can sign up for a free 30-day trial account to get started.
  • MySQL database that meets the following requirements:
    • Accessible over the Internet.
    • SHA256 password authentication enabled.
    • A user with full privileges to the database that Striim will authenticate with. Construct the name by concatenating the user’s name, the @ symbol, and Striim’s IP address (34.66.168.224). For example: john@34.66.168.224.

Note: Ensure you have the database’s name, hostname, port, username, and password as they will be required in the steps below to connect to your database.

  • At least one table in the MySQL database
  • (Optional) Sakila sample database.

Prepare a Snowflake Database and Launch Striim

Before migrating your data from MySQL, you must first create a database within Snowflake to store the migrated data. After that database has been created you can launch Striim as a Snowflake partner service directly from within Snowflake.

Follow the steps below to prepare a database and launch Striim in Snowflake:

  • Launch Snowflake in a web browser.
  • Click on Databases > Create:

Create Snowflake database

  • Enter a unique name for the database and click Finish:

Name database

  • Click on Partner Connect in the top right corner of the navigation bar.
  • Locate and click on Striim in the list of Snowflake partners. Note: you may need to first switch your user role to ACCOUNTADMIN in order to launch Striim from Snowflake:

Locate Striim Partner Connect

  • Activate the partner account if the account has not been previously activated:

Ready to activate partner account

  • Confirm that the database you created above is listed in Database(s) with USAGE privilege granted and click Connect.
  • Note: On subsequent launches after activation has been completed for the first time, Snowflake will just prompt you to launch:

Account previously created

Create a Striim App

In Striim, an app will be used to perform migration from MySQL to a Snowflake database.

Follow the steps below to create a new Striim App:

  • Ensure you’re on the App screen in Striim and click Create App:

Create Striim app

  • Locate the Build using Wizards section on the Create a New App screen and select MySQL Database to Snowflake:

MySQL to snowflake wizard

  • Enter a unique name for your app and click Save:

Name Striim app

  • The data migration wizard is displayed:

Migration wizard displayed

Prepare for Data Migration to Snowflake

In this section you will configure your app to access your source MySQL database. As you proceed through Striim’s data migration wizard, Striim will validate that it can access and fetch the metadata and data of your source MySQL database.

  • Enter the host/port for your MySQL database and the login credentials for the Striim user who has been granted full privileges to your MySQL database:

Enter host port

  • Striim will verify that it can connect to your database and obtain metadata:

Verify connection

  • Click Next to advance to the Select Databases to move screen.
  • Select the databases to migrate from your MySQL database to Snowflake and click Next:

Select database to move

  • Striim will fetch and validate metadata for each table in your database:

Validate and fetch

  • Click Next to advance to the Select Tables screen. Navigate through each schema on the left-hand side, and select the table(s) to migrate:

Select tables to migrate

  • Click Next to complete the wizard. The target creation screen is displayed:

Target creation screen 1

Prepare Your Target and Migrate Your Data to Snowflake

Now that Striim can read from your source MySQL database, you must configure Striim to write to your target Snowflake database.

Follow the steps below to prepare a Snowflake target and start the migration process:

  • Enter a unique name for the target in the Target Name field on the Create Snowflake Target(s) screen.
  • Ensure Input From is set to the stream you created using the steps in the previous sections. Note that the name will be in the form of <your application name> + _OutputStream.
  • Prepare the URL of the target Snowflake database: copy the following URL into the Connection URL field and replace YOUR_HOST with the base host domain assigned by Snowflake to your account, and YOUR_DATABASE with the name of your database:

jdbc:snowflake://YOUR_HOST.snowflakecomputing.com/?db=YOUR_DATABASE_NAME&schema=public

For example, the following URL has a base URL of xr86987.ca-central-1.aws and the database name set to SAKILA

jdbc:snowflake://xr86987.ca-central-1.aws.snowflakecomputing.com/?db=SAKILA&schema=public

  • Enter your credentials corresponding to your Snowflake account into the Username and Password fields.
  • (Optional) Modify which tables to migrate by configuring the table name(s) listed in the Tables field. By default, the tables listed will be based on those specified in the steps from the previous section and include the % as a wildcard character:

Create target 2

Click Next. Striim will recreate the schema(s) in your Snowflake database:

Create Schema

Click Next after target creation is complete. Striim will begin migrating your data to Snowflake and will provide a detailed Application Progress popup showing how the migration is progressing:

monitor data flow

The amount of time required for migration will vary depending on how much data is being migrated from your source MySQL database and on your internet connection speed.

Conclusion: Start Your Free Trial

Our tutorial showed you how easy it is to migrate data from MySQL to Snowflake, a leading cloud data warehouse. Once your data has been migrated, Striim enables continuous, real-time updates via Change Data Capture. For instances where changes continue to be made to the data in your source database, Striim enables zero-downtime, zero-data loss migrations to Snowflake.

As always, feel free to reach out to our integration experts to schedule a demo, or try Striim for free here.

Data Transformation 101: The What, Why, and How

Ninety percent of the data generated daily is unstructured and almost meaningless to organizations. The continuous increase in the volume of data is a potential gold mine for businesses. Like gold, data has to be carefully mined and refined, or in this case, extracted and transformed to get the most value.

Businesses receive tremendous volumes of data from various sources to make better decisions. But raw data can be complex, challenging, and almost meaningless to the decision-makers in an organization. By transforming data, businesses can fully maximize the value of their data and use it to make more in-depth strategic decisions.

In this post, we’ll share an overview of data transformation, how it works, its benefits and challenges, and different data transformation approaches

  1. What is Data Transformation?
  2. Why Transform Data?
  3. How Data Transformation Works
  4. Data Transformation Methods
  5. Data Transformation Use Cases
  6. Data Transformation Challenges
  7. Data Transformation Best Practices
  8. Use Data Transformation Tools Over Custom Coding

What is Data Transformation?

Data transformation is the process of converting data from a complex form to a more straightforward, usable format. It can involve actions ranging from cleaning out data, changing data types, deleting duplicate data, data integration, and data replication, depending on the desired result.

Data transformation is an integral part of any data pipeline including ETL (Extract, Transform, and Load) and ELT (Extract, Load, Transform) pipelines. ETL involves extracting data from multiple sources, transforming it into a more intelligent structure, and loading or storing it in a data warehouse. In contrast, ELT shifts the bulk of the transformations to the destination data warehouse.

ETL vs ELT Infographic

Data transformation is essential to standardize data quality across an organization and create consistency in the type of data shared between systems in the organization.

Why Transform Data? (Benefits)

With data transformation, businesses can increase efficiencies, make better decisions, generate more revenue, and gain many other benefits, including:

  • Higher data quality: Businesses are very concerned with data quality because it is crucial for making accurate decisions. Data transformation activities like removing duplicate data and deleting null values can reduce inconsistencies and improve data quality.
  • Improved data management: Organizations rely on data transformation to handle the tremendous amounts of data generated from emerging technologies and new applications. By transforming data, organizations can simplify their data management and reduce the dreaded feeling of information overload.
  • Seamless data integration: It is normal for a business to run on more than one technology or software system. Some of these systems need to transfer data between one another. With data transformation, the data sent can be converted into a usable format for the receiving system, making data integration a seamless process.
  • Obfuscate sensitive data: In order to abide by GDPR, HIPAA, and other regulations, companies need to be able to mask or remove sensitive information such as PII or credit card numbers when transferring data from one system to another.
masking data with Striim
An example showing how Striim can be used to mask credit card numbers during data transfer from one system to another.

How Data Transformation Works

When data is extracted from the source, it is raw and nearly impossible to use. The data transformation process involves identifying the data, structuring it, and generating a workflow that can be executed to convert the data. Sometimes, it is mandatory to clean the data first for easy identification and mapping.

The steps to transform raw data into an intelligent format include:

  1. Identify the data: This is the data discovery phase, and it involves identifying and understanding the data in its source/extracted format. Data discovery is usually best accomplished with the help of a data profiling tool. Here, you have an idea of what should be done to get the data into the desired format.
  2. Structure the data: This is the data mapping stage where the actual transformation process is planned. Here, you define how the fields in the data are connected and the type of transformation they will need. This stage is also where you consider if there would be any loss of data in the transformation process. For example, if we have a simple Excel spreadsheet with a date column in an incorrect format, we would want to make a ‘mapping’ to determine the type of transformation needed to change the date to a correct format.
  3. Generate a workflow: For transformation to occur, a workflow or code needs to be generated. You can write your custom code or use a data transformation tool. Python and R are the most common languages for a custom code, but it can be done in any language, including an Excel Macro, depending on the transformation needs. When developing a transformation workflow, consider certain factors like scalability (will the transformation needs change over time?) and usability (will other people need this workflow?).
  4. Execute the workflow: Here, data is restructured and converted to the desired format.
  5. Verify the data: After transformation, it is best to check if the output is in the expected format. If it isn’t, review the generated workflow, make necessary changes, and try again.

Data Transformation Methods

Organizations can transform data in different ways to better understand their operations. For example, data aggregation, data filtering, data integration, etc., are all forms of data transformation, and they can happen with any of these types of data transformation:

Transformation with scripting

This type of transformation involves manually coding the data transformation process from start to finish in Python, R, SQL, or any other language. It is an excellent approach for customization but often results in unintentional errors and misunderstandings as developers sometimes fail to interpret the exact requirements in their custom-coded solutions.

Transformation with on-site ETL tools

These tools work through on-site servers to extract, transform, and load information into an on-site data warehouse. However, an on-site transformation solution can be expensive to set up and manage as data volume increases, so big data companies have moved to more advanced cloud-based ETL tools.

Transformation with cloud-based ETL tools

Cloud-based ETL tools have simplified the process of data transformation. Instead of working on an on-site server, they work through the cloud. In addition, these tools make it easier to link cloud-based platforms with any cloud-based data warehouse.

Transformations within a data warehouse

Instead of ETL (where data is transformed before it’s loaded into a destination) ELT shifts the transformations to the data warehouse. dbt (data build tool) is a popular development framework that empowers data analysts to transform data using familiar SQL statements in their data warehouse of choice.

Data Transformation Use Cases

Data Preparation and Cleaning for ELT Pipelines

In ELT, users load data into the data warehouse in its raw form and apply transformations in the DW layer using stored procedures and tools like dbt. However the raw data can have many issues preventing it from being actionable. 

Data Transformation can address the following issues:

  • The raw data may have lookup IDs instead of human readable names, for example like this:
User ID Page Viewed Timestamp
12309841234 Homepage 3:00 pm UTC July 30,2021
12309841235 Request a demo 4:00 pm UTC July 30,2021
    Instead of this:
Username Page Viewed Timestamp
Jane Strome Homepage 3:00 pm UTC July 30,2021
John Smith Request a demo 4:00 pm UTC July 30,2021
  • De-duplication of redundant records
  • Removing null or invalid records that would break reporting queries
  • Transforming your data to fit machine learning training models

Business Data Transformations for Operational Analytics

Business and analytical groups may have some north star metrics that can only be tracked by correlating multiple data sets. Data transformation jobs can join data across multiple data sets and aggregate it to create a unified view of this data. Examples of this include:

  • Joining unique customer records across multiple tables – your customer Jane Smith exists in your CRM, your support ticketing system, your website analytics system (e.g. Google Analytics), and your invoicing system. You can write a transform to create a unified view of all her sales and marketing touchpoints and support tickets to get a 360 view of her customer activity.

Creating a 360 view of the customer

  • Creating aggregate values from raw data – for instance you can take all your invoices and transform them to build a table of monthly revenue categorized by location or industry
  • Creating analytical tables – it may be possible for your data analyst to write complex queries to get answers to simple questions. Alternatively your data engineering team can make life easy for data analysts by pre-creating analytical tables. That will make your analysts’ job as easy as writing a ‘select * from x limit 100’ to generate a new report and drive down compute costs in your warehouse. For example:

Sales Data Table:

Customer Invoice Amount Invoice Date
ACME $100 3:00 pm UTC July 30,2021
MyShop $200 1:00 pm UTC July 29, 2021

And Monthly Sales (analytical table):

Month Invoice Amount
July $500
August $200

Data Transformation Challenges

Data transformation has many benefits to an organization, but it is also important to note that certain hurdles make data transformation difficult.

Data transformation can be a costly process. ETL tools come at a price, and training staff on data management processes is not exactly a walk in the park either. The cost of data transformation depends on infrastructure size. Extensive infrastructure will require hiring a team of data experts to oversee the transformation process.

To transform data effectively, an organization would have to set up tools, train its current employees, and/or hire a new set of experts to oversee the process. Either way, it is going to cost time and resources to make that happen.

Data Transformation Best Practices

The data transformation process seems like an easy step-by-step workflow, but there are certain things to keep in mind to avoid running into blockers or carrying out the wrong type of data transformation. The following are data transformation best practices:

  • Start by designing the target format. Jumping right into the nitty-gritty of data transformation without understanding the end goal is not a good idea. Communicate with business users to understand the process you are trying to analyze and design the target format before transforming data into insights.
  • Profile the data. In other words, get to know the data in its native form before converting it. It helps to understand the state of the raw data and the type of transformation required. Data profiling enables you to know the amount of work required and the workflow to generate for transformation.
  • Cleanse before transforming. Data cleaning is an essential pre-transformation step. It reduces the risk of having errors in the transformed data. Data can have missing values or information that is irrelevant to the desired format. By cleansing your data first, it increases the accuracy of the transformed data.
  • Audit the data transformation process. At every stage of the transformation process, track the data and changes that occur. Auditing the data transformation process makes it easier to identify the problem source if complications arise.

Use Data Transformation Tools Over Custom Coding

Data transformation tools are more cost-effective and more efficient than custom coding. Writing codes for data transformation functions have a higher risk of inefficiencies, human error, and excessive use of time and resources.

Striim data integration platform
Striim is a data integration solution that offers real-time, high-speed SQL-based transformations.

Data transformation tools are usually designed to execute the entire ETL process. If you are not using tools for the transformation process, you’ll also need to worry about the “extraction” and “loading” steps. Custom coding allows for a fully customized data transformation solution, but as data sources, volumes, and other complexities increase, scaling and managing this becomes increasingly difficult.

Striim is an end-to-end data integration solution that offers scalable in-memory transformations, enrichment and analysis, using high-speed SQL queries. To get a personalized walkthrough of Striim’s real-time data transformation capabilities, please request a demo. Alternatively, you can try Striim for free.

Optimize BigQuery Cost and Performance with Partition Pruning by Striim

Striim is a real-time data integration platform that integrates with over 100 data sources and targets, including Google BigQuery. In this post, we’ll share how Striim’s new partition pruning feature can help BigQuery users optimize their query costs and performance.

How Striim Writes Data to Data Warehouses

Striim Data warehouse writer architecture
How Striim writes data to BigQuery

Striim ingests data from a range of data sources including databases, files/logs, messaging systems, IoT devices, and data warehouses. BigQuery is a popular data target used by companies to efficiently analyze large datasets. Striim moves data into BigQuery in three steps when using the MERGE mode (which allows users to combine updates, inserts, deletes into a single statement):

  1. Upload from source table to stage-1 table
  2. Compact by moving only the most recent snapshot of each record to the stage-2 table
  3. Merge from stage-2 table to the target table in BigQuery

In step 3 (Merge), the MERGE query has to scan the entire table to search for matching records and update (or delete) them, or insert a new record if one doesn’t already exist. However, scanning through an entire table is both costly and time-consuming.

What are Partitioned Tables?

Partitioned tables are tables that have been divided into segments. Table partitioning enables users to improve query performance and optimize their costs.

For example, the diagram below depicts an integer-range partitioned table where the target table is partitioned according to User ID values (integers). Partition 1 contains the records with User IDs 1-5, partition 2 contains User IDs 6-10, and so on.

Example of a partitioned table
Example of a target table that’s partitioned by User ID. Since the updates only apply to records in partitions 2 and 3, only those partitions need to be queried.

In this example, a user wants to update three customer records with their first names. Partitioned tables allow the user to denote the partition(s) to be searched (also known as partition pruning). Without table partitioning, the user would need to search the full table to locate and update the corresponding records. With table partitioning, the records that need to be updated with user names are located in partitions 2 and 3, so only those partitions need to be searched.

How BigQuery Partitioned Tables Reduce Query Costs (and Improve Query Performance) in BigQuery

In BigQuery, tables can be partitioned by the following types of columns:

  • TIMESTAMP, DATE, or DATETIME columns
  • Integer-type columns (as shown in the example above)
  • Ingestion time (virtual column) based on a timestamp from when BigQuery ingests data

BigQuery officially supports partition pruning when using a MERGE query with a filter in the merge_condition.

If you use a query to filter based on partition column value, BigQuery will only scan the partitions that match the filter values.

In the example below, the destination table is partitioned based on the “ID” column (the partitioning column). The MERGE query on the right includes a filter based on a range of the “ID” column values. The MERGE query on the left does not include a partition filter, which means that the full table will be scanned (resulting in 61.12 MB processed vs 4.01 MB processed).

 

How partitioning a table in BigQuery reduces query costs and time
Querying a table with and without using a filter in the MERGE query.

 

Even better, since partition pruning is performed before running the query, users can calculate their query costs ahead of time. This is especially compelling for users who take advantage of BigQuery’s on-demand pricing.

New: Striim’s BigQuery Writer Now Supports BigQuery Partition Pruning

Striim recently added support for partition pruning in BigQuery, so you can start optimizing your BigQuery performance and costs today. Ready to try this new feature? Make sure that a) the partitioning column in your source database has supplemental logging enabled and b) your target table in BigQuery is partitioned.

With this newly added functionality, Striim’s BigQuery Writer will detect the partitioned column and automatically optimize the merge without any user input. The optimized MERGE query will scan only the required partitions, resulting in faster query execution and lower costs.

Ready to see how Striim can help you make the most of BigQuery? Request a personalized demo of the Striim platform or start your free trial today!

Oracle LogMiner Continuous Mining Deprecation: What You Need to Know

Striim for Oracle CDCIn this era of digital transformation, companies are shifting to the cloud to increase their agility and shorten the time to business insights. Oracle databases have long been entrusted with operational data, and various methods have emerged to replicate or migrate data to cloud targets. 

Change Data Capture (CDC) tracks database changes as they happen and generates a continuous stream of change data. Oracle CDC methods allow companies to stream database changes to their cloud targets in real time, enabling timely business intelligence and zero-downtime cloud migrations.

Various CDC approaches have been developed over the years to facilitate real time business operations.

“Efficient and reliable log mining techniques are critical for real time business operations. As database log-mining approaches continue to evolve, data architects need to incorporate low latency methods to stream changes from mission critical database workloads.”
-Alok Pareek, Founder at Striim, former CTO at GoldenGate

Oracle CDC methods include LogMiner-based CDC, trigger-based CDC, and third-party tools like Striim. Here’s a brief overview of these methods:

Oracle LogMiner-based CDC

LogMiner provides a SQL-based interface to Oracle database redo log files. LogMiner is designed to help users track data corruption (and identify any required recovery actions) and carry out performance tuning and capacity planning. 

The LogMiner “continuous mine” feature was introduced to allow automatic, continuous mining of the redo logs. This feature provides near real-time access to redo log changes, which has made it the target for many third-party data integration tools that offer CDC from Oracle.

Recently, Oracle announced that continuous mine would be deprecated in Oracle Database 19c. For that reason, companies using newer versions of Oracle can no longer use the continuous mine feature for CDC.

Trigger-based Oracle CDC

Trigger-based Oracle CDC, also known as “synchronous” CDC, starts with defining database triggers that fire before or after INSERT, UPDATE, or DELETE commands (that indicate a change). The change data is used to populate a change table. Each table requires its own change table.

Trigger based Oracle CDC
Image source: Oracle documentation

Triggers introduce overhead to operational tables while changes are being made, and can have significant performance impacts on any applications accessing the tables. Furthermore, maintaining the triggers as the application changes leads to management burden.

Striim for Oracle CDC

Striim is a real-time data integration platform that natively performs change data capture by mining Oracle redo logs while managing additional metadata.

Striim for Oracle CDC

Striim’s Oracle CDC solution offers a variety of benefits:

  • It does not require the LogMiner continuous mining feature (and has a 3x faster capture rate than LogMiner-based CDC)
  • It can also read Oracle GoldenGate Trail files (for companies who are using Oracle GoldenGate)
  • It is RAC-cluster compatible
  • It offers high-speed, error-free, high-performance change data capture (with exactly once processing and support for long-running transactions)
  • It’s an end-to-end data integration tool with real time monitoring, analysis, and enrichment; and connectors to over 100 sources/targets 
  • It’s designed for mission critical, high-volume environments

Striim offers Oracle CDC to Snowflake, Kafka, BigQuery, Azure SQL database and many more targets.

If you’re looking for a high-performance alternative to LogMiner-based or trigger-based Oracle CDC, one of Striim’s CDC experts would be happy to give you a personalized walkthrough of the Striim platform. Alternatively, you can give Striim a try for free. 

Cloud Data Warehouse Comparison: Redshift vs BigQuery vs Azure vs Snowflake for Real-Time Workloads

data warehouse comparisons

Introduction

Data helps companies take the guesswork out of decision-making. Teams can use data-driven evidence to decide which products to build, features to add, and growth initiatives to pursue. And such insights-driven businesses grow at an annual rate of over 30%.

But there’s a difference between being merely data-aware and insights-driven. Discovering insights requires finding a way to analyze data in near real time, which is where cloud data warehouses play a vital role. As scalable repositories of data, warehouses allow businesses to find insights by storing and analyzing huge amounts of structured and semi-structured data.

And running a data warehouse is more than a technical initiative. It’s vital to the overall business strategy and can inform an array of future product, marketing, and engineering decisions.

But choosing a cloud data warehouse provider can be challenging. Users have to evaluate costs, performance, the ability to handle real-time workloads, and other parameters to decide which vendor best fits their needs.

To help with these efforts, we analyze four cloud data warehouses: Amazon Redshift, Google BigQuery, Azure Synapse Analytics, and Snowflake. We cover the pros and cons of each of these options and dive into the factors you’ll need to consider when choosing a cloud data warehouse.

What is a data warehouse and when should I use one?

A data warehouse is a system that brings data from various sources to a central repository and prepares it for quick retrieval. Data warehouses usually contain structured and semi-structured data pulled from transactional systems, operational databases, and other sources. Engineers and analysts use this data for business intelligence and various other purposes.

Data warehouses can be implemented on-premise, in the cloud, or as a mix of both. The on-premise approach requires having physical servers, which makes scaling more expensive and challenging as users have to buy more hardware. Storing data online is less expensive, and scaling is nearly automated.

When to use a data warehouse.

A data warehouse can be used for various tasks. You can use it to store historical data in a unified environment that acts as a single source of truth. Users from an entire organization can then rely on that repository for day-to-day tasks.

Data warehouses can also unify and then analyze data streams from the web, customer relationship management (CRM), mobile, and other apps. Today’s companies use an ever-growing number of software tools. Pulling data from multiple sources, transforming it into consumable formats, and storing it in a warehouse is vital for making sense of data.

And with valuable data stored in warehouses, you can go beyond traditional analytics tools and query data with SQL to discover deep business insights.

For instance, companies use Google Analytics (GA) to learn how customers engage with their apps or websites. But the depth of insights users can discover is limited by the properties of GA. A better way would be to connect GA with a data warehouse that already stores data from platforms such as Salesforce, Zendesk, Stripe, and others. With all your data stored in one place, it’s much easier to analyze it, compare different variables, and produce insightful data visualizations.

Can’t I just use a database?

Conventional wisdom says you can probably use a OLTP database such as PostgreSQL unless you have terabytes or petabytes of complex data sets. However, cloud computing has made data warehousing cost effective for even smaller data volumes. For instance, BigQuery is free for the first terabyte of query processing. Also the total-cost-of-ownership of serverless cloud data warehouses makes analytics simple. Not to mention there is an expansive ecosystem for data integration, data observability, and business intelligence on top of popular cloud data warehousing tools that can accelerate your analytical operations.

BigQuery pricing
BigQuery pricing. Source: Google Cloud

Many of today’s new cloud data warehouses are built using solutions from major vendors such as Amazon Redshift, Google BigQuery, Microsoft Azure Synapse Analytics, and Snowflake.

Major vendors differ in costs or technical details, but they also share some common traits. Their cloud data warehouses are highly reliable. While outages or failures might happen, data replication and other reliability features ensure your data is backed up and can be quickly retrieved.

Amazon, Google, Microsoft, and Snowflake also offer highly scalable cloud data warehouses. Their solutions use massively parallel processing (MPP), a storage structure that handles multiple operations simultaneously, to rapidly scale up or down storage and compute resources. And data is stored in columnar format to achieve better compression and querying.

Compared to on-premise data warehouses, cloud alternatives are more scalable, faster, go live in minutes, and are always up to date.

Data warehouses at-a-glance: Snowflake vs Redshift vs BigQuery vs Azure

 

Snowflake Redshift BigQuery Azure
Administration & Management Simple provisioning. Need to select a cloud provider and virtual warehouse size. Must select correct instance size and configure and scale nodes manually. Requires AWS expertise. Completely serverless — provisioning is automatic. Offers both serverless and dedicated options.
Scalability Users can scale storage and compute independently, Snowflake automatically adds/removes nodes. Decoupled storage and compute with RA3 nodes. Storage and compute scale independently. Scaling is handled automatically by BigQuery. Serverless option scales automatically. For the dedicated option, additional storage must be added manually. 
Analytics Ecosystem Analytics ecosystem is wholly based on Snowflake platform (e.g. Snowpark) and partners on Snowflake Partner Connect. Business intelligence with AWS Quicksight and roster of analytics platforms. Google workplace (simple Google Sheets upload) and Google Cloud. Business intelligence with Looker. Azure ecosystem for analytics including PowerBI for business intelligence and  CosmosDB for NoSQL.
Integrations Ecosystem Data integration via partners on Snowflake Partner Connect. Data Integration with AppFlow and DMS along with partners on AWS Marketplace Native data integration via Cloud Fusion. Integration supported with Azure Data Factory
Ingestion of Streaming Data Yes, with an added service. For continuous data ingestion, Snowflake offers Snowpipe. Snowpipe loads data within minutes after it’s added to a staging file.

Alternative methods:

  • Striim’s Snowflake Writer continuously and securely writes data to Snowflake directly or via staging files with low latency. Allows for initial load and continuous ingestion of data from databases (via Change Data Capture), log files, iOT devices, and more. Striim also offers data delivery and latency SLAs, and in-memory transformations and analysis.
No, does not offer built-in capability for ingestion of data streams.

Options for ingesting streaming data:

  • Kinesis Firehose: possible to define batch intervals as low as 60 seconds for near real time, scalable loading of data to Redshift.
  • Striim’s RedShift Writer enables continuous writing of data to Redshift. Allows for continuous ingestion of data from databases (via Change Data Capture), log files, iOT devices, and more. Striim also offers data delivery and latency SLAs, and in-memory transformations and analysis.
Yes. Users can write code that calls the streaming API and inserts one record at a time.

Alternative methods:

  • Use Dataflow with Apache Beam SDK to set up a streaming pipeline
  • Striim’s BigQuery Writer continuously writes new data to BigQuery via its streaming API. Allows for continuous ingestion of data from databases (via Change Data Capture), log files, iOT devices, and more. Striim also offers data delivery and latency SLAs, and in-memory transformations and analysis.
Yes, users can use the Apache Spark streaming functionality in Azure Synapse to ingest streaming data.

Alternative methods:

  • Azure Stream Analytics event processing engine.
  • Striim Azure Synapse Writer allows for continuous ingestion of data from databases (via Change Data Capture), log files, iOT devices, and more. Striim also offers data delivery and latency SLAs, and in-memory transformations and analysis.
Data Backup and Recovery Yes
Columnar Architecture Yes
Massively Parallel Processing (MPP) Yes
Pricing On-demand pricing or pre-purchase storage capacity at a discount. Compute time is billed separately. On-demand pricing depending on cluster configuration. Can purchase reserved nodes at a discount. Choose between on-demand or discounted flat rate pricing for analysis. Pay for both active and long-term storage. On-demand pricing or option to pre-purchase reserved storage at a discount.
Website Snowflake Redshift BigQuery Azure

Snowflake

Snowflake is a cloud data warehouse that runs on top of the Google Cloud, Microsoft Azure, and AWS cloud infrastructure. As the service doesn’t run on its own cloud but uses major public cloud vendors, it’s easier for it to move data across clouds and regions.

Snowflake supports a nearly unlimited number of concurrent users and can be run with almost zero maintenance or administration. Updating metadata, vacuuming, and many other menial maintenance tasks are automated. Scaling is automatic as well, with per-second pricing.

Users can also query semi-structured data with SQL or other BI and ML tools. Snowflake also offers native support for document store formats such as XML, JSON, Avro, and others. And its hybrid architecture is divided into three distinct layers: cloud services layer, compute layer, and storage layer.

Snowflake architecture
Snowflake’s three-layered architecture. Image source: Snowflake documentation.

Snowflake is growing in popularity and has a number of major customers, including Rakuten. The Japanese ecommerce group uses Snowflake to scale its data resources. The company’s cash back and shopping reward program, called Rakuten Rewards, was using ever-growing amounts of CPU and memory. Demand exceeded the capability of the existing data warehouse.

Rakuten then introduced Snowflake and set up specialized warehouses for individual teams. Workloads from different business units were isolated into different warehouses to prevent them from disrupting each other. This approach was possible because Snowflake separates storage and compute layers. As a result, Rakuten has decreased costs, improved data processing efficiency, and gained more visibility into its data ops. Mark Stange-Tregear, vice president of analytics at Rakuten, says that “I know how much I’m paying to supply the sales team with reports, and I can see how much we are spending to extract data for financial analysis.”

Amazon Redshift

Amazon Redshift is a cloud data warehouse service offered by Amazon. The service handles datasets of various sizes ranging from a few gigabytes to a petabyte or more.

Users initially launch a set of nodes and provision them, after which they upload data and carry out an analysis. Part of a broader Amazon Web Services (AWS) ecosystem, the Redshift data warehousing service offers various features. For instance, users can export data to and from their data lake and integrate with other platforms such as Salesforce, Google Analytics, Facebook Ads, Slack, Jira, Splunk, and Marketo. The warehouse service achieves high performance and efficient storage using columnar storage, data compression, and zone maps.

AWS Redshift Architecture
AWS Redshift Architecture. Image source: Redshift documentation.

Redshift boasts tens of thousands of customers, including Pfizer, Equinox, Comcast, and others. In 2020, Amazon also started working with Pizza Hut. The restaurant chain uses Redshift to consolidate data produced by its stores in Asia-Pacific. This data warehouse allows teams to quickly access petabytes of data, run queries, and produce visualizations. Business intelligence reports are now produced in minutes instead of hours. “Within two months, we could see whether the region was hitting sales targets and performance goals with green and red indicators,” says Pin Yiing Kwok, digital experience manager at Pizza Hut Asia-Pacific. “We could also drill down on any potential issues and identify what needed troubleshooting.”

Google BigQuery

BigQuery is a serverless multi-cloud data warehouse offered by Google. The service can rapidly analyze terabytes to petabytes of data.

Unlike Redshift, BigQuery doesn’t require upfront provisioning and automates various back-end operations such as data replication or scaling of compute resources. It encrypts data at rest and in transit automatically.

The BigQuery architecture consists of several components. Borg is the overall compute part, while Colossus is the distributed storage. The execution engine is called Dremel, and Jupiter is the network.

bigquery architecture
BigQuery architecture. Image source: BigQuery documentation.

BigQuery connects well with other Google Cloud products. Toyota Canada, for instance, has built Build & Price, an online comparison tool that allows site visitors to customize vehicles and get instant quotes. This first-party data is collected by Google Analytics 360 and extracted into BigQuery. The warehousing service then applies machine learning (ML) models on visitors’ data and assigns a propensity score to each individual based on their likelihood of making a purchase. The predictions are refreshed every eight hours.

Toyota’s team then pulls these predictions back into Analytics 360. The team creates 10 audiences using propensity scores and runs personalized ads to each group in a bid to move them down the sales funnel.

BigQuery is also used by many other well-known customers such as Dow Jones, Twitter, The Home Depot, and UPS.

Azure Synapse Analytics

Azure Synapse Analytics is a cloud-based data warehouse offered by Microsoft. The service brings together data warehousing, data integration, and big data analytics through a single user interface (UI).

Users can ingest data from almost 100 native connectors by building ETL/ELT processes in a code-free environment. Users also benefit from integrated artificial intelligence (AI) and business intelligence tools, including Azure Machine Learning, Azure Cognitive Services, and Power BI. Intelligence tools can easily be applied across diverse data sets, including those in Dynamics 365, Office 365, and SaaS products.

Users can analyze data using provisioned or serverless on-demand resources. And from T-SQL and Python to Scala and .NET, you can use various languages in Azure Synapse Analytics.

Azure Synapse Analytics architecture
Azure Synapse Analytics architecture. Image source: Microsoft documentation.

Microsoft’s cloud data warehousing service boasts many customers, including Walgreens. The retail and wholesale pharmacy giant has migrated its inventory management data into Azure Synapse. Instead of on-premise data warehouses, the company is now using the cloud to enable its supply chain analysts to query data and create visualizations using tools such as Microsoft Power BI.

An intuitive drag-and-drop interface makes working with data easy. Costs went down as well. Anne Cruz, an IT manager for supply chain and merchandising at Walgreens, says that “Azure was a third of the cost compared to setting up a new data warehouse appliance on-prem.” And instead of waiting until 1 p.m. to get a previous day’s data report, users have information ready by 9 a.m. every weekday.

Factors to consider when selecting a cloud data warehouse

Major cloud data warehouses share some similarities but also have major differences. Deciding on which warehousing service to use is never an easy task. Consider the following factors when analyzing platforms to use to ensure your team is set for success.

Use cases

A company’s unique circumstances and the use case are critical factors for evaluating data warehousing providers. For instance, businesses that work with JSON may prefer Snowflake as it offers native support for this format. And smaller organizations without a dedicated data administrator might avoid Redshift as it requires regular monitoring and configuration. Services with a plug-and-play setup may be a better fit in this case.

Support for real time workloads

Many companies need to analyze data as soon as it’s generated. For example, some companies may need to detect fraud or security issues in real time, while others may need to process large volumes of streaming IoT data for anomaly detection. In these cases, it’s important to evaluate how different cloud data warehouses handle the ingestion of streaming data.

BigQuery offers a streaming API that users can call with a few lines of code. Azure offers several options for real-time data ingestion, including the built-in Apache Spark streaming functionality. Snowflake offers Snowpipe as an add-on to enable real-time ingestion, while RedShift requires the use of Kinesis Firehose for streaming data ingestion. A real-time data integration solution like Striim provides scalable, enterprise-grade streaming data ingestion for all four of these data warehouses.

Security

Each cloud data warehouse provider takes security seriously. But there are technical differences that users should be aware of when deciding on which vendors to use. For instance, encryption is handled differently: BigQuery encrypts data in transit and at rest by default, while this feature needs to be explicitly enabled in Redshift.

Billing

Vendors calculate costs in different ways. Companies need to know how much data they expect to integrate, store, and analyze each month to estimate costs. Based on these inputs, IT teams can then choose a cloud data warehouse vendor with the most suitable pricing method.

Redshift offers on-demand pricing that varies according to the type and number of nodes in your cluster. Additional capabilities, such as concurrency scaling and managed storage, are billed separately. BigQuery offers separate on-demand and discounted flat rate pricing for storage and analysis. Additional operations, including streaming inserts, incur additional costs.

Azure Synapse uses a Data Warehouse Unit (DWU), a bundle of technical cost factors, to price compute resources. Users are separately charged for storage. And Snowflake uses credits to charge users based on how many virtual warehouses they use and for how long. Storage is billed separately on a terabyte-per-month basis.

Ecosystem

It’s also important to consider in which ecosystems your existing apps and data reside. For example, businesses whose data is already in Google Cloud could get an additional performance boost by using BigQuery or Snowflake on Google Cloud. Data transfer paths will be better optimized since they share the same infrastructure. And data won’t have to move across the public Internet.

Data types

Businesses work with structured, semi-structured, and unstructured data. Most data warehouses usually support the first two data types. Depending on their needs, IT teams should ensure that the vendor they opt for offers the best infrastructure for storing and querying relevant types of data.

Scaling

Another factor to consider when choosing a cloud data warehouse vendor is how the service scales for storage and performance. Redshift requires users to manually add more nodes to ramp up storage and computing power resources. But Snowflake has an auto-scale function that dynamically adds or removes nodes.

Maintenance

Depending on company size and data needs, day-to-day management of data warehouses can be mostly automated or done manually. Small teams may prefer self-optimizing offered by BigQuery or Snowflake. But maintaining data warehouses manually offers more flexibility and greater control, allowing teams to better optimize their data assets. This level of control is offered by Redshift and several other vendors.

Up your data game

From Redshift and BigQuery to Azure and Snowflake, teams can use a variety of cloud data warehouses. But finding the service that best fits company needs is a challenging task. Teams have to consider various parameters, technical specs, and pricing models to make the final decision.

These efforts will eventually pay off. Cloud data warehouses enable product, marketing, sales, and many other departments to up their data game and discover vital insights. And less guesswork and more data-driven evidence will pave the way toward achieving and maintaining a competitive edge.  

Schedule a demo and we’ll give you a personalized walkthrough or try Striim at production-scale for free! Small data volumes or hoping to get hands on quickly? At Striim we also offer a free developer version.

What is Data Ingestion and Why This Technology Matters

data ingestion

  1. Introduction
  2. Types of Data Ingestion
  3. Benefits of Data Ingestion
  4. Data Ingestion Challenges
  5. Data Ingestion Tools
  6. Finding a Key Differentiator

Introduction

Data ingestion is the process of transporting data from one or more sources to a target site for further processing and analysis. This data can originate from a range of sources, including data lakes, IoT devices, on-premises databases, and SaaS apps, and end up in different target environments, such as cloud data warehouses or data marts.

Data ingestion is a critical technology that helps organizations make sense of an ever-increasing volume and complexity of data. To help businesses get more value out of data ingestion, we’ll dive deeper into this technology. We’ll cover types of data ingestion, how data ingestion is done, the difference between data ingestion and ETL, data ingestion tools, and more.

Types of Data Ingestion

There are three ways to carry out data ingestion, including real time, batches, or a combination of both in a setup known as lambda architecture. Companies can opt for one of these types depending on their business goals, IT infrastructure, and financial limitations.

Real-time data ingestion

Real-time data ingestion is the process of collecting and transferring data from source systems in real time using solutions such as change data capture (CDC). CDC constantly monitors transaction or redo logs and moves changed data without interfering with the database workload. Real-time ingestion is essential for time-sensitive use cases, such as stock market trading or power grid monitoring, when organizations have to rapidly react to new information. Real-time data pipelines are also vital when making rapid operational decisions and identifying and acting on new insights.

Batch-based data ingestion

Batch-based data ingestion is the process of collecting and transferring data in batches according to scheduled intervals. The ingestion layer may collect data based on simple schedules, trigger events, or any other logical ordering. Batch-based ingestion is useful when companies need to collect specific data points on a daily basis or simply don’t need data for real-time decision-making.

Lambda architecture-based data ingestion

Lambda architecture is a data ingestion setup that consists of both real-time and batch methods. The setup consists of batch, serving, and speed layers. The first two layers index data in batches, while the speed layer instantaneously indexes data that has yet to be picked up by slower batch and serving layers. This ongoing hand-off between different layers ensures that data is available for querying with low latency.

Benefits of Data Ingestion

Data ingestion technology offers various benefits, enabling teams to manage data more efficiently and gain a competitive advantage. Some of these benefits include:

  • Data is readily available: Data ingestion helps companies gather data stored across various sites and move it to a unified environment for immediate access and analysis.
  • Data is less complex: Advanced data ingestion pipelines, combined with ETL solutions, can transform various types of data into predefined formats and then deliver it to a data warehouse.
  • Teams save time and money: Data ingestion automates some of the tasks that previously had to be manually carried out by engineers, whose time can now be dedicated to other more pressing tasks.
  • Companies make better decisions: Real-time data ingestion allows businesses to quickly notice problems and opportunities and make informed decisions.
  • Teams create better apps and software tools: Engineers can use data ingestion technology to ensure that their apps and software tools move data quickly and provide users with a superior experience.

multi-cloud data integration

Data Ingestion Challenges

Setting up and maintaining data ingestion pipelines might be simpler than before, but it still involves several challenges:

  • The data ecosystem is increasingly diverse: Teams have to deal with an ever-growing number of data types and sources, making it difficult to create a future-proof data ingestion framework.
  • Legal requirements are more complex: From GDPR to HIPAA to SOC 2, data teams have to familiarize themselves with various data privacy and protection regulations to ensure they’re acting within the boundaries of the law.
  • Cyber-security challenges grow in size and scope: Data teams have to fend off frequent cyber-attacks launched by malicious actors in an attempt to intercept and steal sensitive data.

Data Ingestion Tools

Data ingestion tools are software products that gather and transfer structured, semi-structured, and unstructured data from source to target destinations. These tools automate otherwise laborious and manual ingestion processes. Data is moved along a data ingestion pipeline, which is a series of processing steps that take data from one point to another.

Data ingestion tools come with different features and capabilities. To select the tool that fits your needs, you’ll need to consider several factors and decide accordingly:

  • Format: Is data arriving as structured, semi-structured, or unstructured?
  • Frequency: Is data to be ingested and processed in real time or in batches?
  • Size: What’s the volume of data an ingestion tool has to handle?
  • Privacy: Is there any sensitive data that needs to be obfuscated or protected?

And data ingestion tools can be used in different ways. For instance, they can move millions of records into Salesforce every day. Or they can ensure that different apps exchange data on a regular basis. Ingestion tools can also bring marketing data to a business intelligence platform for further analysis.

Data ingestion vs. ETL

Data ingestion tools may appear similar in function to ETL platforms, but there are some differences. For one, data ingestion is primarily concerned with extracting data from the source and loading it into the target site. ETL, however, is a type of data ingestion process that involves not only the extraction and transfer of data but also the transformation of that data before its delivery to target destinations.

ETL platforms, such as Striim, can perform various types of transformation, such as aggregation, cleansing, splitting, and joining. The goal is to ensure that the data is delivered in a format that matches the requirements of the target location.

Finding a Key Differentiator

Data ingestion is a vital tech that helps companies extract and transfer data in an automated way. With data ingestion pipelines established, IT and other business teams can focus on extracting value from data and finding new insights. And automated data ingestion can become a key differentiator in today’s increasingly competitive marketplaces.

Schedule a demo and we’ll give you a personalized walkthrough or try Striim at production-scale for free! Small data volumes or hoping to get hands on quickly? At Striim we also offer a free developer version.

What to Look for in Data Replication Software

Reliable access to data is vital for companies to thrive in this digital age. But businesses struggle with various risk factors- like hardware failures, cyberattacks, and geographical distances-that could block access to data or corrupt valuable data assets. Left without access to data, teams may struggle to carry out day-to-day tasks and deliver on important projects.

One way to safeguard your data from those risks is using data replication solutions. This technology is indispensable for teams that want to replicate and protect their mission-critical data and use it as a source of competitive advantage.

To help businesses explore data replication, we’ll dive into this technology and explore what features you should look for in data replication software.

What is Data Replication

Data replication is the process of copying data from an on-premise or cloud server and storing it on another server or site. The result is a multitude of exact data copies residing in multiple locations.

These data replicas support teams in their disaster recovery and business continuity efforts. If data is compromised at one site (for example by a system failure or a cyberattack), teams can pull replicated data from other servers and resume their work.

Replication also allows users to access data stored on servers close to their offices, reducing network latency. For instance, users in Asia may experience a delay when accessing data stored in North America-based servers. But the latency will decrease if a replica of this data is kept on a node that’s closer to Asia.

Data replication also plays an important role in analytics and business intelligence efforts, in which data is replicated from operational databases to data warehouses.

How Data Replication Works

Data replication is the process of copying data from an on-premise or cloud server and storing it on another server or site. The result is a multitude of exact data copies residing in multiple locations.

These data replicas support teams in their disaster recovery and business continuity efforts. If data is compromised at one site (for example by a system failure or a cyberattack), teams can pull replicated data from other servers and resume their work.

Replication also allows users to access data stored on servers close to their offices, reducing network latency. For instance users in Asia may experience a delay when accessing data stored in North America-based servers. But the latency will decrease if a replica of this data is kept on a node that’s closer to Asia.

Data replication also plays an important role in analytics and business intelligence efforts, in which data is replicated from operational databases to data warehouses.

Types and Methods of Data Replication

Depending on their needs, companies can choose among several types of data replication:

  • Transactional replication: Users receive a full copy of their data sets, and updates are continuously replicated as data in the source changes.
  • Snapshot replication: A snapshot of the database is sent to replicated sites at a specific moment.
  • Merge replication: Data from multiple databases is replicated into a single database.

In tactical terms, there are several methods for replicating data, including:

  • Full-table replication: Every piece of new, updated, and existing data is copied from the source to the destination site. This method copies all data every time and requires a lot of processing power, which puts networks under heavy stress.
  • Key-based incremental replication: Only data changed since the previous update will be replicated. This approach uses less processing power but can’t replicate hard-deleted data.
  • Log-based incremental replication: Data is replicated based on information in database log files. This is an efficient method but works only with database sources that support log-based replication (such as Microsoft SQL Server, Oracle , and PostgreSQL).

What to Look for in Data Replication Software

Data replication software: key features

Data replication software should ideally contain the following features:

A large number of connectors: A replication tool should allow you to replicate data from various sources and SaaS tools to data warehouses and other targets.

Log-based capture: An ideal replication software product should capture streams of data using log-based change data capture.

Data transformation: Data replication solutions should also allow users to clean, enrich, and transform replicated data.

Built-in monitoring: Dashboards and monitoring enable you to see the state of your data flows in real-time and easily identify any bottlenecks. For mission-critical systems that have data delivery Service Level Agreements (SLAs), it’s also important to have visibility into end-to-end lag

Custom alerts: Data replication software should offer alerts that can be configured for a variety of metrics, keeping you up to date on the status and performance of your data flows.

Ease of use: A drag-and-drop interface is an ideal solution for users to quickly set up replication processes.

Data replication software vs. writing code internally

Of course, users can set up the replication process by writing code internally. But managing yet another in-house app is a major commitment of energy, staff, and money. The app also may require the team to handle error logging, refactoring code, alerting, etc. It comes as no surprise that many teams are opting for third-party data replication software.

Use Striim to replicate data in real time

There are also real-time database replication solutions such as Striim. Striim is a unified streaming and real-time data integration platform that connects over 150 sources and targets. Striim provides real-time data replication by extracting data from databases using log-based change data capture and replicating it to targets in real time.

Striim enables real time data replication
Striim is a unified real-time data integration and streaming platform that connects clouds, data, and applications. With log-based change data capture from a range of databases, Striim supports real time data replication.

Striim‘s data integration and replication capabilities support various use cases. This platform can, for instance, enable financial organizations to near instantaneously replicate transactions and new balances data to customer accounts. Inspyrus, a San Francisco-based fintech startup, uses Striim to replicate invoicing data from its private cloud operational databases to other cloud targets such as Snowflake for real-time analytics.

Striim can also be used to replicate obfuscated sensitive data to Google Cloud while original data is safely kept in an on-premises environment. Furthermore, Striim supports mission-critical use cases with data delivery and latency SLAs. Striim customer Macy’s uses Striim to streamline retail operations and provide a unified customer experience. Even at Black Friday traffic levels, Striim is able to deliver data from Macy’s on-premises data center to Google Cloud with less than 200ms latency.

Have More Time to Analyze Data

Striim for data replication
Striim replicates data from databases using high-performance log-based Change Data Capture.

Reliable access to data is of vital importance for today’s companies. But that access can often be blocked or limited, which is why data replication solutions are increasingly important. They enable teams to replicate and protect valuable data assets, and support disaster recovery efforts. And with data secured, teams can have more time and energy to analyze data and find insights that will provide a competitive edge.

Ready to see how Striim can help you simplify data integration and replication? Request a demo with one of our data replication experts, or try Striim for free.

Guide to Modernizing Data Integration and Supercharging Digital Transformation

Striim for real time data integration

Over 80% of digital transformation (DT) initiatives fail because of unreliable data integration methods and siloed data. This figure comes as no surprise because companies find it challenging to handle ever-larger volumes, sources, and types of data. As these businesses struggle to bring data to a unified environment, they’re unable to gain critical insights and make informed decisions.

Legacy data integration solutions share part of the blame. They’re poorly equipped to integrate data in a flexible and scalable way. Companies are left with no choice but to modernize their data integration processes to take advantage of data sets and drive digital transformation.

To help you navigate modernization efforts, we’ve developed a guide. Including conversations with our CTO, Steve Wilkes, the guide dives into key considerations and issues you should prioritize when taking your data integration processes to the next level.

1. Develop a data integration modernization roadmap

Creating a high-level roadmap is the first step in data integration modernization efforts. Ideally, your roadmap should follow a three-step approach.

First, assess existing integrations. Your teams should collect details about integration patterns (REST API, event-driven, P2P, etc.) as well as source and target apps. Knowing the integration architecture and security needs is vital, too.

With the initial analysis completed, define the desired integration architecture and how to deploy it. If needed, consider and plan for any enterprise-specific requirements at this stage. It’s also recommended you identify an integration platform that’s aligned with your desired data architecture.

The final step is to create a plan for executing your modernization ideas. But before launching large-scale modernization, test your plan on a smaller project. Deploying a few integrations will help you identify risks that, left unresolved, could delay your modernization program.

Steve Wilkes, our CTO, recommends starting with initiatives that provide the fastest ROI as proof points. He says “one path for initial modernizations efforts we have seen over and over again with our customers that can rapidly provide results is to migrate some key databases and the applications that use them to the cloud.”

With the roadmap ready and adjusted to your specific needs, you can move forward and focus on crucial modernization tasks.

2. Add real-time data ingestion techniques

Introduce a broader range of modern data ingestion techniques, including real-time, and avoid outdated “batch” ETL techniques that cause latency and poor performance.

Steve, says that “for modern applications, real-time user experience is the new standard. Users expect data to be fresh, and for their reports to show accurate up-to-the-second information.”

Advanced data integration platforms allow you to capture and ingest data faster. Instead of hours- or days-long waiting, data enters staging areas, file systems, and other targets in near real-time.

You can then more effectively gather and analyze data from on-premise or cloud databases, sensors, robots, vehicles, and other sources. Advanced data ingestion techniques also enable companies to rapidly react to changing operational and business circumstances.

Manufacturing robots, for instance, can inform operators of bad parts instantaneously instead of the problems becoming apparent only when production halts. Or customer data from contact centers can be written to a customer relationship management (CRM) tool in real-time. Support agents and sales teams would then have access to up-to-date information all the time.

3. Support self-service access

Ensure that data integration enhances self-service access and, by extension, allows a broader group of analysts to run data analysis and visualization queries. Self-service can take other forms, including data prep and report creation. With users doing many tasks independently, IT teams have more time to tackle complex tasks.

Modern data integration solutions support self-service access in different ways. Integration tech, for instance, can feed raw data into databases and file systems and enable users to conduct data exploration and analytics. Also, integration platforms make it easy to visualize data in a user-friendly way.

Steve says that “data integration used to be a proprietary skill that only few had adequate knowledge to execute. With modern data integration tools like Striim, all analysts have single-click access to powerful data integration capabilities.”

4. Take advantage of new data platform types

Leverage new data platform types, such as Google, Azure, and Snowflake. The change data capture (CDC) feature plays an essential role in these efforts. Change data capture allows you to continuously migrate data from on-premise and cloud-based data warehouses to new platforms.

Modern data integration solutions also provide in-flight data processing. Data is delivered in a format suitable for advanced analytics. Advanced data integration architecture allows you to integrate data from Oracle, PostgreSQL, AWS RDS, and other data warehouses or databases to Google Cloud. Or you can continuously move data from various sources to different Azure Analysis Services or Cosmos DB. Whatever your preference is, integration technologies play an important role. You can establish, run, and enrich real-time data streams to new data platform types and execute your digital transformation strategy.

“New platforms help companies innovate faster with low-code/no-code applications fully managed in the cloud,” says Steve. “Traditionally companies would have to set aside or purchase new hardware in their data centers, install thick software clients, and build software in languages like C, Java, and shell scripts. The new cloud based paradigms have truly shifted the build-vs-buy question firmly into the buy category – where buy is more of a monthly lease that a costly one time purchase.”

5. Get value from various types of data

Working with cutting-edge data integration platforms allows you to capture and get business and analytics value from multi-structured, unstructured, and non-traditional data. These platforms transform data into a consumable format that you can work with. You can also combine information you already collect in a CRM tool with external data generated from social media, sensors, emails, events, and audiovisual sources. The fact that each of these sources has its own format is no longer an obstacle to handling data.

According to Steve, “the ability to not only source a large variety of data in real-time, but to process, combine, and enrich it while it is moving enables organizations to understand information contextually and make the correct decisions, faster.”

Businesses can then gain more accurate insights. For instance, instead of merely relying on its sales data, a company can run a sentiment analysis of social media to gauge how people are responding to new products. If a negative tone appears to be dominant, the company can analyze this problem further.

6. Partner with a versatile data integration vendor

Choose a data integration vendor that supports on-premises and cloud deployment and different types of integration (real-time, batch). This vendor will provide you with much-needed flexibility. As your data requirements evolve, data may have to be stored and moved across a number of private clouds, public clouds, on-premise databases, and other environments. Modern data architecture should support integration across all of these points.

Steve says that “your data integration vendor should be a partner for the long-term. This means they need to understand and support modern approaches to integration, including how the platform is accessed and which endpoints are supported. While there is really no such thing as future proof, your vendor should definitely not be living in the past.”

Versatile data integration vendors should also offer in-flight data processing capabilities, such as denormalization, enrichment, filtering, and masking. These data transformation processes minimize the ETL workload. Also, they reduce the architecture complexity, enable full resiliency, and improve compliance with data privacy regulations.

Integration tech helps you maintain a competitive edge

Modernizing data integration processes enables you to harness the power of digital transformation. Having different types, volumes, and sources of data is no longer a challenge. Modern integration platforms bring data to a unified environment and allow your team to gain critical insights.

Organizations are advised to focus on key modernization tasks, such as adding new ingestion techniques, exploring new platform types, and supporting self-service access. And as data fuels growth in today’s economy, improving your integration tech goes a long way toward maintaining a competitive edge.

Summing this up, Steve says “your journey to modernization needs a sound roadmap, but you also need a way of getting there. A real-time data integration platform like Striim is the engine that drives digital transformation.”

8 Reasons You Need Change Data Capture Solutions

half life of data

Data is the new oil, but it’s only useful if you can move, analyze, and act on it quickly. A Nucleus Research study shows that tactical data loses half its value 30 minutes after it’s generated, while operational data loses half its value after eight hours.

Change data capture (CDC) plays a vital role in the efforts to ensure that data in IT systems is quickly ingested, transformed, and used by analytics and other types of platforms. Change data capture is a software process that identifies changes to data in source systems and replicates those changes to target systems. And this process can be achieved using various change data capture methods, such as table deltas and database log mining.

Failure to handle data effectively prevents companies from taking advantage of their data assets. To help teams modernize their data environment, we explore reasons why you need to change data capture solutions and different tactics you can use.

1. Change Data Capture accelerates reporting and business intelligence

CDC enables companies to quickly move data from different systems, like customer relationship management (CRM) or enterprise resource planning (ERP), into a single dashboard. Faster data collection then leads to timely reporting and improved business intelligence capabilities.

Improved reporting and business intelligence benefit businesses in different ways. One way is that companies may use CDC to report on customer purchases more quickly. Access to timely purchase data would then enable sales teams to offer qualifying customers special offers sooner and close the deal.

Another way manufacturers can use CDC to their benefit is to ensure that data generated by production machines is moved instantaneously to reporting or intelligence dashboards. Analysts can then discover that, for example, specific manufacturing lines are slower than usual. These findings can be used not only to fix the inefficient machines but also to better manage supply chain operations.

2. Change Data Capture connects different database systems

CDC helps teams with incompatible database systems share data in near real-time. One way to do that is by using CDC to populate a staging database with data from different systems. Users can then access the staging database without negatively affecting the performance of primary databases.

CDC especially benefits mid- to large-sized companies that often use a number of different database systems. Support agents in mid- to large-sized companies often need to access data on products purchased by customers, a piece of information typically managed by sales teams. In this case, CDC can be used to move relevant data to a data mart and have support agents access it to get the information they need.

CDC also helps companies better integrate their IT infrastructure following an acquisition or merger. Corporate systems often tend to be different or even incompatible. This challenge can be solved by merging disparate data sets into a single repository, which allows teams to access data and carry out their reporting and analysis work. The repository can also feed data to new systems. And the original databases can continue running for as long as necessary.

3. Change Data Capture pushes data to multiple lines of business

Companies can use CDC to continuously update a data mart with sales or customer data and have different lines of business access that data through web apps. Providing data through data marts reduces the user traffic previously directed to remote databases, including CRM or ERP.

Manufacturers can also use change data capture to ensure customer data, such as canceled or changed orders, is promptly forwarded to various business units, such as production and finance teams. Employees can then take appropriate measures to change a manufacturing deadline or update a sales forecast.

4. Change Data Capture handles two-way data flows

CDC helps companies instantaneously collect and share business data. Such two-way data flows provide departments with the information needed to make new decisions and policies.

Take, for instance, an insurance business. When working on new pricing models or business strategies, corporate teams need to collect and access multiple data points. They need to review data on client responses to past pricing models. Changes in consumer behavior, such as unusual buying patterns in specific regions, are of interest as well. CDC solutions can be used to feed all of these data points into the systems used by the corporate teams. Armed with these insights, teams can then develop better rates and policies. But if these pieces of information aren’t refreshed continuously, insurers may create policies using outdated data.

And information on new policies and rates needs to be quickly pushed to customer-facing apps and systems used by sales agents. Failure to do so may cause agents to prepare inaccurate offers to potential clients, for example. Once again, CDC solutions can be used to push data to specific apps or systems.

5. Change Data Capture improves a master data management system

IT teams can use CDC to draw data from multiple databases and continuously update the master data management (MDM) system, a master record of critical data. CDC, in combination with data integration tech, can transform or adjust data so it reaches MDM in the desired format. Various departments and systems can then pull data from the MDM system or use it for reporting, analysis, and operational purposes.

6. Change Data Capture integrates apps with otherwise incompatible databases

Businesses can use data integration platforms with CDC capabilities to integrate certain software tools into otherwise incompatible in-house database systems. Doing so provides teams with more flexibility when it comes to choosing and deploying business apps. Employees can now focus on working with apps that help the company reach its business goals instead of worrying about database compatibility.

7. Change Data Capture reduces pressure on operational databases

Companies can use CDC to create a copy of operational databases that users can then access. This reduces the stress on vital production systems by diverting heavy user traffic to a secondary database that’s constantly refreshed. The primary operational databases are then less likely to suffer issues, such as unanticipated downtime or poor performance.
Analytics queries, in particular, can consume a lot of processing power. Diverting them to secondary databases removes a major source of stress to the production system.

8. Change Data Capture acts as an integral part of your disaster recovery or backup plan

Businesses can use CDC to maintain the standby copy of their data that can be accessed in the event of a disaster. And disaster recovery solutions are mandatory in some industries where failures can lead to catastrophic consequences for users and businesses. CDC-powered disaster recovery plans can thus prevent major disruptions.

How to select Change Data Capture tools

Companies evaluate potential CDC solutions based on various factors. And while some companies may have industry-specific requirements, there are certain features that any platform with change data capture capabilities should have, including:

    • Support for log reading
    • Transaction integrity
    • In-flight change data processing
    • Distributed processing in a clustered environment
    • Continuous monitoring of change data streams
    • Real-time delivery to a wide range of targets

The fintech startup Inspyrus, for instance, is using Striim’s CDC capabilities to feed real-time operational data from its databases to Snowflake. As a result, customers can now access visualized real-time invoicing data and enjoy improved business intelligence reports.

Supercharging business growth

From improving data flows to strengthening data recovery solutions, change data capture can be of use in many ways. Getting the most value out of this technology is of vital importance. CDC enables teams to move and act on data quickly, producing new insights and business opportunities. And only then are companies able to use data as a fuel that supercharges their growth.

A Guide to Modern Database Replication

Cloud database replication

With the rise of digital-first experiences, companies are grappling with more data than ever before. Gartner notes that 70% of organizations using cloud services today plan to increase their cloud spending to support digital experiences.  They elaborate by stating  “Cloud adoption therefore becomes a significant means to stay ahead in a post-COVID-19 world focused on agility and digital touch points.” 

Given the rapid adoption of cloud technologies, integration with core private cloud and on-premise operations is a critical requirement. This is where modern database replication platforms will play a significant role to enable this strategy.

What is Data Replication?

Data replication is the process of copying data at different physical and virtual location in a manner where each instance of the data is consistent – increasing availability and accessibility across networks. There are several technical implications and caveats of data replication including consistency tradeoffs between eventual consistency versus and strong consistency in the context of a single distributed system.

What is Database Replication?

Database replication is the process of copying data from a source database to one or more target databases. A widely-used approach for database replication is replaying the data manipulation language (or DML) statements from the source database to the relevant targets with minimal latency to ensure data is consistent across each instance.

database replication

Database Replication in a Distributed World

In a multi-instance cloud environment with endless distributed applications, file systems, storage systems, databases, and data warehouses, every enterprise is striving to implement a single source of truth to drive their analytics and operations.

Yet as industry thought leader and lauded database researcher Dr. Michael Stonebraker notes, “One size does not fit all.” . The key to supporting diverse data management operations and analytics at scale is a modern data replication strategy. 

However, as Striim Co-Founder and EVP of Product Alok Pareek notes in his lecture at Boston University’s Department of Computer Science , there is significant complexity in handling change streams to reliably deliver database replication for real-world use cases.

Change stream complexity

We will elaborate on how those challenges are addressed in the section below, but first we will provide some context on the problem.

Why Data Replication?

Reliable data replication in a digital-first world is critical to every business. In retail, it can be the difference between a sale and an abandoned cart, a blocked suspicious transaction versus thousands of dollars lost in fraud, and an on-time delivery versus a lost package. 

Retail integ

8 Key Categories to Evaluate Modern Database Replication Platforms 

There are 8 key feature categories to consider when looking at modern data replication platforms to support digital-first experiences and cloud adoption. The high level goals of these features should deliver the following

  • Low latency replication
  • Distributed architecture
  • Support Replication across private and public clouds

One real world example of a modern database replication architecture is Macy’s implementation that uses Striim for replication to Google Cloud. Macy’s approach can handle peak, holiday transaction workloads:

1. Change data capture 

The ability to mine transaction logs and change-data from applications in real-time is the core requirement of data replication platforms. While change data capture (or CDC) has been around for decades, modern implementations take into account new interfaces to change streams.

This allows companies to build change data capture topologies that meet the scale of distributed cloud platforms. Striim’s ground-up, distributed architecture widely separates itself from legacy CDC tools that rely fully on brittle, file-based buffering.

Modern Database Replication 2

2. Observable data at every step

3. SQL-based access and transformation of real-time data

When combining data across multiple sources into a single ‘source-of-truth’, modern data replication solutions be able to transform data with SQL-constructs that data managers are already proficient with. 

4. Horizontally scalable

Here is just a short list of tools in a modern data stack are built to scale horizontally on commodity hardware 

  • Azure PostgreSQL Hyperscale
  • BigQuery (fully managed)
  • Databricks
  • Apache Kafka
  • Apache Spark 
  • MongoDB
  • MariaDB
  • Snowflake (fully managed)
  • And the list goes on…

See a trend here?

Yet legacy database replication tools are built to scale vertically on a single machine. This was suitable for the era of scaling cores vertically on proprietary machines running on-premise, but does not meet the needs of modern data management.

A modern database replication solution should support multi-node deployments across clouds to help unify your data at scale with maximum reliability.

5. A single view of metadata across nodes

Building on to the last point – a horizontally distributed replication platform would be impossible to manage without a single, queryable view into your objects. 

6. Data delivery SLAs for minimum latency

In the example noted above, Macy’s modern replication stack was built to deliver end-to-end latencies under 200 milliseconds – even with Black Friday loads of 7500+ transactions per second (or 19 billion transactions a month). Choosing a data replication platform that can handle peak loads and monitor latency is critical to operate at scale.

7. Flexibility between ETL and ELT

A modern data replication tool should give users the choice to rely on external transformation tools like DBT when applicable. However, in the case of low-latency transformation requirements and custom EtLT, modern database replication should support streaming transformations. In some cases data needs to be formatted and transformed just to adhere to the format of the target data warehouse. In those cases ELT is not a suitable solution. 

8. Data validation at scale

Modern data replication platforms should give users the ability to query data validation statistics to ensure all data read from a source application are successfully written to their target platform. A practical use case would be reporting organization-wide data reliability. If your data validation dashboard shows that your source database is hours behind your target data warehouse, you can immediately notify other orgs that data may be stale while you triage the issue with root causes ranging from a server outage or bottlenecked resources in the data warehouse. 

 

Back to top