Unleashing the Power of Striim: Oracle to Snowflake Data Stream with Real-time CDC

 

Watch our recorded webinar to discover how Striim, a powerful enterprise-grade streaming integration and intelligence platform, revolutionizes the process of data migration and real-time Change Data Capture (CDC) from Oracle to Snowflake.
_______________
As organizations increasingly adopt Snowflake’s cloud data platform for its scalability, flexibility, and performance, the need to seamlessly transition data from legacy systems becomes paramount. Striim offers a comprehensive solution by leveraging its advanced capabilities to ensure a smooth, efficient, and near real-time migration process.
In this webinar, our experts will demonstrate how Striim’s intuitive interface simplifies the complexities of Oracle to Snowflake migration, saving significant time and effort. They will showcase Striim’s powerful CDC functionality, enabling you to capture and replicate changes from Oracle databases in real time to Snowflake, ensuring data integrity and continuous availability.
By attending this session, you will gain insights into:
  1. The challenges of traditional data migration methods and the advantages of using Striim’s streamlined approach.
  2. How Striim’s intelligent integration and transformation capabilities enable seamless data transfer from Oracle to Snowflake.
  3. Real-time Change Data Capture (CDC) and its significance in maintaining data accuracy and consistency during migration.
  4. A live demonstration of Striim’s Oracle to Snowflake migration and CDC functionalities, highlighting key features.
  5. Customer success stories and real-world examples of organizations that have leveraged Striim to achieve efficient Oracle to Snowflake data migration.
Don’t miss this opportunity to explore how Striim empowers your organization to unlock the full potential of Snowflake’s cloud data platform. Watch now and embark on a journey towards seamless data migration and real-time CDC with Striim’s cutting-edge capabilities.

Presented by:

Srdan Dvanajscak
Director of Solutions Consulting, Striim

 

Dave Buswell
Solutions Architect, Striim

A Comprehensive Guide to Migrating On-Premise Oracle Data to Databricks Unity Catalog with Python and Databricks Notebook

Tutorial

A Comprehensive Guide to Migrating On-Premise Oracle Data to Databricks Unity Catalog with Python and Databricks Notebook

Seamlessly establish connectivity between the Oracle database and Databricks

Benefits

Migrate your database data and schemas to Databricks in minutes.

Stream operational data from Oracle to your data lake in real-time

Automatically keep schemas and models in sync with your operational database.
On this page

In today’s data-driven world, businesses are constantly seeking ways to enhance data accessibility and accelerate analytics workflows. In this comprehensive guide, we will explore how to seamlessly bring data from an on-premise Oracle database to Databricks Unity Catalog using the powerful combination of Databricks Notebook and Python. Databricks Unity Catalog serves as an enterprise data catalog and collaborative platform for data discovery and management, enabling organizations to centralize and leverage their data assets effectively. Additionally, Striim is a robust real-time data integration platform, which complements Databricks Unity Catalog by facilitating continuous data ingestion and synchronization. By following these step-by-step instructions, you’ll be able to harness the benefits of cloud computing, streamline data integration, and enable data agility for your organization with the integration of Databricks Unity Catalog and Striim.

Before diving into the migration process, ensure you have the following prerequisites in place:

  1. Access to a Striim instance: You will need a functional Striim instance configured to communicate with the source Oracle database and the Databricks environment. If you don’t have a Striim instance set up, refer to the Striim Cloud documentation for deployment and configuration instructions.
  2. Access to a Databricks instance with Unity Catalog enabled: To migrate data from the on-premise Oracle database to Databricks Unity Catalog, you’ll need access to a Databricks instance where Unity Catalog is enabled. If you are unsure how to enable Unity Catalog in your Databricks instance, you can follow the instructions provided in the Databricks documentation: Enabling Unity Catalog.
  3. Familiarity with creating Databricks clusters and notebooks.
  4. Ensure that you have created the necessary target schema and tables within the Databricks database prior to proceeding.

Once you have ensured that you meet the prerequisites, follow the steps below to setup your Databricks environment:

Step 1: Create a Databricks Cluster

In your Databricks instance, navigate to the cluster creation interface. Configure the cluster settings according to your requirements, such as the cluster type, size, and necessary libraries.

Additionally, make sure to set the following environment variables by clicking on “Advanced Options” and selecting “Spark”:

DATABRICKS_ACCESS_TOKEN=<access_token>

PYSPARK_PYTHON=/databricks/python3/bin/python3

ORACLE_JDBC_URL=<jdbc_oracle_conn_url>

DATABRICKS_JDBC_URL=<jdbc_databricks_conn_url>

DATABRICKS_HOSTNAME=<databricks_host>

ORACLE_USERNAME=<oracle_username>

STRIIM_USERNAME=<striim_username> # We will be using the ‘admin’ user

ORACLE_PASSWORD=<oracle_password>

STRIIM_PASSWORD=<striim_pass>

STRIIM_IP_ADDRESS=<striim_ip_address> #Example: <ip_address>:9080

Later in our Databricks notebook, we will extract the values of these environment variables to obtain a Striim authentication token and create our first data pipeline.

Note: To adhere to best practices, it is recommended to use Databricks Secrets Management for storing these credentials securely. By leveraging Databricks Secrets Management, you can ensure that sensitive information, such as database credentials, is securely stored and accessed within your Databricks environment. This approach helps enhance security, compliance, and ease of management.

Create a Databricks Notebook: With the cluster up and running, you are ready to create a notebook. To do this, click on “New” in the Databricks workspace interface and select “Notebook.” Provide a name for your notebook and choose the desired programming language (Python) for your notebook.

By creating a notebook, you establish an environment where you can write and execute Python code to perform the necessary data extraction, and loading tasks using Striim.

Once you have created your Databricks Python Notebook, follow the steps below to begin bringing data from an on-prem Oracle database to Databricks Unity Catalog:

Generate an Authentication Token: To interact with the Striim instance programmatically, we will use the Striim REST API. The first step is to generate an authentication token that will allow your Python code to authenticate with the Striim instance:

import requests, os

striim_username = os.getenv(‘STRIIM_USERNAME’)
striim_password = os.getenv(‘STRIIM_PASSWORD’)
striim_ip_address = os.getenv(‘STRIIM_IP_ADDRESS’) #Example: <Striim_IP_Address>:9080

striim_api_info = {
‘auth_endpoint’: ‘/security/authenticate’,
‘tungsten_endpoint’: ‘/api/v2/tungsten’,
‘applications_endpoint’: ‘/api/v2/applications’
}

headers = {
‘Content-Type’: ‘application/x-www-form-urlencoded’,
}

data = ‘username={username}&password={password}’.format(username=striim_username,
password=striim_password)

response = requests.post(‘http://{ip_address}{auth_endpoint}’.format(ip_address=striim_ip_address,
auth_endpoint=striim_api_info[‘auth_endpoint’]), headers=headers, data=data)
token = response.json()[‘token’]

The code snippet generates an authentication token by making an HTTP POST request to the Striim REST API. It retrieves the Striim username, password, and IP address from environment variables, sets the necessary headers, and sends the request to the authentication endpoint. The authentication token is then extracted from the response for further API interactions.

Step 2: Create a Striim Application

With the authentication token in hand, you will use Python and the Striim REST API to create a Striim application. This application will serve as the bridge between the Oracle database and Databricks.

headers = {
'authorization': 'STRIIM-TOKEN {token}'.format(token = token),
'content-type': 'text/plain'
}

# Extracting Oracle database credentials from env vars
oracle_jdbc_url = os.getenv(‘ORACLE_JDBC_URL’)
oracle_username = os.getenv(‘ORACLE_USERNAME’)
oracle_password = os.getenv(‘ORACLE_PASSWORD’)

# Extracting Databricks credentials from env vars
databricks_hostname = os.getenv(‘DATABRICKS_HOSTNAME’)
databricks_jdbc_url = os.getenv(‘DATABRICKS_JDBC_URL’)
databricks_access_token = os.getenv(‘DATABRICKS_ACCESS_TOKEN’)

app_data = ”’
CREATE APPLICATION InitialLoad_OracleToDatabricks;
CREATE SOURCE Oracle_To_Databricks USING Global.DatabaseReader (
Username: ‘{oracle_username}’,
QuiesceOnILCompletion: false,
DatabaseProviderType: ‘Oracle’,
ConnectionURL: ‘{oracle_jdbc_url}’,
FetchSize: 100000,
Password_encrypted: ‘false’,
Password: ‘{oracle_password}’,
Tables: ‘DMS_SAMPLE.NBA_SPORTING_TICKET’ )
OUTPUT TO ORACLE_OUTPUT;
CREATE OR REPLACE TARGET Databricks USING Global.DeltaLakeWriter (
hostname: ‘{databricks_hostname}’,
Tables: ‘DMS_SAMPLE.NBA_SPORTING_TICKET,main.default.nba_sporting_ticket’,
Mode: ‘APPENDONLY’,
stageLocation: ‘/’,
ParallelThreads: ’12’,
personalAccessToken: ‘{databricks_access_token}’,
CDDLAction: ‘Process’,
adapterName: ‘DeltaLakeWriter’,
personalAccessToken_encrypted: ‘false’,
uploadPolicy: ‘eventcount:100000,interval:5s’,
ConnectionRetryPolicy: ‘initialRetryDelay=10s, retryDelayMultiplier=2, maxRetryDelay=1m, maxAttempts=5, totalTimeout=10m’,
connectionUrl: ‘{databricks_jdbc_url}’ )
INPUT FROM ORACLE_OUTPUT;
END APPLICATION InitialLoad_OracleToDatabricks;
”’.format(oracle_username=oracle_username,
oracle_jdbc_url=oracle_jdbc_url,
oracle_password=oracle_password,
databricks_hostname=databricks_hostname,
databricks_access_token=databricks_access_token,
databricks_jdbc_url=databricks_jdbc_url)

response = requests.post(‘http://{ip_address}{tungsten_endpoint}’.format(ip_address=striim_ip_address,
tungsten_endpoint=striim_api_info[‘tungsten_endpoint’]),
headers=headers,
data=app_data)

The provided code creates a Striim application named InitialLoad_OracleToDatabricks to migrate the DMS_SAMPLE.NBA_SPORTING_TICKET Oracle table.

The code sets the necessary headers for the HTTP request using the authentication token obtained earlier. It retrieves the Oracle database and Databricks credentials from the environment variables.

Using the retrieved credentials, the code defines the application data, specifying the source as the Oracle database using the Global.DatabaseReader adapter, and the target as Databricks using the Global.DeltaLakeWriter adapter. More information of the Delta Lake Writer adapter can be found here: https://www.striim.com/docs/en/databricks-writer.html

After formatting the application data with the credentials and configuration details, the code sends a POST request to the Striim tungsten endpoint to create the application.

To verify that it was created successfully, we will log in to the Striim console and go to the “Apps” page:

Step 3: Deploy and Start the Striim Application

Once the Striim application is created, we will deploy and start it using the following HTTP POST requests:

headers = {

‘Authorization’: ‘STRIIM-TOKEN {token}’.format(token=token),

‘Content-Type’: ‘application/json’,

}

# POST Request to DEPLOY the application

response = requests.post(

‘http://{ip_address}{applications_endpoint}/admin.InitialLoad_OracleToDatabricks/deployment’.format(ip_address=striim_ip_address, applications_endpoint=striim_api_info[‘applications_endpoint’]),

headers=headers,

)

# POST Request to START the application

response = requests.post(

‘http://{ip_address}{applications_endpoint}/admin.InitialLoad_OracleToDatabricks/sprint’.format(ip_address=striim_ip_address, applications_endpoint=striim_api_info[‘applications_endpoint’]),

headers=headers,

)

Step 4: Validate the Striim Application Status and Metrics

Output:

headers = {

‘authorization’: ‘STRIIM-TOKEN {token}’.format(token=token),

‘content-type’: ‘text/plain’,

}

target_component_name = ‘mon admin.Databricks;’

post_response = requests.post(‘http://{ip_address}{tungsten_endpoint}’.format(ip_address=striim_ip_address,

tungsten_endpoint=striim_api_info[‘tungsten_endpoint’]),

headers=headers,

data=data)

response_summary = post_response.json()

print(“Status: “, target_response_summary[0][‘executionStatus’])

print(“Timestamp: “, target_response_summary[0][‘output’][‘timestamp’])

print(“Total Input (Read): “, target_response_summary[0][‘output’][‘input’])

print(“Total Input (Read): “, target_response_summary[0][‘output’][‘output’])

Output:

Status:  Success

Timestamp:  2023-06-07 16:41:26

Total Input (Read):  1,510,000

Total Input (Read):  1,510,000


If the Total Output and Input values are equal, it indicates the successful migration of Oracle data to the Databricks catalog. To further validate the completeness of the migration, execute the following query in the Databricks editor to verify the total count of our NBA_SPORTING_TICKET table:

Conclusion

In conclusion, this comprehensive guide has walked you through the process of migrating data from an on-premise Oracle database to the Databricks Unity Catalog using Databricks Notebook and Python. By leveraging the power of Striim and its REST API, we were able to seamlessly establish connectivity between the Oracle database and Databricks. Through the step-by-step instructions and code snippets provided, you have learned how to generate an authentication token, create a Striim application, and deploy it to facilitate the data transfer process. With the integration of Databricks’ data processing capabilities and the centralized data catalog provided by the Unity Catalog, organizations can unlock data agility and enable streamlined data integration.

Wrapping Up: Start your Free Trial Today

In this recipe, we have walked you through steps for migrating on-premise Oracle Data to Databricks Unity Catalog with Python and Databricks Notebook. You can easily set up a streaming app by configuring your Databricks target. As always, feel free to reach out to our integration experts to schedule a demo, or try Striim developer for free here.

Tools you need

Striim

Striim’s unified data integration and streaming platform connects clouds, data and applications.

Databricks

Databricks combines data warehouse and Data lake into a Lakehouse architecture

Python

Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation via the off-side rule.

When Change Data Capture Wins

A guide on when real-time data pipelines are the most reliable way to keep production databases and warehouses in sync.

Photo by American Public Power Association on Unsplash

 

Co-written with John Kutay of Striim

Data warehouses emerged after analytics teams slowed down the production database one too many times. Analytical workloads aren’t meant for transactional databases, which are optimized for high latency reads, writes, and data integrity. Similarly, there’s a reason production applications are run on transactional databases.

Definition: Transactional (OLTP) data stores are databases that keep ACID (atomicity, consistency, isolation, and durability) properties in a transaction. Examples include PostgreSQL and MySQL, which scale to 20 thousand transactions per second.

Analytics teams aren’t quite so concerned with inserting 20 thousand rows in the span of a second — instead, they want to join, filter, and transform tables to get insights from data. Data warehouses optimize for precisely this using OLAP.

Definition: OLAP (online analytical processing) databases optimize for multidimensional analyses on large volumes of data. Examples included popular data warehouses like Snowflake, Redshift, and BigQuery.

Different teams, different needs, different databases. The question remains: if analytics teams use OLAP data warehouses, how do they get populated?

Image by authors

Use CDC to improve data SLAs

Let’s back up a step. A few examples of areas analytics teams own:

  • Customer segmentation data, sent to third party tools to optimize business functions like marketing and customer support
  • Fraud detection, to alert on suspicious behavior on the product

If these analyses are run on top of a data warehouse, the baseline amount of data required in the warehouse is just from the production database. Supplemental data from third party tools is very helpful but not usually where analytics teams start. The first approach usually considered when moving data from a database to a data warehouse is batch based.

Definition: Batch process data pipelines involve checking the source database on scheduled intervals and running the pipeline to update data in the target (usually a warehouse).

There are technical difficulties with this approach, most notably the logic required to know what has changed in the source and what needs to be updated in the target. Batch ELT tools have really taken this burden off of data professionals. No batch ELT tool, however, has solved for the biggest caveat of them all: data SLAs. Consider a data pipeline that runs every three hours. Any pipelines that run independently on top of that data, even if running every three hours as well, would in the worst case scenario be six hours out of date. For many analyses, the six hour delay doesn’t move the needle. This begs the question: when should teams care about data freshness and SLAs?

Definition: An SLA (service level agreement) is a contract between a vendor and its customers as to what they can expect from the vendor when it comes to application availability and downtime. A data SLA is an agreement between the analytics team and its stakeholders around how fresh the data is expected to be.

When fresh data makes a meaningful impact on the business, that’s when teams should care. Going back to the examples of analytics team projects, if a fraudulent event happens (like hundreds of fraudulent orders) time is of the essence. A data SLA of 3 hours could be what causes the business to lose thousands of dollars instead of less than $100.

When freshness can’t wait — cue CDC, or change data capture. CDC tools read change logs on databases and mimic those changes in the target data. This happens fairly immediately, with easy reruns if a data pipeline encounters errors.

With live change logs, CDC tools keep two data stores (a production database and analytics warehouse) identical in near-realtime. The analytics team is then running analyses on data fresh as a daisy.

Getting started with CDC

Image by authors

The most common production transactional databases are PostgreSQL and MySQL, which have both been around for decades. Being targets more often than sources, warehouses don’t usually support CDC in the same way (although even this is changing).

To set up a source database for CDC, you need to:

  • Make sure WAL (write-ahead) logs are enabled and the WAL timeout is high enough. This occurs in database settings directly.
  • Make sure archive logs are stored on the source based on the CDC tool’s current specifications.
  • Create a replication slot, where a CDC tool can subscribe to change logs.
  • Monitor source and target database infrastructure to ensure neither is overloaded.

On the source database, if a row of data changes to A, then to value B, then to A, this behavior is replayed on the target warehouse. The replay ensures data integrity and consistency.

While open source CDC solutions like Debezium exist, hosted CDC solutions allow users to worry less about infrastructure and more about the business specifications of the pipeline, unique to their business.

As a consultant in analytics & go-to-market for dev-tools, I was previously leading the data engineering function at Perpay and built out a change data capture stack. From my perspective, change data capture isn’t just about real-time analytics. It’s simply the most reliable and scalable way to copy data from an operational database to analytical systems especially when downstream latency requirements are at play.

Oracle to Snowflake Initial Load

 

1. In this video tutorial, we will show you how to complete the initial load from Snowflake to Oracle using Striim’s Flow Designer.

2. To get started, we will go to the Create App page in Striim and click Start from Scratch using the Flow Designer. Next, you will name your new application and begin to design your initial load application.

3. In the components panel, Search for database as a source and drag it over to the right. With this component, we will connect to Snowflake as our source database.

4. Setting up Snowflake as a Source requires your connection URL, Username, and Password. Under Advanced settings, you will enter in which tables you are going to move and any other necessary details. Select New Output and enter in a name for the data stream.

5. Now that you have your source configured, you will drag a database as a target component over to configure your Oracle connection. The same information is required when setting up your target. Under Advanced Settings, be sure to enter your batch and commit policy.

6. Once you have Snowflake and Oracle configured, we will deploy your application.

7. Before running your application, you can preview the data stream by clicking on the blue eye icon or by going directly into Snowflake and Oracle. Let’s take a look at what is happening in our source and target.

8. In Snowflake we can see that we have read 1,000 events, and Oracle has not received any data at this point since the application is not yet running.

9. Now we will start the application and can watch as the initial load is completed in real-time. Our Striim Application Progress screen shows that 1,000 events have been moved from Snowflake to Oracle.

10. In Oracle, we will double-check that all 1,000 events were written by running a query. You can view additional details about your initial load by reviewing the Monitoring page in Striim.

11. In this video, you have seen an initial load application from Snowflake to Oracle created and running in just a few minutes. Thanks for watching!

Snowflake to Oracle Initial Load

1. In this video tutorial, we will show you how to complete the initial load from Snowflake to Oracle using Striim’s Flow Designer.

2. To get started, we will go to the Create App page in Striim and click Start from Scratch using the Flow Designer. Next, you will name your new application and begin to design your initial load application.

3. In the components panel, Search for database as a source and drag it over to the right. With this component, we will connect to Snowflake as our source database.

4. Setting up Snowflake as a Source requires your connection URL, Username, and Password. Under Advanced settings, you will enter in which tables you are going to move and any other necessary details. Select New Output and enter in a name for the data stream.

5. Now that you have your source configured, you will drag a database as a target component over to configure your Oracle connection. The same information is required when setting up your target. Under Advanced Settings, be sure to enter your batch and commit policy.

6. Once you have Snowflake and Oracle configured, we will deploy your application.

7. Before running your application, you can preview the data stream by clicking on the blue eye icon or by going directly into Snowflake and Oracle. Let’s take a look at what is happening in our source and target.

8. In Snowflake we can see that we have read 1,000 events, and Oracle has not received any data at this point since the application is not yet running.

9. Now we will start the application and can watch as the initial load is completed in real-time. Our Striim Application Progress screen shows that 1,000 events have been moved from Snowflake to Oracle.

10. In Oracle, we will double-check that all 1,000 events were written by running a query. You can view additional details about your initial load by reviewing the Monitoring page in Striim.

11. In this video, you have seen an initial load application from Snowflake to Oracle created and running in just a few minutes. Thanks for watching!

Introducing Striim for Databricks

Striim is excited to introduce to you our fully-managed and purpose-driven service for Databricks. In this demo, you will see how simple overall data pipeline configuration is between Oracle to Databricks. You will be able to set up a pipeline in under 5 minutes and watch the data in Databricks in real-time.

Striim for Databricks is the first fully-managed and purpose-built streaming service for Databricks in the industry. Designed for everyone, you do not need to have any prior E T L expertise, and the simplified user experience requires little to no coding. This solution also offers reduced Total cost of ownership with consumption based metering and billing.

In this demo, we will demonstrate creating an Oracle to Databricks pipeline and moving your data in a few simple steps. In addition to this video, we have added inline documentation to help you understand the on screen information.

When you launch the service, you will be brought to the Create a Pipeline screen. On this screen, you will enter in a Data Pipeline Name.

Then, you will connect to Databricks. Connection details to Databricks are saved so you can reuse it for future pipeline configurations. In this demo, we will create a new connection to Databricks that requires the account keys to be entered. The Service automatically validates the connection and checks for all the necessary prerequisites.

Introducing Striim for Databricks

Similar to the target, the service will save the source connection details for future use. Prerequisite checks are run against the source as well and the report will be shown to you.

If the connection is valid, the service identifies the schema on the Oracle source and presents the list for you to select the correct one.

The service then checks for the compatibility of the source schema with Databricks and presents the table list for your selection. While selecting your tables, you can also choose the transformation per table that will be applied as the data flows through the pipeline in real time. For this demo, let’s choose to mask this specific column’s data.

Striim for Databricks also offers intelligent performance optimization with parallel data processing by grouping the tables.

A summary is shown in case you choose to make modifications before running the pipeline. In this demo, we reviewed it and started the first pipeline. As you can see, within a few seconds the pipeline was created and started to move the initial load automatically.

Striim for Databricks also has an intuitive overview dashboards and monitoring screens. The source and target statuses are displayed here on the Overview screen. In this case, Oracle is online and green, and Databricks is Paused which means the data is not flowing yet between the source and target. We will review our Oracle data and Databricks to ensure the data flow is going to move smoothly.

Let’s check what’s happening in our source and target. First, we will go into Oracle to check the number of records that have been moved through change data capture (cdc) for each table. Then we will check in Databricks that the same number of records have been updated for each table. We can also review the tables and columns that we have masked to ensure it processed correctly.

We will also use the Manage Tables in Pipeline feature to remove any tables that we no longer want to stream.

Next, we will use the Optimize Pipeline Performance screen which will show us which tables in the pipeline may be causing issues in the data stream. We can then pause the pipeline to optimize performance by creating table groupings and reducing the time spent between batches being sent to Databricks.

Now we will go back to the Oracle database and insert values into the source table. As you can see, the pipeline immediately recognizes the changes on Oracle and starts capturing the changes in real time. If we run a query on Databricks now to check the changes made to that table, we will see the C D C events are already available.

Thanks for watching! You have now seen our seamless, automated, and real time data capture using Striim for Databricks service.

Striim for Snowflake: Stream data in real-time to Snowflake

Tutorial

Striim for Snowflake:
Real-time streaming ingest for Snowflake

Turn Snowflake into a real-time source of truth with Striim

Benefits

Operational Analytics 
Use non-intrusive CDC to Kafka to create persistent streams that can be accessed by multiple consumers and automatically reflect upstream schema changes

Empower Your TeamsGive teams across your organization a real-time view of your Oracle database transactions.Get Analytics-Ready DataGet your data ready for analytics before it lands in the cloud. Process and analyze in-flight data with scalable streaming SQL.
On this page

Introduction

Striim for Snowflake combines the power of fast data streaming with the simplicity of fully automated, ELT (Extract-Load-Transform) data integration to replicate databases to Snowflake in real-time.

Striim for Snowflake is also the only automated data integration product that leverages Snowflake’s new Snowpipe Streaming API to provide fast, cost-optimized data ingest.

You can try Striim for Snowflake yourself by signing up for a 14-day free trial with $1,000 worth of free credits included.

Follow this step-by-step guide to configure your data streaming pipeline in snowflake for Striim.

Step 1: Configure your snowflake target connection

  • To start building your data pipeline, first name your pipeline, and optionally add a description.

  • Select an existing connection or add a new connection from scratch by adding hostname or account identifier, username and password for your snowflake account, target database, role associated with the user id, snowflake warehouse associated with the specified user, and any additional properties in the jdbc string in <key>=<value> format separated by ‘&’.

  • Name your snowflake connection. Striim saves these connection details under the connection name for future use.

  • After you have entered your connection details, click Next. Striim will check if all the prerequisites have been met.

Step 2: Configure your Source connection

  • Once your target connection is verified, configure your source connection by selecting your source database. Here, we have selected postgreSQL as our data source.

  • Select an existing connection or add a new connection. Enter the hostname, username, password and database name. You can also connect securely by using SSH tunnel

  • Now Striim will run prerequisite checks to verify source configuration and user permissions.

Step 3: Select Schema and Tables to move into the target

  • Select the source schemas containing the tables you want to sync with Snowflake.

  • Next, select the tables, optionally you can mask fields or select key columns.

  • Striim will now check the target warehouse and give you the option to sync with existing tables or create a new table.

  • You have the option to create table groups based on input change rates and low latency expectations.

Step 4: Input Additional Settings to complete your CDC pipeline

  • You have options to write changes as audit records or directly to your target table. You can specify how you want to handle schema changes. For streaming option, enter the private key associated to the public key of your Snowflake user id.

  • For the source database, enter your postgreSQL replication slot that is required to keep the WAL logs in the master server.

Step 5: Review your Pipeline

  • Review your source and target connection before running the pipeline. Make sure that your source and target connection details, selected tables and additional settings are correct.

Step 6: Run your pipeline to sync the source and target tables and stream data from your source

  • After you have reviewed your pipeline, run your pipeline to sync the table from your source with the target followed by data streaming.

Tools you need

Striim

Striim’s unified data integration and streaming platform connects clouds, data and applications.

Oracle Database

Oracle is a multi-model relational database management system.

Apache Kafka

Apache Kafka is an open-source distributed streaming system used for stream processing, real-time data pipelines, and data integration at scale.

Azure Cosmos

Azure Cosmos is a fully managed NoSQL database.

Azure Blob Storage

Azure Blob Storage is an object store designed to store massive amounts of unstructured data.

Microsoft Fabric and Striim

Microsoft Fabric and Striim from Striim on Vimeo.

Using Microsoft Fabric, we created real time dashboards in minutes using CDC from MongoDB. See how simple it is!

This data pipeline continuously takes Sales and Inventory data from MongoDB using Striim. Integrates with Azure Event Hubs, and visualizes it in Onelake using PowerBI. It’s a cool solution put together by Striim developers to combine operations, events, and analytics – highlighting our partnership with the newly announced unified Microsoft Fabric.

 

 

Hot Topics in Data with Ethan Aaron from Portable.io

We are kicking off Season 3 of What’s New in Data with none other than Ethan Aaron: CEO of Portable.io. Ethan is known for founding Portable.io – a leading Cloud ETL product – along with his viral perspectives on data and “low key” data events that have taken over the industry with appearances in New York City, Chicago, Denver, Toronto, Boston, and many other cities. We chat with Ethan on the hottest topics in data including macro economic effects on the data industry, consolidation of the data ecosystem, and of course data contracts.

BONUS: Sign up for Portable.io’s Low Key Data Conference: a free data conference featuring Fortune 500 data executives, leading data venture capitalists, and other expert data practitioners:

https://docs.google.com/forms/d/e/1FA…

Back to top