Data Mesh, Data Products, and re-inventing yourself with Bruno Aziza (Head of Data at Google Cloud)

Posted on May 11, 2023 by Striim Team | 1 min read | 5 views

Bruno Aziza – Head of Data and Analytics at Google Cloud – has a unique expert perspective on the data industry through his work with data executives and visionaries while managing a broad portfolio of products at Google including BigQuery, Dataflow, Dataproc, Looker, Data Studio, PubSub, Composer, Data Fusion, Catalog, Dataform, Dataprep and Dataplex.

In this episode, Bruno we dive into topics such as the development (and disappointment) of Data Mesh, how it relates to Data Products, along with how data leaders can transform their organizations by first looking inward and transforming one’s self.

We also speak to why Chief Data Officers are navigating from ‘reactive analytics’ to data products that are driven by real-time.

Follow Bruno Aziza on Linkedin, Twitter, and Medium.

Striim Real-Time Analytics Intro Recipe

Posted on March 6, 2023 by Sweta Prabha | 7 min read | 5 views

Tutorial

Striim Real-Time Analytics Quick Start

Go from zero to Streaming Analytics in a matter of clicks

Benefits

Get Started with Streaming

Learn how to play with real-time streams with simple auto-generated data streams

Explore Striim Dashboards

Use Striim’s SQL dashboard for real-time analytics

Free Streaming with Striim Developer

Stream up to 10 million events per month for free with Striim Developer
On this page

Overview

In today’s fast-paced, always-on lifestyle, real-time data is crucial. No one wants to know where their rideshare was ten minutes ago, miss out on the trade of a lifetime, or find out that half the items they ordered from their delivery app were out of stock. However, for many organizations, real-time data is out of reach due to the complexity of the infrastructure and the need to integrate with internal systems. This is where Striim comes in.

Why Striim?

Striim provides a simple unified data integration and streaming platform that combines change data capture (CDC), application integration, and Streaming SQL as a fully managed service.

Free Streaming for Developers!

With Striim Developer, you can prototype streaming use cases for production use with no upfront cost, stream up to 10 million events per month with unlimited Streaming SQL queries, and simulate real-time data behavior using Striim’s synthetic continuous data generator.

Want to see how easy it is to use Striim Developer for a real-time analytics use case? This tutorial will show you how to use Striim to process and analyze real-time data streams using continuous queries. You’ll also learn how to use a Striim dashboard for real-time data visualizations and reports. Whether you’re a data engineer or an analyst, this tutorial is the perfect introduction to real-time data insights with Striim Developer.

Core Striim Components

Continuous Generator: A continuous data-generator that can auto generate meaningful data for a given set of fields

Continuous Query: Striim’s continuous queries are continually running SQL queries that act on real-time data and may be used to filter, aggregate, join, enrich, and transform events.

Window: A window bounds real-time data by time, event count or both. A window is required for an application to aggregate or perform calculations on data, populate the dashboard, or send alerts when conditions deviate from normal parameters.

Dashboard: A Striim dashboard gives you a visual representation of data read and written by a Striim application.

Get Started

Watch this step-by-step video walkthrough or read through all the steps below:

Step 1: Sign up

To sign up for free Striim Developer, you can use either your personal or work email address. Go to https://signup-developer.striim.com/ and use your referral code. If you cannot reach this page, try disabling any ad/cookie blockers or sign up from an incognito page. Fill in all your details and select your choice of source and target from the dropdown list.

Step 2: Create your account and complete signup

After you submit the form in Step 1, check your inbox. You would receive an email from Striim Developer with the next steps. Check your spam folder if you do not get the following email.

Complete your signup by clicking ‘Complete’. That will prompt you to confirm your email address and set your username and password. You can ignore the part about whitelisting the IP address.

Step 3: Sign in using your username and password

Once you submit the form from Step 2, you will be redirected to developer.striim.com sign in page where you can enter your username and password. You will also receive a confirmation email that contains the reset password in case you forget your password.

Create your first real-time analytics app

Step 1: Create and name your Striim App

After you have signed into your developer account, click on ‘Create an App’ on the landing page.

Build your app using flow designer by selecting ‘Start from scratch’.

Next, name your app ‘salesanalytics’ and click ‘save’ on the lower right. Select the namespace that’s automatically set.

Step 2: Build a streaming application

Now you are ready to build your first real-time analytics application on Striim. To add an auto-generating data source, click on the relevant link on your screen as shown below:

You will have an option to choose between ‘Simple Source’ and ‘Advanced Source’. For this use-case, let us create an advanced source with seven fields. This will start generating streaming sample data.

The advanced source comes with a continuous query that processes and analyzes the data stream.

Now a Window component will be added that bounds the streaming data into buckets. Click the ‘+’ button as shown below and select ‘Window’.

We will set the size of the window to ‘1-second’ on the timestamp of the sample data events. Important: name it ‘onesecondwindow’ for this exercise.

Now scroll down on the window editor and populate the following fields:

Time: 1 Second

EventTime

So it looks exactly like this:

Next we will create a query to analyze this data. A CQ (Continuous Query) will process streaming data in real-time. You can name it ‘getcounts’ or whatever you want (no spaces or special characters). To add a CQ, select ‘+’ and connect next CQ component.

Copy and paste the following Streaming SQL snippet into the ‘CQ’ box:

SELECT count(*) as transactioncount, DNOW() as time FROM

onesecondwindow o;

IMPORTANT: name the New Output ‘countstream’.

As you may have noticed, the sample data has an IP address (every computer’s network ID) for each transaction. However, the business wants to know where their customers are coming from and we are missing that data. Luckily, Striim has a way of pulling locations for IP addresses.

Click and drag another ‘Continuous Query’ from the left-side panel on to the flow designer panel (anywhere on the blank/white space)

You can name the component ‘getlocations’ or whatever you want (no special characters or whitespaces)

Now copy and paste the following snippet in to the query box:

SELECT s.Name,s.Product_Name, IP_LAT(s.IP) as latitude, IP_LON(s.IP) as longitude, IP_COUNTRY(s.IP) FROM salesanalytics_PurchasesGDPRStream s;

IMPORTANT: name the New Output ‘locations’

Step 3: Deploy and Run your Striim Application

After you have saved all the components of your streaming app, you may deploy and run the Striim application. Click on the dropdown list next to ‘Created’ and choose ‘Deploy App’.

You can select any available deployment group and click ‘Deploy’.

After your Striim app is deployed, you can run the streaming application by clicking ‘Start App’.

You can monitor the processed data through the ‘eye’ wizard next to any stream component.

Explore Striim Dashboards

A Striim dashboard gives you a visual representation of data read and written by a Striim application. We will create a dashboard for the above data stream to visualize streaming data in real-time. Start by downloading this file.

Click on ‘View All Dashboards’ from the dropdown next to ‘Dashboards’ at the top of screen.

Click on ‘Create Dashboard’ and import the above downloaded file and select ‘Import all queries into this namespace’ using the auto selected namespace.

Here you will see a Striim Dashboard with a map already created. You will create a real-time chart yourself!

We will now create a line chart with our sales data. Click the ‘Line’ chart on the left side, drag it into the panel. Then select ‘Edit Query’ on the ‘<>’ icon on the top left.

Name the query ‘getcounts’ or whatever you want (no whitespace or special characters) and push ‘Enter’ on your keyboard.

Enter the following query into the input

SELECT * FROM countstream;

Click the ‘configure’ button as shown below to add axes details into your line graph. Choose ‘transactioncount’ field for y axis and ‘time’ with datetime format for x axis. For real-time chart, the data retention time is ‘current’.

Now you have a real-time sales dashboard!

Wrapping Up: Start your Free Trial Today

In this recipe, we have walked you through steps for creating a Striim application using test data from Striim’s Continuous Generator adapter. You can query the data stream using continuous queries and partition it using a Window. We have also demonstrated how to create aStriim dashboard for real-time data visualization.You can try adding your own continuous queries to the sales app and build whatever charts you want!

As always, feel free to reach out to our integration experts to schedule a demo, or try Striim developer for free here.

Tools you need

Striim

Striim’s unified data integration and streaming platform connects clouds, data and applications.

Streaming Synthetic Data to Snowflake with Striim

Posted on February 21, 2023 by Sweta Prabha | 5 min read | 6 views

Tutorial

Streaming Data to Snowflake With Striim

Experiment with real-time ingest in Snowflake

Benefits

Get Started with Streaming

Learn how to play with real-time streams with simple auto-generated data streams

Real-Time Ingest for Snowflake

Enable true real-time ingest for Snowflake via Snowpipe StreamingActivate Data
With real-time data in Snowflake, you can power data activation workflows fed by fresh data and in-the-moment actions
On this page

Overview

Striim is a unified data streaming and integration product that offers change capture (CDC), enabling continuous replication from popular databases such as Oracle, SQLServer, PostgreSQL and many others to target data warehouses like BigQuery and Snowflake.

In this recipe, we walk you through setting up a streaming application to a Snowflake target. To begin with, we will generate synthetic data to get a feel for Striim’s streaming platform. We use Striim’s Continuous Generator component to generate test data which is then queried by a SQL-based Continuous Query. Follow the steps to configure your own streaming app on Striim.

Core Striim Components

Continuous Generator: A continuous data generator can auto-generate meaningful data for a given set of fields

Continuous Query: Striim continuous queries are continually running SQL queries that act on real-time data and may be used to filter, aggregate, join, enrich, and transform events.

Snowflake Writer: Striim’s Snowflake Writer writes to one or more existing tables in Snowflake. Events are staged to local storage, Azure Storage, or AWS S3, then written to Snowflake as per the Upload Policy setting.

Step 1: Log into your Striim account and select the source

If you do not have an account yet, please go to signup-developer.striim.com to sign up for a free Striim developer account in a few simple steps. You can learn more on how to get started with free Striim Developer here. To configure your source adapter from the flow designer, click on ‘Create app’ on your homepage followed by ‘Start from scratch’. Name your app and click ‘Save’.

Click on the relevant link on the flow-designer screen to add an auto-generated data source.

You will be prompted to select a simple or an advanced source. For this application, we’ll add a simple source. The simple source has a continuous generator with four fields that are queried by a CQ component of Striim.

Step 2: Add a target table on your Snowflake Data Warehouse and enter the connection details on the Striim Target Snowflake adapter

On your Snowflake warehouse, add a table with the same fields and data type as the outcoming stream from Continuous Query.

Drag the Snowflake component from the left panel and configure your target. The connection url is of the format

jdbc:snowflake://YOUR_HOST-2.azure.snowflakecomputing.com:***?warehouse=warehouse_name&db=RETAILCDC&schema=public

Step 3: Deploy and Run the Striim app

Once the source, target and CQ are configured, select Deploy from the dropdown menu next to Created. Choose any available node and click Deploy. After the app is deployed, from the same drop-down, select StartApp.

You can preview the processed data by clicking on the ‘eye’ wizard next to the stream component.

Setting Up the Striim Application

Step 1: Log into your Striim account and select the source

To create a free account, go to signup-developer.striim.com

Step 2: Add a target table on your Snowflake Data Warehouse and enter the connection details on Striim Target adapter

Connection url: jdbc:snowflake://<YOUR_SNOWFLAKE_URL:***>?warehouse=warehouse_name&db=RETAILCDC&schema=public

Step 3: Deploy and Run the Striim app

Snowflake Writer: Support for Streaming API (Optional)

The Snowpipe Streaming API is designed to supplement Snowpipe, rather than replace it. It is intended for streaming scenarios where data is transmitted in row format, such as from Apache Kafka topics, rather than written to files. It enables low-latency loading of streaming data directly to the target table using the Snowflake Ingest SDK and Striim’s Snowflake Writer, thereby saving the costs associated with writing the data from staged files.

Configurations:

Users should enable streaming support for their Snowflake account along with key-pair authentication. The Private Key is passed on SnowflakeWriter property by removing header and footer and no line break:

—–BEGIN ENCRYPTED PRIVATE KEY—– ## HEADER

*************************

*******************

…

—–END ENCRYPTED PRIVATE KEY—– ## FOOTER

To configure the snowflake writer, under Advanced Settings, enable APPEND ONLY and STREAMING UPLOAD. With this setting, data will be streamed to the target table directly. Enter your user role and private key as shown below.

You can fine-tune the settings of upload policies based on the needs of your users. But you may start by changing ‘UploadPolicy’ to ‘eventcount:500,interval:5s’ to load either at every 500 events or 5 seconds (whichever comes first).

There are a few limitations to this approach, as follows:

Snowflake Streaming API restricts AUTO INCREMENT or IDENTITY.
Default column value that is not NULL is not supported.
Data re-clustering is not available on Snowpipe streaming target tables.
The GEOGRAPHY and GEOMETRY data types are not supported.

Wrapping Up: Start your Free Trial Today

In this recipe, we have walked you through steps for creating a Striim application with Snowflake as a target using test data from our Continuous Generator adapter. You can easily set up a streaming app by configuring your Snowflake target. As always, feel free to reach out to our integration experts to schedule a demo, or try Striim developer for free here.

Tools you need

Striim

Striim’s unified data integration and streaming platform connects clouds, data and applications.

Snowflake

Snowflake is a cloud-native relational data warehouse that offers flexible and scalable architecture for storage, compute and cloud services.

Streaming SQL on Kafka with Striim

Posted on February 16, 2023 by Sweta Prabha | 8 min read | 6 views

Tutorial

Streaming SQL on Kafka with Striim

Data integration and SQL-based processing for Kafka with Striim

Benefits

Efficient Data Processing
Process streaming data quickly and effectively between enterprise databases and Kafka

Streamlined SQL-Based Queries
Transform, filter, aggregate, enrich, and correlate your real-time data using continuous queries

ACID-Compliant CDCStriim and Confluent work together to ensure high-performance, ACID-compliant Change Data Capture
On this page

Overview

Apache Kafka is a powerful messaging system, renowned for its speed, scalability, and fault-tolerant capabilities. It is widely used by organizations to reliably transfer data. However, deploying and maintaining Kafka-based streaming and analytics applications can require a team of developers and engineers capable of writing and managing substantial code. Striim is designed to simplify the process, allowing users to reap the full potential of Kafka without extensive coding.

Striim and Confluent, Inc. (founded by the creators of Apache Kafka), partnered to bring real-time change data capture (CDC) to the Kafka ecosystem. By integrating Striim with Confluent Kafka, organizations can achieve a cost-effective, unobtrusive solution for moving transactional data onto Apache Kafka message queues in real time. This delivery solution is managed through a single application that offers enterprise-level security, scalability, and dependability.

The Striim platform helps Kafka users quickly and effectively process streaming data from enterprise databases to Kafka. Streamlined SQL-like queries allow for data transformations, filtering, aggregation, enrichment, and correlation. Furthermore, Striim and Confluent work together to ensure high-performance, ACID-compliant CDC and faster Streaming SQL queries on Kafka. For further insights into the strengths of the Striim and Kafka integration, visit our comparison page.

This recipe will guide you through the process of setting up Striim applications (Striim apps) with Confluent Kafka. Two applications will be set up: one with Kafka as the data source using the Kafka Reader component and another with Kafka as the destination with the Kafka Writer component. You can download the associated TQL files from our community GitHub page and deploy them into your free Striim Developer account. Please follow the steps outlined in this recipe to configure your sources and targets.

Core Striim Components

Kafka Reader: Kafka Reader reads data from a topic in Apache Kafka 0.11 or 2.1.

Kafka Writer: Kafka Writer writes to a topic in Apache Kafka 0.11 or 2.1.

Stream: A stream passes one component’s output to one or more components. For example, a simple flow that only writes to a file might have this sequence.

Mongodb Reader: Striim supports MongoDB versions 2.6 through 5.0 and MongoDB and MongoDB Atlas on AWS, Azure, and Google Cloud Platform.

Continuous Query: Striim continuous queries are continually running SQL queries that act on real-time data and may be used to filter, aggregate, join, enrich, and transform events.

App 1: Kafka Source to Snowflake Target

For the first app, we have used Confluent Kafka (Version 2.1) as our source. Data is read from a Kafka topic and processed in real time before being streamed to a Snowflake target warehouse. Please follow the steps below to set up the Striim app from the Flow Designer in your Striim Developer account. If you do not have an account yet, please follow this tutorial to sign up for a free Striim Developer account in a few simple steps.

Step 1: Configure the Kafka Source adapter

In this recipe the Kafka topic is hosted on Confluent. Confluent offers a free trial version for learning and exploring Kafka and Confluent Cloud. To sign-up for a free trial of Confluent cloud, please follow the Confluent documentation. You can create a topic inside your free cluster and use it as the source for our Striim app.

To configure your source adapter from the Flow Designer, click on ‘Create app’ on your homepage followed by ‘Start from scratch’. Name your app and click ‘Save’.

From the side panel, drag the Kafka source component and enter the connection details.

Add the broker address that you can find under client information on Confluent Cloud, also called the bootstrap server.

Enter the offset from where you want to stream data from your topic. Change the Kafka Config value and property separators as shown above. For the Kafka Config field you will need API key and API secret of your Confluent Kafka topic. The Kafka Config is entered in the following

format:session.timeout.ms==60000:sasl.mechanism== PLAIN:sasl.jaas.config==org.apache.kafka.common.security.plain.PlainLoginModule required username=””password=””; :ssl.endpoint.identification.algorithm==https:security.protocol==SASL_SSL

You can copy the sasl.jaas.config from client information on Confluent Cloud and use the correct separators for the Kafka Config string.

Step 2: Add a Continuous Query to process the output stream

Now the data streamed from the Kafka source will be processed in real time for various analytical applications. In this recipe the data is processed with SQL-like query that converts the JSON values into a structured table which is then streamed into your Snowflake warehouse, all in real time.

Drag the CQ component from the side panel and enter the following query. You can copy the SQL query from our GitHub page.

Step 3: Configure your Snowflake Target

On your target Snowflake warehouse, create a table with the same schema as the processed stream from the above Continuous Query. Enter the connection details and save. You can learn more about Snowflake Writer from this recipe.

Step 4: Deploy and run the app

Once the source, target and CQ are configured, select Deploy from the dropdown menu next to ‘Created’. Choose any available node and click Deploy. After the app is deployed, from the same drop-down, select StartApp.

You can preview the processed data by clicking on the ‘eye’ icon next to the stream component.

App 2: MongoDB Source to Kafka Target

In this app, real-time data from MongoDB has been processed with SQL-like queries and replicated to a Kafka topic on Confluent. Follow the steps below to configure a MongoDB to Kafka streaming app on Striim. As shown in app 1 above, first name your app and go to the Flow Designer.

Step 1: Set up your MongoDB Source

Configure your MongoDB source by filling in the connection details. Follow this recipe for detailed steps on setting up a MongoDB source on Striim. Enter the connection url, username, password and the collection data that you want to stream.

Step 2: Add a Continuous Query to process incoming data

Once the source is configured, we will run a query on the data stream to process it. You can copy and paste the code from our GitHub page.

Step 3: Set up the Kafka target

After the data is processed, it is written to a Confluent Kafka topic. The configuration for the Kafka Writer is similar to Kafka Reader as shown in app 1. Enter the connection details of your Kafka and click Save.

Step 4: Deploy and run the app

After the source and target adapters are configured, click Deploy followed by Startapp to run the data stream.

You can preview the processed data through the ‘eye’ wizard next to the data stream.

As seen on the target Kafka messages, the data from MongoDB source is streamed into the Kafka topic.

Setting Up the Striim Applications

App 1: Kafka Source to Snowflake Target

Step 1: Configure the Kafka Source Adapter

Kafka Config:

session.timeout.ms==60000:sasl.mechanism==PLAIN: sasl.jaas.config==org.apache.kafka.common.security.plain.PlainLoginModule required username=””password=””; :ssl.endpoint.identification.algorithm==https:security.protocol==SASL_SSL

Step 2: Add a Continuous Query to process the output stream

select TO_STRING(data.get(“ordertime”)) as ordertime,
TO_STRING(data.get(“orderid”)) as orderid,
TO_STRING(data.get(“itemid”)) as itemid,
TO_STRING(data.get(“address”)) as address
from kafkaOutputStream;

Step 3: Configure your Snowflake target

Step 4: Deploy and run the Striim app

App 2: MongoDB Source to Kafka target

Step 1: Set up your MongoDB Source

Step 2: Add a Continuous Query to process incoming data

SELECT
TO_STRING(data.get(“_id”)) as id,
TO_STRING(data.get(“name”)) as name,
TO_STRING(data.get(“property_type”)) as property_type,
TO_STRING(data.get(“room_type”)) as room_type,
TO_STRING(data.get(“bed_type”)) as bed_type,
TO_STRING(data.get(“minimum_nights”)) as minimum_nights,
TO_STRING(data.get(“cancellation_policy”)) as cancellation_policy,
TO_STRING(data.get(“accommodates”)) as accommodates,
TO_STRING(data.get(“bedrooms”)) as no_of_bedrooms,
TO_STRING(data.get(“beds”)) as no_of_beds,
TO_STRING(data.get(“number_of_reviews”)) as no_of_reviews
FROM mongoOutputStream l

Step 3: Set up the Kafka target

Step 4: Deploy and run the app

Wrapping Up: Start your Free Trial Today

The above tutorial describes how you can use Striim with Confluent Kafka to move change data into the Kafka messaging system. Striim’s pipelines are portable between multiple clouds across hundreds of endpoint connectors. You can create your own applications that cater to your needs. Please find the app TQL and data used in this recipe on our GitHub repository.

As always, feel free to reach out to our integration experts to schedule a demo, or try Striim for free here.

Tools you need

Striim

Striim’s unified data integration and streaming platform connects clouds, data and applications.

Snowflake

Snowflake is a cloud-native relational data warehouse that offers flexible and scalable architecture for storage, compute and cloud services.

Apache Kafka

Apache Kafka is an open-source distributed streaming system used for stream processing, real-time data pipelines, and data integration at scale.

MongoDB

NoSQL database that provides support for JSON-like storage with full indexing support.

Building a Real-Time Lakehouse with Data Streaming

Posted on February 3, 2023 by Striim Team | 2 min read | 6 views

Data lakehouses are a strategic platform for unification of enterprise decision support, advanced analytics, and machine learning.

If engineered for streaming and other low-latency workloads, data lakehouses can enable companies to deliver trusted data-driven insights and sophisticated analytics in real time to internal operations as well as to customer-facing applications.

Delivering on the promise of the real-time cloud data lakehouse demands careful attention to a wide range of architecture, deployment, governance, and performance issues. Please join TDWI’s senior research director James Kobielus on this on-demand webinar, in which he, Databricks’ Spencer Cook, and Striim’s John Kutay, discuss how enterprises can incorporate change data capture and data streaming in a cloud-based lakehouse to drive real-time data-driven insights, operations, and customer engagement. After brief presentations, Kobielus, Cook, and Kutay engage in a roundtable discussion focused on the following issues:

What are the core business use cases for a real-time enterprise cloud data lakehouse?
What are the essential ingredients of a real-time enterprise cloud data lakehouse?
How should enterprises deploy change data capture, streaming, and other low-latency infrastructure within a real-time cloud data lakehouse?
What challenges do enterprises face when migrating schemas, data, and workloads from legacy data warehouses to a cloud data lakehouse?
How can enterprises future-proof their investments in real-time cloud data lakehouses?

Presented by:

Spencer Cook – Senior Solutions Architect, Databricks

Spencer Cook, M.S., is a data professional with experience delivering end-to-end analytics solutions in the cloud to iconic brands. Since 2021, Spencer has been a financial services solutions architect at Databricks focused on revolutionizing the industry with lakehouse architecture.

John Kutay – Director of Product Management, Striim

John Kutay is director of product management at Striim with prior experience as a software engineer, product manager, and investor. His podcast “What’s New in Data” best captures his ability to understand upcoming trends in the data space with thousands of listeners across the globe. In addition, John has over 10 years of experience in the streaming data space through academic research and his work at Striim.

A Guide to Data Contracts

Posted on January 4, 2023 by John Kutay | 7 min read | 5 views

Companies need to analyze large volumes of datasets, leading to an increase in data producers and consumers within their IT infrastructures. These companies collect data from production applications and B2B SaaS tools (e.g., Mailchimp). This data makes its way into a data repository, like a data warehouse (e.g., Redshift), and is shown to users via a dashboard for decision-making.

This entire data ecosystem can be wobbly at times due to a number of assumptions. The dashboard users may assume that data is being transformed the same way as when the service was initially launched. Similarly, an engineer might change something in the schema of the source system. Although it might not affect production, it might break something in other cases.

Data contracts tackle this uncertainty and end assumptions by creating a formal agreement. This agreement contains a schema that describes and documents data, which determines who can expose data from your service, who can consume your data, and how you can manage your data.

What are data contracts?

A data contract is a formal agreement between the users of a source system and the data engineering team that is extracting data for a data pipeline. This data is loaded into a data repository — such as a data warehouse — where it can be transformed for end users.

As per James Densmore in Data Pipelines Pocket Reference, the contract must include a number of things, such as:

What data are you extracting?
What method are you using to extract data (e.g., change data capture)?
At what frequency are you ingesting data?
Who are the points of contact for the source system and ingestion?

You can write a data contract in a text document. However, it’s better to use a configuration file to standardize it. For example, if you are ingesting data from a table in Postgres, your data contract could look like the following in JSON format:

“{
  ingestion_jobid: "customers_postgres",
  source_host: "ABC_host.com",
  source_db: "bank",
  source_table: "customers",
  ingestion_type: "full",
  ingestion_frequency_minutes: "15",
  source_owner: "developmentteam@ABC.com",
  ingestion_owner: "datateam@ABC.com"
};”

How to implement data contracts

When your architecture becomes distributed or large enough, it’s increasingly difficult to track changes, and that’s where a data contract brings value to the table.

When your applications access data from each other, it can cause high coupling, i.e., applications are highly interdependent on each other. If you make any changes in the data structure, such as dropping a table from the database, it can affect the applications that are ingesting or using data from it. Therefore, you need data contracts to implement versioning to track and handle these changes.

To ensure your data contracts fulfill their purpose, you must:

Enforce data contracts at the data producer level. You need someone on the data producer side to manage data contracts. That’s because you don’t know how many target environments can be used to ingest data from your operational systems. Maybe, you first load data into a data warehouse and later go on to load data into a data lake.
Cover schemas in data contracts. On a technical level, data contracts handle schemas of entities and events. They also prevent changes that are not backward-compatible, such as dropping a column.
Cover semantics in data contracts. If you alter the underlying meaning of the data being generated, it should break the contract. For instance, if your entity has distance as a numeric field, and you start collecting distance in kilometers instead of miles, this alteration is a breaking change. This means that your contract should include metadata about your schema, which you can use to describe your data and add value constraints for certain fields (e.g., temperature).
Ensure data contracts don’t affect iteration speed for software developers. Provide developers with familiar tools to define and implement data contracts and add them to your CI/CD pipeline. Implementing data contracts can minimize tech debt, which can positively impact iteration speed.

In their recent article, Chad Sanderson and Adrian Kreuziger shared an example of a CDC-based implementation of data contracts. According to them, a data contract implementation consists of the following components, as depicted below:

Defining data contracts as code using open-source projects (e.g. Apache Avro) to serialize and deserialize structured data.
Data contract enforcement using integration tests to verify that the data contract is correctly implemented, and ensuring schema compatibility so that changes in the producers won’t break downstream consumers. In their example, they use Docker compose to spin up a test instance of their database, a CDC pipeline (using Debezium), Kafka, and the Confluent Schema Registry.
Data contract fulfillment using stream processing jobs (using KSQL, for example) to process CDC events and output a schema that matches the previously-defined data contract.
Data contract monitoring to catch changes in the semantics of your data.

A data contract implementation, from Chad Sanderson and Adrian Kreuziger’s “An Engineer’s Guide to Data Contracts – Part 1”

Data contract use cases

Data contracts can be useful in different stages, such as production and development, by acting as a validation tool, as well as supporting your data assets like data catalogs to improve data quality.

Assess how data behaves on the fly

During production, you can use data contracts as a data validation tool to see how data needs to behave in real time. For example, let’s say your application is collecting data for equipment in a manufacturing plant. Your data contract says that the pressure for your equipment should not exceed the specified limit. You can monitor the data in the table and send out a warning if the pressure is getting too high.

Avoid breaking changes

During software development, you can use data contracts to avoid breaking changes that can cause any of your components to fail since data contracts can validate the latest version of data.

Improve discoverability and data understanding

Like data contracts, data catalogs accumulate and show various types of information about data assets. However, data catalogs only define the data, whereas data contracts define the data and specify how your data should look. Moreover, data catalogs are made for humans, whereas data contracts are made for computers. Data contracts can be used with data catalogs by acting as a reliable source of information for the latter to help people discover and understand data through additional context (e.g., tags).

Striim helps you manage breaking changes

Striim Cloud enables you to launch fully-managed streaming Change Data Capture pipelines, greatly simplifying and streamlining data contract implementation and management. With Striim, you can easily define, enforce, fulfil, and monitor your data contracts, without having to wrangle with various open-source tools.

For example, using Striim, you can set parameters for Schema Evolution based on internal data contracts. This allows you to pass schema changes to data consumers on an independent table-specific basis. If your data contract is broken, you can use Striim to automate sending alerts on Slack.. Consider the workflow in this following diagram:

You can use Striim to move data from a database (PostgreSQL) to a data warehouse (BigQuery). Striim is using Streaming SQL to filter tables in your PostgreSQL based on data contracts. If a schema change breaks your contract, Striim will stop the application and send you an alert through Slack, allowing your engineers to stop the changes in your source schema. If the schema change is in line with your contract, Striim will automatically propagate all the changes in your BigQuery.

Learn more about how Striim helps you manage data contracts here.

Striim Cloud on AWS: Unify your data with a fully managed change data capture and data streaming service

Posted on November 30, 2022 by Ananda Venkatesha | 8 min read | 5 views

Businesses of all scales and industries have access to increasingly large amounts of data, which need to be harnessed effectively. According to an IDG Market Pulse survey, companies collect data from 400 sources on average. Companies that can’t process and analyze it to glean useful insights for their operations are falling behind.

Thousands of companies are centralizing their analytics and applications on the AWS ecosystem. However, fragmented data can slow down the delivery of great product experiences and internal operations.

We are excited to launch Striim Cloud on AWS: a real-time data integration and streaming platform that connects clouds, data and applications with unprecedented speed and simplicity.

With a serverless experience to build smart data pipelines in minutes, Striim Cloud on AWS helps you unify your data in real time with out-of-the box support for the following targets:

AWS S3
AWS Databases on RDS and Aurora
AWS Kinesis
AWS Redshift
AWS MSK
Snowflake
Databricks with Delta Lake on S3

along with over 100 additional connectors available at your fingertips as a fully managed service.

Striim Cloud runs natively on AWS services like EKS, VPC, EBS, Cloudwatch, and S3 enabling it to offer infinite large-scale, high performance, and reliable data streaming.

How does Striim Cloud bring value to the AWS ecosystem?

Striim enables you to ingest and process real-time data from over one hundred streaming sources. This includes enterprise databases via Change Data Capture, transactional data, and AWS Cloud environments. When you run Striim on AWS, it lets you create real-time data pipelines for Redshift, S3, Kinesis, Databricks, Snowflake and RDS for enterprise workloads.

Sources and targets

Striim supports more than 120+ sources and targets. It comes with pre-built data connectors that can automate your data movement from any source to AWS Redshift or S3 within a few minutes.

With Striim, all your team needs to do is to hit a few clicks for configuration, and an automated pipeline will be created between your source and AWS targets. Some of the sources Striim supports include:

Databases: Oracle, Microsoft SQL Server, MySQL, PostgreSQL, etc.
Data Streams: Kafka, JMS, IBM MQ, Rabbit MQ, IoT data over MQTT
Data formats: JSON, XML, Parquet, Free Form Text, CSV, and XML
AWS targets: RDS for Oracle, RDS for MySQL, RDS for SQL Server, Amazon S3, Databricks via Delta Lake on S3, Snowflake, Redshift, and Kinesis
Additional targets: Over 100 additional connectors including custom Kafka endpoints with Striim’s full-blown schema registry support

Change data capture

Change data capture (CDC) is a process in ETL used to track changes to data in databases (e.g., insert, update, delete) and stream those changes to target systems like Redshift. However, CDC approaches like trigger-based CDC or timestamps can affect the performance of the source system.

Striim supports the latest form of CDC — log-based CDC — which can reduce overhead on source systems by retrieving transaction logs from databases. It also helps move data continuously in real time in a non-intrusive manner. Learn about log-based CDC in detail here.

Streaming SQL

Standard SQL can only work with bounded data that are stored in a system. While dealing with streaming data in Redshift, you can’t use standard SQL because you are dealing with unbounded data, i.e., data that keep coming in. Striim provides a Streaming SQL engine that helps your data engineers and business analysts write SQL-style declarative queries over streaming data. These queries never stop running and can continuously produce outputs as streams.

Data transformation and enrichment

Data transformation and enrichment are critical steps to creating operational data products in the form of tables and materialized views with minimal cost and duplication of data. To organize these data into a compatible format for the target system, Striim helps you perform data transformation with Streaming SQL. This can include operations such as joining, cleaning, correlating, filtering, and enriching. For example, enriching helps you to add context to your data (e.g., by adding geographical information to customer data to understand their behavior).

What makes Striim unique in this regard is that it not only supports data transformation for batch data, but it also supports in-flight transformations for real-time streams with a full blown Streaming SQL engine called Tungsten.

Use case: How can an apparel business analyze data with Striim?

Suppose there’s a hypothetical company, Acme Corporation, which sells apparel across the country. The management wants to make timely business decisions that can help them to increase sales and minimize the number of lost opportunities due to delays in decision-making. Some of the questions that can help them to make the right decisions include the following:

Which product is trending at the moment?
Which store and location received the highest traffic last month?
What’s the inventory status across warehouses?

Currently, all store data is stored in a transaction database (Oracle). Imagine you’re Acme Corporation’s data architect. You can generate and visualize answers to the above questions by building a data pipeline in two steps:

Use Striim Cloud Enterprise to stream data from Oracle to Amazon Redshift.
After data is loaded into Redshift, use Amazon QuickSight service to show data insights and create dashboards.

Here’s how the flow will look:

In this blog, we will show you how you can configure and manage Striim Cloud Enterprise on AWS to create this pipeline for your apparel business within a few minutes.

Sign up for Striim Cloud

Signing up for Striim Cloud Enterprise is simple: just visit striim.com, get a free trial and sign up for the AWS solution. Activate your account by following the instructions.

Once you are signed in, create a Striim Cloud service, which essentially runs in the background and creates a dedicated Kubernetes cluster (EKS service on AWS) to host your pipeline, as you can see in the picture below.

Once the cluster is ready and before launching your service, configure secure connections using the secure SSH connection configuration, as seen below.

Create a pipeline for Oracle to Amazon Redshift

To create a pipeline, simply type Source: Oracle and target Amazon to see all the supported targets. In our example, we are selecting Amazon S3 as our target. This could be Amazon Redshift, Kinesis, etc.

The wizard will help you walk through the simple process with source and target credentials. The service automatically validates the credentials, connects to the sources, and fetches the list of schemas and tables available on the sources for your selection, as shown below.

On the target side, enter Amazon Redshift Access Key and secret key with the appropriate S3 bucket name and Object names to write Oracle data into, as depicted in the image below.

Follow the wizard to finish the configuration, which creates a data pipeline that collects historical data from the Oracle database and moves them to Amazon Redshift. For example, you can see the total number of sales across all branches during the last week.

In the next step, you can create an Oracle CDC pipeline via Striim to stream real-time change data coming in from different stores into Oracle to Redshift. Now, you can see real-time store data.

Pipeline to Amazon RedShift — A data pipeline streaming data from the source (Oracle) to Amazon RedShift

A data pipeline streaming data from the source (Oracle) to Amazon RedShift

Your data engineers can use streaming SQL to join, filter, cleanse, and enrich data on the real-time data stream before it’s written to the target system (S3). A monitoring feature offers real-time stream views for further low-latency insights.

Once data becomes available on Redshift, your data engineer can create dashboards and set up metrics for the relevant business use cases such as:

Current product trending
Store and location with the highest traffic last month
Inventory status dashboard across warehouses; quantity sold by apparel, historic graph vs. latest (last 24 hours)

Data like current trending products can be easily shared with management for real-time decision-making and the creation of business strategies.

For example, here’s a real-time view of the apparel trends by city:

And below are insights on the overall business, where you can see the top-selling and bottom-selling locations. The management can use this information to try out new strategies to increase sales in the bottom-selling locations, such as by introducing discounts or running a more aggressive social media campaign in those locations.

Striim is available for other cloud environments, too

Like AWS, Striim Cloud is available on other leading cloud ecosystems like Google Cloud and Microsoft Azure. You can use Striim with Azure to move data between on-premises and cloud enterprise sources while using Azure analytics tools like Power BI and Synapse. Similarly, you can use Striim with Google Cloud to move real-time data to analytics systems, such as Google BigQuery, without putting any significant load on your data sources.

Learn more about them here and here.

The Future of Streaming Data: Technology, Use Cases, and Opportunities

Posted on November 18, 2022 by Striim Team | 1 min read | 5 views

The streaming data market is constantly changing and evolving due to technological innovation and market demand. In order to stay ahead of the competition, companies need to have the most current and accurate information about the streaming data landscape. Watch our on-demand webinar to learn more about the future of the streaming data market. Sanjeev Mohan (Principal, SanjMo & Former Gartner Research VP) and Alok Pareek (Founder & EVP Products, Striim) discuss topics including:

- - The latest data trends including data products, data mesh, data observability, and more
  - The growth of real-time streaming replication and multiple market patterns to achieve replication (and their use cases)

Real-Time Healthcare Analytics: How Leveraging It Improves Patient Care

Posted on November 18, 2022 by John Kutay | 7 min read | 5 views

On a Tuesday night, a nurse in the emergency department receives a real-time alert on her smartphone: the department will be overcrowded within 1.5 hours. This alert, powered by real-time healthcare analytics, projects bed occupancy and anticipated care needs, allowing the nurse to coordinate with transport, radiology, and lab teams to prepare for the surge.

Historically, data silos limited information access, but real-time analytics now makes healthcare processes more connected. By aggregating and analyzing data, these insights boost operational efficiency and enhance patient care. In this post, we’ll explore how leveraging real-time healthcare analytics ensures seamless patient care and a smoother workflow for your team.

Why Leverage Real-Time Healthcare Analytics?

There are several compelling reasons why real-time healthcare analytics is essential for healthcare institutions. These include:

To Analyze EHR Data and Improve Patient Care

An electronic health record (EHR) digitally stores patient information, such as medical history, prescriptions, lab results, and treatments. While EHRs collect and display data, they lack real-time analysis capabilities — a gap filled by real-time healthcare analytics.

With real-time analytics, medical professionals can instantly access insights and recommendations based on current EHR data. This system ingests relevant data points, like progress and nursing notes, identifies diagnostic patterns, detects minor condition changes, and prioritizes patients with deteriorating health, enabling swift and proactive care.

Leveraging real-time healthcare analytics is essential in early sepsis detection. According to the CDC, sepsis claims 350,000 adult lives annually in the U.S. Early detection is vital yet challenging due to symptom overlap with other conditions. However, real-time analytics combined with AI can improve sepsis detection rates by up to 32%, according to one report.

The Medical University of South Carolina (MUSC) uses this technology to monitor patient health continuously, drawing on EHR data and machine learning to classify signs of sepsis onset. This proactive approach enables timely intervention, potentially saving lives, due to real-time data.

To Encourage People to Take a Proactive Approach to their Health

Another popular use case of real-time analytics in healthcare includes smartwatches and fitness trackers. Devices from the likes of Apple, Samsung, Fitbit, and others have exploded in popularity in recent years, enabling people to monitor their own health and adopt healthier habits.

They help people walk more by tracking their daily step count via in-app challenges, calculate the calories they lose during workouts and sports activities, and monitor their daily caloric intake. These wearables collect data from their sensors and use real-time analytics to provide useful insights.

While these devices are far from replacements for a doctor visit, they might alert the user to potential health risks. If someone notices their heart rate is often too high/too low, they may be more likely to visit their physician to check in.

For instance, a 12-year-old girl was alerted by her Apple Watch that she had an unusually high heart rate, and promptly sought medical attention. She was taken to a healthcare facility where doctors found her suffering from a rare condition in children: a neuroendocrine tumor on her appendix.

To Manage the Spread of Disease

Real-time analytics in healthcare can also help healthcare institutions and doctors identify trends in regards to the spread of illness. For instance, in 2020 during the Covid-19 pandemic, healthcare institutions leveraged real-time analytics to identify the growing disease. Healthcare organizations used machine learning algorithms fueled by data to analyze trends from 50 countries with the highest rates of Covid-19 and predict what would happen in the next several days.

Healthcare providers also leveraged real-time analytics in healthcare to determine how fast the virus was spreading in real time and how it mutated under various conditions. For example, the EU launched a software in 2020, InferRead, that collected image data from a CT scanner to analyze whether lungs were damaged due to a COVID infection. This analysis was generated within a few seconds, allowing a doctor to study it and diagnose the patient quickly.

Real-time analytics can also help to manage resources in the case of an outbreak. In the US, the Kinetica Active Analytics Platform was used to create a real-time analytics program for aggregating and tracking data. The purpose of this program was to aid emergency responders by collecting information on test kit quantities, personal protective equipment (PPE) availability, and hospital capacity. This allowed decision-makers to determine whether they could redirect patients to a hospital with capacity or set up alternative triage centers. Similarly, these insights also helped to distribute PPE to the locations where it was needed most, especially when a shortage made access more difficult.

To Optimize Hospital Staff Allocation

Healthcare institutions often face the critical challenge of maintaining optimal staffing levels. Leveraging real-time healthcare analytics can transform how hospitals predict staffing needs by analyzing historical data and identifying patterns in staffing operations. By continuously examining how nurses and other staff operated under varying circumstances, real-time analytics generates recommendations for each hour, considering potential unforeseen scenarios. This ensures that patients receive an appropriate level of care, minimizing resource gaps and elevating the standard of patient care.

Intel’s recent paper highlights how real-time healthcare analytics enables four hospitals to use data from diverse sources to forecast admissions accurately. By applying time series analysis — a statistical technique designed to identify patterns within admission records — these hospitals can predict patient arrivals hour by hour, optimizing preparation and resource allocation.

Additionally, data insights from real-time analytics empower healthcare institutions to enhance job satisfaction and reduce turnover. By identifying the percentage of experienced staff open to emergency shifts or overtime with incentives, healthcare providers can better manage workloads and redistribute tasks to prevent burnout.

Improve Patient Care and Operational Efficiency with Striim

For healthcare organizations aiming to optimize real-time healthcare analytics, Striim 5.0 offers a robust, secure solution. The platform not only ingests and analyzes high volumes of data in real-time but also introduces AI agents Sentinel and Sherlock to protect sensitive patient information. This feature automates authentication and connection processes, reducing overhead, enhancing data security, and ensuring compliance by masking personally identifiable information.

Discovery Health achieved a remarkable transformation with Striim, slashing data processing times from 24 hours to seconds. By replacing daily ETL processes with Striim’s Change Data Capture (CDC) technology, the organization seamlessly integrated disparate systems, eliminating delays and enabling faster, more responsive decisions. This innovation improved efficiency, reduced costs, and fostered personalized engagement by leveraging predictive analytics to encourage healthier member choices.

Backed by Oracle, Striim delivered unmatched reliability and scalability, utilizing advanced logical database replication expertise. The platform’s real-time insights empowered Discovery Health to promote wellness, enhance health outcomes, and streamline workflows. Through ongoing optimization, Discovery Health revolutionized its data infrastructure, driving informed decision-making and elevating customer experiences on a global scale.

Another healthcare organization that leverages Striim is Boston Children’s Hospital. In addition to enhancing patient outcomes, improving operational efficiency is critical to success in healthcare organizations. By consolidating data from multiple systems, including patient, billing, scheduling, clinical, and financial information, hospitals can streamline their operations and make faster, data-driven decisions.

Striim’s platform enables near real-time and batch-style processing of data from diverse sources like MS SQL Server, Google BigQuery, and Oracle, all feeding into a centralized Snowflake data warehouse. This seamless integration reduces the need for various scripts and disparate source systems, providing a single, cohesive view of the data pipelines. The hospital has not only saved time and money on support resources but has also significantly reduced the time it takes to deliver actionable insights to business users, a crucial factor in the fast-paced healthcare industry.

Ready to see for yourself how Striim can streamline operations and improve patient outcomes? Get started with a demo today.

How Striim Extends Azure Synapse Link

Posted on November 7, 2022 by Edward Bell | 2 min read | 5 views

We recently announced that Striim is a participant in Microsoft’s Intelligent Data Platform partner ecosystem. We’re also excited to share that Striim extends Synapse Link to add support for additional source systems.

There’s no question about the benefits of Azure Synapse. Whether it’s around on-demand usage, the ability to reduce high CapEx projects and increase cost savings, or enabling insight-driven decisions as fast as possible, Synapse can be an integral piece to your digital transformation journey. However, in order to make the most of Synapse and Power BI you need to reliably ingest data from disparate sources in real time.

In order to do so, Azure introduced Synapse Link, a method of easily ingesting data from Cosmos DB, SQL Server 2022, SQL DB, and Dataverse. Synapse Link utilizes either the change feed or change tracking to support continuous replication from the source transactional system. Rather than relying on legacy ETL tools to ingest data into Synapse on a nightly basis, Synapse Link enables more real-time analytical workloads with a smaller performance impact on the source database.

Outside of the sources included today with Synapse Link, Microsoft partnered with Striim to add support for real-time ingestion from Oracle and Salesforce to Synapse. Striim enables real-time Smart Data Pipelines into critical cloud services via log-based change data capture (CDC). CDC is the least intrusive method of reading from a source database, which reads from the underlying transaction logs rather than the database itself – empowering replications of high-value business-critical workloads to the cloud with minimal downtime and risk.

Besides pure data replication use cases, one common pattern that we see is the requirement to pre-process data in flight before even landing on Synapse. This reduces the time to value, and gets the data in the right format ahead of time. Within Striim it’s incredibly easy to do so either with out-of-the-box transformations, SQL code, or even Java for the most flexibility.

Whether you’re interested in replication or Smart Data Pipelines, to learn more please watch the free joint webinar: https://info.microsoft.com/ww-ondemand-unlock-insights-to-your-data-with-azure-synapse-link.html?lcid=en-us, or download our Oracle to Synapse or Salesforce to Synapse reference architectures.

If you have any questions please reach out to microsoft@striim.com, we’d be happy to discuss your specific use case in more detail.