John Kutay and Mariana Park

4 Posts

An Overview of Reverse ETL: A New Approach to Make Your Operational Teams More Data-Driven

Posted on February 11, 2022 by John Kutay and Mariana Park | 8 min read | 2 views

Consider the following stats: the 2020 Vena Industry Benchmark Report found that 57% of finance teams see data silos as a challenge; Treasure Data’s Customer Journey Report noted 47% of marketers find it hard to access their information due to data silos, and a Forrester study stated that 51% of sales professionals aren’t satisfied with how their organizations provide customer data. To sum up, non-technical users are struggling with data access. So, what’s causing these data silos?

Most organizations store their data in a data warehouse. Due to the inherent structure of the extract, transform, and load (ETL) process, this data is mainly used by data scientists, data engineers, and data analysts. These roles do their best to provide data to other non-data departments like customer success, sales, and marketing. However, these non-data departments need a better form of data access and analytical insights. That’s where reverse ETL can be a game-changer.

Reverse ETL is a new addition to the modern data ecosystem that can make organizations more data-driven. It empowers operational teams to get access to transformed data in their day-to-day business platforms, such as ERPs, CRMs, and MarTech tools.

What is ETL vs. Reverse ETL?

How does Reverse ETL work?

Why Should You Adopt Reverse ETL?

Beat Your Competition with a Personalized Customer Experience

What Is ETL vs. Reverse ETL?

The ETL process takes data from a source, such as customer touchpoints (e.g., CRM), processes/transforms this data, then stores it at a target, which is usually a data warehouse. The reverse ETL does the opposite by swapping the source and destination, i.e., it takes data from the data warehouse and sends it to operational business platforms.

Another difference between ETL and reverse ETL is their approach to data transformations. With ETL, data engineers perform data transformation before loading data into a data warehouse. This data can be used by data scientists and data analysts who analyze it for different purposes, such as building reports and dashboards.

In reverse ETL, data engineers perform data transformation on the data in the data warehouse so that third-party tools can use it immediately. This data is used by marketers, sales professionals, customer success managers, and other non-data roles to make data-driven decisions.

For instance, your BI report shows cost per lead (CPL) data that you need to send to a CRM system. In that case, your data engineer has to perform data transformations via SQL in your data warehouse. This transformation isolates your CPL data, formats it for your on-site platform, and adds it into the CRM, so your marketing experts can use this data for their campaigns.

How Does Reverse ETL Work?

Reverse ETL solutions deliver real-time data to operational and business platforms (like Salesforce, Intercom, Zendesk, MailChimp, etc.). It is a process that turns your data warehouse into a data source and the operational and business platforms into a data destination. Making data readily available to these platforms can give your front-line teams a 360-degree view of customer data. They can use data-driven decision-making for personalized marketing campaigns, smart ad targeting, proactive customer feedback, and other use cases.

The modern data stack. Strim (a real-time data integration and streaming ETL tool) continuously feeds data to your cloud data warehouse, where it can be processed and analyzed by members of the data team. Reverse ETL activates your data, making it accessible in platforms used by your front-line teams.

One might wonder: why are we moving the data back to those SaaS tools after moving data from them to data warehouses? That’s because sometimes, data warehouses can fail to address data silos.

Your key business metrics might be isolated in your data warehouse, limiting your non-data departments from making the most of your data. With traditional ETL, these departments are highly dependent on your data teams. They have to ask data analysts to send a report every time they need relevant insights. Likewise, once they add a new SaaS tool to their workflow, they rely on your data engineer to write custom API connectors. These issues can slow the speed of data access and availability for your front-line business users. Fortunately, reverse ETL can plug this gap.

Reverse ETL can help you to sync your KPIs (e.g., customer lifetime value) with your operational platforms. It ensures your departments can get real-time and accurate insights to pave the way for data-driven decision-making.

Why Should You Adopt Reverse ETL?

Reverse ETL solves a myriad of issues by democratizing data access, saving your data resources, and automating workflows.

It democratizes data beyond the data team

Reverse ETL enables data teams to channel data insights to other operational business teams in their usual workflow. Data becomes accessible and actionable because it is streamed directly from the data warehouse to platforms like CRMs, advertising, marketing automation, and customer support ticketing systems.

Providing more in-depth knowledge to the front-line team, such as your customer success team, can help your team members to make better decisions. It ensures that your front-line personnel are now equipped with comprehensive insights that can help them to personalize the customer experience. For instance, your data science team used complex modeling to segment your customer data, which is updated every week. Your customer success team can use reverse ETL to import this data automatically to an email platform and send personalized emails.

It reduces the engineering burden on data engineers

Traditionally, data engineers will have to build API connectors to channel data from the data warehouse to the operational business platforms. These API connectors come with a myriad of challenges, which include:

Writing APIs and maintaining them is challenging for data engineers.
It can take a few days to map fields from a source of truth (e.g.,data lake) to a SaaS app.
Often, these APIs are unable to process real-time data transfer.

Reverse ETL is designed to address these challenges. For starters, these reverse ETL tools come with built-in connectors. For this reason, data teams don’t have to write API connectors and maintain them. Previously, data teams might have only written a limited number of connectors. However, reverse ETL’s out-of-the-box connectors mean that companies can send data into more systems now.

Moreover, ETL tools consist of a visual interface that allows you to populate SaaS fields automatically. Reverse ETL tools can help you to define what triggers the movement of data between your data warehouse and operational business platforms to move data in real time.

As a result, you can save your data engineers’ time, and they can now turn their focus to other pressing data issues.

It automates and distributes data flow across multiple apps

Reverse ETL eliminates the manual process of switching between apps to get information. Reverse ETL feeds relevant KPIs and metrics to the operational systems at a pre-established frequency. This way, it can automate a number of workflows.

For instance, consider that your sales team uses Zendesk Sell as a CRM. One of the things that they do manually is to track freemium accounts and look for ways to turn them into paid users. For this purpose, your account managers need to jump back and forth between BI and CRM tools to view where these users are placed in the sales funnel.

What reverse ETL can do is to load your product data into Zendesk from your data warehouse and generate an alert that notifies the account managers as soon as a freemium account crosses a defined threshold in your sales funnel.

While Reverse ETL has many benefits, the long-term payoff is in an improved customer experience. By addressing ETL’s woes, reverse ETL puts contextual information at the fingertips of your customer-facing teams. The end result is a seamless, personalized service that enriches the customer experience.

Beat Your Competition With a Personalized Customer Experience

Every organization wants to get the most value out of the data in their data warehouse — because therein lies the answers to serving customers better and creating hyper-personalized experiences. By feeding comprehensive insights to the front-line teams, reverse ETL can help you to improve your customer personalization.

Suppose there’s a winter clothing brand that has the data to identify buyers who bought their winter coats last winter. If they want to launch a winter sale for their previous customers, reverse ETL can help their marketing teams to view detailed information from their tools. This is done by pulling the relevant data directly from the data warehouse and placing it in the software they are already using. They can use this data access to work on a hyper-personalized marketing campaign to appeal to those customers. Hence, by using reverse ETL, you can get a unified view of the customer in all your tools.

In many cases, data has a short shelf life, and needs to be acted on quickly. For example, SaaS companies that follow the Product Led Growth (PLG) model continuously collect product usage data. If a user hits a key milestone in product usage, or gets stuck at a certain point, this information can be shared with the sales or customer support teams for personalized outreach and support at exactly the right moment. Waiting hours or days to act on insights may mean a lost customer or upgrade opportunity.

A real-time data pipeline starts with a streaming ETL platform like Striim that continuously delivers data to your data warehouse. Once there, customer data can be synced to your applications to support your customer-facing team members. Real-time data pipelines underpin superior customer experiences and increased revenue.

To learn more about how Striim supports real-time data integration use cases, please get in touch or try Striim for free today.

DataOps vs DevOps: An Overview of the Twins of Digital Transformation

Posted on January 28, 2022 by John Kutay and Mariana Park | 9 min read | 2 views

If you’re comparing DataOps vs DevOps for a practical digital transformation approach, you’re engaging in the wrong debate. Sure DevOps has revolutionized how companies deliver software. But, DataOps is transforming how companies utilize data. So instead, the better question should be: How can I use both DevOps and DataOps to enhance the value I deliver to my customers?

Regardless of the size of your enterprise or the industry you operate in, a good understanding of DevOps and DataOps principles, differences, use cases, and how they fit together is instrumental to accelerating product development and improving your technology processes.

What is DevOps?

Development Operations (or DevOps) combines principles, technologies, and processes that improve and accelerate an organization’s capacity to produce high-quality software applications, allowing it to evolve and improve products faster than traditional software development methods.

How DevOps Fits Into The Technology Stack

Fundamentally, DevOps is about improving the tools and processes around delivering software. In the traditional software development model, the development and operations teams are separate. The development team focuses on designing and writing software. The operations team handles other processes not directly related to writing code, such as software deployments, server provisioning, and operational support. However, the disadvantages to this approach are that the development team has to depend on the operations team to ship out new features, which leads to slower deployments.

Additionally, when bug fixes and problems arise, the operations team has to depend on the development team to resolve them; this leads to a longer time to detect and resolve issues and ultimately affects software quality. Consequently, the DevOps model came as a solution to these pain points.

In a DevOps model, the development and operations teams no longer work in isolation. Often, these two teams combine into a single unit where the software engineers work on the entire application cycle, from development and testing to deployment and operations, to deliver the software faster and more efficiently. Larger companies often have specialized “DevOps engineers” whose primary purpose is to build, test, and maintain the infrastructure and tools that empower software developers to release high-quality software quickly.

What is DataOps?

Data Operations (or DataOps) is a data management strategy that focuses on improving collaboration, automation, and integration between data managers and consumers to enable quick, automated, and secure data flows (acquisition, transformation, and storage) across an organization. Its purpose is to deliver value from data faster by enabling proper data management and bringing together those who require data and those who operate it, removing friction from the data lifecycle.

How DataOps Fits Into The Technology Stack

The primary purpose of DataOps is to deliver value from data faster by streamlining and optimizing the management and delivery of data, thereby breaking down the traditional barriers that prevent people from accessing the data they need.

According to a recent study by VentureBeat, lack of data access is one of the reasons why 87% of data science projects never make it to production. For instance, data consumers like data scientists and analysts responsible for utilizing data to generate insights depend on data operators such as database administrators and data engineers to provide data access and infrastructure. For example, take a data scientist who has to rely on a data engineer to clean and validate the data and set up the required environment to run the ML(machine learning) models. In this case, the faster the data scientists get their requirements met, the quicker they can start delivering value on projects.

Additionally, a data scientist who doesn’t understand how the data engineer collected and prepared their data will waste time making inferences out of the noise. Similarly, a data engineer who doesn’t understand the use cases of their data will create unusable data schemas and miss crucial data quality issues. Consequently, the DataOps process came to mitigate these data pain points.

DataOps takes this cluttered mess and turns it into a smooth process where data teams aren’t spending their time trying to fix problems. Instead, they can focus on what matters: providing actionable insights. DataOps relies heavily on the automation capabilities of DevOps to overcome data friction. For example, suppose the processes such as server provisioning and data cleaning are automated. A data scientist can easily access the data they need to run their models, and analysts can run reports in minutes and not days. Larger companies often have specialized “DataOps engineer” roles whose purpose is to automate data infrastructure needs and build, create and deploy tools that help data consumers utilize data quickly to deliver value to the enterprise.

DataOps and DevOps: Overlapping Principles

DevOps and DataOps share some underlying principles. They require a cultural shift from isolation to collaboration. They both depend heavily on tools and technologies for process automation, and they both employ agile methodologies that support incremental delivery. As such, both DevOps and DataOps represent a sweeping change to the core areas of culture, process, and technology.

Culture: DevOps and DataOps require a change in culture and mindset, focused on collaboration and delivering value instead of isolation and performing functions. In both cases, every team works together towards a common goal – delivering value. DevOps is about removing the barriers that prevent software from being deployed fast. For DataOps, it is breaking down obstacles that prevent people from managing and accessing data quickly.
Process: DevOps and DataOps require an end-to-end revision of traditional methods focusing on automation and continuous improvement. Both DevOps and DataOps leverage continuous integration and delivery(CI/CD) in their processes. In the case of DevOps, the software is merged into a central repository, tested, built, and deployed to various environments(test and production). In the case of DataOps, CICD involves setting up workflows that automate various data processes such as uploading, cleaning, and validating data from source to destination.
Technology. DevOps and DataOps rely heavily on tools to provide complete automation for different workflows(development, testing, deployment, monitoring). For DevOps, tools such as Jenkins and Ansible help automate the entire application lifecycle from development to deployment. In the case of DataOps, platforms like Apache Airflow and DataKitchen help organizations control their data pipelines from data orchestration to deployment. Additionally, data integration tools like Striim automate data integration from multiple sources, helping organizations quickly access their data.

DataOps vs DevOps: Differences and Use Cases

Although DevOps and DataOps have similarities, one mistake companies make when comparing them is thinking they are the same thing. They tend to take everything they have learned about DevOps, apply it to “data,” and present it as DataOps; this misconception can add needless complexity and confusion and fails to reap the benefits of DataOps processes. Some differences between DevOps and DataOps are:

DevOps focuses on optimizing the software delivery; DataOps focuses on optimizing data management and access.
DevOps involves primarily technical people- software engineers, testers, IT operations team. In contrast, the participants in DataOps are a mix of technical(data engineers, data scientists) and non-technical people (business users and other stakeholders).
DevOps requires somewhat limited coordination once set up. However, because of the ever-changing nature of data, its use cases (and everyone who works with it). DataOps requires the consistent coordination of data workflows across the entire organization.

While foundationally, the concepts of DevOps serve as a starting point for DataOps, the latter involves additional considerations to maximize efficiency when operating data and analytical products.

Each approach has its unique strengths, making it the best choice for different scenarios.

Top DevOps Use Cases

Faster development of quality software: One of the top use cases of DevOps is in shipping software products faster. Google’s 2021 State of DevOps report stated that DevOps teams now deploy software updates 973x more frequently than traditional development teams. Additionally, Netflix reported faster and more quality software deployments after switching to DevOps. They implemented a model where all software developers are “Full cycle developers” and responsible for the entire application lifecycle. The result was a boost in the speed of software deployments from days to hours.
Improved Developer Productivity: Organizations that have implemented DevOps have seen an increase in the productivity of their development teams. By automating the underlying infrastructure to build, test and deploy applications, developers can focus on what matters – building quality solutions. By implementing DevOps practices, Spotify boosted its developer productivity by 99%. Time to develop and deploy websites and backend services went from 14 days to 5 minutes.

Top DataOps Use Cases

Accelerates Machine Learning and Data Analytics Workloads: The main goal of DataOps is to remove barriers that prevent people from accessing data. A recent survey by McKinsey reported that organizations spend 80 percent of their analytics time on repetitive tasks such as preparing data. When such repetitive tasks are automated, data consumers like data scientists and data analysts can access data faster to perform machine learning and data analytics workloads. In a recent case study by Data Kitchen, pharmaceutical company Celgene saw improvements in the development and deployment of analytics processes and the quality of the insights after implementing DataOps. Visualizations that took weeks/months were now taking less than a day.
Improved Data Quality: Ensuring data quality is of utmost importance to any enterprise. In a study by Gartner, organizations surveyed reported that they lose close to $15 million per annum due to poor data quality. By implementing DataOps practices, companies can improve the quality of data that flows through the organization and save costs. In 2019, Airbnb embarked on a data quality initiative to rebuild the processes and technologies of their data pipelines. They automated the data validation and anomaly detection by leveraging tools that enabled extensive data quality and accuracy in their data pipelines.
Maintain Data Freshness Service Level Agreements: Data has entered a new stage of maturity where data teams must adhere to strict Service Level Agreements (SLAs). Data must be fresh, accurate, traceable, and scalable with maximum uptime so businesses can react in real time to business events and ensure superior customer experiences. By incorporating DataOps practices, companies can ensure that data is dispersed across business systems with minimal latency. For example, Macy’s uses Striim to meet the demands of online buyers by replicating inventory data with sub-second latency, allowing them to scale for peak holiday shopping workloads.

Use Both DevOps and DataOps For The Best Of Both Worlds

Both DevOps and DataOps teams depend on each other to deliver value. Therefore, companies get the best success by incorporating DevOps and DataOps in their technology stack. The result of having both DevOps and DataOps teams working together is accelerated software delivery, improved data management and access, and enhanced value for the organization.

As you begin your digital transformation journey, choose Striim. Striim makes it easy to continuously ingest and manage all your data from different sources in real-time for data warehousing. Sign up for your free trial today!

Your Guide to Optimizing Snowflake Costs for Real-Time Analytics

Posted on December 6, 2021 by John Kutay and Mariana Park | 8 min read | 3 views

Data warehouses allow businesses to find insights by storing and analyzing huge amounts of data. Over the last few years, Snowflake, a cloud-native relational data warehouse, has gained significant adoption across organizations for real-time analytics.

In this post, we’ll share an overview of real-time analytics with Snowflake, some challenges with real-time data ingestion in Snowflake, and how Striim addresses these challenges and enables users to optimize their costs.

Continuous Pipelines and Real-Time Analytics With Snowflake: Architecture and Costs

Snowflake’s flexible and scalable architecture allows users to manage their costs by independently controlling three functions: storage, compute, and cloud services. Consumption of Snowflake resources (e.g. loading data into a virtual warehouse and executing queries) is billed in the form of credits. Data storage is billed at a flat-rate based on TB used/month. Storage and credits can be purchased on demand or up front (capacity).

In 2017, Snowflake introduced the Snowpipe data loading service to allow Snowflake users to continuously ingest data. This feature, together with Snowflake Streams and Tasks, allows users to create continuous data pipelines. As shown below, Snowpipe loads data from an external stage (e.g., AmazonS3). When data enters the external staging area, an event is generated to request data ingestion by Snowpipe. Snowpipe then copies files into a queue before loading them into an internal staging table(s). Snowflake Streams continuously record subsequent changes to the ingested data (for example, INSERTS or UPDATES), and Tasks automate SQL queries that transform and prepare data for analysis.

Snowflake real time pipeline — Creating a continuous data pipeline in Snowflake using Snowpipe to ingest data, and Streams and Tasks to automate change detection and data transformation. Image source: Snowflake docs.

While Snowpipe is ideal for many uses cases, continuous ingestion of large volumes of data can present challenges including:

Latency: On average, once a file notification is sent, Snowpipe loads incoming data after a minute. But when it comes to larger files, loading takes more time. Similarly, if your project requires an excessive amount of compute resources for performing decompression, description, and transformation on the fresh data, data ingestion with Snowpipe will take longer.
Cost: The utilization costs of Snowpipe include an overhead for managing files in the internal load queue. With more files queued for loading, this overhead will continue to increase over time. For every 1,000 files queued, you have to pay for 0.06 credits. For example, if your application loads 100,000 files into Snowflake daily, Snowpipe will charge you six credits.

Real-time Analytics With Striim and Snowflake: An Overview

With growing workloads, you can reduce costs incurred during real-time analytics by integrating Striim with Snowflake. Striim is an end-to-end data integration platform that enables enterprises to derive quick insights from high-velocity and high-volume data. Striim combines real-time data integration, streaming analytics and live data visualization, in a single platform.

Some benefits of using Striim for real-time analytics include:

Support for all types of data, including structured, unstructured, and semi-structured data from on-premise and cloud sources. High-performance change data capture from databases
In-memory, high-speed streaming SQL queries for in-flight transformation, filtering, enrichment, correlation, aggregation, and analysis. Live dashboards and visualizations
A customizable Snowflake Writer that gives users granular control over how data is uploaded to Snowflake

Next, we’ll share two examples demonstrating how Striim can support real-time analytics use cases while helping you optimize your Snowflake costs.

Optimize Snowflake Costs With Striim: Examples

Earlier, we discussed the latency and cost considerations for continuous ingestion with Snowpipe. Striim lets you minimize these costs via an “Upload Policy” that allows you to set parameters that control the upload to Snowflake. These parameters allow the Snowflake Writer, which writes to one or more tables in Snowflake, to consolidate the multiple input events of individual tables and perform operations on these groups of events with greater efficiency. You can control the number of events and set interval and file sizes. Unlike other tools that limit you to these settings at the global level, Striim goes one step further by allowing these configurations at the table level.

These events are staged to AWS S3, Azure Storage, or local storage, after which they are written to Snowflake. You can configure the Snowflake Writer in Striim to collect all the incoming events, batch them locally as a temp file, upload them to a stage table, and finally merge them with your final Snowflake table. Here’s what an example workflow would look like:

Oracle Reader → Stream → Snowflake Writer ( Batch→ temp file → Triggers upload to cloud storage → Stage → Merge)

Example 1: Choose how to batch your data to meet your SLAs

The Snowflake Writer’s “Upload Policy” allows you to batch data at varying frequencies based on your company’s data freshness service level agreement (SLA). Data freshness is the latency between data origination and its availability in the warehouse. For example, data freshness can measure the time it takes for data to move from your sales CRM into your Snowflake warehouse. A data freshness SLA is a commitment that guarantees moving data within a specific period.

Let’s take a look at the following diagram to review how Striim manages data freshness SLAs. If you have a large data warehouse, then you often need critical data views (e.g., customer data) and reports. With a 5-minute data freshness SLA, you can ensure that this data loads into your data warehouse five minutes after it’s generated in its data source (e.g., ERP system). However, for other use cases, such as reports, you don’t necessarily need data immediately. Hence, you can settle with a 30-minute data freshness SLA. Depending on how fast you need data, Striim uses fast or medium/low SLA tables to deliver data via its SnowflakeWriter.

While other tools have a sync frequency, Striim has come up with an innovative way of handling data freshness SLAs. For instance, if 50% of your tables can be reported with a one-hour data freshness SLA, 35% with a 30-minute SLA, and 15% with a five-minute SLA, you can split up these tables. This way, you can use Striim to optimize the cost of ingesting your data.

In addition to giving you granular control over the upload process, Striim enables you to perform in-flight, real-time analysis of your data before it’s uploaded to Snowflake. Striim is horizontally scalable, so there are no limitations if you need to analyze large volumes of data in real time. So, even if you choose to batch your data for Snowflake, you can still analyze it in real time in Striim.

Example 2: Reduce costs by triggering uploads based on a given time interval

As mentioned previously, the Snowflake Writer’s “Upload Policy” governs how and when the upload to Snowflake is triggered. Two of the parameters that control the upload are “EventCount”and “Interval”. So, which one of these two parameters yields lower latencies and costs? In most cases using the “Interval” parameter is a better option; here’s an example to show why.

Assume that a system is producing 100k events a minute. Setting an “UploadPolicy = EventCount:10000” would instruct Snowflake Writer to work on 10 upload, merge tasks (100,000/10,000). Assume that each upload, merge takes one minute to complete. So it will take at least 10 minutes to process all 100k events. At the end of the fifth minute, the source would have pumped another 500k events, and Snowflake Writer will be processing another 40 upload, merge tasks (400,000/10,000). In this configuration, you could see the lag increase over time.

If you go with the “Interval” approach, you would get better results. Assume the value of “UploadPolicy = Interval:5m” would instruct the Snowflake Writer to upload, merge tasks every five minutes. It means that every five minutes, Snowflake Writer would receive 500,000 events from the source and process upload, merge in two minutes (assumption). You could see a constant latency of seven minutes (five-minute interval + two-minute upload, merge time) across all the batches.

Cost is another advantage of the “Interval” based approach. In the “EventCount” based approach, you are keeping the Snowflake virtual warehouses always active. For the “EventCount” based approach, Snowflake Writer is active for the entire five minutes as opposed to two minutes for the “Interval” based approach. Therefore, this approach can help you optimize your Snowflake costs.

Go Through Striim Documentation to Improve Your Data-Centric Operations

If you’d like to optimize your other data-related workloads, Striim’s documentation shows some of the things that you can do.

Go through the Web UI Guide to learn about Striim’s browser-based user interface.
Go through the Dashboard Guide to build and edit dashboards and visualizations.
Go through the Programmer’s Guide to create and edit Striim applications.

Want a customized walkthrough of the Striim platform and its real-time integration and analytics capabilities? Request a demo with one of our platform experts, or alternatively, try Striim for free.

Data Fabric: What is it and Why Do You Need it?

Posted on October 11, 2021 by John Kutay and Mariana Park | 10 min read | 3 views

Insight-driven businesses have the edge over others; they grow at an average of more than 30% annually. Noting this pattern, modern enterprises are trying to become data-driven organizations and get more business value out of their data. But the rise of cloud, the emergence of the Internet of Things (IoT), and other factors mean that data is not limited to on-premises environments.

In addition, there are voluminous amounts of data, many data types, and multiple storage locations. As a consequence, managing data is getting more difficult than ever.

One of the ways organizations are addressing these data management challenges is by implementing a data fabric. Using a data fabric is a viable strategy to help companies overcome the barriers that previously made it hard to access data and process it in a distributed data environment. It empowers organizations to manage mounting amounts of data with more efficiency. Data fabric is one of the more recent additions to the lexicon of data analytics. Gartner listed data fabric as one of the top 10 data and analytics trends for 2021.

What is a data fabric?
Why do you need a data fabric in today’s digital world?
Data fabric examples to consider for improving your organization’s processes
Security is key to a successful data fabric implementation
Building your data fabric with Striim
Learn more: on-demand webinar with James Serra

What is a data fabric?

A data fabric is an architecture that runs technologies and services to help an organization manage its data. This data can be stored in relational databases, tagged files, flat files, graph databases, and document stores.

A data fabric architecture facilitates data-centric tools and applications to access data while working with various services. These can include Apache Kafka (for real-time streaming), ODBC (open database connectivity), HDFS (Hadoop distributed file system), REST (representational state transfer) APIs, POSIX (portable operating system interface), NFS (network file system), and others. It’s also crucial for a data fabric architecture to support emerging standards.

A data fabric is agnostic to architectural approach, geographical locations, data use case, data process, and deployment platforms. With data fabric, organizations can work toward meeting one of their most desired goals: having access to the right data in real-time, with end-to-end governance-and all at a low cost.

Data fabric vs. data lake

Often it happens that organizations lack clarity on what makes a data lake different from a data fabric. A data lake is a central location that stores large amounts of data in its raw and native format.

However, there’s an increase in the trend of data decentralization. Some data engineers believe that it’s not practical to build a central data repository, which you can govern, clean, and update effectively.

On the other hand, a data fabric supports heterogeneous data locations. It simplifies managing data stored in disparate data repositories, which can be a data lake or a data warehouse. Therefore, a data fabric doesn’t replace a data lake. Instead, it helps it to operate better.

Why do you need data fabric in today’s digital world?

Data fabrics empower businesses to use their existing data architectures more efficiently without structurally rebuilding every application or data store. But why is a data fabric relevant today?

Organizations are handling challenges of bigger scalability and complexity. Today, their IT systems are advanced and work with disparate environments while managing existing applications and modern applications powered by microservices.

Previously, software development teams went with their own implementation for data storage and retrieval. A typical enterprise data center stores data in relational databases (e.g., Microsoft SQL Server), non-relational databases (e.g., MongoDB), data repositories (e.g., a data warehouse), flat files, and other platforms. As a result, data is spread across rigid and isolated data silos, which creates issues for modern businesses.

Unifying this data isn’t trivial. Apps store data in a wide range of formats, even if they are using the same data. Besides, organizations store data in various siloed applications. Consolidating this data includes going through data deduplication — a process that removes duplicate copies of repeating data. Taking data to the right application at the right time is desirable, but it’s a tough nut to crack. That’s where a data fabric architecture can resolve your problem.

A data fabric helps to:

Handle multiple environments simultaneously, including on-premises, cloud, and hybrid.
Use pre-packaged modules to establish connections to any data source.
Bolster data preparation, data quality, and data governance capabilities.
Improve data integration between applications and sources.

A data fabric architecture allows you to map data from different apps, making business analysis easier. Your team can draw decisions and insights from existing and new data points with connected data. For instance, suppose an authorized user in the sales department wants to look at data from marketing. A data fabric lets them access marketing data seamlessly, in the same way they access sales data.

With a data fabric, you can build a global and agile data environment that can track and govern data across applications, environments, and users. For instance, if objects move from one environment to another, the data fabric notifies each component about this change and oversees the required processes, such as what process to run, how to run, and what’s the object’s state.

Data fabric examples to consider for improving your organization’s processes

The flexibility of a data fabric architecture helps in more ways than one. Some of the data fabric examples include the following:

Enhancing machine learning (ML) models

When the right data is fed to machine learning (ML) models in a timely manner, their learning capabilities improve. ML algorithms can be used to monitor data pipelines and recommend suitable relationships and integrations. These algorithms can obtain information from data while being connected to the data fabric, go through all the business data, examine that data, and identify appropriate connections and relationships.

One of the most time-consuming elements of training ML models is getting the data ready. A data fabric architecture helps to use ML models more efficiently by reducing data preparation time. It also aids in increasing the usability of the prepared data across applications and models. When an organization distributes data across on-premises, cloud, and IoT, it’s the data fabric that provides controlled access to secure data, enhancing ML processes.

Building a holistic customer view

Businesses can employ a data fabric to harness data from customer activities and understand how interacting with customers can offer more value. This could include consolidating real-time data of different sales activities, the time it takes to onboard a customer, and customer satisfaction KPIs.

For instance, an IT consulting firm can consolidate data from customer support requests and rework their sales activities accordingly. The firm receives concerns from its clients regarding the lack of a tool that can help them to migrate their on-premises databases to multi-cloud environments without downtime. The firm can then recognize the need to resolve this issue, find a reliable tool like Striim to address it, and have its sales representatives recommend the tool to customers.

Security is key to a successful data fabric implementation

Over the past few years, cyberattacks, especially ransomware attacks, have grown at a rapid rate. So, it’s no surprise organizations are concerned about the risk these attacks pose to their data security while data is being moved from one point to another in the data fabric.

Organizations can improve data protection by incorporating security protocols to protect their data from cyber threats. These protocols include firewalls, IPSec (IP Security), and SFTP (Secure File Transfer Protocol). Another thing to consider is a dynamic and fluid access control policy, which can be adapted dynamically to tackle evolving cyber threats.

With so many cyberattacks causing damages worth millions, securing your data across all points is integral for successfully implementing your data fabric architecture.

This can be addressed in multiple ways:

Ensuring data at-rest and in-flight are encrypted
Protecting your networking traffic from the public internet by using PrivateLink on services like Azure and AWS
Managing secrets and keys securely across clouds

Building your data fabric with Striim

Now that you know the benefits and some use cases of a data fabric, how can you start the transition towards a data fabric architecture in your organization?

According to Gartner, a data fabric should have the following components:

A data integration backbone that is compatible with a range of data delivery methods (including ETL, streaming, and replication)
The ability to collect and curate all forms of metadata (the “data about the data”)
The ability to analyze and make predictions from data and metadata using ML/AI models
A knowledge graph representing relationships between data

While there are various ways to build a data fabric, the ideal solution simplifies the transition by complementing your existing technology stack. Striim serves as the foundation for a data fabric by connecting with legacy and modern solutions alike. Its flexible and scalable data integration backbone supports real-time data delivery via intelligent pipelines that span hybrid cloud and multi-cloud environments.

Striim secure multi-cloud data fabric — Striim enables a multi-cloud/hybrid cloud data fabric architecture with automated, intelligent pipelines that continuously deliver data to consumers including data warehouses and data lakes.

Striim continuously ingests transaction data and metadata from on-premise and cloud sources and is designed ground-up for real-time streaming with:

An in-memory streaming SQL engine that transforms, enriches, and correlates transaction event streams
Machine learning analysis of event streams to uncover patterns, identify anomalies, and enable predictions
Real-time dashboards that bring streaming data to life, from live transaction metrics to business-specific metrics (e.g. suspected fraud incidents for a financial institution or live traffic patterns for an airport)
Hybrid and multi-cloud vault to store passwords, secrets, and keys. Striim’s vault also integrates seamlessly with 3rd party vaults such as HashiCorp

Continuous movement of data (without data loss or duplication) is essential to mission-critical business processes. Whether a database schema changes, a node fails, or a transaction is larger than expected — Striim’s self-healing pipelines resolve the issue via automated corrective actions. For example, Striim detects schema changes in source databases (e.g. create table, drop table, alter column/add column events), and users can set up intelligent workflows to perform desired actions in response to DDL changes.

As shown below, in the case of an “Alter Table” DDL event, Striim is configured to automatically propagate the change to downstream databases, data warehouses and data lakehouses. In contrast, in the case of a “Drop Table” event, Striim is set up to alert the Ops Team.

automated schema change detection with Striim — How intelligent workflows can be set up to automatically respond to different types of DDL/schema changes.

With Striim at its core, a data fabric functions as a comprehensive source of truth — whether you choose to maintain a current snapshot or a historical ledger of your customers and operations. The example below shows how Striim can replicate exact DML statements to the target system, creating an exact replica of the source:

Striim current snapshot mode — DML propagation to replicate database changes from source to target. This will actually perform updates and deletes on your target system to match it to the source exactly.

And the following example shows how Striim can be used to maintain a historical record of all the changes in the source system:

Striim history mode — History-mode for a record of all changes. This will show the logical change event and the optype including what has changed in the row.

Taken together, Striim makes it possible to build an intelligent and secure real-time data fabric across multi-cloud and hybrid cloud environments. Once data is unified in a central destination (e.g. a data warehouse), a data catalog solution can be used to organize and manage data assets.

Learn More: On-Demand Data Fabric Webinar

Looking for more examples and use cases of enterprise data patterns including data fabric, data mesh, and more? Watch our on-demand webinar with James Serra (Data Platform Architecture Lead at EY) on “Building a Multi-Cloud Data Fabric for Analytics”. Topics covered include:

Pros and cons of multi-cloud vs doubling down on a single cloud
Enterprise data patterns such as Data Fabric, Data Mesh, and The Modern Data Stack
Data ingestion and data transformation in a multi-cloud/hybrid cloud environment
Comparison of data warehouses (Snowflake, Synapse, Redshift, BigQuery) for real-time workloads