John Kutay

49 Posts

How Striim’s Data-Streaming Capabilities Help Tackle These 4 Data Governance Challenges

Today, organizations are developing data architectures and infrastructures that use real-time streaming data, making data governance more crucial than ever. When a large amount of data has to be rapidly processed in near real-time, data in motion in the form of streaming data is an excellent option.

Governing data at rest (i.e., data stored in databases) wasn’t easy in the first place; now, organizations have to deal with a tougher challenge following the rise of data in motion (i.e., data that is moved between different sources and environments). As more and more data is streamed in real-time, managing streams spread across multiple sources and apps (e.g., databases and CRMs) takes the matter to a whole new level.

There is a gap in the data governance space. Organizations use data governance tools that are more suited for managing data at rest. However, the growing adoption of big data analytics means that managing data in motion is more crucial than ever. Striim has prioritized addressing this vacuum by introducing solutions to some of the most common data governance challenges.

Challenge #1: Lack of visibility into your data

Data security is one of the major data governance challenges. But, you can only secure data that you’re aware of. That’s why a data governance team always sets a certain objective: Identify and classify the data that exists within the enterprise.

It’s common for businesses to have complex data environments that are often all over the place. Developing a framework to find data sources on a continuous basis is a tough nut to crack, and keeping data categorized is equally challenging.

Solution: Striim enables data discovery

To discover and classify your data, you have to answer a few questions, such as:

  • Where is the data located?
  • Who can access the data?
  • How long will you keep the data?

Striim can help you resolve your data issues by bringing your streaming data into a centralized location (e.g. a data warehouse) where a data catalog solution can provide a bird’s-eye view of all the data within your organization.  Furthermore, Striim allows you to enrich your data with reference data to make your data more meaningful. For example, a B2B company may use a relational database to store order information. With a normalized schema, many of the data fields are in the form of IDs, e.g. the “Orders” table may have a column for “Sales Rep ID”. Striim can add valuable context to the “Orders” data by adding sales rep names and emails (from the “SalesRep” table) to the streaming data en route to a data warehouse.

stream enrichment
Striim enriches the “Orders” stream with cached data from the “SalesRep” table (name, email)

Once the data is centralized, a data catalog arranges data into an easy-to-understand format, allowing data users to use it readily. It also addresses multiple data governance issues. For example, your data catalog can connect siloed data, which can help to fix data inconsistencies and improve data quality. Or, you can use it to control data for compliance.

You can use these catalogs to store metadata and integrate it with data collaboration, management, and search tools (e.g., Tableau, Elasticsearch). This way, your users can locate and utilize relevant data instantly. It provides context to your data roles. For instance, your data scientists can use a data catalog to find and understand a dataset, which can uncover crucial insights. These can be market trends, correlations, hidden patterns, and customer preferences to help your business make informed business decisions. 

And if you’re using Confluent’s streaming platform, Striim offers a seamless integration with Confluent’s schema registry and serialization layer. This enables you to stream data (from databases and other sources) into Confluent and leverage their recently-released Stream Governance features.

Challenge #2: Loose permissions for data access

Often, organizations grant extremely broad permissions to their data teams for data access. As a result, the lines between the responsibilities of data governance roles like chief data officer, data custodian, data steward, data trustee, data owner, and data user are blurred. This lack of access control can also make it difficult to minimize data privacy risks and maximize accountability.

Some organizations manage their data governance by tracking data access through access logs, but this presents its own set of challenges. That’s because each data technology comes with its own log system that stores varying information. You also need context to understand these logs, such as who accessed the data and what they did with it. This context is often stored in various tools that are incompatible with access logs. There’s a clear need for a better solution.

Solution: Striim offers role-based access control to enforce better control over data

Roles and responsibilities form the cornerstone of an effective data governance strategy. Data governance holds people accountable for performing the right set of actions at the right time. To do this, Striim can help with the definition and deployment of roles that are suited to the organizational structure and culture. This is done through role-based access control (RBAC), allowing you to control what your business users can do at both granular and broad levels.

For example, you can designate whether the user is a data custodian, data trustee, or data user and assign roles and data access permissions based on employees’ positions in your organization.

The main objective of RBAC is to provide a framework that lets organizations set and enforce access control policies for their data, which helps to streamline data governance. It grants permissions to ensure that employees receive adequate access — good enough to help them do their jobs.

With Striim, you can set roles and privileges to access all objects. An object in Striim can be many things, including sources, targets, streams, flow, and so on.

Your admins can define roles with different access levels and controls on objects, such as:

  • A group of users who can create and edit any type of object.
  • A group of users who can copy and read data from objects but aren’t allowed to edit them.

For example, you may have a connector that reads data with PII (‘Personal Identifiable Information’). You can create a specific permission to read the objects that contain PII and assign permission only to users with that degree of authorization. 

role based access control to streaming data
In Striim, user permissions control which actions given users can take on different object types (for example, streams or sources).

Challenge #3: Teams share the same data implementation

A common data governance challenge is when different departments use the same application for their data-related tasks. The data team configures a data infrastructure or system that several other departments use for the collection of accurate data. However, a lack of communication can lead to an employee breaking the system.

For instance, suppose multiple teams share the same website and data analytics infrastructure. The goal of the IT team will be to use the data from analytics to fix the website’s functionality and security. On the other hand, a marketing team will be crunching the numbers to find ways to improve customer experience. Unfortunately, the difference in these team goals can lead to ongoing blunders, such as double-tagging or interrupted customer journeys.

Solution: Striim uses apps and app groups to divide workloads

You can use apps and app groups in Striim to divide workloads between teams. Striim supports data orchestration — essentially, you can use Striim’s user interface and REST APIs to automate the data in flight between your event tracking, data loader, modeling, and data integration tools.

Organizations can create a dedicated app for each business group to build a domain-specific view or transformation for analysis. That means that both your marketing and IT teams can have their own data workflows, empowering them to work more freely without one department affecting another.

For example, you can dedicate a group of Striim apps to collect streaming data from streaming databases and transfer it to a data warehouse for data analysis. Similarly, you can have a Striim app for data transformation that uses Python scripts to convert data from your sources to a standardized format.

Striim CDC app
A Striim app to stream data to collect streaming data from MySQL(via Change Data Capture) to Kafka

Challenge #4: Lack of security for data in motion

Data in motion is exposed to a wide range of risks. Unlike data at rest, it travels both inside and outside the organization. These days, it’s important to protect data in motion because modern regulatory guidelines like HIPAA, GDPR, and PCI DSS enforce the protection of data in motion.

Solution: Striim offers advanced security features to protect data in motion

Striim protects data in motion with a number of security initiatives. Some of these include:

  • Striim helps you to set an encryption policy to encrypt your data with encryption algorithms, such as RSA (Rivest–Shamir–Adleman), PGP (Pretty Good Privacy), and AES (Advanced Encryption Standard). This can especially come in handy in compliance-based industries (e.g., healthcare) and help protect data like PHI (protected health information) and PII (personally identifiable information).
  • Striim has multi-layered application security for exporting and importing data pipeline applications. For instance, during the import of these applications, you can set a passphrase for applications that contain passwords and other encrypted values. This way, you can incorporate an additional security layer into your application security.
  • Striim has a secure, centralized repository, Striim Vault, which can serve as a go-tool for storing passwords and encryption keys. Striim’s vault also integrates seamlessly with 3rd party vaults such as HashiCorp

A reliable data governance program is key to addressing your data governance challenges

The value of data governance is understated. A reliable data governance program increases the trust of people in your data analytics, business processes, and systems that power your data-driven decision-making. It offers secure access, enabling IT to successfully oversee the management of data sources and analytical content and meet policy, risk, and compliance requirements. Line-of-business employees can instantly locate the data they need and perform their jobs better.

Streamline your enterprise data governance framework with Striim. Learn more about how Striim can enhance your data governance initiatives by getting a technical demo.

 

What is a Data Engineer? A Brief Guide to Pursuing This High-Demand Career

data engineer job listings have increased

Data engineer roles have gained significant popularity in recent years. This study by Dice shows that the number of data engineering job listings has increased by 15% between Q1 2021 to Q2 2021, up 50% from 2019.

In addition to being an in-demand role, working as a data engineer can allow you to solve problems, experiment with large datasets, and understand patterns in our world. Students and professionals looking for a switch to a technology role should consider a career in data engineering.

To help you understand the requirements of a data engineer, we’ve compiled the roles and responsibilities of data engineers, the tools they use, and what you need to get started as a data engineer.

  1. What is a Data Engineer?
  2. Data Engineers vs Data Scientists vs Data Architects: What are the differences?
  3. What Tools do Data Engineers Use?
  4. What Skills do I Need to Learn to be a Data Engineer?
  5. Should I Purse a Career in Data Engineering?

What is a Data Engineer: An Overview of the Responsibilities

Data engineers are responsible for designing, maintaining, and optimizing data infrastructure for data collection, management, transformation, and access. They are in charge of creating pipelines that convert raw data into usable formats for data scientists and other data consumers to utilize. The data engineer role evolved to handle the core data aspects of software engineering and data science; they use software engineering principles to develop algorithms that automate the data flow process. They also collaborate with data scientists to build machine learning and analytics infrastructure from testing to deployment.

Data engineers help organizations structure and access their data with the speed and scalability they need and provide the infrastructure to enable teams to deliver great insights and analytics from that data. Kevin Wylie, a data engineer with Netflix, says his work is about making the lives of data consumers easier and enabling these consumers to be more impactful.

Most times, the format/structure optimal to store data for an application is rarely optimal for data science/reporting/analytics. For example, your application may need to be able to serve one million concurrent requests for individual records. But your data science team might need to access billions of records per time. Both scenarios will require different approaches to solve their problems, and this is where data engineers can help.

The primary responsibility of a data engineer is ensuring that data is readily available, secure, and accessible to stakeholders when they need it. Data engineering responsibilities can be grouped into two main categories:

Data structure and management

Data engineers are responsible for implementing and maintaining the underlying infrastructure and architecture for data generation, storage, and processing. Their responsibilities include:

  • Building and maintaining data infrastructure for optimal extraction, transformation, and loading of data from a wide variety of sources such as Amazon Web Services (AWS) and Google Cloud big data platforms.
  • Ensuring data accessibility at all times and implementing company data policies with respect to data privacy and confidentiality.
  • Improving data systems reliability, speed, and performance.
  • Creating optimal data warehouses, pipelines, and reporting systems to solve business problems.

Data analysis and insight

Data engineers play an important role in building platforms that enable data consumers to analyze and gain insights from data. They are responsible for:

  • Cleaning and wrangling data from primary and secondary sources into formats that can be easily utilized by data scientists and other data consumers.
  • Developing data tools and APIs for data analysis.
  • Deploying and monitoring machine learning algorithms and statistical methods in production environments.
  • Collaborating with engineering teams, data scientists, and other stakeholders to understand how data can be leveraged to meet business needs.

Although every organization has slightly different requirements, data engineering job listings from top tech company’s career sites like Netflix and Google and articles from job sites such as Indeed can provide more information on what data engineers are commonly responsible for in an organization.

Data Engineers vs. Data Scientists vs. Data Architects: What are the Differences?

data scientist vs data engineer
From a thankful data scientist to data engineers. Original post here.

These roles vary significantly from company to company and often overlap since their work usually revolves around the same key component: data. Larger companies tend to have separate departments for these roles, and in smaller companies, it’s not uncommon to have one person acting as all three.

This table gives a brief overview of the differences between the three roles.

Data Architect Data Engineer Data Scientist
Data architects plan and design the framework the data engineers build. They create the organization’s logical and physical data assets, as well as the data management resources, and they set data policies based on company requirements. Data engineers are responsible for gathering, collecting, and processing data. They also build systems, algorithms, and APIs to expose datasets to data consumers. Data scientists are responsible for performing statistical analysis using machine learning and artificial intelligence on collated data in order to gain insight and form new hypotheses.

Unless a company has a large data/engineering team, it’s unlikely to have all three of these roles and will likely employ some combination of the above based on engineering, data, and business needs. Read more: For a deeper dive into how data architects and data engineers differ in responsibilities, skill sets, and career paths, see our comparison: Data Architect vs. Data Engineer.

What Tools Do Data Engineers Use?

There are no one-size-fits-all tools data engineers use. Instead, each organization leverages tools based on business needs. However, below are some of the popular tools data engineers use. You don’t necessarily have to gain mastery of all the tools here, but we recommend you learn the fundamentals of each core tool.

Databases

In our fast-paced world where tools and technologies are constantly evolving, SQL remains central to it all and is a foundational tool for data engineers. SQL is the standard programming language for creating and managing relational database systems (a collection of tables that consist of rows and columns).

NoSQL databases are non-tabular and can take the form of a graph or a document, depending on their data model. Popular SQL databases include MYSQL, PostgreSQL, and Oracle. MongoDB, Cassandra, and Redis are examples of popular NoSQL databases.

Data processing

Today’s businesses recognize the importance of processing data in real-time to enhance business decisions. As a result, data engineers are in charge of building real-time data streaming and data processing pipelines. Apache Spark is an analytics engine used for real-time stream processing; Apache Kafka is a popular tool for building streaming pipelines and is used by more than 80% of fortune 500 companies.

For example, Netflix uses Kafka to process over 500 billion events per day, ranging from user viewing activities to error logs.

Programming languages

Data engineers are typically fluent in at least one programming language to create software solutions to data challenges. Python is regarded as the most popular and widely used programming language in the data engineering community. It’s easy to learn and features a simple syntax and an abundance of third-party libraries geared toward data needs.

Data migration and integration

As more companies leverage cloud-based computing to meet business demands, migrating mission-critical applications can introduce several challenges of which migrating the underlying database is often the most difficult. Data migration and integration refer to the processes involved in moving data from one system or systems to another without compromising its integrity. Data integration specifically is the process of consolidating data from various sources and combining it in a meaningful and valuable way.

Striim is a popular real-time data integration platform used by data engineers for both data integration and migration; it provides modern, reliable data integration and migration across the public and private cloud.

Distributed systems

Because of the massive amount of data in circulation today, a single machine/system cannot meet data processing and storage requirements. Distributed systems are systems that work together to achieve a common goal but appear to the end-user as a single system.

Hadoop is a popular data engineering framework for storing and computing large amounts of data using a network of computers.

Data science and machine learning

Data engineers need a basic understanding of popular data science tools because it enables them better to understand data scientists and other data consumers’ needs. PyTorch is an open-source machine learning library used for deep learning applications using GPUs and CPUs. TensorFlow is a free, open-source machine learning platform that provides tools for teams to create and deploy machine learning-powered applications.

What Skills Do I Need to Learn to be a Data Engineer?

Data engineering is a developing field that bisects software engineering and data science. While there are no defined steps to becoming a data engineer, that doesn’t mean you can’t do it.

Here are some of the necessary skills and knowledge you need to become a successful data engineer.

  • Understand databases (SQL and NoSQL): An essential skill for data engineers is learning how databases work and how to write queries to manipulate and retrieve data. This free database systems course by freeCodeCamp and Cornell University is an excellent resource to learn how database systems work.
  • Understand data processing techniques and tools: LinkedIn Learning provides fantastic resources to learn Apache Kafka – a popular tool for data processing.
  • Know a programming language: Knowing how to program is a must-have skill for data engineers. Programming languages such as Python and Scala are popular with data engineers. The complete Python Bootcamp on Udemy is a popular resource for getting started with Python.
  • Understand how distributed systems work: Designing Data-Intensive Applications is a great resource to understand the fundamental challenges companies face when designing large data applications.
  • Learn about cloud computing: With more companies relying on cloud providers for data infrastructure needs, learning how to design and engineer data solutions using popular cloud providers such as Amazon Web Services, Google Cloud, and Azure will help you stand out as a data engineer. Online courses, official tutorials, and certifications from cloud providers (like this one from Google Cloud ) are excellent ways to learn cloud computing.

Many data engineers teach themselves skills through free and low-cost online learning programs. The Data Engineering Career Learning Path by Coursera and the Learn Data Engineering Academy provides practical resources to get you started. If you prefer a more degree-oriented approach, Udacity offers a specialized track dedicated to data engineering.

Should I Pursue a Career in Data Engineering?

Research from Domo estimates that humans generate about 2.5 quintillion bytes of data per day through social media, video sharing, and other means of communication. Furthermore, the World Economic Forum predicts that by 2025, the world will generate 463 exabytes of data per day, the equivalent of 212,765,957 DVDs per day. With the copious amount of data generated, there will be an increase in the demand for data engineers to manage it.

data engineer salary

If you love experimenting with data, using it to discover patterns in technology or enjoy building systems that organize and process data to help companies make data-driven decisions, you might consider a career in data engineering. Further, data engineering is a lucrative field, with a median base salary of $102,472. While data engineering can be difficult and complex, and you may need to learn new skills and technology, it is also a rewarding career in a growing field.

An Introduction to Database Migration Strategy and Best Practices

Database migration refers to transferring your data from one platform to another. An organization can opt for database migration for a multitude of reasons. For example, an organization might feel that a specific database (e.g., Oracle) has features that can offer more benefits than their existing database (e.g., MySQL). Or, they might want to cut costs by moving their on-premises legacy system to the cloud (e.g., Amazon RDS).

Having said that, moving data from one place to another isn’t a simple endeavor – a narrative supported by stats. According to Gartner, 50% of all data migration projects go above their predetermined budgets and affect the overall business negatively. The lack of an adequate database migration strategy and flawed execution are most often the culprits. That’s because database migration projects:

  • Involve a greater degree of complexity than other IT projects 
  • Often involve databases that host mission-critical applications (which requires careful coordination of downtime, as well as data loss prevention measures)
  • Tend to take a great deal of time and effort (from manual schema changes to post-migration database validations)
  • Involve several systems, technologies, and IT teams to work properly

Simply put, an effective database migration strategy can prevent companies from blowing budgets and deadlines.

Why a Database Migration Strategy Is Key to Your Organization’s Success

A database migration strategy is a plan that facilitates your data transfer from one platform to another. There is a wide range of complexities that go into the data migration process. It’s much more than simply copying and pasting data. Such a plan takes certain factors into account, such as a data audit, data cleanup, data maintenance, protection, and governance.

A well-defined database migration strategy can reduce the business impact of database migration. A strategy helps the data migration team to avoid creating a poor experience that often generates more issues than it solves. A subpar strategy can cause your team to miss deadlines, exceed budgets, and cause the entire project to fail. According to a study, database migration can lead to more than $250,000 in cost overruns.

Legacy data doesn’t always align with the new system. Bringing unnecessary data to a new system wastes resources. But a database migration strategy can address these issues by identifying the core data requirements and guide you to make the right decisions.

3 Types of Database Migration Strategies

There are three main approaches to database migration: big bang data migration, trickle data migration, and zero downtime migration.

1. Big Bang Database Migration

A big bang migration transfers all data from one source system to a target database in a single operation at a single point in time. Often it’s performed during a weekend or a scheduled downtime period.

The benefit of this strategy is its simplicity, as everything occurs in a time-boxed event. The tradeoff is downtime. This can be undesirable for organizations that run their systems 24/7.

2. Trickle Database Migration

A trickle migration follows an agile-type approach to database migration. It breaks down the migration into small sub-migrations, each having its own scope, deadlines, and goals. This way, it’s easier for the database migration team to confirm the success of each phase. If any sub-processes falter, it’s common to only re-work the failed process. As a result, the lessons from the failure can be utilized to improve subsequence runs. That’s one of the reasons why it’s less prone to unexpected failures.

The drawback is that trickle database migration takes more time. Since you have to run two systems simultaneously, it consumes more resources and effort.

3. Zero-Downtime Database Migration

A zero-downtime migration replicates data from the source database to the target database. It allows the client to access and operate on the source database while the migration is in process.

Benefits include less business disruption, faster migration, and minimal cost, especially when considering business impact and all-hands migration efforts.

Database Migration Best Practices

Sticking to the best practices can increase the likelihood of successful database migration. Some of the practices followed by a well-planned database migration strategy include:

Set Up the Database Migration Project Scope

First, set the parameters (e.g., object types, source objects in scope, connection parameters) of your database migration project. Like other IT initiatives, this process is prone to scope creep.

According to a published study, specifications change in 90% of data migration projects. In addition, 25% of such projects tackle more than one specification change. Therefore, it’s better to start small. For instance, if you have multiple databases, then move data from only one of them. Once you succeed with this single migration, you can extend the project scope.

Analyze Your Current Data

Database migration projects deal with plenty of factors, such as:

  • The type of the data
  • The size of the data
  • The operating systems
  • The source and target systems
  • The database platform

Before you initiate the migration process, you have to determine how much data you need to move. For example, you might have records that are no longer required and better left behind. Or you might think about compatibility issues, such as when moving data from a relational database (e.g., Oracle) to a non-relational database (e.g., MongoDB).

Communicate the Process

Multiple teams need to give their input to the data migration process. Communicating the entire data migration process to them is vital. They should know what they’re expected to do. For that, you have to assign responsibilities and tasks.

Set a list of deliverables and tasks and assign roles to activities. Some of the questions you need to answer include:

  • Who is the chief decision-maker of the migration process?
  • Who has the authority to determine whether the migration was successful?
  • After database migration, who will validate data?

Lack of a proper division of tasks and responsibilities can cause organizational chaos and your project to fail.

Strengthen Your Database Migration Strategy with StreamShift

Part of building a database migration strategy will include making sure your team has the right tools in place. Using a tool like StreamShift helps to ensure a successful data migration.

Streamshift database migration

StreamShift is a fully managed SaaS database migration tool that can simplify database migration from on-premise or cloud databases to your desired target databases.

With StreamShift, you can fulfill the core requirements needed for zero downtime database migration. All you need to do is choose source and target connections and provide the credentials. After that, StreamShift will handle everything, including creating initial schemas, bulk loading historical data, and performing continuous synchronization between the source and target.

StreamShift is run by a team that has years of deep data management and migration experience working with global enterprise customers. That’s why it comes with a user-friendly user interface that addresses various common and complex database migration pain points.

Schedule a demo and we’ll give you a personalized walkthrough or try Striim at production-scale for free! Small data volumes or hoping to get hands on quickly? At Striim we also offer a free developer version.

Oracle Database Replication Methods (and their Pros and Cons)

Modern computing demands have forced businesses to ensure high availability and data accessibility while tackling various networks simultaneously. That’s why it’s crucial to get 24/7 and real-time access to key business data from your database—and for many businesses, that database is Oracle. Whether a business is looking to process millions of transactions or build a data warehouse, Oracle is the go-to option for handling critical enterprise workloads. 

In today’s digital age, organizations must scale up their systems and build an ecosystem that helps them seamlessly access data from their Oracle database to improve the efficiency of their operations. To achieve this, they can use database replication — a technique that enables data access to a wide range of sources (e.g., servers and sites). Since it allows real-time data access, database replication maintains high-data availability and addresses a major concern for enterprises.

Why Replicate an Oracle Database?

Replicating an Oracle database makes it easy to distribute, share, and consolidate data. With replication, businesses can synchronize data across different locations, share data with vendors and partners, and aggregate data from their branches — both international and local.

Companies use Oracle database replication to create multiple copies of an organization’s database. These synchronized copies pave the way for distributed data processing, backups for disaster recovery, testing, and business reporting.

Some benefits of Oracle database replication include the following:

  • Enhance application availability. Since database replication copies data to several machines, it’s easier to maintain access to your application’s data. Even if one of your machines is compromised due to a malware attack, faulty hardware, or another issue, your application data will remain available 24/7.
  • Enhance server performance. It’s a common practice in database replication to direct data read operations to a replica. This allows system administrators to minimize processing cycles on the primary server and prioritize it for write operations.
  • Enhance network performance. Maintaining multiple copies of the same data is convenient for minimizing data access latency. That’s because you can fetch the relevant data from the location where the transaction is being executed.

For instance, users from Europe might face latency problems while trying to access Australian-based data centers. You can address this challenge via Oracle database replication, so a replica of your data is placed closer to the user.

A common example of Oracle database replication can be found in money transfers and ATM withdrawals. For example, if you withdraw $150 from an ATM, the transaction will be immediately copied to each of your bank’s servers. As a result, your information will be updated instantaneously in all branches to display $150 less in your account.

Similarly, an e-commerce website that uses an Oracle database has to ensure visitors from different countries can view the same product information at each site. Database replication helps them to achieve this goal by copying their product details for each site.

4 Ways to Replicate An Oracle Database

Choosing the right approach to replicate your Oracle depends on several factors, including the goal of your replication, the size of the database, how the performance of the source systems is affected, and whether you need synchronous replication or asynchronous replication.

Here are some of the common ways to replicate an Oracle database.

1. Full Dump and Load

In this approach, you start by choosing a table you want to replicate. Next, you define a replication interval (could be 4, 8, or 12 hours) as per your requirements. For each interval, your replicated table is queried, and a snapshot is generated. This snapshot (also known as a dump) is used as the substitute for the previous snapshot.

This approach is effective for small tables (usually less than 100 million rows). However, once the table grows in size, you will have to rely on a more reliable replication strategy. That’s because it takes a considerable amount of time to perform the dump.

2. Incremental Approach (Table Differencing)

Table differencing is an approach in which a copy of the source table is periodically compared to an older version of the table and the differences are extracted. 

For example, use the following command to get the difference between the two tables, named new_version and old_version.

SELECT * FROM new_version
MINUS SELECT * FROM old_version;

This command gives you the inserts and updates that are present in the new_version of the table. However, when the time comes to load data into the target database, you have to make sure to replace the table’s old version with the new version (so you can compare it to a future new version of the table).

The incremental approach provides an accurate view of changed data while only using native SQL scripts. However, this method can lead to high computation and transport costs. In addition, it isn’t ideal for restoring data. For instance, if you want to go through the files that were backed up incrementally on Wednesday, you’ll first have to restore the full backup from Tuesday.

3. Trigger-Based Approach

This approach depends on triggers — a function you can set up to execute automatically whenever a data change occurs in your database system.

For example, you can set a trigger that inserts a record into another table (the “change table”) whenever the source table changes. You can then replicate your data from the Oracle database to another database.

Oracle comes with a stored procedure to set any trigger and monitor the source table for updates. Triggers help to achieve synchronous replication. However, this approach can affect the performance of the source database because triggers cause transactions to be delayed.

4. Change Data Capture

CDC image

Change data capture (CDC) is a managed software process that determines the rows in source tables that have been modified after the last replication. This makes it more efficient and faster than other methods, especially the ones that copy entire tables at every replication cycle and replicate even the rows that weren’t changed.

CDC replicates create-update-delete (CUD) operations, written in SQL via the following commands: INSERT, UPDATE, and DELETE.

Here’s why Oracle CDC is a better approach for replicating your Oracle database:

  • Since CDC only works with the rows that have changed, it sends less data from the source to the replication, putting a minimal load on the network.
  • Proper CDC implementation ensures that replication operations don’t affect your production database. In this way, you can free up resources for transactions.
  • With CDC, you can achieve real-time data integration, helping you build streaming analytics.

For example, consider HomeServe, a leading home assistance provider. HomeServe wanted to send detailed reports to its insurer partners that could give an in-depth overview of water leaks. To do this, they needed a technology that could help them move the operational data to Google BigQuery without impacting their database.

Ultimately, they went with Striim’s enterprise-grade CDC to go through binary logs, JSON columns, and other sources. This allowed them to move all the changes from the transactional database. More importantly, CDC ensured no overhead was caused on the source system and performance remained unaffected. Learn more about it here.

Oracle GoldenGate is another tool that can replicate data from one Oracle database to another by using CDC. GoldenGate can be useful for a broad array of use cases. These include multi-cloud ingestion, data lake ingestion, high availability (peer-to-peer, unidirectional, bi-directional, etc.), and online transactional processing (OLTP) data replication.

Although Oracle GoldenGate is more convenient than the above methods, configuring it requires assistance from an Oracle administrator.

Simplify Oracle Database Replication with Striim

Striim for Oracle CDC

CDC is the best approach for replicating Oracle databases in many scenarios, but which tool should you choose to implement it? Consider taking a look at Striim.

Using Striim’s Oracle CDC reader, you can turn your Oracle database into a streaming source and migrate your critical transactional data to cloud environments or real-time applications.

The log-based CDC method in Striim fetches the insert, update, and delete operations in the Oracle redo logs. Even better, Striim’s CDC reader can buffer large transactions to disk with minimal performance overhead. Throughout this process, the source systems remain unaffected.

Another thing that makes Striim stand out is, unlike other tools, it can work without the LogMiner continuous mining feature, which was deprecated in 19c. Additionally, Striim can read from GoldenGate trail files. 

Schedule a demo and we’ll give you a personalized walkthrough or try Striim at production-scale for free! Small data volumes or hoping to get hands on quickly? At Striim we also offer a free developer version.

 

MySQL Replication: Your Guide to Getting Started

Here’s the thing: Getting a 500: Server error is never good news to a business, regardless of its source. Granted, the error could stem from a number of issues, but what happens when a problem with your database is the cause of the error? Are you going to start panicking because you don’t have a backup, or would you be calm, assured that your replica is up and running?

This is just one of the uses of MySQL replication. Replication is the process of copying data from one database (source) to another (replica). The data copied can be part or all of the information stored in the source database. MySQL database replication is carried out according to the business needs — and if you’re considering having a replica, here’s what you need to know.

Why Replication?

Replication may seem expensive from afar — and it usually is if you consider time and effort expended. On a closer look and in the long run, you’ll see it delivers great value to your business. It saves you time, and more money, in the future.

Most of the benefits of replication revolve around the availability of the database. However, it also ensures the durability and integrity of the data stored.

Replication Improves the Server’s Performance

HTTP requests to the server are either read (SELECT) or write(CREATE, UPDATE, DELETE) queries. An application may require a large chunk of these queries per minute, and these may, in turn, slow down the server. With replicas, you can distribute the load in the database.

For example, if the bulk of your requests are read queries, you can have the source server handle all write operations on the server, and then the replicas can be read-only. As a result, your server becomes more efficient as this load is spread across the databases. Doing this also helps avoid server overload.

Replication Allows for Easier Scalability

As a business grows, there is often a heavier load on the server. You can tackle this extra load by either scaling vertically by getting a bigger database or scaling horizontally by distributing your database’s workload among its replicas. However, scaling vertically is often a lengthy process because it takes a while — from several hours to a few months — to fully migrate your data from one database to another.

When scaling horizontally, you spread the load without worrying about reaching the load limit of any one database — at least for a while. You also do not need to pause or restart your server to accommodate more load as you would have done when moving your data to another database.

Replication Improves Application Availability (Backup)

Without replicas, you might have to do backups on tape, and in the event of a failover, it could take hours before the system is restored. And when it’s fully restored, it’ll only contain data from the last backup; the latest data snapshot is often lost. Having replicas means you can easily switch between databases when one fails without shutting down the server.

You can automate the switch so that once a source stops responding, queries are redirected to a replica in the shortest time possible. Then, when the source is back online, it can quickly receive updates to the replica.

Replication Provides an Easy Way to Pull Analytics

Pulling analytics on data stored in the database is a common procedure, and replication provides a hassle-free way to do this. You can draw insights for analytics from replicas, so you avoid overloading the source. Doing this also helps preserve the data integrity of the source — the information remains untouched, preventing any tampering with the data (whether by accident or otherwise).

Otherwise, the source can always sync up with the replica to get the latest data snapshot as needed.

Whatever the size of your database, MySQL replication is beneficial — especially if you intend to grow your business. First, figure out if and how replication affects your business and go from there to decide how many replicas to have and which replication method to use.

Use Cases For MySQL Replication

MySQL replication allows you to make exact copies of your database and update them in near real-time, as soon as updates to your source database are made. There are several cases where MySQL replication is helpful, and a few of them include:

Backups

Data backups are usually performed on a replica so as not to affect the up-time and performance of the source. These backups can be done in three forms: using mysqldump, raw data files, and backing up a source while it’s read-only. For small- to medium-sized databases, mysqldump is more suitable.

Using mysqldump involves three steps:

  1. Stop the replica from processing queries using mysqladmin stop-replica
  2. Dump selected or all databases by running mysqldump
  3. Restart the replication process once the dump is completed. mysqladmin start-replica does that for you.

Since information backed up using mysqldump is in the form of SQL statements, it would be illogical to use this method for larger databases. Instead, raw data files are backed up for databases with large data sets.

Scaling Out

Using replication to scale out is best for servers with more reads and fewer writes. An example is a server for e-commerce where more users look through your product catalog than users adding to your catalog. You’re then able to distribute the read load from the source to the replicas.

Have an integrated platform for read and write connections to the server and the action of those requests on the server. That platform could be a library that helps carry out these functions. Here’s an example layout of replication for scale-out solutions.

Switching Sources

MySQL allows you to switch your source database when need be (e.g., failover) using the CHANGE REPLICATION SOURCE TO statement. For example, say you have a source and three replicas, and that source fails for some reason. You can direct all read/write requests to one of the replicas and have the other two replicate from your new source. Remember to STOP REPLICA and RESET MASTER on the new source.

Once the initial source is back, it can replicate from the active source using the same CHANGE REPLICATION SOURCE TO statement. To revert the initial source to source, follow the same procedure used to switch sources above.

Whatever your reason for adopting MySQL replication, there’s extensive documentation on MySQL statements that can help you achieve your replication goal.

The Common Types of MySQL Replication

You can set up MySQL replication in a number of ways, depending on many factors. These factors include the type of data, the quantity of the data, the location, and the type of machines involved in the replication.

To help you determine the right type of MySQL replication for your needs, let’s review the most common MySQL replication types:

Snapshot Replication

As the name implies, snapshot replication involves taking a picture-like replica of a database. It makes an exact replication of the database at the time of the data snapshot. Because it’s a snapshot, the replica database does not track updates to the source. Instead, you take snapshots of the source at intervals with the updates included.

Snapshot replication is simple to set up and easy to maintain and is most useful for data backup and recovery. And the good news is that it’s included by default in a lot of database services.

Merge Replication

merge replication

The word “merge” means to combine or unite to form a single entity. That’s the basis for merge replication — it unites data from the source database with the replica. Merge replication is bidirectional: changes to the source are synchronized with the replicas and vice versa.

You can use merge replication when there are multiple replicas working autonomously, and the data on each database needs to be synchronized at certain operational levels of the server. It is mostly used in server-client environments. 

An example of a merge replication use case would be a retail company that maintains a warehouse and tracks the stock levels in a central (source) database. Stores in different locations have access to replica databases, where they can make changes based on products sold (or returned). These changes can be synchronized with the source database. In turn, changes to the source database are synchronized with the replicas, so that everyone has up-to-date inventory information.

Transactional Replication

transactional replication

Transactional replication begins with a snapshot of the source database. Subsequent changes to the source are replicated in the replica as they occur in near real-time and in the same order as the source. Thus, updates to the replica usually happen as a response to a change in the source.

Replicas in transactional replication are typically read-only, although they can also be updated. Therefore, transactional replication is mainly found in server-server environments, for instance, multi-region/zone databases.

There are a few other MySQL replication types, but one feature is common to all, including those discussed above. Each replication typically starts with an initial sync with the source database. After that, they branch out to their respective processes and use cases.

The Power of Leveraging Your MySQL Replicas

We’ve established the importance and advantages of replicating your databases, but keeping up with changes or corruption to the database can be a struggle for a lot of production teams. To top that, you need to know how your data is ingested and delivered and, more importantly, ensure this process of ingestion and delivery is continuous.

With Striim, you can easily see and understand the state of your data workflow at any given time. Whether your server is on-premise, cloud-based or hybrid, you can still create data flows and have control over how your data is gathered, processed, and delivered. We’re ready when you are.

 

Tutorial: Migrate and Replicate Data from MySQL to Snowflake Using Striim

In this article, we’ll go over how to use Striim to migrate and replicate data from an existing MySQL database to Snowflake.

What is Striim?

Striim is a next generation Cloud Data Integration product that offers change data capture (CDC) enabling continuous replication from popular databases (such as MySQL, Oracle, SQL Server, PostgreSQL ) to data warehouses, databases, messaging systems, and more.

In addition to CDC connectors, Striim has hundreds of automated adapters for file-based data (logs, xml, csv), IoT data (OPCUA, MQTT), and applications such as Salesforce and SAP. Our SQL-based stream processing engine makes it easy to enrich and normalize data before it’s written to Snowflake.

In this tutorial, we’ll show you how to use Striim to migrate schemas and data from an existing MySQL database into Snowflake.

This tutorial uses the Sakila sample database to illustrate the steps.

Tutorial: Migrate and Replicate Data from MySQL to Snowflake

Prerequisites

The following are required to complete this tutorial:

  • Striim Account. You can sign up for a free trial account.
  • Snowflake Account. You can sign up for a free 30-day trial account to get started.
  • MySQL database that meets the following requirements:
    • Accessible over the Internet.
    • SHA256 password authentication enabled.
    • A user with full privileges to the database that Striim will authenticate with. Construct the name by concatenating the user’s name, the @ symbol, and Striim’s IP address (34.66.168.224). For example: john@34.66.168.224.

Note: Ensure you have the database’s name, hostname, port, username, and password as they will be required in the steps below to connect to your database.

  • At least one table in the MySQL database
  • (Optional) Sakila sample database.

Prepare a Snowflake Database and Launch Striim

Before migrating your data from MySQL, you must first create a database within Snowflake to store the migrated data. After that database has been created you can launch Striim as a Snowflake partner service directly from within Snowflake.

Follow the steps below to prepare a database and launch Striim in Snowflake:

  • Launch Snowflake in a web browser.
  • Click on Databases > Create:

Create Snowflake database

  • Enter a unique name for the database and click Finish:

Name database

  • Click on Partner Connect in the top right corner of the navigation bar.
  • Locate and click on Striim in the list of Snowflake partners. Note: you may need to first switch your user role to ACCOUNTADMIN in order to launch Striim from Snowflake:

Locate Striim Partner Connect

  • Activate the partner account if the account has not been previously activated:

Ready to activate partner account

  • Confirm that the database you created above is listed in Database(s) with USAGE privilege granted and click Connect.
  • Note: On subsequent launches after activation has been completed for the first time, Snowflake will just prompt you to launch:

Account previously created

Create a Striim App

In Striim, an app will be used to perform migration from MySQL to a Snowflake database.

Follow the steps below to create a new Striim App:

  • Ensure you’re on the App screen in Striim and click Create App:

Create Striim app

  • Locate the Build using Wizards section on the Create a New App screen and select MySQL Database to Snowflake:

MySQL to snowflake wizard

  • Enter a unique name for your app and click Save:

Name Striim app

  • The data migration wizard is displayed:

Migration wizard displayed

Prepare for Data Migration to Snowflake

In this section you will configure your app to access your source MySQL database. As you proceed through Striim’s data migration wizard, Striim will validate that it can access and fetch the metadata and data of your source MySQL database.

  • Enter the host/port for your MySQL database and the login credentials for the Striim user who has been granted full privileges to your MySQL database:

Enter host port

  • Striim will verify that it can connect to your database and obtain metadata:

Verify connection

  • Click Next to advance to the Select Databases to move screen.
  • Select the databases to migrate from your MySQL database to Snowflake and click Next:

Select database to move

  • Striim will fetch and validate metadata for each table in your database:

Validate and fetch

  • Click Next to advance to the Select Tables screen. Navigate through each schema on the left-hand side, and select the table(s) to migrate:

Select tables to migrate

  • Click Next to complete the wizard. The target creation screen is displayed:

Target creation screen 1

Prepare Your Target and Migrate Your Data to Snowflake

Now that Striim can read from your source MySQL database, you must configure Striim to write to your target Snowflake database.

Follow the steps below to prepare a Snowflake target and start the migration process:

  • Enter a unique name for the target in the Target Name field on the Create Snowflake Target(s) screen.
  • Ensure Input From is set to the stream you created using the steps in the previous sections. Note that the name will be in the form of <your application name> + _OutputStream.
  • Prepare the URL of the target Snowflake database: copy the following URL into the Connection URL field and replace YOUR_HOST with the base host domain assigned by Snowflake to your account, and YOUR_DATABASE with the name of your database:

jdbc:snowflake://YOUR_HOST.snowflakecomputing.com/?db=YOUR_DATABASE_NAME&schema=public

For example, the following URL has a base URL of xr86987.ca-central-1.aws and the database name set to SAKILA

jdbc:snowflake://xr86987.ca-central-1.aws.snowflakecomputing.com/?db=SAKILA&schema=public

  • Enter your credentials corresponding to your Snowflake account into the Username and Password fields.
  • (Optional) Modify which tables to migrate by configuring the table name(s) listed in the Tables field. By default, the tables listed will be based on those specified in the steps from the previous section and include the % as a wildcard character:

Create target 2

Click Next. Striim will recreate the schema(s) in your Snowflake database:

Create Schema

Click Next after target creation is complete. Striim will begin migrating your data to Snowflake and will provide a detailed Application Progress popup showing how the migration is progressing:

monitor data flow

The amount of time required for migration will vary depending on how much data is being migrated from your source MySQL database and on your internet connection speed.

Conclusion: Start Your Free Trial

Our tutorial showed you how easy it is to migrate data from MySQL to Snowflake, a leading cloud data warehouse. Once your data has been migrated, Striim enables continuous, real-time updates via Change Data Capture. For instances where changes continue to be made to the data in your source database, Striim enables zero-downtime, zero-data loss migrations to Snowflake.

As always, feel free to reach out to our integration experts to schedule a demo, or try Striim for free here.

Data Transformation 101: The What, Why, and How

Ninety percent of the data generated daily is unstructured and almost meaningless to organizations. The continuous increase in the volume of data is a potential gold mine for businesses. Like gold, data has to be carefully mined and refined, or in this case, extracted and transformed to get the most value.

Businesses receive tremendous volumes of data from various sources to make better decisions. But raw data can be complex, challenging, and almost meaningless to the decision-makers in an organization. By transforming data, businesses can fully maximize the value of their data and use it to make more in-depth strategic decisions.

In this post, we’ll share an overview of data transformation, how it works, its benefits and challenges, and different data transformation approaches

  1. What is Data Transformation?
  2. Why Transform Data?
  3. How Data Transformation Works
  4. Data Transformation Methods
  5. Data Transformation Use Cases
  6. Data Transformation Challenges
  7. Data Transformation Best Practices
  8. Use Data Transformation Tools Over Custom Coding

What is Data Transformation?

Data transformation is the process of converting data from a complex form to a more straightforward, usable format. It can involve actions ranging from cleaning out data, changing data types, deleting duplicate data, data integration, and data replication, depending on the desired result.

Data transformation is an integral part of any data pipeline including ETL (Extract, Transform, and Load) and ELT (Extract, Load, Transform) pipelines. ETL involves extracting data from multiple sources, transforming it into a more intelligent structure, and loading or storing it in a data warehouse. In contrast, ELT shifts the bulk of the transformations to the destination data warehouse.

ETL vs ELT Infographic

Data transformation is essential to standardize data quality across an organization and create consistency in the type of data shared between systems in the organization.

Why Transform Data? (Benefits)

With data transformation, businesses can increase efficiencies, make better decisions, generate more revenue, and gain many other benefits, including:

  • Higher data quality: Businesses are very concerned with data quality because it is crucial for making accurate decisions. Data transformation activities like removing duplicate data and deleting null values can reduce inconsistencies and improve data quality.
  • Improved data management: Organizations rely on data transformation to handle the tremendous amounts of data generated from emerging technologies and new applications. By transforming data, organizations can simplify their data management and reduce the dreaded feeling of information overload.
  • Seamless data integration: It is normal for a business to run on more than one technology or software system. Some of these systems need to transfer data between one another. With data transformation, the data sent can be converted into a usable format for the receiving system, making data integration a seamless process.
  • Obfuscate sensitive data: In order to abide by GDPR, HIPAA, and other regulations, companies need to be able to mask or remove sensitive information such as PII or credit card numbers when transferring data from one system to another.
masking data with Striim
An example showing how Striim can be used to mask credit card numbers during data transfer from one system to another.

How Data Transformation Works

When data is extracted from the source, it is raw and nearly impossible to use. The data transformation process involves identifying the data, structuring it, and generating a workflow that can be executed to convert the data. Sometimes, it is mandatory to clean the data first for easy identification and mapping.

The steps to transform raw data into an intelligent format include:

  1. Identify the data: This is the data discovery phase, and it involves identifying and understanding the data in its source/extracted format. Data discovery is usually best accomplished with the help of a data profiling tool. Here, you have an idea of what should be done to get the data into the desired format.
  2. Structure the data: This is the data mapping stage where the actual transformation process is planned. Here, you define how the fields in the data are connected and the type of transformation they will need. This stage is also where you consider if there would be any loss of data in the transformation process. For example, if we have a simple Excel spreadsheet with a date column in an incorrect format, we would want to make a ‘mapping’ to determine the type of transformation needed to change the date to a correct format.
  3. Generate a workflow: For transformation to occur, a workflow or code needs to be generated. You can write your custom code or use a data transformation tool. Python and R are the most common languages for a custom code, but it can be done in any language, including an Excel Macro, depending on the transformation needs. When developing a transformation workflow, consider certain factors like scalability (will the transformation needs change over time?) and usability (will other people need this workflow?).
  4. Execute the workflow: Here, data is restructured and converted to the desired format.
  5. Verify the data: After transformation, it is best to check if the output is in the expected format. If it isn’t, review the generated workflow, make necessary changes, and try again.

Data Transformation Methods

Organizations can transform data in different ways to better understand their operations. For example, data aggregation, data filtering, data integration, etc., are all forms of data transformation, and they can happen with any of these types of data transformation:

Transformation with scripting

This type of transformation involves manually coding the data transformation process from start to finish in Python, R, SQL, or any other language. It is an excellent approach for customization but often results in unintentional errors and misunderstandings as developers sometimes fail to interpret the exact requirements in their custom-coded solutions.

Transformation with on-site ETL tools

These tools work through on-site servers to extract, transform, and load information into an on-site data warehouse. However, an on-site transformation solution can be expensive to set up and manage as data volume increases, so big data companies have moved to more advanced cloud-based ETL tools.

Transformation with cloud-based ETL tools

Cloud-based ETL tools have simplified the process of data transformation. Instead of working on an on-site server, they work through the cloud. In addition, these tools make it easier to link cloud-based platforms with any cloud-based data warehouse.

Transformations within a data warehouse

Instead of ETL (where data is transformed before it’s loaded into a destination) ELT shifts the transformations to the data warehouse. dbt (data build tool) is a popular development framework that empowers data analysts to transform data using familiar SQL statements in their data warehouse of choice.

Data Transformation Use Cases

Data Preparation and Cleaning for ELT Pipelines

In ELT, users load data into the data warehouse in its raw form and apply transformations in the DW layer using stored procedures and tools like dbt. However the raw data can have many issues preventing it from being actionable. 

Data Transformation can address the following issues:

  • The raw data may have lookup IDs instead of human readable names, for example like this:
User ID Page Viewed Timestamp
12309841234 Homepage 3:00 pm UTC July 30,2021
12309841235 Request a demo 4:00 pm UTC July 30,2021
    Instead of this:
Username Page Viewed Timestamp
Jane Strome Homepage 3:00 pm UTC July 30,2021
John Smith Request a demo 4:00 pm UTC July 30,2021
  • De-duplication of redundant records
  • Removing null or invalid records that would break reporting queries
  • Transforming your data to fit machine learning training models

Business Data Transformations for Operational Analytics

Business and analytical groups may have some north star metrics that can only be tracked by correlating multiple data sets. Data transformation jobs can join data across multiple data sets and aggregate it to create a unified view of this data. Examples of this include:

  • Joining unique customer records across multiple tables – your customer Jane Smith exists in your CRM, your support ticketing system, your website analytics system (e.g. Google Analytics), and your invoicing system. You can write a transform to create a unified view of all her sales and marketing touchpoints and support tickets to get a 360 view of her customer activity.

Creating a 360 view of the customer

  • Creating aggregate values from raw data – for instance you can take all your invoices and transform them to build a table of monthly revenue categorized by location or industry
  • Creating analytical tables – it may be possible for your data analyst to write complex queries to get answers to simple questions. Alternatively your data engineering team can make life easy for data analysts by pre-creating analytical tables. That will make your analysts’ job as easy as writing a ‘select * from x limit 100’ to generate a new report and drive down compute costs in your warehouse. For example:

Sales Data Table:

Customer Invoice Amount Invoice Date
ACME $100 3:00 pm UTC July 30,2021
MyShop $200 1:00 pm UTC July 29, 2021

And Monthly Sales (analytical table):

Month Invoice Amount
July $500
August $200

Data Transformation Challenges

Data transformation has many benefits to an organization, but it is also important to note that certain hurdles make data transformation difficult.

Data transformation can be a costly process. ETL tools come at a price, and training staff on data management processes is not exactly a walk in the park either. The cost of data transformation depends on infrastructure size. Extensive infrastructure will require hiring a team of data experts to oversee the transformation process.

To transform data effectively, an organization would have to set up tools, train its current employees, and/or hire a new set of experts to oversee the process. Either way, it is going to cost time and resources to make that happen.

Data Transformation Best Practices

The data transformation process seems like an easy step-by-step workflow, but there are certain things to keep in mind to avoid running into blockers or carrying out the wrong type of data transformation. The following are data transformation best practices:

  • Start by designing the target format. Jumping right into the nitty-gritty of data transformation without understanding the end goal is not a good idea. Communicate with business users to understand the process you are trying to analyze and design the target format before transforming data into insights.
  • Profile the data. In other words, get to know the data in its native form before converting it. It helps to understand the state of the raw data and the type of transformation required. Data profiling enables you to know the amount of work required and the workflow to generate for transformation.
  • Cleanse before transforming. Data cleaning is an essential pre-transformation step. It reduces the risk of having errors in the transformed data. Data can have missing values or information that is irrelevant to the desired format. By cleansing your data first, it increases the accuracy of the transformed data.
  • Audit the data transformation process. At every stage of the transformation process, track the data and changes that occur. Auditing the data transformation process makes it easier to identify the problem source if complications arise.

Use Data Transformation Tools Over Custom Coding

Data transformation tools are more cost-effective and more efficient than custom coding. Writing codes for data transformation functions have a higher risk of inefficiencies, human error, and excessive use of time and resources.

Striim data integration platform
Striim is a data integration solution that offers real-time, high-speed SQL-based transformations.

Data transformation tools are usually designed to execute the entire ETL process. If you are not using tools for the transformation process, you’ll also need to worry about the “extraction” and “loading” steps. Custom coding allows for a fully customized data transformation solution, but as data sources, volumes, and other complexities increase, scaling and managing this becomes increasingly difficult.

Striim is an end-to-end data integration solution that offers scalable in-memory transformations, enrichment and analysis, using high-speed SQL queries. To get a personalized walkthrough of Striim’s real-time data transformation capabilities, please request a demo. Alternatively, you can try Striim for free.

Oracle LogMiner Continuous Mining Deprecation: What You Need to Know

Striim for Oracle CDCIn this era of digital transformation, companies are shifting to the cloud to increase their agility and shorten the time to business insights. Oracle databases have long been entrusted with operational data, and various methods have emerged to replicate or migrate data to cloud targets. 

Change Data Capture (CDC) tracks database changes as they happen and generates a continuous stream of change data. Oracle CDC methods allow companies to stream database changes to their cloud targets in real time, enabling timely business intelligence and zero-downtime cloud migrations.

Various CDC approaches have been developed over the years to facilitate real time business operations.

“Efficient and reliable log mining techniques are critical for real time business operations. As database log-mining approaches continue to evolve, data architects need to incorporate low latency methods to stream changes from mission critical database workloads.”
-Alok Pareek, Founder at Striim, former CTO at GoldenGate

Oracle CDC methods include LogMiner-based CDC, trigger-based CDC, and third-party tools like Striim. Here’s a brief overview of these methods:

Oracle LogMiner-based CDC

LogMiner provides a SQL-based interface to Oracle database redo log files. LogMiner is designed to help users track data corruption (and identify any required recovery actions) and carry out performance tuning and capacity planning. 

The LogMiner “continuous mine” feature was introduced to allow automatic, continuous mining of the redo logs. This feature provides near real-time access to redo log changes, which has made it the target for many third-party data integration tools that offer CDC from Oracle.

Recently, Oracle announced that continuous mine would be deprecated in Oracle Database 19c. For that reason, companies using newer versions of Oracle can no longer use the continuous mine feature for CDC.

Trigger-based Oracle CDC

Trigger-based Oracle CDC, also known as “synchronous” CDC, starts with defining database triggers that fire before or after INSERT, UPDATE, or DELETE commands (that indicate a change). The change data is used to populate a change table. Each table requires its own change table.

Trigger based Oracle CDC
Image source: Oracle documentation

Triggers introduce overhead to operational tables while changes are being made, and can have significant performance impacts on any applications accessing the tables. Furthermore, maintaining the triggers as the application changes leads to management burden.

Striim for Oracle CDC

Striim is a real-time data integration platform that natively performs change data capture by mining Oracle redo logs while managing additional metadata.

Striim for Oracle CDC

Striim’s Oracle CDC solution offers a variety of benefits:

  • It does not require the LogMiner continuous mining feature (and has a 3x faster capture rate than LogMiner-based CDC)
  • It can also read Oracle GoldenGate Trail files (for companies who are using Oracle GoldenGate)
  • It is RAC-cluster compatible
  • It offers high-speed, error-free, high-performance change data capture (with exactly once processing and support for long-running transactions)
  • It’s an end-to-end data integration tool with real time monitoring, analysis, and enrichment; and connectors to over 100 sources/targets 
  • It’s designed for mission critical, high-volume environments

Striim offers Oracle CDC to Snowflake, Kafka, BigQuery, Azure SQL database and many more targets.

If you’re looking for a high-performance alternative to LogMiner-based or trigger-based Oracle CDC, one of Striim’s CDC experts would be happy to give you a personalized walkthrough of the Striim platform. Alternatively, you can give Striim a try for free. 

What is Data Ingestion and Why This Technology Matters

data ingestion

  1. Introduction
  2. Types of Data Ingestion
  3. Benefits of Data Ingestion
  4. Data Ingestion Challenges
  5. Data Ingestion Tools
  6. Finding a Key Differentiator

Introduction

Data ingestion is the process of transporting data from one or more sources to a target site for further processing and analysis. This data can originate from a range of sources, including data lakes, IoT devices, on-premises databases, and SaaS apps, and end up in different target environments, such as cloud data warehouses or data marts.

Data ingestion is a critical technology that helps organizations make sense of an ever-increasing volume and complexity of data. To help businesses get more value out of data ingestion, we’ll dive deeper into this technology. We’ll cover types of data ingestion, how data ingestion is done, the difference between data ingestion and ETL, data ingestion tools, and more.

Types of Data Ingestion

There are three ways to carry out data ingestion, including real time, batches, or a combination of both in a setup known as lambda architecture. Companies can opt for one of these types depending on their business goals, IT infrastructure, and financial limitations.

Real-time data ingestion

Real-time data ingestion is the process of collecting and transferring data from source systems in real time using solutions such as change data capture (CDC). CDC constantly monitors transaction or redo logs and moves changed data without interfering with the database workload. Real-time ingestion is essential for time-sensitive use cases, such as stock market trading or power grid monitoring, when organizations have to rapidly react to new information. Real-time data pipelines are also vital when making rapid operational decisions and identifying and acting on new insights.

Batch-based data ingestion

Batch-based data ingestion is the process of collecting and transferring data in batches according to scheduled intervals. The ingestion layer may collect data based on simple schedules, trigger events, or any other logical ordering. Batch-based ingestion is useful when companies need to collect specific data points on a daily basis or simply don’t need data for real-time decision-making.

Lambda architecture-based data ingestion

Lambda architecture is a data ingestion setup that consists of both real-time and batch methods. The setup consists of batch, serving, and speed layers. The first two layers index data in batches, while the speed layer instantaneously indexes data that has yet to be picked up by slower batch and serving layers. This ongoing hand-off between different layers ensures that data is available for querying with low latency.

Benefits of Data Ingestion

Data ingestion technology offers various benefits, enabling teams to manage data more efficiently and gain a competitive advantage. Some of these benefits include:

  • Data is readily available: Data ingestion helps companies gather data stored across various sites and move it to a unified environment for immediate access and analysis.
  • Data is less complex: Advanced data ingestion pipelines, combined with ETL solutions, can transform various types of data into predefined formats and then deliver it to a data warehouse.
  • Teams save time and money: Data ingestion automates some of the tasks that previously had to be manually carried out by engineers, whose time can now be dedicated to other more pressing tasks.
  • Companies make better decisions: Real-time data ingestion allows businesses to quickly notice problems and opportunities and make informed decisions.
  • Teams create better apps and software tools: Engineers can use data ingestion technology to ensure that their apps and software tools move data quickly and provide users with a superior experience.

multi-cloud data integration

Data Ingestion Challenges

Setting up and maintaining data ingestion pipelines might be simpler than before, but it still involves several challenges:

  • The data ecosystem is increasingly diverse: Teams have to deal with an ever-growing number of data types and sources, making it difficult to create a future-proof data ingestion framework.
  • Legal requirements are more complex: From GDPR to HIPAA to SOC 2, data teams have to familiarize themselves with various data privacy and protection regulations to ensure they’re acting within the boundaries of the law.
  • Cyber-security challenges grow in size and scope: Data teams have to fend off frequent cyber-attacks launched by malicious actors in an attempt to intercept and steal sensitive data.

Data Ingestion Tools

Data ingestion tools are software products that gather and transfer structured, semi-structured, and unstructured data from source to target destinations. These tools automate otherwise laborious and manual ingestion processes. Data is moved along a data ingestion pipeline, which is a series of processing steps that take data from one point to another.

Data ingestion tools come with different features and capabilities. To select the tool that fits your needs, you’ll need to consider several factors and decide accordingly:

  • Format: Is data arriving as structured, semi-structured, or unstructured?
  • Frequency: Is data to be ingested and processed in real time or in batches?
  • Size: What’s the volume of data an ingestion tool has to handle?
  • Privacy: Is there any sensitive data that needs to be obfuscated or protected?

And data ingestion tools can be used in different ways. For instance, they can move millions of records into Salesforce every day. Or they can ensure that different apps exchange data on a regular basis. Ingestion tools can also bring marketing data to a business intelligence platform for further analysis.

Data ingestion vs. ETL

Data ingestion tools may appear similar in function to ETL platforms, but there are some differences. For one, data ingestion is primarily concerned with extracting data from the source and loading it into the target site. ETL, however, is a type of data ingestion process that involves not only the extraction and transfer of data but also the transformation of that data before its delivery to target destinations.

ETL platforms, such as Striim, can perform various types of transformation, such as aggregation, cleansing, splitting, and joining. The goal is to ensure that the data is delivered in a format that matches the requirements of the target location.

Finding a Key Differentiator

Data ingestion is a vital tech that helps companies extract and transfer data in an automated way. With data ingestion pipelines established, IT and other business teams can focus on extracting value from data and finding new insights. And automated data ingestion can become a key differentiator in today’s increasingly competitive marketplaces.

Schedule a demo and we’ll give you a personalized walkthrough or try Striim at production-scale for free! Small data volumes or hoping to get hands on quickly? At Striim we also offer a free developer version.

What to Look for in Data Replication Software

Reliable access to data is vital for companies to thrive in this digital age. But businesses struggle with various risk factors- like hardware failures, cyberattacks, and geographical distances-that could block access to data or corrupt valuable data assets. Left without access to data, teams may struggle to carry out day-to-day tasks and deliver on important projects.

One way to safeguard your data from those risks is using data replication solutions. This technology is indispensable for teams that want to replicate and protect their mission-critical data and use it as a source of competitive advantage.

To help businesses explore data replication, we’ll dive into this technology and explore what features you should look for in data replication software.

What is Data Replication

Data replication is the process of copying data from an on-premise or cloud server and storing it on another server or site. The result is a multitude of exact data copies residing in multiple locations.

These data replicas support teams in their disaster recovery and business continuity efforts. If data is compromised at one site (for example by a system failure or a cyberattack), teams can pull replicated data from other servers and resume their work.

Replication also allows users to access data stored on servers close to their offices, reducing network latency. For instance, users in Asia may experience a delay when accessing data stored in North America-based servers. But the latency will decrease if a replica of this data is kept on a node that’s closer to Asia.

Data replication also plays an important role in analytics and business intelligence efforts, in which data is replicated from operational databases to data warehouses.

How Data Replication Works

Data replication is the process of copying data from an on-premise or cloud server and storing it on another server or site. The result is a multitude of exact data copies residing in multiple locations.

These data replicas support teams in their disaster recovery and business continuity efforts. If data is compromised at one site (for example by a system failure or a cyberattack), teams can pull replicated data from other servers and resume their work.

Replication also allows users to access data stored on servers close to their offices, reducing network latency. For instance users in Asia may experience a delay when accessing data stored in North America-based servers. But the latency will decrease if a replica of this data is kept on a node that’s closer to Asia.

Data replication also plays an important role in analytics and business intelligence efforts, in which data is replicated from operational databases to data warehouses.

Types and Methods of Data Replication

Depending on their needs, companies can choose among several types of data replication:

  • Transactional replication: Users receive a full copy of their data sets, and updates are continuously replicated as data in the source changes.
  • Snapshot replication: A snapshot of the database is sent to replicated sites at a specific moment.
  • Merge replication: Data from multiple databases is replicated into a single database.

In tactical terms, there are several methods for replicating data, including:

  • Full-table replication: Every piece of new, updated, and existing data is copied from the source to the destination site. This method copies all data every time and requires a lot of processing power, which puts networks under heavy stress.
  • Key-based incremental replication: Only data changed since the previous update will be replicated. This approach uses less processing power but can’t replicate hard-deleted data.
  • Log-based incremental replication: Data is replicated based on information in database log files. This is an efficient method but works only with database sources that support log-based replication (such as Microsoft SQL Server, Oracle , and PostgreSQL).

What to Look for in Data Replication Software

Data replication software: key features

Data replication software should ideally contain the following features:

A large number of connectors: A replication tool should allow you to replicate data from various sources and SaaS tools to data warehouses and other targets.

Log-based capture: An ideal replication software product should capture streams of data using log-based change data capture.

Data transformation: Data replication solutions should also allow users to clean, enrich, and transform replicated data.

Built-in monitoring: Dashboards and monitoring enable you to see the state of your data flows in real-time and easily identify any bottlenecks. For mission-critical systems that have data delivery Service Level Agreements (SLAs), it’s also important to have visibility into end-to-end lag

Custom alerts: Data replication software should offer alerts that can be configured for a variety of metrics, keeping you up to date on the status and performance of your data flows.

Ease of use: A drag-and-drop interface is an ideal solution for users to quickly set up replication processes.

Data replication software vs. writing code internally

Of course, users can set up the replication process by writing code internally. But managing yet another in-house app is a major commitment of energy, staff, and money. The app also may require the team to handle error logging, refactoring code, alerting, etc. It comes as no surprise that many teams are opting for third-party data replication software.

Use Striim to replicate data in real time

There are also real-time database replication solutions such as Striim. Striim is a unified streaming and real-time data integration platform that connects over 150 sources and targets. Striim provides real-time data replication by extracting data from databases using log-based change data capture and replicating it to targets in real time.

Striim enables real time data replication
Striim is a unified real-time data integration and streaming platform that connects clouds, data, and applications. With log-based change data capture from a range of databases, Striim supports real time data replication.

Striim‘s data integration and replication capabilities support various use cases. This platform can, for instance, enable financial organizations to near instantaneously replicate transactions and new balances data to customer accounts. Inspyrus, a San Francisco-based fintech startup, uses Striim to replicate invoicing data from its private cloud operational databases to other cloud targets such as Snowflake for real-time analytics.

Striim can also be used to replicate obfuscated sensitive data to Google Cloud while original data is safely kept in an on-premises environment. Furthermore, Striim supports mission-critical use cases with data delivery and latency SLAs. Striim customer Macy’s uses Striim to streamline retail operations and provide a unified customer experience. Even at Black Friday traffic levels, Striim is able to deliver data from Macy’s on-premises data center to Google Cloud with less than 200ms latency.

Have More Time to Analyze Data

Striim for data replication
Striim replicates data from databases using high-performance log-based Change Data Capture.

Reliable access to data is of vital importance for today’s companies. But that access can often be blocked or limited, which is why data replication solutions are increasingly important. They enable teams to replicate and protect valuable data assets, and support disaster recovery efforts. And with data secured, teams can have more time and energy to analyze data and find insights that will provide a competitive edge.

Ready to see how Striim can help you simplify data integration and replication? Request a demo with one of our data replication experts, or try Striim for free.

Back to top