Comparing Snowflake Data Ingestion Methods with Striim

Posted on November 14, 2023 by John Kutay | 11 min read | 5 views

Introduction

In the fast-evolving world of data integration, Striim’s collaboration with Snowflake stands as a beacon of innovation and efficiency. This comprehensive overview delves into the sophisticated capabilities of Striim for Snowflake data ingestion, spanning from file-based initial loads to the advanced Snowpipe streaming integration.

Quick Compare: File-based loads vs Streaming Ingest

We’ve provided a simple overview of the ingestion methods in this table:

Feature/Aspect	File-based loads	Snowflake Streaming Ingest
Data Freshness SLAs	5 minutes to 1 hour	Under 5 minutes. Benchmark demonstrated P95 latency of 3 seconds with 158 gb/hr of Oracle CDC ingest.
Use Cases	– Ideal for batch processing and reporting scenarios – Suitable for scenarios where near real-time data is not critical – Bulk data uploads at periodic intervals	– Critical for operational intelligence, real-time analytics, AI, and reverse ETL – Necessary for scenarios demanding immediate data actionability – Continuous data capture and immediate processing
Data Volume Handling	Efficiently handles large volumes of data in batches	Best for high-velocity, continuous data streams
Flexibility	Limited flexibility in terms of data freshness – Good for static, predictable workloads	High flexibility to handle varying data rates and immediate data requirements – Adaptable to dynamic workloads and suitable for AI-driven insights and reverse ETL processes
Operation Modes	Supports both Append Only and Merge modes	Primarily supports Append Only mode
Network Utilization	Higher data transfer in bulk, but less frequent – Can be more efficient for network utilization in certain scenarios	Continuous data transfer, which might lead to higher network utilization
Performance Optimization	Batch size and frequency can be optimized for better performance – Easier to manage for predictable workloads	Requires fine-tuning of parameters like MaxRequestSizeInMB, MaxRecordsPerRequest, and MaxParallelRequests for optimal performance – Cost optimization is a key benefit, especially in high-traffic scenarios

File-based uploads: Merge vs Append Only

Striim’s approach to loading data into Snowflake is marked by its intelligent use of file-based uploads. This method is particularly adept at handling large data sets securely and efficiently. A key aspect of this process is the choice between ‘Merge’ and ‘Append Only’ modes.

Merge Mode: In this mode, Striim allows for a more traditional approach where updates and deletes in the source data are replicated as such in the Snowflake target. This method is essential for scenarios where maintaining the state of the data as it changes over time is crucial.

Append Only Mode: Contrarily, the ‘Append Only’ setting, when enabled, treats all operations (including updates and deletes) as inserts into the target. This mode is particularly useful for audit trails or scenarios where preserving the historical sequence of data changes is important. Append Only mode will also demonstrate higher performance in workloads like Initial Loads where you just want to copy all existing data from a source system into Snowflake.

Snowflake Writer: Technical Deep Dive on File-based uploads

The SnowflakeWriter in Striim is a robust tool that stages events to local storage, AWS S3, or Azure Storage, then writes to Snowflake according to the defined Upload Policy. Key features include:

Secure Connection: Utilizes JDBC with SSL, ensuring secure data transmission.
Authentication Flexibility: Supports password, OAuth, and key-pair authentication.
Customizable Upload Policy: Allows defining batch uploads based on event count, time intervals, or file size.
Data Type Support: Comprehensive support for various data types, ensuring seamless data transfer.

SnowflakeWriter efficiently batches incoming events per target table, optimizing the data movement process. The batching is controlled via a BatchPolicy property, where batches expire based on event count or time interval. This feature significantly enhances the performance of bulk uploads or merges.

Batch tuning in Striim’s Snowflake integration is a critical aspect that can significantly impact the efficiency and speed of data transfer. Properly tuned batches ensure that data is moved to Snowflake in an optimized manner, balancing between throughput and latency. Here are key considerations and strategies for batch tuning:

Understanding Batch Policy: Striim’s SnowflakeWriter allows customization of the batch policy, which determines how data is grouped before being loaded into Snowflake. The batch policy can be configured based on event count (eventcount), time intervals (interval), or both.
Event Count vs. Time Interval:
- Event Count (eventcount): This setting determines the number of events that will trigger a batch upload. A higher event count can increase throughput but may add latency. It’s ideal for scenarios with high-volume data where latency is less critical.
- Time Interval (interval): This configures the time duration after which data is batched and sent to Snowflake. A shorter interval ensures fresher data in Snowflake but might reduce throughput. This is suitable for scenarios requiring near real-time data availability.
- Both: in this scenario, the batch will load when either eventcount or interval threshold is met.
Balancing Throughput and Latency: The key to effective batch tuning is finding the right balance between throughput (how much data is being processed) and latency (how fast data is available in Snowflake).
- For high-throughput requirements, a larger eventcount might be more effective.
- For lower latency, a shorter interval might be better.
Monitoring and Adjusting: Continuously monitor the performance after setting the batch policy. If you notice delays in data availability or if the system isn’t keeping up with the data load, adjustments might be necessary. You can do this by going to your Striim Console and entering ‘mon <target name>’ which will give you a detailed view of your batch upload monitoring metrics.
Considerations for Diverse Data Types: If your data integration involves diverse data types or varying sizes of data, consider segmenting data into different streams with tailored batch policies for each type.
Handling Peak Loads: During times of peak data load, it might be beneficial to temporarily adjust the batch policy to handle the increased load more efficiently.
Resource Utilization: Keep an eye on the resource utilization on both Striim and Snowflake sides. If the system resources are underutilized, you might be able to increase the batch size for better throughput.

Snowpipe Streaming Explanation and Terminology

Snowpipe Streaming is an innovative streaming ingest API released by Snowflake. It is distinct from classic Snowpipe with some core differences:

Category	Snowpipe Streaming	Snowpipe
Form of Data to Load	Rows	Files
Third-Party Software Requirements	Custom Java application code wrapper for the Snowflake Ingest SDK	None
Data Ordering	Ordered insertions within each channel	Not supported
Load History	Recorded in SNOWPIPE_STREAMING_FILE_MIGRATION_HISTORY view (Account Usage)	Recorded in LOAD_HISTORY view (Account Usage) and COPY_HISTORY function (Information Schema)
Pipe Object	Does not require a pipe object	Requires a pipe object that queues and loads staged file data into target tables

Snowpipe Streaming supports ordered, row-based ingest into Snowflake via Channels.

Channels in Snowpipe Streaming:

Channels represent logical, named streaming connections to Snowflake for loading data into a table. Each channel maps to exactly one table, but multiple channels can point to the same table. These channels preserve the ordering of rows and their corresponding offset tokens within a channel, but not across multiple channels pointing to the same table.

Offset Tokens:

Offset tokens are used to track ingestion progress on a per-channel basis. These tokens are updated when rows with a provided offset token are committed to Snowflake. This mechanism enables clients to track ingestion progress, check if a specific offset has been committed, and enable de-duplication and exactly-once delivery of data.

Migration to Optimized Files:

Initially, streamed data written to a target table is stored in a temporary intermediate file format. An automated process then migrates this data to native files optimized for query and DML operations.

Replication:

Snowpipe streaming supports the replication and failover of Snowflake tables populated by Snowpipe Streaming and its associated channel offsets from one account to another, even across regions and cloud platforms.

Snowpipe Streaming: Unleashing Real-Time Data Integration and AI

Snowpipe Streaming, when teamed up with Striim, is kind of like a superhero for real-time data needs. Think about it: the moment something happens, you know about it. This is a game-changer in so many areas. For instance, in banking, it’s like having a super-fast guard dog that barks the instant it smells a hint of fraud. Or in online retail, imagine adjusting prices on the fly, just like that, to keep up with market trends. Healthcare? It’s about getting real-time updates on patient stats, making sure everyone’s on top of their game when lives are on the line. And let’s not forget the guys in manufacturing and logistics – they can track their stuff every step of the way, making sure everything’s ticking like clockwork. It’s about making decisions fast and smart, no waiting around. Snowpipe Streaming basically makes sure businesses are always in the know, in the now.

Striim’s integration with Snowpipe Streaming represents a significant advancement in real-time data ingestion into Snowflake. This feature facilitates low-latency loading of streaming data, optimizing both cost and performance, which is pivotal for businesses requiring near-real-time data availability.

Cost and Performance Efficiency: Striim’s use of Snowpipe Streaming demonstrates over 95% cost savings and an average P95 latency of just 3 seconds for high traffic tables.
Versatile Source Connectivity: Striim offers a wide array of streaming source connectors including databases like Oracle, Microsoft SQL Server, MongoDB, PostgreSQL, IoT streams, Kafka, and many more.
Industry-leading benchmarks: In collaboration with industry experts, Striim has thoroughly benchmarked the Snowpipe Streaming API in their new Snowflake Writer, validating its efficacy in cost optimization and performance.

Choosing the Right Streaming Configuration in Striim’s Integration with Snowflake

The performance of Striim’s Snowflake writer in a streaming context can be significantly influenced by the correct configuration of its streaming parameters. Understanding and adjusting these parameters is key to achieving the optimum balance between throughput and responsiveness. Let’s delve into the three critical streaming parameters that Striim’s Snowflake writer supports:

MaxRequestSizeInMB:
- Description: This parameter determines the maximum size in MB of a data chunk that is submitted to the Streaming API.
- Usage Notes: It should be set to a value that:
  - Maximizes throughput with the available network bandwidth.
  - Manages to include data in the minimum number of requests.
  - Matches the inflow rate of data.
MaxRecordsPerRequest:
- Description: Defines the maximum number of records that can be included in a data chunk submitted to the Streaming API.
- Usage Notes: This parameter is particularly useful:
  - When the record size for the table is small, requiring a large number of records to meet the MaxRequestSizeInMB.
  - When the rate at which records arrive takes a long time to accumulate enough data to reach MaxRequestSizeInMB.
MaxParallelRequests:
- Description: Specifies the number of parallel channels that submit data chunks for integration.
- Usage Notes: Best utilized for real-time streaming when:
  - Parallel ingestion on a single table enhances performance.
  - There is a very high inflow of data, allowing chunks to be uploaded by multiple worker threads in parallel as they are created.

The integration of these parameters within the Snowflake writer needs careful consideration. They largely depend on the volume of data flowing through the pipeline and the network bandwidth between the Striim server and Snowflake. It’s important to note that each Snowflake writer creates its own instance of the Snowflake Ingest Client, and within the writer, each parallel request (configured via MaxParallelRequests) utilizes a separate streaming channel of the Snowflake Ingest Client.

Illustration of Streaming Configuration Interaction:

Consider an example where the UploadPolicy is set to Interval=2sec, and the streaming configuration is set to (MaxParallelRequests=1, MaxRequestSizeInMB=10, MaxRecordsPerRequest=10000). In this scenario, as records flow into the event stream, streaming chunks are created as soon as either 10MB of data has been accumulated or 10,000 records have entered the stream, depending on which condition is satisfied first by the incoming stream of events. Any events that remain outside these parameters and have arrived within 2 seconds before the expiry of the UploadPolicy interval are packed into another streaming chunk.

Real-world application and what customers are saying

The practical application of Striim’s Snowpipe Streaming integration can be seen in the experiences of joint customers like Ciena. Their global head of Enterprise Data & Analytics reported significant satisfaction with Striim’s capabilities in handling large-scale, real-time data events, emphasizing the platform’s scalability and reliability.

Conclusion and Exploring Further

Striim’s data integration capabilities for Snowflake, encompassing both file-based uploads and advanced streaming ingest, offer a versatile and powerful solution for diverse data integration needs. The integration with Snowpipe Streaming stands out for its real-time data processing, cost efficiency, and low latency, making it an ideal choice for businesses looking to leverage real-time analytics.

For those interested in a deeper exploration, we provide detailed resources, including a comprehensive eBook on Snowflake ingest optimization and a self-service, free tier of Striim, allowing you to dive right in with your own workloads!

Data Mesh Architecture: Revolutionizing Event Streaming with Striim

Posted on November 8, 2023 by Melissa Latyon | 12 min read | 5 views

Data Mesh is revolutionizing event streaming architecture by enabling organizations to quickly and easily integrate real-time data, streaming analytics, and more. With the help of Striim’s enterprise-grade platform, companies can now deploy and manage a data mesh architecture with automated data mapping, cloud-native capabilities, and real-time analytics. In this article, we will explore the advantages and limitations of data mesh, while also providing best practices for building and optimizing a data mesh with Striim. By exploring the benefits of using Data Mesh for your event streaming architecture, this article will help you decide if it’s the right solution for your organization.

What is a Data Mesh and how does it work?

Data Mesh is a revolutionary event streaming architecture that helps organizations quickly and easily integrate real-time data, stream analytics, and more. It enables data to be accessed, transferred, and used in various ways such as creating dashboards or running analytics. The Data Mesh architecture is based on four core principles: scalability, resilience, elasticity, and autonomy.

Data mesh technology also employs event-driven architectures and APIs to facilitate the exchange of data between different systems. This allows for two-way integration so that information can flow from one system to another in real-time. Striim is a cloud-native Data Mesh platform that offers features such as automated data mapping, real-time data integration, streaming analytics, and more. With Striim’s enterprise-grade platform, companies can deploy and manage their data mesh with ease.

Moreover, common mechanisms for implementing the input port for consuming data from collaborating operational systems include asynchronous event-driven data sharing in the case of modern systems like Striim’s Data Mesh platform as well as change data capture (Dehghani, 220). With these mechanisms in place organizations can guarantee a secure yet quick exchange of important information across their networks which helps them maintain quality standards within their organization while also providing insights into customer behaviors for better decision making.

What are the four principles of a Data Mesh, and what problems do they solve?

A data mesh is technology-agnostic and underpins four main principles described in-depth in this blog post by Zhamak Dehghani. The four data mesh principles aim to solve major difficulties that have plagued data and analytics applications for a long time. As a result, learning about them and the problems they were created to tackle is important.

Domain-oriented decentralized data ownership and architecture

This principle means that each organizational data domain (i.e., customer, inventory, transaction domain) takes full control of its data end-to-end. Indeed, one of the structural weaknesses of centralized data stores is that the people who manage the data are functionally separate from those who use it. As a result, the notion of storing all data together within a centralized platform creates bottlenecks where everyone is mainly dependent on a centralized “data team” to manage, leading to a lack of data ownership. Additionally, moving data from multiple data domains to a central data store to power analytics workloads can be time consuming. Moreover, scaling a centralized data store can be complex and expensive as data volumes increase.

There is no centralized team managing one central data store in a data mesh architecture. Instead, a data mesh entrusts data ownership to the people (and domains) who create it. Organizations can have data product managers who control the data in their domain. They’re responsible for ensuring data quality and making data available to those in the business who might need it. Data consistency is ensured through uniform definitions and governance requirements across the organization, and a comprehensive communication layer allows other teams to discover the data they need. Additionally, the decentralized data storage model reduces the time to value for data consumers by eliminating the need to transport data to a central store to power analytics. Finally, decentralized systems provide more flexibility, are easier to work on in parallel, and scale horizontally, especially when dealing with large datasets spanning multiple clouds.

Data as a product

This principle can be summarized as applying product thinking to data. Product thinking advocates that organizations must treat data with the same care and attention as customers. However, because most organizations think of data as a by-product, there is little incentive to package and share it with others. For this reason, it is not surprising that 87% of data science projects never make it to production.

Data becomes a first-class citizen in a data mesh architecture with its development and operations teams behind it. Building on the principle of domain-oriented data ownership, data product managers release data in their domains to other teams in the form of a “product.” Product thinking recognizes the existence of both a “problem space” (what people require) and a “solution space” (what can be done to meet those needs). Applying product thinking to data will ensure the team is more conscious of data and its use cases. It entails putting the data’s consumers at the center, recognizing them as customers, understanding their wants, and providing the data with capabilities that seamlessly meet their demands. It also answers questions like “what is the best way to release this data to other teams?” “what do data consumers want to use the data for?” and “what is the best way to structure the data?”

Self-serve data infrastructure as a platform

The principle of creating a self-serve data infrastructure is to provide tools and user-friendly interfaces so that generalist developers (and non-technical people) can quickly get access to data or develop analytical data products speedily and seamlessly. In a recent McKinsey survey, organizations reported spending up to 80% of their data analytics project time on repetitive data pipeline setup, which ultimately slowed down the productivity of their data teams.

The idea of the self-serve data infrastructure as a platform is that there should be an underlying infrastructure for data products that the various business domains can leverage in an organization to get to the work of creating the data products rapidly. For example, data teams should not have to worry about the underlying complexity of servers, operating systems, and networking. Marketing teams should have easy access to the analytical data they need for campaigns. Furthermore, the self-serve data infrastructure should include encryption, data product versioning, data schema, and automation. A self-service data infrastructure is critical to minimizing the time from ideation to a working data-driven application.

Federated computational governance

This principle advocates that data is governed where it is stored. The problem with centralized data platforms is that they do not account for the dynamic nature of data, its products, and its locations. In addition, large datasets can span multiple regions, each having its own data laws, privacy restrictions, and governing institutions. As a result, implementing data governance in this centralized system can be burdensome.

The data mesh more readily acknowledges the dynamic nature of data and allows for domains to designate the governing structures that are most suitable for their data products. Each business domain is responsible for its data governance and security, and the organization can set up general guiding principles to help keep each domain in check.

While it is prescriptive in many ways about how organizations should leverage technology to implement data mesh principles, perhaps the more significant implementation challenge is how that data flows between business domains.

Deploy an API spec in low-code for your Data Mesh with Striim

For businesses looking to leverage the power of Data Mesh, Striim is an ideal platform to consider. It provides a comprehensive suite of features that make it easy to develop and manage applications in multiple cloud environments. The low-code, SQL-driven platform allows developers to quickly deploy data pipelines while a comprehensive API spec enables custom and scalable management of data streaming applications. Additionally, Striim offers resilience and elasticity that can be adjusted depending on specific needs, as well as best practices for scalability and reliability.

The data streaming capabilities provided by Striim are fast and reliable, making it easy for businesses to get up and running quickly. Its cloud agnostic features allow users to take advantage of multiple cloud environments for wider accessibility. With its comprehensive set of connectors, you can easily integrate external systems into your data mesh setup with ease.

While monolithic data operations have accelerated adoption of analytics within organizations, centralized data pipelines can quickly grow into bottlenecks due to lack of domain ownership and focus on results.

To address this problem, using a data mesh and tangential Data Mesh data architectures are rising in popularity. A data mesh is an approach to designing modern distributed data architectures that embrace a decentralized data management approach.

Benefits of Using Data Mesh Domain Oriented Decentralization approach for data enables faster and efficient real-time cross domain analysis. A data mesh is an approach that is primitively based on four fundamental principles that makes this approach a unique way to extract the value of real-time data productively. The first principle is domain ownership, that allows domain teams to take ownership of their data. This helps in domain driven decision making by experts. The second principle promotes data as a product. This also helps teams outside the domain to use the data when required and with the product philosophy, the quality of data is ensured. The third principle is a self-serve data infrastructure platform. A dedicated team provides tools to maintain interoperable data products for seamless consumption of data by all domains that eases creation of data products. The final principle is federated governance that is responsible for setting global policies on the standardization of data. Representatives of every domain agree on the policies such as interoperability (eg: source file format), role based access for security, privacy and compliance

In short, Striim is an excellent choice for companies looking to implement a data mesh solution due to its fast data streaming capabilities, low-code development platform, comprehensive APIs, resilient infrastructure options, cloud agnostic features, and features that support creating a distributed data architecture. By leveraging these features – businesses can ensure that their data mesh runs smoothly – allowing them to take advantage of real-time analytics capabilities or event-driven architectures for their operations!

Example of a data mesh for a large retailer using Striim. Striim continuously reads the operational database transaction logs from disjointed databases in their on-prem data center, continuously syncing data to a unified data layer in the cloud. From there, streaming data consumers (e.g. a mobile shopping app and a fulfillment speed analytics app) consume streaming data to support an optimal customer experience and enable real-time decision making.

Benefits of using Striim for Data Mesh Architecture

Using Striim for Data Mesh architecture provides a range of benefits to businesses. As an enterprise-grade platform, Striim enables the quick deployment and management of data meshes, to automated data mapping and real-time analytics capabilities. Striim offers an ideal solution for businesses looking to build their own Data Mesh solutions.

Striim’s low-code development platform allows businesses to rapidly set up their data mesh without needing extensive technical knowledge or resources. Additionally, they can make use of comprehensive APIs to easily integrate external systems with their data mesh across multiple cloud environments. Automated data mapping capabilities help streamline the integration process by eliminating the need for manual processing or complex transformations when dealing with large datasets from different sources.

Real-time analytics are also facilitated by Striim with its robust event-driven architectures that provide fast streaming between systems as well as secure authentication mechanisms for safeguarding customer data privacy during transmission over networks. These features offer businesses an optimal foundation on which they can confidently construct a successful data mesh solution using Striim’s best practices.

Best practices for building and optimizing a Data Mesh with Striim

Building and optimizing a data mesh with Striim requires careful planning and implementation. It’s important to understand the different use cases for a data mesh and choose the right tool for each one. For example, if data is being exchanged between multiple cloud environments, it would make sense to leverage Striim’s cloud-agnostic capabilities. It’s also important to ensure that all components are properly configured for secure and efficient communication.

Properly monitoring and maintaining a data mesh can help organizations avoid costly downtime or data loss due to performance issues. Striim provides easy-to-use dashboards that provide real-time insights into your event streams, allowing you to quickly identify potential problems. It’s also important to plan for scalability when building a data mesh since growth can often exceed expectations. Striim makes this easier with its automated data mapping capabilities, helping you quickly add new nodes as needed without disrupting existing operations.

Finally, leveraging Striim’s real-time analytics capabilities can help organizations gain greater insight into their event streams. By analyzing incoming events in real time, businesses can quickly identify trends or patterns they might have otherwise missed by simply relying on historical data. This information can then be used to improve customer experiences or develop more efficient business processes. With these best practices in mind, companies can ensure their data mesh is secure, efficient, and optimized for maximum performance.

Conclusion – Is a Data Mesh architecture the right solution for your event stream solution?

When it comes to optimizing your event stream architecture, data mesh is a powerful option worth considering. It offers numerous advantages over traditional architectures, including automated data mapping, cloud-native capabilities, scalability, and elasticity. Before committing resources towards an implementation, organizations should carefully evaluate its suitability based on their data processing needs, dataset sizes, and existing infrastructure.

Organizations that decide to implement a Data Mesh solution should use Striim as their platform of choice to reap the maximum benefits of this revolutionary architecture. With its fast data streaming capabilities, low-code development platform and comprehensive APIs businesses can make sure their Data Mesh runs smoothly and take advantage of real-time analytics capabilities and event-driven architectures.

Using Kappa Architecture to Reduce Data Integration Costs

Posted on August 31, 2023 by John Kutay | 9 min read | 5 views

Kappa Architectures are becoming a popular way of unifying real-time (streaming) and historical (batch) analytics giving you a faster path to realizing business value with your pipelines.

Treating batch and streaming as separate pipelines for separate use cases drives up complexity, cost, and ultimately deters data teams from solving business problems that truly require data streaming architectures.

Kappa Architecture combines streaming and batch while simultaneously turning data warehouses and data lakes into near real-time sources of truth.

Showing how Kappa unifies batch and streaming pipelines

The development of Kappa architecture has revolutionized data processing by allowing users to quickly and cost-effectively reduce data integration costs. Kappa architecture is a powerful data processing architecture that enables near-real-time data processing, making it ideal for companies needing to quickly process large amounts of data. Striim offers an easy-to-use platform with drag-and-drop functionality and pre-built components that make it simple to build a kappa architecture. In this article, we will take a look at the benefits and drawbacks of kappa architecture, how Striim makes it easier to use, what infrastructure you need for your kappa architecture, and how you can start designing your own kappa architecture with a free version of Striim’s unified data integration and streaming platform.

Overview of kappa architecture

Kappa architecture is a powerful data processing architecture that enables near-real-time data processing. By combining batch and stream processing techniques, companies are able to process large volumes of data quickly and efficiently, even with frequent changes in the data structure. Two different systems are required for creating a kappa architecture: one for streaming data and another for batch processing. Stream processors, storage layers, message brokers, and databases make up the basic components of this architecture.

The goal of kappa architecture is to reduce the cost of data integration by providing an efficient and real-time way of managing large datasets. By eliminating manual processes such as ETL (extract-transform-load) systems, companies can save time and money while still leveraging advanced technologies like machine learning and artificial intelligence (AI). Striim offers an intuitive UI with drag-and-drop functionality as well as prebuilt components to help users design their own custom kappa architectures. With its free version also available, businesses can start building their own system right away without needing expensive consultants or weeks spent configuring complex systems.

In conclusion, kappa architectures have revolutionized the way businesses approach big data solutions – allowing them to take advantage of cutting edge technologies while reducing costs associated with manual processes like ETL systems. With Striim’s unified platform making it easier than ever before to build a custom kappa architecture tailored exactly towards your business needs – you can get started designing your own system today!

Benefits of kappa architecture for data integration

Kappa architecture is quickly gaining popularity due to its ability to enable near-real-time data processing and reduce the complexity associated with data integration. By utilizing a single codebase for both streaming and batch processing, businesses can reap multiple benefits from this solution. This simplification drastically cuts down on development resources needed as well as infrastructure setup and maintenance costs. Additionally, it allows for efficient processing of both real-time and historical data which eliminates the need for multiple versions of the same dataset or manually managed systems.

The versatility offered by kappa architectures makes them suitable for many industries such as healthcare, finance, retail, telecoms energy and more. Companies can leverage this technology to create analytics solutions that are tailored to their individual needs that are capable of handling substantial amounts of streaming data in real-time without any latency issues. Moreover, users can design their own system with Striim’s unified platform which features an intuitive UI with drag-and-drop functionality – plus they offer a free version so businesses can get started straight away!

In summation, kappa architectures offer immense advantages for those looking to reduce their data integration costs while using cutting edge technologies. With Striim’s unified platform businesses have access to a range of features that make designing their own system easy and straightforward – all at an affordable cost or even free!

Drawbacks of kappa architecture

Kappa architecture has revolutionized the way businesses process and store data, allowing them to take advantage of cutting edge technologies while reducing costs associated with manual processes. However, this technology is not without its drawbacks.

The complexity of setting up and maintaining a kappa architecture can be very high, requiring specialized engineers to ensure that all components are properly configured and functioning correctly. Additionally, without a centralized system for managing data, it can be difficult for businesses to maintain data governance across their organization. This lack of centralization also means that each component must be independently managed, leading to higher costs in terms of additional computing resources.

Another limitation of kappa architecture is scalability. As more data is processed through the system, it will require more computing resources in order to remain efficient and effective. This makes scaling the architecture complex and costly, as businesses will need to invest in additional hardware or cloud computing services in order to handle larger volumes of data processing.

Finally, kappa architectures are not suitable for all types of data processing tasks. While they are well suited for near-real-time analytics applications, they may not be the best choice for batch processing jobs or those that require intensive computation or machine learning algorithms. It’s important for businesses to assess their individual needs before deciding if kappa architectures are the right choice for reducing their data integration costs.

How Striim overcomes these drawbacks to make Kappa simple and affordable

Kappa architecture is an incredibly powerful tool for businesses looking to quickly and cost-effectively reduce data integration costs, but it does have some drawbacks that can make it difficult to use. Striim’s platform overcomes these drawbacks by making it easy and affordable to build a Kappa architecture.

Striim’s real-time streaming capabilities allow users to capture data from over 150 sources in near-real time, which eliminates the need for manual processes. Striim users can also see cost reduction of over 90% when using its smart data pipelines.

In addition, Striim has a range of pricing plans available, so businesses can find the plan that best suits their needs from its free Striim Developer tier to the Mission Critical offering which is the industry’s only horizontally scalable, unified data streaming platform as a managed service for maximum uptime SLAs and performance.

The intuitive UI, drag-and-drop functionality, and pre-built components make building a Kappa architecture quick and easy. This reduces the complexity associated with configuration and maintenance, allowing users to get up and running in no time. Plus, Striim’s free version allows users to start designing their kappa architecture without any upfront cost – making it perfect for businesses of all sizes. It also provides granular control for data contracts for data delivery and schema SLAs.With its real-time streaming capabilities, cloud integration options, pricing plans that fit various budgets, intuitive UI with drag-and drop functionality and pre-built components – as well as its free version – Striim makes building a Kappa architecture simple and affordable. This makes it the ideal tool for businesses looking to reduce their data integration costs while taking advantage of cutting edge technologies.

Choosing the right infrastructure for kappa architecture

When setting up a kappa architecture, businesses have to choose between cloud and on-premise solutions. Cloud-based architectures are more cost-effective but lack the control of an on-premise setup. On the other hand, an on-premise architecture provides more control but can be more expensive and difficult to manage. Each option has its own advantages and disadvantages, so companies should carefully weigh their needs before deciding which type of infrastructure is right for them.

The components needed to create a successful kappa architecture vary depending on the setup chosen, but generally include storage, compute, networking resources, and some form of data integration software. Companies should ensure they have enough resources available in order to avoid any performance issues as data volumes increase over time. Additionally, businesses should plan for scalability and high availability in order to ensure that their system can handle large amounts of data without disruption or loss of service.

Cost optimization is also an important consideration when building a kappa architecture. Companies need to balance performance requirements with financial constraints in order to get the most out of their investment while still ensuring reliability and stability. Additionally, they should follow industry best practices such as using containerized workloads for portability and leveraging managed services such as databases and message brokers whenever possible. Finally, companies should keep abreast of emerging trends in kappa architectures such as serverless computing or streaming automation tools that could help them further reduce costs while improving efficiency and scalability.

Ultimately, choosing the right infrastructure for a kappa architecture requires careful consideration of individual needs while keeping cost optimization in mind. Businesses should assess their performance requirements alongside financial constraints in order to build a reliable system that meets both goals while taking advantage of industry best practices and emerging trends wherever possible.

Leveraging Striim’s unified data integration and streaming platform to build your kappa architecture

Building a kappa architecture with Striim’s unified data integration and streaming platform is an easy and cost-effective solution that can help businesses reduce their data integration costs. With its intuitive UI, drag-and-drop functionality and pre-built components, Striim’s platform makes it simple to construct the architecture quickly.

The platform is optimized to support a wide range of data sources, including both structured and unstructured data. This allows users to easily manage all their data in one place, while also allowing them to scale up or down as needed for peak performance. Additionally, Striim’s platform provides cloud integration options for popular cloud platforms like Amazon Web Services and Microsoft Azure.

Striim’s platform is designed with scalability in mind, making it easy for businesses to handle large volumes of real-time streaming data without any latency issues or downtime. Additionally, the platform provides automated monitoring capabilities that enable companies to ensure their architecture remains reliable and stable. Furthermore, the platform also offers several other features that make it easier for businesses to manage their kappa architectures such as advanced analytics tools, machine learning algorithms, security features and more.

In addition to its powerful features, Striim’s unified data integration and streaming platform comes with a free version that allows users to get started quickly and cost-effectively – without having to pay any upfront costs. This makes it an ideal choice for businesses looking for ways to reduce their data integration costs while taking advantage of cutting edge technologies like kappa architectures.

Start architecting your Kappa Architecture today by talking to one of our specialists or trying Striim for free.

Striim Achieves Google Cloud Ready — Cloud SQL Designation

Posted on August 29, 2023 by John Kutay | 2 min read | 4 views

We are proud to announce that Striim has successfully achieved Google Cloud Ready – Cloud SQL Designation for Google Cloud’s fully managed relational database service for MySQL, PostgreSQL, and SQL Server. This exciting new designation recognizes Striim’s unwavering partnership efforts with Google Cloud and the joint commitment to be part of a customer’s cloud adoption and app modernization journey and become instrumental in their business innovations.

Alok Pareek the Co-founder and Executive Vice President of Products and Engineering at Striim shared that: “Striim is excited to be part of the Google Cloud Ready — Cloud SQL designation. Major enterprise customers leverage Striim to continuously move data from on-premise and cloud-based mission-critical databases into Google Cloud SQL for digital transformation. Striim seamlessly connects to Cloud SQL and enables operational data to be synced via snapshot and incremental CDC workloads in real time. This helps our joint customers innovate for example by feeding ML models in real time and leveraging Cloud SQL’s generative AI capabilities such as using the new pgvector PostgreSQL extension for storing vector embeddings.”

The Google Cloud Ready – Cloud SQL designation is designed to help businesses get started quickly with their cloud-based projects. Through this program, customers can deploy applications on the cloud with confidence knowing that they are backed by a trusted partner who has been through rigorous testing and certification processes. Our team is excited about this opportunity to continue to work closely with Google Cloud —and we’re eager to help customers leverage their existing investments in cloud technologies while leveraging our expertise in data streaming to Cloud SQL targets.

Being part of the program, Striim continues to collaborate closely with Google Cloud partner engineering and Cloud SQL teams to develop joint roadmaps and provide Google-approved and industry-standard solutions for integration use cases.

Striim is committed to providing comprehensive support for Google Cloud services across all industries. Our team of experienced engineers will work closely with customers to ensure successful deployments on Google Cloud while preserving their current data architecture. We are thrilled about this new partnership with Google Cloud and look forward to helping our customers take advantage of all its features for efficient database management.

If you’re interested in learning more about Striim’s launch partnership with Google Cloud Ready — Cloud SQL designation, please visit us at booth 532 during Google Next 2023 from August 29-31 in San Francisco!

Real-Time Data Stories Powering Gen AI & Large Language Models (LLM)

Posted on July 10, 2023 by Roger Nash | 7 min read | 4 views

Striim may be pronounced stream, however the way Striim streams data is more than just classic streaming. ‘Striiming’ data ensures that the anytime, ever-changing story from data, is perpetually told. This story, is “Real-Time”.

HOW FAR HAVE WE COME?

In 1689 Newton, in his 3rd law, observed that every action has a reaction. Apple falling on head = Ouch! Genius. Yet over 400 years later time delays are still endured between business related actions, and a business’s reaction.

Jump forward to Henry Ford 1933: he reduced the time to make a car from 12 to 1.5 hours on the production line. “Batch” production was valuable then, however today it is the dated principle of ‘Batch processing of data’ that holds back many businesses from being served with essential real time business critical insights. And this is at the expense of business results, not to mention the increasing cost of trying to make batch processing run quicker. Nobody likes finding out what is going on, by the time that it is too late to do anything about it. Ouch.

WHY REAL-TIME MATTERS

Life happens in real time. Business customers and consumers expect organizations to respond, react and manage business in real-time. Whether this is across OmniChannel consumer environments, supply chain decision making, medical life-critical decisions, responding to changes in the weather, exchange rates, stock prices, whims, needs and wants, power outages etc… All these are examples of where the real-time processing of data can save and enrich lives and be the power-house for the AI automation and LLM that is transforming business operations.

LLM?

By LLM we of course mean Large Language Models, you know, where you get a human like response to a human like question from a machine. A world where the machines learn (ML) and get smarter at making predictions that help us. And this, along with Chat GPT and NLP (Natural Language Processing) all comes under the generic banner of Generative AI. The modern day evolution of good old Artificial Intelligence (AI).

MODERN GENERATIVE EXPECTATIONS

Many of us have experienced the real-time effect like when targeted offers find us immediately on our devices just seconds after we utter a product word. However, not many people are aware of the underlying advances in data-streaming that can power these instant and accurately targeted outcomes. Well, it is something to do with continuous, real-time, simultaneous homogenisation of petabytes of data, from numerous sources that are then modeled to drive automated reactions at the right (Real) time due to clever algorithms. Or it’s pretty much that. Actions happen, and a real-time appropriate reaction can be generated. Real-time intelligent streaming of data is allowing Generative AI to enact Newton’s 3rd law: Action, reaction.

ACTIONS, REACTIONS IN REAL-TIME

The old IT world is still rife with the “Batch” processing of data operations which entails multiple individual data source transfers to usually one cloud or ‘Data-Lake’ where different treatments are then applied, one after the other, to clean, dedupe, curate, govern, merge, wrangle and scrub this data as best as possible to make it passable for AI. Often in a mysterious inexplicable way where dubious data can create “Hallucinated” results or impose a bias. Too many people suffer from the delusion that ‘If a result appears on a pretty dashboard, it must be true’. Rather than the reality of ‘garbage in, garbage out’.

THE RISE OF SMART STREAMING

Streaming is not new, Striiming data is. Streaming is the continuous flow of data from a source to a target. It is fast and can ingest from databases via Change Data Capture (CDC). The difference with “Intelligent Streaming” or “Striiming” data is that multiple sources of data are extracted via CDC, intelligently integrated (ii) and Striimed simultaneously. And… in the same split-second the data is cleaned, cross-connected and correlated with data science models applied to the data in transit. It arrives in the flexible cloud environment Action-Ready. AI-Ready. That can help explain how when things happen in the real world, there can be a real-time response. Action-reaction. Genius!

It is the rocket fuel for agile cloud environments like: Snowflake & Snowpark, Google BigQuery & Vertex AI, Databricks, Microsoft Azure ML, Sagemaker in AWS and DataScience platforms like Dataiku

STRIIMING FOR LARGE LANGUAGE MODELS (LLM)

Large Language Models are like a professional business version of Chat GPT. LLM is not just a trawl and regurgitation of the internet. The difference is that LLM is anchored to vast sets of real organizational data that is defined, structured and can be openly challenged for provenance, legacy and validity. Striiming data can differentiate in these LLM contexts due to the ability to access huge, new, vast volumes of historic or new real-time generated data. And this reveals the new true stories that human brain power could never fathom, and batch processing struggles to cater for. This allows predictions and actions to be output in seconds after events occur.

NEW “ONLINE MACHINE LEARNING”

The word ‘Online’ here is not like ‘Being Online” it refers more to a fresh-feedback fashion of ML. Continuously Striiming data is a continuous real-time enabler of Online Machine Learning which is a superior form of model training that perpetually provides new fresh continuously Striimed data for the training models. As opposed to normal ML which trains from an initial, static data set.

Online ML facilitates significantly higher rates of prediction and accuracy from the machines and can begin to explain the breath-taking speed and accuracy for how answers to questions appear to us in perfectly articulated words and figures as the outputs of LLM.

DATA TELLS A STORY – NOW YOU CAN READ IT

Before Striiming, it was thought too complex to interrogate vast oceans of deeply tangled and submerged data. Not so now. The in-memory compute power of Striim and its Striiming approach can now add the relevance of this historic data and apply its story and meaning concatenated with other relevancy-selected split-second continuously-changing current-day data from live-happening events and actions. Hence Online ML served by Striiming data can yield better forecast and predicted results and reactions.

SO WHAT?

Well, saving lives for one. But let’s take a look at some other real life scenarios. Picture hundreds of cameras at an airport capturing gigabytes of intel on files. How many people, where, how many suitcases, size, drivers, pilots, engines, parts, fuel, brakes, fails, fix, stock, threats, tourists, delays – this Airport Ballet plays out every day. Seemingly unrelated scenarios, actions, reactions, and stories being captured and recorded within yottabytes of file data. The cameras capture patterns and meaning way beyond human comprehension yet the story, in context of other cross-referenced real-time data, is of huge importance to those that can extract the meaning from the data on a real-time basis. So what? Well, It means getting staff, passengers, luggage and parts to the right place at the right time and for at least one airline, it ensures 100s more planes take off, safely and on-time, saving an estimated $1m dollars each time a plane is now not delayed waiting for a part or a person.

HUMANS ARE THINKING MORE LIKE COMPUTERS

Humans are getting smarter, Data Science expertise grows at an impressive rate – but arguably what is fuelling the greatest impact on LLM and Gen AI is the speed and quality of data prepared ready-made for the new clever models and algorithms and ML recipes. Sure, AI teaches itself from legacy and new data oceans. But remember, the humans are the creators of these new data Striiming methods and the models that yield the results. Humans have learnt to think like computers (Actions). So no wonder the computers seem to be thinking more like humans (Reactions).

CONCLUSIONS DRAWN

In my humble and bias opinion, this is now finally the best evidence, application and observation of Newton’s 3rd law of Motion within Generative AI. Actions and instant AI reactions for large enterprises. Solving the old problems in a new, real-time way: saving lives and money, making money and mitigating risk. The same problems we ever had. Only now, Striiming solutions by virtue of CDC and “ii” (Intelligent integration) is certainly a next generation powerful way to solve them.

Remember. Don’t stream data when you can Striim data.

Aye Aye… (ii). Roger over and out. Email me Roger.Nash@Striim.com.

When Change Data Capture Wins

Posted on July 7, 2023 by Sarah Krasnik | 5 min read | 4 views

A guide on when real-time data pipelines are the most reliable way to keep production databases and warehouses in sync.

Photo by American Public Power Association on Unsplash

Co-written with John Kutay of Striim

Data warehouses emerged after analytics teams slowed down the production database one too many times. Analytical workloads aren’t meant for transactional databases, which are optimized for high latency reads, writes, and data integrity. Similarly, there’s a reason production applications are run on transactional databases.

Definition: Transactional (OLTP) data stores are databases that keep ACID (atomicity, consistency, isolation, and durability) properties in a transaction. Examples include PostgreSQL and MySQL, which scale to 20 thousand transactions per second.

Analytics teams aren’t quite so concerned with inserting 20 thousand rows in the span of a second — instead, they want to join, filter, and transform tables to get insights from data. Data warehouses optimize for precisely this using OLAP.

Definition: OLAP (online analytical processing) databases optimize for multidimensional analyses on large volumes of data. Examples included popular data warehouses like Snowflake, Redshift, and BigQuery.

Different teams, different needs, different databases. The question remains: if analytics teams use OLAP data warehouses, how do they get populated?

Use CDC to improve data SLAs

Let’s back up a step. A few examples of areas analytics teams own:

Customer segmentation data, sent to third party tools to optimize business functions like marketing and customer support
Fraud detection, to alert on suspicious behavior on the product

If these analyses are run on top of a data warehouse, the baseline amount of data required in the warehouse is just from the production database. Supplemental data from third party tools is very helpful but not usually where analytics teams start. The first approach usually considered when moving data from a database to a data warehouse is batch based.

Definition: Batch process data pipelines involve checking the source database on scheduled intervals and running the pipeline to update data in the target (usually a warehouse).

There are technical difficulties with this approach, most notably the logic required to know what has changed in the source and what needs to be updated in the target. Batch ELT tools have really taken this burden off of data professionals. No batch ELT tool, however, has solved for the biggest caveat of them all: data SLAs. Consider a data pipeline that runs every three hours. Any pipelines that run independently on top of that data, even if running every three hours as well, would in the worst case scenario be six hours out of date. For many analyses, the six hour delay doesn’t move the needle. This begs the question: when should teams care about data freshness and SLAs?

Definition: An SLA (service level agreement) is a contract between a vendor and its customers as to what they can expect from the vendor when it comes to application availability and downtime. A data SLA is an agreement between the analytics team and its stakeholders around how fresh the data is expected to be.

When fresh data makes a meaningful impact on the business, that’s when teams should care. Going back to the examples of analytics team projects, if a fraudulent event happens (like hundreds of fraudulent orders) time is of the essence. A data SLA of 3 hours could be what causes the business to lose thousands of dollars instead of less than $100.

When freshness can’t wait — cue CDC, or change data capture. CDC tools read change logs on databases and mimic those changes in the target data. This happens fairly immediately, with easy reruns if a data pipeline encounters errors.

With live change logs, CDC tools keep two data stores (a production database and analytics warehouse) identical in near-realtime. The analytics team is then running analyses on data fresh as a daisy.

Getting started with CDC

The most common production transactional databases are PostgreSQL and MySQL, which have both been around for decades. Being targets more often than sources, warehouses don’t usually support CDC in the same way (although even this is changing).

To set up a source database for CDC, you need to:

Make sure WAL (write-ahead) logs are enabled and the WAL timeout is high enough. This occurs in database settings directly.
Make sure archive logs are stored on the source based on the CDC tool’s current specifications.
Create a replication slot, where a CDC tool can subscribe to change logs.
Monitor source and target database infrastructure to ensure neither is overloaded.

On the source database, if a row of data changes to A, then to value B, then to A, this behavior is replayed on the target warehouse. The replay ensures data integrity and consistency.

While open source CDC solutions like Debezium exist, hosted CDC solutions allow users to worry less about infrastructure and more about the business specifications of the pipeline, unique to their business.

As a consultant in analytics & go-to-market for dev-tools, I was previously leading the data engineering function at Perpay and built out a change data capture stack. From my perspective, change data capture isn’t just about real-time analytics. It’s simply the most reliable and scalable way to copy data from an operational database to analytical systems especially when downstream latency requirements are at play.

How to Build and Deploy a Custom Striim Image to Google Cloud Platform with HashiCorp Packer

Posted on March 22, 2023 by Simson Chow | 4 min read | 5 views

In this article, I’ll share how to create a custom Striim CentOS image with Packer, deploy it, and incorporate it into your infrastructure and DevOps stack. Before we set up our environment and start building and deploying the image, let’s go through the definitions of both tools.

What is HashiCorp Packer?

Hashicorp Packer is an open-source Infrastructure-as-Code (IaC) tool that enables you to quickly build and deploy custom images for cloud and on-premises environments. With Packer, you can create custom image builds for a variety of platforms, such as Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, and more. You can use Packer to automate the process of building images from scratch, including creating, configuring, and optimizing them. Additionally, Packer can be used to efficiently deploy those images in multiple cloud locations or on-premises.

Packer is great for automating the creation of machine images so that they can be deployed quickly and easily. The tool supports a variety of image formats and has built-in support for various configuration management tools, such as Chef and Puppet. By using Packer, you can ensure that your images are always up-to-date and properly configured. This makes it easier to manage and maintain your infrastructure in the cloud or on-premises.

What is Striim?

Striim is an end-to-end streaming platform for real-time data integration, complex event processing, and analytics. It is designed to ingest, analyze, and deliver massive volumes of data from multiple sources including databases, files, messaging systems, and IOT devices.

By integrating Packer into the Striim deployment process, we can quickly create an automated environment that enables your analytics team to immediately replicate real-time data to data warehouses and/or RDBMS databases.

Pre-requisites:

An available Linux (CentOS, Ubuntu, or Suse) machine.
Install Packer in the machine. More information: https://developer.hashicorp.com/packer/tutorials/docker-get-started/get-started-install-cli
Create a Service Account in GCP with Compute Instance Admin (v1) and Service Account User roles attached to it and generate the key. More info: https://blog.knoldus.com/how-to-create-a-custom-image-using-packer-in-gcp/#create-a-service-account-in-gcp
A Striim license.

Setting Up Your Environment

$ packer --version 1.8.6
Once verified, copy the JSON key from the GCP service account to the home directory:

$ ls account_key.json

Export the following environment variables for later configuration use:

# Values from the Striim license export company_name=<company_name_from_striim_license> export cluster_name=<cluster_name_from_striim_license> export license_key=<license_key_from_striim_license> export product_key=<product_key_from_striim_license >

# Setting up the passwords for Keystore, admin, and sys users export keystore_pass=<keystore_pass_for_striim_config> export admin_pass=<admin_pass_for_striim_config> export sys_pass=<sys_pass_for_striim_config> export mdr_type=<mdr_type_for_striim_config>

Creating Your Striim Image

Create a shell script named striim_install.sh and add the following code to it:

Striim (V4.1.2) and Java JDK (V1.8.0) will be installed using this shell script, and Striim will be configured using the environment variables that we specified in the previous section. Note: This shell script installs Striim and Java JDK only in CentOS, RedHat, Amazon Linux 2, and SUSE Linux machines.

Create a JSON file named packer_striim_image.json and copy the following code:

The variables section declares variables for Packer from the environment variables that we exported in the previous section. More info: https://developer.hashicorp.com/packer/guides/hcl/variables

The builders section allows us to define the cloud provider and provide information about our GCP project. More info: https://developer.hashicorp.com/packer/plugins/builders/googlecompute

The provisioners section uses built-in and third-party software to install and configure the image after booting. In our case, we are copying our striim_install.sh to the machine’s /tmp/ directory and executing the script to install Striim and its dependencies. More info: https://developer.hashicorp.com/packer/docs/provisioners

Once these files are created, we should see a file structure like the one below in our home directory:

----- home/ -- gcp_account_key.json -- packer_striim_image.json -- striim_install.sh

Deploying Your Image to Google Cloud Platform

Let’s run the following commands to validate our Packer JSON file:

$ packer validate packer_striim_image.json The configuration is valid.

Once we verified that it is valid, we can deploy our image:

$ packer build packer_striim_image.json

... ==> Wait completed after 3 minutes 36 seconds

==> Builds finished. The artifacts of successful builds are: --> googlecompute: A disk image was created: gcp-custom-striim-image-167888819

We can copy the image name gcp-custom-striim-image-167888819 and use it to build a VM in GCP to check that Striim is correctly installed on this image:

We can access the Striim UI by navigating to <public_or_private_ip>:9080 in the browser once the VM is in the “Running” state:

In conclusion, Packer is a powerful tool that can be used to create a more efficient infrastructure-as-code approach. With Packer, image builds and deployments can be automated in a secure and consistent manner. This automation allows you to quickly build and deploy Striim images without worrying about manual configuration errors. In addition to this, using Striim Cloud can fully automate this entire process. Visit the Striim Cloud page for a fully managed SaaS service/solution and Pay As You Go option to reduce total cost of ownership.

How to Use Terraform to Automate the Deployment of a Striim Server

Posted on February 17, 2023 by Simson Chow | 4 min read | 6 views

Deploying a server can be a time-consuming process, but with the help of Terraform, it’s easier than ever. Terraform is an open-source tool that automates the deployment and management of infrastructure, making it an ideal choice for quickly and efficiently setting up a Striim server in the cloud or on-premise. With the help of Striim‘s streaming Extract, Transform, and Load (ETL) data platform, data can be replicated and transformed in real-time, with zero downtime, from a source database to one or more target database systems. Striim enables your analytics team to work more efficiently and migrate critical database systems.

In this blog post, we’ll walk through the steps of how to use Terraform to automate the deployment of a Striim server in AWS.

Pre-requisites

Access to an AWS account including the Access Key ID and Secret Access Key.
Have an available Linux machine.
General understanding of what Striim is.
A Striim license. For free trials, go to https://signup-developer.striim.com/.

Install and Setup Terraform

In order to automate the deployment of a Striim server, we’ll first need to install Terraform on our CentOS Linux machine.

Let’s log in to it and enter the following commands into the terminal:

sudo yum install -y yum-utils sudo yum-config-manager - add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo sudo yum -y install terraform

If you’re using a different operating system, please find the appropriate instructions in this link: https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli

We’ll be using Terraform version 1.3.6 in this tutorial. Please verify the version by running this command:

terraform -version

Terraform v1.3.6 on linux_amd64If

Once the installation is successful, we can authenticate to our AWS account by exporting the AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY and AWS_REGION environment variables:

export AWS_ACCESS_KEY_ID=123456789-1234-1234-222 export AWS_SECRET_ACCESS_KEY=123456789-234-234-444 export AWS_REGION=us-west-2

For more information about getting your AWS access keys from an IAM user, please visit this link: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html

Configure Terraform

After the installation process, we can create a directory named striim_server_tf and add the following files inside:

main.tf — will include the primary set of configuration for your module. Additionally, you can create additional configuration files and arrange them in a way that makes sense for your project:

variables.tf — will contain the variable definitions for your module:

As was mentioned above in the “Striim Credentials and License Information” section from the variables.tf file, we will need to set Striim’s license information and user passwords as environment variables since they are confidential values:

export TF_VAR_striim_product_key=123456-123456-123456 export TF_VAR_striim_license_key=123456-123456-123456-123456-123456-123456-123456-04C export TF_VAR_striim_company_name=striim export TF_VAR_striim_cluster_name=striim_cluster_name export TF_VAR_striim_sys_password=my_awesome_password export TF_VAR_striim_keystore_password=my_awesome_password export TF_VAR_striim_admin_password=my_awesome_password export TF_VAR_striim_mdr_database_type=Derby

Terraform will then be instructed to search for the variable’s value in the environment variable by the TF_VAR_ prefix. More information: https://developer.hashicorp.com/terraform/cli/config/environment-variables

Once we have these files created, we should see a directory and file structure like this:

striim_server_tf | |-- main.tf | |-- variables.tf

Run Terraform

At this point, we have configured our Terraform environment to deploy a Striim server to our AWS account and written Terraform code to define the server. To deploy it, we can now execute the two Terraform commands, terraform plan and terraform apply, inside of the striim_server_tf directory.

The terraform plan command lets the user preview the changes (create, destroy, and modify) that Terraform plans to make to your overall infrastructure.
The terraform apply command executes the actions proposed in a Terraform plan.

If these commands executions are successful, you should see a message at the end of your terminal with the following message:

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Verify the Deployment

To verify the Striim server deployment, navigate to the AWS EC2 console and search for striim-server:

Make sure it’s in a Running state and Status check is 2/2 checks passed.

Next, enter the public IP address of the server with :9080 at the end of the url in a web browser and check to see if Striim is up and running:

Enter your credentials and verify you can log in to Striim console:

By leveraging Terraform and its Infrastructure-as-Code approach, deploying a Striim server can be automated with ease. It allows organizations to save time and money by quickly spinning up Striim servers, which can be used for data migration or zero downtime replication. This blog post provided an overview of how to use Terraform to set up and deploy a Striim server, as well as how to verify that the deployment was successful. With Terraform, it is possible to automate the entire process, making it easier than ever to deploy and manage cloud infrastructure. In addition to this, using Striim Cloud can fully automate this entire process. Visit the Striim Cloud page for a fully managed SaaS service/solution and Pay As You Go option to reduce total cost of ownership.

Democratizing Data Streaming with Striim Developer

Posted on February 15, 2023 by John Kutay | 3 min read | 6 views

Everyone wants real-time data…in theory. You see real-time stock tickers on TV, you use real-time odometers when you’re driving to gauge your speed, when you check the weather in your app.

Yet the “Modern Data Stack” is largely focussed on delivering batch processing and reporting on historical data with cloud-native platforms. While these cloud analytics platforms have transformed business operations, we are still missing the real-time piece of the puzzle and many data engineers feel inclined to think real-time is simply out of their organization’s reach. As a result, companies don’t have a real-time, single source of truth for their business nor can they take in-the-moment actions on customer behavior.

Why? Real-time data is currently synonymous with spinning up complex infrastructure, cobbling together multiple projects, and figuring out the integrations to internal systems yourself. The more valuable work of delivering fresh data to enable real-time data-driven applications in the business seems like an afterthought compared to the engineering prerequisites.

Now there is another way…

Striim is a simple unified data integration and streaming platform that uniquely combines change data capture, application integration, and Streaming SQL as a fully managed service that is used by the world’s top enterprises to truly deliver real-time business applications.

With Striim Developer, we’ve opened up the core piece of Striim’s Streaming SQL and Change Data Capture engine as a free service to stream up to 10 million events per month with an unlimited number of Streaming SQL queries. Striim Developer includes:

CDC connectors for PostgreSQL, MongoDB, SQLServer, MySQL, and MariaDB
SaaS connectors for Slack, MS Teams, Salesforce, and others
Streaming SQL, Sliding and Jumping Windows, Caches to join data from databases and data warehouses like Snowflake
Source and Target connectors for BigQuery, Snowflake, Redshift, S3, GCS, ADLS, Kafka, Kinesis, and more

Now any data engineer can quickly get started prototyping streaming use cases for production use with no upfront cost. You can even use Striim’s synthetic continuous data generator and plug it into your targets to see how real-time data behaves in your environment.

What happens when you hit your monthly 10 million event quota? We simply pause your account and you can resume using it the following month without losing your pipelines. You also download your pipelines as code and upgrade to Striim Cloud in a matter of clicks. No effort wasted.

Use cases you can address in Striim Developer:

Act on anomalous customer behavior by comparing real-time data with their historical norms, then alert internally in Slack or Teams
Implement data contracts on database schemas and freshness SLAs with Striim’s CDC, Streaming SQL, and schema evolution rules
Compute moving averages, aggregations, and run regressions on streaming data from Kafka or Kinesis using SQL.

If you’d like to join our first cohort of Striim Developers, you can sign up here.

If you’d like to get an overview from a data streaming expert first, request a demo here.

Three Real-world Examples of Companies Using Striim for Real-Time Data Analytics

Posted on February 10, 2023 by John Kutay | 5 min read | 6 views

According to a recent study by KX, US businesses could see a total revenue uplift of $2.6 trillion through investment in real-time data analytics. From telecommunication to retail, businesses are harnessing the power of data analytics to optimize operations and drive growth.

Striim is a data integration platform that connects data from different applications and services to deliver real-time data analytics. These three companies successfully harnessed data analytics through Striim and serve as excellent examples of the practical applications of this valuable tool across industries and use cases.

1. Ciena: Enabling Fast Real-time Insights to Telecommunication Network Changes

Ciena is an American telecommunications networking equipment and software services supplier. It provides networking solutions to support the world’s largest telecommunications service providers, submarine network operators, data and cloud operators, and large enterprises.

Use cases

Ciena’s data team wanted to build a modern, self-serve data and analytics ecosystem that:

Improves the customer experience by enabling real-time insights and intelligent automation to network changes as they occur.
Facilitates data access across the enterprise by removing silos and empowering every team to make data-driven decisions quickly.

To meet its goals, Ciena chose Snowflake as its data warehousing platform for operational reporting and analytics and Striim as its data integration and streaming solution to replicate changes from its Oracle database to Snowflake. The company used Striim to collect, filter, aggregate, and update (in real time) 40-90 million business events to Snowflake daily across systems that manage manufacturing, sales, and dozens of other crucial business functions to enable advanced real-time analytics.

With its real-time analytics platform, Ciena has offered customers up-to-date insights as changes occurred in its network, thus improving the customer experience. Additionally, operators can begin experimenting with machine learning by using real-time analytics to identify network events that could impact performance.

Finally, with its self-serve analytics platform, everyone in the organization can now access the data they need to make faster data-driven decisions. With real-time analytics, Ciena’s customers no longer have to wait to see their updated data because it is displayed instantly after any changes are made in the source platforms.

“Because of Striim, we have so much customer and operational data at our fingertips. We can build all kinds of solutions without worrying about how we’ll provide them with timely data,” Rajesh Raju, director of data engineering at Ciena, explains.

2. Macy’s: Improving Digital and Mobile Shopping Experiences

Macy’s, Inc. is one of America’s largest retailers, delivering quality fashion to customers in more than 100 international destinations through the leading e-commerce site macys.com. Macy’s, Inc. sells a wide range of products, including men’s, women’s, and children’s clothes and accessories, cosmetics, home furnishings, and more.

Use cases

Macy’s real-time analytics use cases were to:

Achieve real-time visibility into customer and inventory orders to maximize operational cost, especially during the peak holiday events like Black Friday and Cyber Monday
Leverage artificial intelligence and machine learning to personalize customer shopping experiences.
Quickly turn data into actionable insights that help Macy’s deliver quality digital customer experiences and improve operational efficiencies.

Macy’s migrated its on-premise inventory and order data to Google Cloud storage to reach its objectives. The company decided to move to the cloud based on the benefits of cost efficiency, flexibility, and improved data management. To facilitate the data integration process, it used Striim, which allowed it to:

Import historical and real-time on-premise data from its Oracle and DB2 mainframe databases.
Process the data in flight, including detecting and transforming mismatched timestamp fields.
Continuously deliver data to its Big Query data warehouse for scalable analysis of petabytes of information.

Real-time data analytics has been a critical factor in Macy’s ability to understand customer behaviors and improve the shopping experience for its customers. Data analytics has enabled the company to increase customer purchases and loyalty and optimize its operations to minimize costs. As a result, Macy’s has been able to offer its customers a seamless and personalized shopping experience.

3. MineralTree: Facilitating Real-time Customer Invoice Reporting

MineralTree, formerly Inspyrus, is a fintech SaaS company specializing in automating the accounts payable (AP) process of invoice capture, invoice approval, payment authorization, and payment completion. To do this, the company connects with hundreds of different ERP and accounting systems companies and streamlines the entire AP process into a unified system.

Use cases

MineralTree wanted to build a real-time data analytics system to:

Provide customers with a real-time view of all their invoicing reports as they occur.
Help customers visualize their data using a business intelligence tool.

MineralTree used Striim to seamlessly integrate customer data from various ERP and accounting systems into its Snowflake cloud data warehouse. Striim’s data integration connector enabled the company to generate real-time operational data from Snowflake and use it to power the business intelligence reports it provides to customers through Looker.

MineralTree updated data stack, consisting of Striim, Snowflake, dbt, and Looker, has enhanced the invoicing operations of its customers through rich, value-added reports.

According to Prashant Soral, CTO, the real-time data integration provided by Striim from operational systems to Snowflake has been particularly beneficial in generating detailed, live reports for its customers.

Transform How Your Company Operates Using Real-time Analytics With Striim

Real-time analytics transforms how your business operates by providing accurate, up-to-date information that can help you make better decisions and optimize your operations.

Striim offers an enterprise-grade platform that allows you to easily build continuous, streaming data pipelines to support real-time cloud integration, log correlation, edge processing, and analytics. Request a demo today.