Striim 5.0 Release: Unleash Real-Time HubSpot Integration with Our Latest Connector

As businesses increasingly rely on HubSpot’s customer platform for marketing, sales, customer service, and more, the ability to seamlessly move data in real time is a game changer. Striim 5.0 introduces the HubSpot Reader, a powerful connector that integrates your HubSpot CRM data with any target system. With Striim’s robust real-time data movement capabilities, you can unlock the full potential of your HubSpot data to enhance analytics, streamline operations, and improve customer experiences.

What Does It Do?

Striim’s HubSpot Reader uses HubSpot APIs to extract data from your CRM platform, emitting WAEvents that can be processed with continuous queries or directed to a Striim target. This connector supports three distinct modes:

  • Initial Load: Pulls all existing data from HubSpot, ideal for creating a foundational dataset.
  • Incremental Load: Captures and replicates new source data in near real-time, ensuring your systems are always up to date.
  • Automated Mode: Combines both approaches, completing an initial load before transitioning to incremental updates automatically.

This flexibility allows businesses to tailor data movement workflows to their specific needs, whether for one-time migrations or ongoing synchronization.

How Do You Use It?

Using the HubSpot Reader is simple and adaptable. After configuring the mode of operation, you can seamlessly connect to HubSpot to read data and direct it to any target supported by Striim, including leading data warehouses and data lakes.

For example:

  • Begin with an Initial Load to build a comprehensive dataset from your HubSpot environment.
  • Switch to Incremental Load to enable continuous replication, capturing changes like new leads, updated deals, or customer interactions.
  • Use Automated Mode to eliminate manual intervention, ensuring uninterrupted, real-time updates.

Want to dive deeper? Check out the doc and explore more.

How Does Striim Add Value?

Striim enhances the value of your HubSpot data by integrating it with various systems to support advanced analytics, improve customer support, and boost operational efficiency. With real-time data migration, businesses gain faster decision-making and up-to-date insights. Its seamless integration capabilities connect HubSpot to data warehouses, lakes, or operational systems, providing a unified view of your business. Additionally, Striim offers enhanced flexibility for diverse use cases, from creating analytical reports and optimizing marketing campaigns to elevating customer support workflows.

Experience the Power of Striim 5.0

Striim 5.0 takes HubSpot data integration to the next level with a connector that’s as flexible as it is powerful. Whether you’re laying the groundwork for data-driven initiatives or fine-tuning your existing workflows, the HubSpot Reader ensures real-time, reliable data movement.

Ready to power your business with real-time data? Try Striim today with a free trial or book a demo to see it in action.

Start Your Free Trial | Schedule a Demo

Striim 5.0 Release: Supercharge Customer Service with the Zendesk Reader

Real-time access to data is essential for delivering outstanding customer experiences. Striim’s 5.0 release introduces the Zendesk Reader, a powerful tool that enables businesses to seamlessly integrate their Zendesk data into their broader data ecosystem. This integration enhances decision-making and helps teams improve customer service efficiency by providing timely insights from their help desk management system.

What Does It Do?

The Striim Zendesk Reader ingests data from Zendesk’s cloud-based help desk platform and emits WAEvents, which can be processed through continuous queries or directed to any supported Striim target. By leveraging the Zendesk API, the reader reads the user’s objects and tables, delivering data directly to the Striim platform. This provides a streamlined way to access and use critical customer service data for business analytics and decision-making.

How Do You Use It?

The Zendesk Reader can be used in two modes: the initial load mode and incremental load mode. For an initial load, you can set the mode in the Intercom Reader, allowing you to extract all relevant Zendesk data for the first time. After the initial load, you can switch to “Incremental Load” mode for near real-time continuous replication. This mode enables the adapter to read new source data at regular intervals, ensuring that you always have the latest updates flowing through your systems.

To use the Zendesk Reader, the user should have access to a Zendesk instance or an Access token of the OAuth client registered to the instance. This ensures the necessary permissions are in place for data extraction and integration.

Want to dive deeper? Check out the doc and explore more.

How Does Striim Add Value?

Striim’s Zendesk Reader delivers immense value by enabling real-time data flow with high throughput and low latency. This ensures the seamless handling of large-scale data, giving businesses immediate access to valuable insights. By writing data in real time to a data warehouse, you can build a comprehensive Customer Data Platform (CDP) to enhance your customer insights and decision-making processes.

Plus, Striim empowers businesses to integrate Zendesk data with machine learning (ML) and analytics systems for advanced workflows like Next Best Action, LTV (Lifetime Value) Analysis, and churn analysis. These integrations allow you to anticipate customer needs and make data-driven decisions that improve customer satisfaction and retention.

Transform Your Business Today!

Ready to power your business with real-time data? Try Striim today with a free trial or book a demo to see it in action.

Start Your Free Trial | Schedule a Demo

 

5.0 Release: Unlocking the Power of Snowflake CDC for Real-Time Data Replication

What is Snowflake CDC?

Snowflake CDC (Change Data Capture) is a method that enables real-time data replication from Snowflake databases by tracking and capturing changes made to tables. Using a specialized Snowflake Reader, it enables continuous replication after an initial load, ensuring that any data manipulation language (DML) changes like inserts, updates, and deletes are identified and captured in near real-time.

What Does It Do?

The Snowflake Reader is designed to monitor and read changes occurring in a Snowflake database. It identifies changes in tables through a “CHANGES” clause, querying the table at incrementing time intervals to ensure up-to-date information. This process is ideal for scenarios where keeping track of ongoing data modifications is essential for accurate analytics, reporting, or operational use cases.

The Snowflake Reader can capture both DML changes and certain limited DDL (Data Definition Language) changes, keeping your data in sync and allowing you to confidently use Snowflake as a dynamic, continuously updated data source.

How Do You Use It?

  1. Initial Load: Start by using the standard Database Reader to load your data into Snowflake for the first time.
  2. Continuous Replication: Once the initial load is complete, the Snowflake Reader takes over, enabling CDC to maintain ongoing updates in real time. This setup is beneficial for applications that require near real-time data synchronization, reducing latency and ensuring the data stays fresh.

Want to dive deeper? Check out the doc and explore more.

How Does Striim Add Value?

Striim’s Snowflake CDC functionality supports several high-impact use cases:

  • Reverse ETL: Many organizations need to read analytics results from Snowflake and apply those insights directly in operational systems like CRM, SCM, or other transactional databases. With Snowflake CDC, Striim enables this seamless reverse ETL process, allowing data like customer lifetime value (LTV) or churn predictions to be easily updated across systems.
  • Data Warehouse Consolidation: Companies with multiple departmental data warehouses can use Snowflake CDC to continuously sync data across these instances, ensuring a consistent and consolidated view at the corporate level.

Additional Highlights

  • Snowflake CDC Reader supports all Snowflake data types, except for the Vector type, making it flexible enough to handle diverse data requirements.

Ready to power your business with real-time data? Try Striim today with a free trial or book a demo to see it in action.

Start Your Free Trial | Schedule a Demo

 

Securing Data in Striim: Cryptographic Key Management in Striim, Field-Level Encryption, and Vault Integration

How Striim Keeps Enterprise Grade Streaming Workloads Safe

The Streaming Data Security Challenge

At Striim our approach to security encompasses protecting data both in motion and at rest, utilizing advanced encryption techniques and integrating with powerful tools.

In this blog, I’ll walk you through 2 small pieces of this architecture on how Striim secures sensitive data & metadata (a) in motion and (b) at rest. 

The Striim Shield: Protecting Data In Motion

One of the standout features gaining traction in this landscape is field-level encryption within data pipelines. While some startups have made this their sole focus, Striim takes it a step further by offering this capability at no extra cost through our innovative practice known as encrypted streaming.

With Shield, you can easily encrypt one or more fields within a data stream with your own self-managed encryption key. This means that as your data flows downstream—eventually reaching its external target—those sensitive fields remain securely encrypted.

In the above, you can clearly see that the SSN, or social security number, is being encrypted through the Shield component.

So, how does this work?

The system architecture is centered around a Key Management Service (KMS) that maintains a master Key Encryption Key (KEK). This KEK is utilized to encrypt the Data Encryption Key (DEK), which is generated by the Striim server using the Tink cryptographic library.

The encryption process is optimized through a batching mechanism. A Tink client is batched to utilize a specific KEK for a predetermined number of events flowing through the data pipeline. This approach balances computational efficiency with security requirements.

The implementation serves two primary technical objectives:

  1. Encryption of data prior to warehouse ingestion, ensuring protection during network transit.
  2. Maintenance of data encryption within the warehouse, providing security for data at rest.

So, what about decryption? – to decrypt the data, customers must utilize Tink in conjunction with the same KMS. This ensures that decryption is only possible for authorized entities possessing the correct cryptographic keys. 

Here’s how you can get your hands dirty with it in Striim

  1. Create a connection profile which connects to the KMS
  2. Create a shield component in the flow designer which uses this connection profile
  3. Deploy & Run your application, and sit back and enjoy encrypted real-time data streaming! 

It’s important to note that since we send the data encryption key over the network to be encrypted by the KMS, there are performance implications in terms of speed while using this feature. 

For simpler encryption than Shield, Striim offers the WITH ENCRYPTION option during application creation. This secures data streams between Striim servers or from Forwarding Agents, especially useful when data sources are external to the cluster or private network. You can apply encryption at the application or flow level to balance security and performance, encrypting only sensitive data streams as needed.

The Striim Vault: Protecting Customer Metadata 

Now that we’ve discussed how data in motion is safe in Striim at rest, we’ll dive into how data is kept safe at rest. Striim maintains customer metadata in our Metadata Repository (MDR). This includes all information such as pipeline source & target metadata, user information, pipeline deployment plans and more. While this repository is crucial for storing various configuration details, we understand that certain information—such as passwords and table names—requires an extra layer of security.

When users input passwords for database connections or secure JDBC connections, Striim doesn’t simply store this sensitive information in plain text. Instead, we’ve developed a sophisticated system to protect these credentials – Striim integrates seamlessly with external vaults, providing customers with a secure way to manage their sensitive data. This integration allows customers to avoid storing sensitive information directly in the MDR, even though we do encrypt that before persisting it as well. 

Our vault integration includes several advanced security features:

  • Token Auto-renewal: Customers can configure settings for Striim to automatically renew the vault token periodically. This ensures that even if a token is leaked, it becomes invalid after a certain time.
    The Striim server operates a timer thread within the JVM that regulates the frequency at which the HashiCorp Renew Endpoint is accessed.
  • Reference-based Storage: In Striim, we maintain a reference to the connection in the vault and the key that the customer wants to use as the password for their database or any other sensitive field.
  • Runtime Connection: Striim makes the connection to the vault at runtime. This means the sensitive value is only stored in memory and never persisted to the MDR. 

One of the many vaults we integrate with is the HashiCorp Vault. Here’s how it works:

  1. Customers maintain their keys and values on their HashiCorp server.

  2. In Striim, customers create a reference to their vault. You provide only an access token to connect to the HashiCorp Server, rather than the actual credentials.

  3. Once the vault is created, customers can see their HashiCorp Vault keys, as well as the usage of the vault in Striim Application components such as Sources & Targets.

  4. Striim users can use this vault key as the password in their source/target (e.g., Oracle Reader)  – this ensures that the raw password is never stored in our MDR, and only maintained in-memory for this connection.

  5. Deploy and Run Pipeline – This is the final step which allows you to run your data pipeline and drive mission critical business applications securely, and in real-time!

Benefits of Using Vaults in Striim

  1. Enhanced Security: Sensitive data is never stored at rest in Striim’s MDR.
  2. Flexibility: Customers can use their preferred vault solution. We integrate with Google’s Secret Manager, HashiCorp Vault, Azure Key Vault, and even provide our own homegrown Striim Vault. 
  3. Reduced Risk: Sensitive credentials existing in a completely segregated system designed for data protection allows for heightened security.

For this post we focused on protecting and data in motion with Shield, and data at rest with vaults. In an era where data breaches can cost companies millions and erode customer trust, Striim’s commitment to secure data processing provides peace of mind. By addressing security at every stage of the data lifecycle, from ingestion to processing to storage, we enable businesses to harness the power of their data without compromising on safety.

This article only scratches the surface in how we keep your data safe. There are so many more amazing features in our new release, including automated PII detection, OAuth connectivity, SSO/SAML support, and more to safely activate your siloed data in real-time.

Ready to learn more? Sign up for a demo today

Reinventing Data Governance for the AI Era: Embracing Automation and Intelligent Data Protection

As organizations increasingly rely on AI to drive innovation and efficiency, protecting sensitive data has become both a strategic necessity and a regulatory mandate. Traditional security measures, often reactive and manual, no longer suffice. Instead, we now stand at the cusp of a new era where data governance is automatic, intelligent, and built to match the speed of AI. 

Let’s explore how AI-driven sensitive data protection is transforming data security. Then, discover how Striim’s AI agents are leading the way in this revolution. 

The New Age of Data Governance 

Despite the widespread deployment of multiple API security products, recent surveys reveal a staggering statistic: 92% of organizations experienced an API-related security incident in the last year, with 57% encountering multiple incidents. This alarming reality underscores the limitations of traditional security measures and highlights the urgent need for more intelligent, automated solutions.

Historically, safeguarding hackable data required a labor-intensive process—manual audits, constant monitoring, and a reactive approach to threats. However, the reality of today’s fast-moving data environment demands a radical shift. With the advent of AI-driven security, sensitive data can be detected, classified, and protected in real time. This proactive stance eliminates the need for constant manual oversight. Protecting sensitive data helps organizations work towards compliance and reduce the risk of human error.

Imagine a world where sensitive data moves through systems effortlessly, but never without oversight. Striim’s AI-powered approach ensures this by detecting and classifying data before it even reaches storage. Continuous scanning identifies sensitive data the moment it’s created—not after it’s stored—while proactive security mechanisms like real-time masking, encryption, or redaction safeguard the information from exposure. Striim enables businesses to instantly manage and protect sensitive data, making it possible to adhere to regulations like GDPR, CCPA, and HIPAA. The result? Data flows freely and securely, empowering businesses to focus on what matters most.

Enter Striim’s AI Agents Sentinel and Sherlock: Pioneering AI-Powered Data Governance

Striim’s AI agents, Sentinel and Sherlock, are pioneering tools that bring real-time, AI-powered governance to your data pipelines, increasing security without compromising performance.

Sherlock AI offers: 

  • Source Operation: Identifies sensitive data before it enters data pipelines—even in third-party-managed databases and SaaS environments.
  • Early Detection: Finds sensitive data before it moves, eliminating risk at the earliest stage.
  • Comprehensive Visibility: Works seamlessly across SaaS, cloud, and third-party environments to ensure full visibility.
  • Lightweight Scanning: Operates with zero performance impact, ensuring databases aren’t overloaded.
  • Automated Classification: Classifies financial, health, and identity-related PII automatically, providing real-time security insights.
  • Data Quality Monitoring: Detects data quality issues in real time, alerting teams when sensitive data appears in unintended locations.

Sentinel AI provides: 

  • In-Motion Protection: Provides real-time detection and protection of sensitive data as it moves across systems.
  • Accurate Detection: Spots PII anywhere in a record—even if it’s misplaced or mislabeled—beyond the scope of rules-based controls.
  • Exposure Prevention: Prevents data exposure when transferring information from internal systems to external platforms for analytics or exchange.
  • Compliance Support: Supports 25+ sensitive data types across the USA, Canada, UK, and India to support various compliance requirements.
  • Automated Actions: Executes policy-based actions such as encryption and masking (partial, full, regex-based) automatically.
  • Plug-and-Play UX: Easily integrated into your pipeline with a plug-and-play setup that requires only a few clicks.
  • Regulatory Governance: Supports businesses on their journey to meet GDPR, CCPA, HIPAA, and other regulatory requirements.

Together, Sherlock AI and Sentinel AI work to prevent sensitive data exposure before it happens, ensuring your operations remain secure and that your team is in full control of its data.

How AI-Powered Data Governance Works

Our process begins with Sherlock AI, which proactively identifies sensitive data at its source—before it moves. By scanning both structured and unstructured data across SQL, NoSQL, SaaS, and cloud databases, it detects and automatically classifies financial, health, and identity-related information that may present compliance challenges. 

As data moves, Sentinel AI validates it in real time using advanced pattern recognition and NLP, catching any mislabeled or misplaced data that traditional rules-based systems might overlook. Sentinel AI then applies automated protection measures—encrypting, masking, or blocking data based on business policies—to secure its movement between internal and external systems and prevent unintended processing of regulated information. 

Sentinel delivers live reporting via real-time dashboards that continuously monitor sensitive data exposure, security actions, and compliance. It uses predefined identifiers to detect, log, and protect sensitive information, while AI-driven metadata tags each event for effective tracking and auditing. With support for schema evolution, Sentinel easily adapts to new data sources, ensuring ongoing AI-powered data governance.

This continuous monitoring helps organizations stay audit-ready and compliant. Real-time dashboards provide complete visibility into data protection efforts, and Sentinel generates audit logs that align regulations like GDPR, CCPA, HIPAA, and the EU AI Act. Additionally, it integrates with enterprise security tools such as SIEM, DLP, Datadog, and Snowflake Security to ensure a unified security framework.

The Impact of AI-Powered Automation on Data Governance 

By automating these processes, organizations no longer need to scramble after a potential data breach. Instead, security becomes a built-in feature of data management. Sensitive information is automatically shielded by AI agents as it moves through the enterprise ecosystem, whether in production environments, during testing, or throughout analytics workflows.

Automated authentication and connection processes also reduces strain on IT teams. This allows security professionals to shift their focus from routine monitoring to strategic initiatives, such as threat intelligence and proactive risk management. With Sentinel AI operating silently in the background, businesses can innovate without fear of compromising their sensitive data. 

By ensuring that sensitive data is protected, organizations can also enhance customer trust. In addition, streamlined security processes translate into improved operational efficiency. Data flows remain uninterrupted, and the risk of security incidents is drastically minimized.

Moving Forward in the AI Era

The AI era requires businesses to rethink traditional approaches to data security. With the speed at which data moves and the sophistication of modern cyber threats, it’s clear that reactive measures are no longer sufficient. Automated, intelligent solutions are not just an option—they are a necessity. 

Get a demo today and discover how Striim can help you better protect your data. 

 

Scaling Databases in the AI Era: Insights from Andy Pavlo (Carnegie Mellon University)

Get More Insights In Your Inbox

Join us for a deep dive into the world of databases with CMU professor Andy Pavlo. We discuss everything from OLTP vs. OLAP, the challenges of distributed databases, and why cloud-native databases require a fundamentally different approach than legacy systems. We discuss modern Vector Databases, RAG, Embeddings, Text to SQL and industry trends.

You can follow Andy’s work on:

What’s New In Data is a data thought leadership series hosted by John Kutay who leads data and products at Striim. What’s New In Data hosts industry practitioners to discuss latest trends, common patterns for real world data patterns, and analytics success stories.

Real-Time RAG: Streaming Vector Embeddings and Low-Latency AI Search

Imagine searching for products on an online store by simply typing “best eco-friendly toys for toddlers under $50” and getting instant, accurate results—while the inventory is synchronized seamlessly across multiple databases. This blog dives into how we built a real-time AI-powered hybrid search system to make that vision a reality. Leveraging Striim’s advanced data streaming and real-time embedding generation capabilities, we tackled challenges like ensuring low-latency data synchronization, efficiently creating vector embeddings, and automating inventory updates.

We’ll walk you through the design decisions that balanced consistency, efficiency, and scalability and discuss opportunities to expand this solution to broader Retrieval-Augmented Generation (RAG) use cases. Whether you’re building cutting-edge AI search systems or optimizing hybrid cloud architectures, this post offers practical insights to elevate your projects.

What is RAG?

Retrieval-Augmented Generation (RAG) enhances the capabilities of large language models by incorporating external data retrieval into the generation process. It allows the model to fetch relevant documents or data dynamically, ensuring responses are more accurate and context-aware. RAG bridges the gap between static model knowledge and real-time information, which is crucial for applications that require up-to-date insights. This hybrid approach significantly improves response relevance, especially in domains like e-commerce and customer service.

Why Vector Embeddings and Similarity Search?

Vector embeddings translate natural language text into numerical representations that capture semantic meaning. This allows for efficient similarity searches, enabling the discovery of products even if queries differ from stored descriptions. Embedding-based search supports fuzziness, matching results that aren’t exact but are contextually relevant. This is essential for natural language search, as it interprets user intent beyond simple keyword matching. The combination of embeddings and similarity search improves the user experience by providing more accurate and diverse search results.

Why Real-time RAG Instead of Batch-Based Data Sync?

Real-time RAG ensures that inventory changes are reflected instantly in the search engine, eliminating stale or outdated results. Unlike batch-based sync, which can introduce latency, real-time pipelines offer continuous updates, improving accuracy for fast-moving inventory. This minimizes the risk of selling unavailable products and enhances customer satisfaction. Real-time synchronization also supports dynamic environments where product data changes frequently, aligning search capabilities with the latest inventory state.

How We Designed the Embedding Generator for Performance

In designing the Vector Embedding Generator, we addressed the challenges associated with token estimation, handling oversized input data, and managing edge cases such as null or empty input strings. These design considerations ensure that the embedding generation process remains robust, efficient, and compatible with various AI models.

Token Estimation and Handling Large Data

Google Vertex AI

Vertex AI simplifies handling large data inputs by silently truncating input data that exceeds the token limit and returning an embedding based on the truncated input. While this approach ensures that embeddings are always generated regardless of input size, it raises concerns about data loss affecting embedding quality. We have an ongoing effort to analyze how this truncation impacts embeddings and whether improvements can be made to mitigate potential quality issues.

OpenAI

OpenAI enforces strict token limits, returning an error if input data exceeds the threshold (e.g., 2048 or 3092 tokens). To handle this, we integrated a tokenizer library into the Embedding Generator’s backend to estimate token counts before sending data to the API. The process involves:

  1. Token Count Estimation: Input strings are tokenized to determine the estimated token count.
  2. Iterative Truncation: If the token count exceeds the model’s limit, we truncate the input to 75% of its current size and recalculate the token count. This loop continues until the token count falls within the model’s threshold.
  3. Submission to Model: The truncated input is then sent to OpenAI for embedding generation.

For instance, if an OpenAI model has a token limit of 3092 and the estimated token count for incoming data is 4000, the system will truncate the input to approximately 3000 tokens (75%) and re-estimate. This iterative process ensures compliance with the token limit without manual intervention.

Handling Null or Empty Input

When generating embeddings, edge cases like null or empty input can result in API errors or undefined behavior. To prevent such scenarios, we adopted a solution inspired by discussions in the OpenAI developer forum: the use of a default vector.

Characteristics of the Default Vector:

  • Dimensionality: Matches the size of embeddings generated by the specific model (e.g., 1536 dimensions for OpenAI’s text-embedding-ada-002).
  • Structure: The vector contains a value of 1.0 at the first index, with all other indices set to 0.0.
    • Example:[1.0, 0.0, 0.0, … , 0.0]

By returning this default vector, we ensure the system gracefully handles cases where input data is invalid or missing, enabling downstream processes to continue operating without interruptions.

Summary of Implementation:

  1. Preprocessing
    • Estimate token counts and handle truncation for models with strict token limits (OpenAI).
    • Allow silent truncation for models like Google Vertex AI but analyze its impact.
  2. Error Handling
    • For null or empty data, return a default vector matching the model’s embedding dimensions.
  3. Scalability
    • These mechanisms are integrated seamlessly into the Embedding Generator, ensuring compatibility across multiple data streams and models without manual intervention.

This design enables developers to generate embeddings confidently, knowing that token limits and edge cases are managed effectively.

Tutorial: Using the Embeddings Generator for Search

An e-commerce company aims to build an AI-powered hybrid search that enables users to describe their needs in natural language.

Their inventory management system is in Oracle database and the store front search database is maintained in the Azure PostgetSQL.

Current problem statements are:

  1. Data Synchronization: Inventory data from Oracle must be replicated in real-time to the storefront’s search engine to ensure data consistency and avoid stale information.
  2. Vector Embedding Generation: Product descriptions need vector embeddings to facilitate similarity searches. The storefront must support storing and querying these embeddings.
  3. Real-Time Updates: Ongoing changes in the inventory (such as product details or stock updates) need to be reflected immediately in the search engine.
  4. Embedding Updates: Updates to products in the inventory should trigger real-time embedding regeneration to prevent outdated or inaccurate similarity search results.

Solution

Striim has all the necessary features for the use case and the problem statements described.

  1. Readers to capture initial snapshot and real-time changes from Oracle database.
  2. Embedding generator to generate vector embeddings for text content.
  3. Writers to deliver the data along with the embeddings to Postgres database.

Striim Features :

  1. Readers: Capture the initial snapshot and ongoing real-time changes from the Oracle database.
  2. Embedding Generator: Creates vector embeddings for product descriptions.
  3. Writers: Deliver updated data and embeddings to the PostgreSQL database.

PostgreSQL with pgvector:

  1. Supports storing vector embeddings as a specific data type.
  2. Enables similarity search functionality directly within the database.

Note: The search engine for this solution is implemented in Python for demonstration purposes. OpenAI is used for embedding generation and for summarisation of the results.

Design Choices and Rationale

  1. Striim for Data Integration: Chosen for its seamless real-time change capture (CDC) from Oracle to PostgreSQL.
    1. Alternative could be to have an independent application/script periodically but would be inefficient and expensive to have this developed for the initial load and change data capture, also would be difficult to maintain over a period of time for various data sources which might also need to be synced to the store front database (Postgres)
  2. OpenAI Embeddings: Ensures high-quality embeddings compatible across pipeline stages.
  3. Striim Embedding generator: Enables in-flight embedding generation while the data is being synced instead of having to move the data separately and then generate and update the embedding.
    1. Alternative is to have separate listeners/triggers/scripts in the search front database (Postgres) to generate and update the embeddings every time the data changes. This could be very expensive and not be very accurate as the data changes and the embedding changes can go out of sync.
  4. pgvector: Facilitates native vector searches, reducing system complexity.
    1. Alternative design choices were to choose the standalone vector databases but they do not provide the flexibility pgvector provides as we can store the actual data and the embeddings in a relational database setup compared to vector databases where we need to maintain metadata for each vector database and cross lookup for actual data from a different database/source while summarising the similarity search results. Eg., the embeddings would be in one place which needs to be queried using similarity search query whereas price or rating-based filters need to be applied elsewhere.
  5. Python Search Engine: Provides flexibility and integration simplicity.
    1. This is a convenient choice to make use of the python libraries, more details are included in the upcoming sections

Future Work

  1. Expand embedding model options beyond OpenAI and make it generic
    1. This is already done for the Striim Embedding generator as it supports VertexAI as well. We could consider supporting self-hosted models.
  2. Expand the support for non-textual input.
  3. Expand the implementation to generically cover any use case and have the application interface integrated with Striim UI.
    1. Current implementation as a proof of concept is tightly coupled with the e-commerce use case and it’s data set.

Step-by-step instructions

Set up Striim Developer 5.0

  1. Sign up for Striim developer edition for free at https://signup-developer.striim.com/.
  2. Select Oracle CDC as the source and Database Writer as the target in the sign-up form.

Prepare the dataset

The dataset can be downloaded from this link:  https://raw.githubusercontent.com/GoogleCloudPlatform/python-docs-samples/main/cloud-sql/postgres/pgvector/data/retail_toy_dataset.csv

(selected set of 800 toys from the Kaggle dataset – https://www.kaggle.com/datasets/promptcloud/walmart-product-details-2020)

Have the above data imported into Oracle. Please use the following table definition to import the data :


					
				

A peek into the data :

Create Embedding generator

Go to StriimAI -> Vector Embeddings Generator and create a new Embeddings Generator

  1. Provide a unique name for the embedding generator.
  2. Choose OpenAI as model provider
  3. Enter the API key copied over from your OpenAI platform account – https://platform.openai.com/settings/organization/api-keys
  4. Enter “text-embedding-ada-002” as the model
    1. The same model should be used in the Python code for converting the user query into embedding before similarity search as well.
  5. Later in this blog, the above embedding generator will be used in the pipeline for generating vectors using a Striim builtin function – generateEmbeddings()

Setup the automated data pipeline from Oracle to Postgres

Go to Apps -> Create An App -> Choose Oracle as Source and Azure Postgres as Target

Follow the guided wizard flow to configure the pipeline with source, target connection details and the table selection. In the review screen, make sure to choose the “Save & Exit option”

Note: Please follow the prerequisites for Oracle CDC from Striim doc – https://striim.com/docs/en/configuring-oracle-to-use-oracle-reader.html

The table used in Postgres :


					
				

Please note that the embedding column is created additionally in Postgres.

Customize the Striim pipeline

  1. Go back to the pipeline using Apps -> View All Apps
  2. Open the IL app RealTimeAIDemo_IL
  3. Open the reader configuration -> Disable “Create Schema” under “Schema and Data Handling” as we are creating the schema on Postgres already.

Click the output stream of the reader and add a Continuous Query component (CQ) to it to generate embeddings for the description column using the embedding generator we created above.

The CQ essentially puts the embeddings into the userdata section of the source  event, which can be used to write to the embedding column in Postgres table.

Here the output of the generateEmbedding() function will be placed as part of the ‘embedding’ section (as part of the user data)  in the source event.


					
				

Click to the target and change the input stream to the output of CQ (OutputWithEmbedding).  In the Tables property, also add a mapping for embedding column under Data Selection :

AIDEMO.PRODUCTS,aidemo.products ColumnMap(embedding=@USERDATA(embedding)

This will insert or update the ‘embedding’ column of the Postgres table with the generated vectors from the generateEmbedding() call. This column holds the vector value of the ‘description’ column value.

Perform the same steps for the CDC app in the pipeline as well.

  1. Go back to the pipeline using Apps -> View All Apps
  2. Open the CDC app RealTimeAIDemo_CDC and perform the same steps as done for the RealTimeAIDemo_IL application (i.e calling generateEmbedding function and columnMap())

Create Gen AI application using python

Next step is to build a python application, which does the following:

  1. Accepts user query in natural text for searching a product
  2. Converts the Query to embeddings using Open AI
  3. Performs similarity search using PG vector extension functionality in Azure Database for PostgreSQL
  4. Returns a response generated using LLM service with the top matching product details

  1. Please click here to download this python application.
  2. asyncpg is used to connect to Postgres
  3. OpenAIEmbeddings (langchain.embeddings) is used to generate embeddings for the user query
    1. Please note that you need to use the same model in the Striim embedding generator
  4. langchain.chains.summarize is used for summarisation (model used : gpt-3.5-turbo-instruct)
  5. Query used to perform the similarity search :

					
				
    1. Similarity threshold of 0.7, max matches of 5 are used as experimentation but only one closest result is picked for summarisation
  1. gradio is used to present in a simple UI

Start the pipeline to perform the initial snapshot load

Once the initial load is complete, the pipeline will automatically transition to CDC phase. Verify the data in Postgres table and confirm that embeddings are stored as well.

Run the similarity search python application and verify that it fetches the results using the similarity search of pgvector.

Capture the real-time changes from the Oracle database and generate on the fly vector conversion

Perform a simple change in your source Oracle database to make a better product description for the product with id : ’20e597d8836d9e5fa29f2bd877fc3e0a’


					
				

Striim pipeline would instantly capture the change and deliver it to the Postgres table along with the updated vector embedding.

Run the same query in the search interface to notice that a fresh result shows up :

Now the same query would result in a different product id since the query works against the recently updated data in the Oracle database.

There we go! Real-time integration and consistent data everywhere!

Conclusion

Experience the power of real-time RAG with Striim. Get a demo or start your free trial today to see how we enable context-aware, natural language-based search with real-time data consistency—delivering smarter, faster, and more responsive customer experiences!

References and credits

Inspired by the Google notebook which showcased the use case and also had reference to the curated dataset: Building AI-powered data-driven applications using pgvector, LangChain and LLMs

Real-Time Analytics: Upleveling the Modern Customer Experience

Customer expectations have evolved beyond simply receiving timely responses. Consumers now expect personalized experiences that make every interaction with a brand feel personal and relevant. 

To meet these rising expectations, businesses are investing in real-time customer analytics—a strategic approach that enables them to understand, predict, and respond to customer behavior as it happens. In fact, according to a Gartner survey, nearly 80% of companies are increasing their investments in customer experience initiatives to stay competitive in the digital age. The result? An enhanced customer experience that drives loyalty, revenue growth, and sustainable success.

The Importance of Delivering Instant, Personalized Experiences

Generic messaging isn’t appealing to today’s customers — they expect more. They want to feel understood, valued, and personally connected to brands. Imagine visiting a website that recognizes your unique preferences and offers suggestions that truly resonate with your lifestyle. Instead of encountering one-size-fits-all content, the experience adapts to you—highlighting products that complement your previous choices or even tailoring messages to suit your local context and current environment.

This personalized touch transforms the way you interact with a brand. It creates a sense of ease and relevance, making you feel like the brand truly “gets” you. When every interaction feels thoughtfully designed around your needs, it not only enhances your shopping journey but also builds trust and fosters loyalty. In a world where time is precious and options are abundant, these tailored experiences become the key to turning a casual browser into a dedicated customer.

Now, let’s dive into how real-time data and analytics tie in. 

How Real-Time Data Directly Contributes to Customer Experience 

Real-time analytics is only as effective as the data it relies on. To truly transform customer interactions, brands must harness up-to-the-minute information that reflects every nuance of customer behavior. Without this dynamic input, any attempt at personalization risks being outdated by the time it reaches the customer. Real-time data empowers companies to analyze interactions across various channels—whether online or in-store—and immediately adjust the experience to meet individual needs. This agility can be the difference between a one-size-fits-all approach and a truly engaging, bespoke customer journey.

This instant personalization is built on a well-structured data strategy that combines three key types of data:

  • First-Party Data: This is data directly collected from your owned channels, such as your website and mobile apps.
  • Second-Party Data: Sourced from trusted partners who share insights from their interactions with customers, this data helps broaden your understanding while reinforcing direct customer feedback.
  • Third-Party Data: Acquired from data aggregators, this information can enrich your insights, offering a broader market perspective. However, it must be used judiciously, especially in light of evolving privacy regulations.

By integrating these diverse data sources, companies can transform raw information into actionable insights. Every customer touchpoint—be it browsing a website, receiving an email, or visiting a store—can be optimized in real time, ensuring that each interaction is as engaging and relevant as possible.

Yet, while the benefits of real-time data are clear, many companies still struggle with the necessary infrastructure. Legacy systems, siloed databases, and outdated analytics tools often impede the swift collection, processing, and application of data. 

Without a modern, agile data infrastructure, even the best personalization strategies can falter, resulting in delayed interactions and missed opportunities to connect with customers when it matters most. To fully leverage real-time data for a superior customer experience, businesses must invest in robust, scalable systems that can keep pace with the rapid flow of information in today’s digital landscape.

Enhancing the Entire Customer Journey

A holistic view of the customer journey is crucial in today’s competitive landscape. Real-time analytics offers a comprehensive look at every step a customer takes—from initial awareness to post-purchase engagement. This continuous flow of data allows companies to identify bottlenecks, understand how customers interact with various touchpoints, and make immediate improvements where needed.

For example, if analytics reveal that a particular webpage is causing customers to drop off during the checkout process, a real-time alert can prompt the team to investigate and optimize the page—whether by simplifying the form, improving the user interface, or even offering a live support chat. Similarly, journey reports and attribution analyses help trace the paths that lead to successful conversions, enabling brands to replicate positive experiences across other channels.

By continuously monitoring the customer journey and making data-driven adjustments, companies can ensure a smoother, more engaging experience that evolves alongside customer needs.

How to Implement Real-Time Analytics to Improve Customer Experience 

Transitioning to real-time analytics might seem like a daunting, resource-intensive task, but a strategic, phased approach can make the process manageable and highly effective.

Here’s how to begin. 

Start with High-Impact Use Cases

Focus initially on the areas where real-time data can make the most significant impact—such as personalization and loyalty. This allows your team to see immediate benefits and build internal support for broader initiatives.

Integrate Across Channels

Ensure your data infrastructure can handle inputs from various sources—online interactions, in-store purchases, mobile app engagements, and more. A unified view of customer behavior is key to delivering truly personalized experiences.

Leverage Scalable Platforms

Platforms like Striim offer robust solutions that combine data ingestion, processing, and analytics in one place. These tools are designed to grow with your needs, helping you integrate third-party data where appropriate and maintain compliance with evolving privacy standards.

Continuous Optimization

Use the insights gained from real-time data not just to react, but to proactively enhance the customer journey. Experiment with different loyalty strategies, test new personalization tactics, and refine your approach based on what the data tells you.

Looking Ahead: The Future of Customer Analytics

As technology advances, real-time analytics is poised to become even more integral to customer experience strategies. The evolution of AI and machine learning is enabling businesses to not only react to customer behavior but also predict it. This predictive capability means that brands are starting to anticipate customer needs before they arise, offering proactive recommendations and solutions that further enhance satisfaction and loyalty.

Emerging technologies, such as the Internet of Things (IoT), are also broadening the spectrum of available data. By integrating IoT devices, companies can gain insights into customer behavior in physical spaces—such as tracking in-store movements or monitoring product interactions—thereby adding another layer of depth to the customer experience.

In this new era, success is defined by the ability to blend data-driven insights with human creativity, crafting experiences that feel both personalized and authentic.

The Role of AI in Real-Time Analytics

By combining AI with real-time analytics with integrative platforms like Striim in parallel with AI-ready cloud data warehouses like Snowflake, businesses can create hyper-personalized, adaptive experiences that drive deeper customer connections and long-term loyalty.

Real-World Example: Morrisons 

Morrisons, one of the UK’s largest supermarket chains, has embraced real-time analytics to elevate its customer experience. By integrating critical data from its Retail Management System (RMS) and Warehouse Management System (WMS) into Google BigQuery via Striim, Morrisons now gains immediate visibility into stock levels and product availability. 

 

 

This shift from batch processing to real-time data access enables the company to promptly identify and resolve inventory issues, optimize replenishment, and ensure that shelves are consistently stocked. As a result, customers enjoy a more reliable and satisfying shopping experience—whether they’re shopping in-store or online—with up-to-date product information and timely promotions that cater to their needs.

The Future of Customer Experience is Here

Real-time analytics is no longer a futuristic concept—it is the foundation of modern customer engagement. By enabling instantaneous personalization and a continuously optimized customer journey, real-time analytics helps brands build lasting, meaningful relationships with their customers. 

For companies looking to embark on this journey, starting small and building on high-impact use cases can pave the way for a comprehensive transformation. With strategic tools and platforms available today, the path to delivering truly exceptional customer experiences is clearer than ever. Ready to discover how Striim can help your business leverage real-time data and analytics to enhance customer experience? Get a demo today

Combining Change Data Capture with Streaming to Drive AI-Powered Real-Time Analytics

AI thrives on real-time data. In a world where businesses generate massive volumes of data every second, success hinges on the ability to process, analyze, and act on that data instantly. Change Data Capture (CDC) and streaming technologies form the foundation for AI-driven analytics, ensuring data is always fresh, accurate, and actionable.

Together, CDC and streaming empower businesses to:

  • Supercharge AI models with real-time data: Provide AI with up-to-the-second insights to improve predictions and drive smarter decisions.
  • Adapt operations with AI-powered agility: Real-time processing enables immediate responses to market shifts, customer behaviors, and operational changes.
  • Deliver hyper-personalized experiences: AI leverages real-time streams to create tailored interactions that enhance engagement and satisfaction.
  • Streamline critical processes: From fraud detection to predictive maintenance, AI acts on live data to mitigate risks and improve outcomes.
  • Power agentic AI frameworks: Enable AI systems to operate autonomously by continuously ingesting and responding to real-time data.

Streaming Salesforce Data into Google BigQuery to Build Business Reports

Introduction

At Striim, we use our Salesforce Reader to read from our Salesforce account and write into Google BigQuery where we join data from HubSpot to create Looker reports that multiple internal teams (Sales, Customer Success and Finance) use for reporting, analysis and drive action items for their departments.

This recipe shows how you can build a data pipeline to read data from Salesforce and write to BigQuery.  Striim’s Salesforce Reader will first read the existing tables from the configured Salesforce dataset and then write them to the target BigQuery project using the BigQuery Writer, a process called “initial load” in Striim and “historical sync” or “initial snapshot” by others.  After completing the initial load, the Salesforce Reader will automatically transition to continuously reading updates to the configured Salesforce datasets, and then writing these source updates to the target BigQuery project using the BigQuery Writer.You can use the recipe to write into any of Striim supported targets.

Benefits

  • Act in Real Time – Predict, automate, and react to business events as they happen, not minutes or hours later.
  • Empower Your Teams – Give teams across your organization a real-time view into operational data.

Step – 1 – Prep Work

Setting up Salesforce as a source
Make sure you have the permissions to be able to access the objects in the Salesforce account that you would like to read the data from. These are the permissions that will be required for Automated OAuth:

  • Access the identity URL service
  • Manage Salesforce services
  • Manage user data via APIs
  • Perform requests at any time

Google BigQuery Target

Striim setup details

  • Get started on your journey with Striim by signing up for free on Striim’s Developer Edition.

Step – 2 – Create Striim Application

In Striim, App (application) is the component that holds the details of the data pipeline – source & target details, other logical components organized into one or more flows.

Below steps will help you create an application (refer – Screenshot-1):

  1. Click on Apps (left-hand panel) to create your application.
  2. Enter Source:Salesforce Target:BigQuery (as shown in the screenshots-1,2,3 below)
  3. For this recipe, we are going to use Salesforce Reader with App type as automated (screenshot-3)
  4. Click on “Get Started” button.

(Screenshot- 1 – App selection based on source & target)
(Screenshot- 2 – Select the Salesforce Reader (first in the list as shown below))
(Screenshot- 3 – Target selected is BigQuery and App type should be “Automated”)

  1. Provide a name for your application
  2. Create a new namespace
  3. Click the “Next” button

(Screenshot- 4 – Striim App Creation)

Step – 3 – Configuring Salesforce as Source

Before we jump into the connection to the source, lets understand Connection Profile and Target schema creation that are required for our pipeline creation.

  1. Connection Profile – Connection profile allows you to specify the properties required to connect to an external data source once and use that set of properties in multiple sources and/or targets in multiple applications. The authentication types supported by Connection profiles are OAuth for Salesforce and ServiceAccount for Big Query.
  2. Target Schema creation – Keep this enabled to experience the power of Striim where all the required schemas and tables are created for you by Striim (if they don’t exist already). Note – Permissions are the only key requirement that you need to make sure of. For this recipe you will need to provide the service account key which is also mentioned in the next step.

(Screenshot- 5 Gathering source details to connect)

Enable “Use Connection Profile”

(Screenshot- 6 – Connection Profile creation)

In the “New Salesforce Connection Profile” dialog:

  1. Connection Profile Name – provide a name to identify this connection
  2. Namespace –  Select the namespace. In this case we have used the namespace where the App is created and you can do the same.
  3. Host – We are connecting to the Prod instance and hence it is not required. If you are connecting to a non-prod account like sandbox, then provide the host (for example: striim–ferecipe.sandbox.my.salesforce.com ). Note: Please do not specify https:// in the host field.

(Screenshot-7- Creation of Connection Profile for Salesforce)

Click on “Sign in using OAuth”

  1. You will be redirected to the Salesforce login page where you can provide your credentials.
  2. After successfully logging in you will see the below – screenshot-8

(Screenshot- 8 – Salesforce account authenticated)

  1. The Connection profile dialog should have success messages for the connection and Test. (refer Screenshot-9 below)
  2. Click on Save.

(Screenshot- 9 –  Successful creation of Connection profile for Source)

  1. Striim will check on the source and environment access and then enable the “Next” button.
  2. In the next screen, select the Salesforce object(s) that you want to move into Big Query and click “Next”.

(Screenshot- 10 –  Source Object selection)

Step – 4 – Configuring BigQuery as Target

  1. Choose the service account key.
  2. The “Project ID” will get auto-populated from the service account key. (Screenshot-11, 12)

(Screenshot-11 – BigQuery credential upload)

  1. Either select an existing data set or create a new one. For this recipe we have created a new data set.
  2. Click on “Next” button

(Screenshot-12 – Target Configuration)

Striim will validate the target connectivity and enable the “Next” button.

(Screenshot- 13 – Target checks and validation)

Review all the details of the data pipeline that we just created and click on “Save & Start”.

(Screenshot- 14 – Pipeline Review)

You have successfully created an app that will move your Salesforce data into Google BigQuery!

(Screenshot- 15 – Successful App Creation)

Step – 5 – Running and Monitoring your application

As the Striim App starts running,the dashboards and monitors (screenshot- 16,17) show the real-time data movement along with various metrics (ex- memory and CPU usage) that we capture. Refer to our documentation for more details on monitoring.

(Screenshot- 16 – Monitoring Dashboard)

(Screenshot-17 – Metrics overview)

In the application that we just created you will be able to experience the real-time data movement into the target thereby being able to predict, automate, and react to business events as they happen, not minutes or hours later. The data from Big Query is then joined with our Hubspot data to create Looker reports.

Related Information

  • In addition to the Salesforce Reader, Striim offers other Salesforce related adapters –  Salesforce CDC, Salesforce Writer, Salesforce Pardot, Salesforce Platform Event Reader and Salesforce Push Topic Reader.
  • You can also look into another Salesforce recipe where we read from Salesforce and write into Azure Synapse.
  • Learn more about data streaming using Striim through our other Tutorials and Recipes.
  • More details about increasing throughput using parallel threads and recovery are here.

Conclusion

While this recipe has provided you steps to create a pipeline for your Salesforce data, do check all the application adapters that Striim supports to read from and write to.

If you have any questions regarding Salesforce adapter or any other application adapters reach out to us at applicationadapters_support@striim.com

Back to top