September 2024 - Str-Headless

Revolutionizing Data Queries with TextQL: Insights from Co-Founder Ethan Ding

Posted on September 27, 2024 by Striim Team | 2 min read | 3 views

Can AI really make your data analysis as easy as talking to a friend? Join us for an enlightening conversation with Ethan Ding, the co-founder and CEO of TextQL, as he shares his journey from Berkeley graduate to pioneering the text-to-SQL technology that’s transforming how businesses interact with their data. Discover how natural language queries are breaking down barriers, making data analysis accessible to everyone, regardless of technical skill. Ethan delves into the historical hurdles and the game-changing advancements that are pushing the boundaries of AI and large language models in data querying.

Ever wondered how the quest for full autonomy in self-driving cars relates to data querying? We draw fascinating parallels between these two cutting-edge fields, emphasizing the importance of structured systems over chaotic, AI-driven approaches. This chapter reveals the often-overlooked limitations of current data management practices and underscores the critical need for high-quality data and robust modeling. Through a comparison of traditional business intelligence tools and advanced AI-driven solutions, we explore what truly makes data querying effective and insightful.

Hear from Ethan Deng, co-founder and CEO of TextQL, as he explains how their innovative tool integrates seamlessly with existing BI infrastructures, boosting productivity without the need for disruptive overhauls. Tune in to find out how TextQL is making data-driven decisions faster and smarter, paving the way for a future where data is everyone’s best friend.

Follow Ethan Ding and TextQL at:

Ethan’s LinkedIn: / theethanding
Ethan’s Twitter: https://x.com/TheEthanDing
TextQL’s LinkedIn: / textql
TextQL’s Twitter: https://x.com/textql
TextQL’s website: https://www.textql.com/

Training and Calling SGDClassifier with Striim for Financial Fraud Detection

Posted on September 26, 2024 by Dmitriy Rudakov | 5 min read | 3 views

In today’s fast-paced financial landscape, detecting transaction fraud is essential for protecting institutions and their customers. This article explores how to leverage Striim and SGDClassifier to create a robust fraud detection system that utilizes real-time data streaming and machine learning.

Problem

Transaction fraud detection is a critical responsibility for the IT teams of financial institutions. According to the 2024 Global Financial Crime Report from Nasdaq, an estimated $485.6 billion was lost to fraud scams and bank fraud schemes globally in 2023.

AI and ML help detect fraud, while real-time streaming frameworks like Striim play a key role in delivering financial data to reference and train classification models, enhancing customer protection.

Solution

In this article, I will demonstrate how to use Striim to perform key tasks for fraud detection with machine learning:

Ingest data using a Change Data Capture (CDC) reader in real time, call the model and deliver alerts to a target such as Email, Slack, Teams or any other target supported by Striim
Train the model using Striim Initial load app and re-train the model if its accuracy score decreases by using automation via REST APIs

Fraud Detection Approach

In typical credit card transactions, a financial institution’s data science team uses supervised learning to label data records as either fraudulent or legitimate. By carefully analyzing the data, engineers can extract key features that define a fraudulent user profile and behavior, such as personal information, number of orders, order content, payment history, geolocation, and network activity.

For this example, I’m using a dataset from Kaggle, which contains credit card transactions collected from EU retailers approximately 10 years ago. The dataset is already labeled with two classes representing fraudulent and normal transactions. Although the dataset is imbalanced, it serves well for this demonstration. Key fields include purchase value, age, browser type, source, and the class parameter, which indicates normal versus fraudulent transactions.

Picking Classification Model

There are many possibilities for classification using ML. In this example, I evaluated logistic regression and SGDClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html. The main difference is that SGDClassifier uses stochastic gradient descent optimization whereas logistic regression uses the logistic function to model binary classification. Many experts consider SGD to be a more optimal approach for larger datasets, which is why it was selected for this application.

Accuracy Measurement

The accuracy score is a metric that measures how often a model correctly predicts the desired outcome. It is calculated by dividing the total number of correct predictions by the total number of predictions. In an ideal scenario, the best possible accuracy is 100% (or 1). However, due to the challenges of obtaining and diagnosing a high-quality dataset, data scientists typically aim for an accuracy greater than 90% (or 0.9).

Training Step

Striim provides the ability to read historical data from various sources including databases, messaging systems, files, and more. In this case, we have historical data stored in the MySQL database, which is a highly popular data source in the FinTech industry. Here’s what architecture with real-time data streaming augmented with training of the ML model looks like:

You can achieve this in Striim with an Initial Load application that has a Database reader pointed to the transactions table in MySQL and file target. With Striim’s flexible adapters, data can be loaded virtually from any database of choice and loaded into a local file system, ADLS, S3 or GCS.

Once the data load is completed, the application will change its status from RUNNING to COMPLETED. A script, or in this case, a PS made Open Processor (OP), can capture the status change and call the training Python script.

Additionally, I added a step with CQ (Continuous Query) that allows data scientists to add any transformation to the data in order to prepare the form satisfactory for the training process. This step can be easily implemented using Striim’s Flow Designer, which features a drag and drop interface along with the ability to code data modifications using a combination of SQL-like language and utility function calls.

Model Reference Step

Once the model is trained, we can deploy it in a real-time data CDC application that streams user financial transactions from an operational database. The application calls the model’s predict method, and if fraud is detected, it generates and sends an alert. Additionally, it will check the model accuracy and, if needed, initiate the retraining step described above.

Model Reference App Structure

Flow begins with Striim’s CDC reader that streams financial transactions directly from database binary log. It then invokes our classification model that was trained in the previous step via a REST CALL. In this case, I am using an OP that executes REST POST calls containing parsed transaction values needed for predictions. The model service returns the prediction to be parsed by a query. If fraud is detected, it generates an alert. At the same time, if the model accuracy dips below 90 percent, the Application Manager function can restart a training application called IL MySQL App using an internal management REST API.

Final Thoughts on Leveraging SGDClassifier and Striim for Financial Fraud Detection

This example illustrates how a real-world data streaming application can detect fraud by interacting with a classification model. The application sends alerts when fraud is detected using various Striim alert adapters, including email, web, Slack, or database. Furthermore, if the model’s quality deteriorates, it can retain the model for further evaluation.

For reference TQL sources:

Small Data, Big Impact: Insights from MotherDuck’s Jacob Matson

Posted on September 19, 2024 by Striim Team | 2 min read | 3 views

What makes MotherDuck and DuckDB a game-changer for data analytics? Join us as we sit down with Jacob Matson, a renowned expert in SQL Server, dbt, and Excel, who recently became a developer advocate at MotherDuck.

During this episode, Jacob shares his compelling journey to MotherDuck, driven by his frequent use of DuckDB for solving data challenges. We explore the unique attributes of DuckDB, comparing it to SQLite for analytics, and uncover its architectural benefits, such as utilizing multi-core machines for parallel query execution. Jacob also sheds light on how MotherDuck is pushing the envelope with their innovative concept of multiplayer analytics.

Our discussion takes a deep dive into MotherDuck’s innovative tenancy model and how it impacts database workloads, highlighting the use of DuckDB format in Wasm for enhanced data visualization. Jacob explains how this approach offers significant compression and faster query performance, making data visualization more interactive. We also touch on the potential and limitations of replacing traditional BI tools with Mosaic, and where MotherDuck stands in the modern data stack landscape, especially for organizations that don’t require the scale of BigQuery or Snowflake. Plus, get a sneak peek into the upcoming Small Data Conference in San Francisco on September 23rd, where we’ll explore how small data solutions can address significant problems without relying on big data. Don’t miss this episode packed with insights on DuckDB and MotherDuck innovations!

Small Data SF Signup
Discount Code: MATSON100

The Rise of Streaming Data Platforms: Embrace the Future Now

Posted on September 17, 2024 by Striim Team | 1 min read | 3 views

Harnessing Continuous Data Streams: Unlocking the Potential of Online Machine Learning

Posted on September 4, 2024 by Striim Team | 10 min read | 3 views

The world is generating an astonishing amount of data every second of every day. It reached 64.2 zettabytes in 2020, and is projected to mushroom to over 180 zettabytes by 2025, according to Statista.

Modern problems require modern solutions — which is why businesses across industries are moving away from batch processing and towards real-time data streams, or streaming data. Moreover, the concept of ‘online machine learning’ has emerged as a potential solution for organizations working with data that arrives in a continuous stream or when the dataset is too large to fit into memory.

Today, we’ll walk you through the close connection between successful machine learning and streaming data. You’ll learn potential applications and why online machine learning is an excellent idea.

What is Online Machine Learning?

Online machine learning is an approach that feeds data to the machine learning model in an incremental manner, which can leverage continuous streams. Instead of being trained on a complete data set all at once, online machine learning allows models to receive data points one at a time or in small batches. This method is especially helpful in scenarios where data is generated continuously, as this enables the model to learn and adapt in real time.

Applying machine learning to streaming data can help organizations with a wide range of applications. These include fraud detection from real-time financial transactions, real-time operations management (e.g., stock monitoring in the supply chain), or sentiment analysis over live social media trends on Facebook, Twitter, etc.

“Online ML is the only way forward as old ways of using schedules to run batches do not fit with the growing data volumes and real time expectations,” shares Dmitriy Rudakov, Director of Solution Architecture at Striim.

Simson Chow, Sr. Cloud Solutions Architect, adds, “Online machine learning allows models to continuously learn from new data and adapt in real-time. This will allow models to rapidly adjust to changing environments and produce accurate, up-to-date predictions. This dynamic approach is crucial in a constantly changing environment, where static models can quickly become outdated and ineffective.”

What are Potential Use Cases for Online Machine Learning?

Some instances where online machine learning is particularly impactful include:

When your data has no end and is effectively continuous
When your training data is sensitive due to privacy issues, and you are unable to move it to an offline environment
When you can’t transfer training data to an offline environment due to device or network limitations
When the size of training datasets is too large, making it impossible to fit into the memory of a single machine at a specific time

Online vs Offline Machine Learning: Why Offline Machine Learning Is Not Ideal for Streaming Data

To effectively utilize streaming data for machine learning, traditional batch processing methods fall short.

These methods, usually referred to as offline or batch learning, can handle static datasets, processing them all at once. However, they’re not equipped to deal with the continuous flow of data in real time. Due to this, taking such an approach is not only resource-intensive but also time-consuming, making it unsuitable for dynamic environments where timely updates are crucial. Let’s dive deeper.

Online vs Offline Machine Learning: Offline Learning Limitations

Offline learning systems are limited by their inability to learn incrementally. Each time new data becomes available, the entire model must be retrained from scratch, incorporating both the old and new data into a single dataset.

“Because traditional batch processing relies on frequently updating models with massive batches of data, it can result in redundant predictions and inadequate responses to new patterns, changes in the data, and more costs as a result of the model’s retraining and re-deployment, requiring significant infrastructure and compute resources,” says Chow. “This makes it unsuitable for various machine learning use cases. Because of this latency, it is not appropriate for real-time applications like online personalization, fraud detection, or autonomous systems where quick decisions are necessary.”

This process consumes significant computational resources and can result in prolonged downtime as the model is retrained, re-evaluated, and redeployed. While automated tools can streamline this process, the delay in retraining limits the model’s responsiveness, particularly in time-sensitive applications such as financial forecasting.

“There are 2 main reasons traditional batch systems don’t work for customers anymore,” says Dmitriy Rudakov. “The first one is the growing need to act in real time. For example, can you imagine using Uber without a fast real-time response today?” Dmitriy Rudakov also adds that, while traditionally data administrators have tried to time this process to occur at night so it doesn’t interfere with daily operations, “Growing volumes of data [means] batch based training just doesn’t fit the time windows provided.”

Online vs Offline Machine Learning: Online Learning Advantages

On the contrary, online machine learning can handle streaming data by feeding the model data incrementally. This approach allows the model to update itself in real time as new data arrives, making it highly adaptable to changes and reducing the latency associated with batch learning. For example, in stock price forecasting, where real-time data is crucial, an online learning model can continuously refine its predictions without the need for complete retraining, ensuring that forecasts are always based on the most current information.

How Does Online Machine Learning Work?

Now that you know why online machine learning is the better option, here’s how it works from a technical perspective — and how stream processing plays a role.

Think of stream processing as the backbone that enables online machine learning to function effectively. It provides the infrastructure to ingest, process, and manage continuous data flows in real-time. This is where Striim comes into play, offering a robust platform designed to handle the complexities of stream processing and real-time data integration.

Striim also captures and processes real-time data from various sources, such as databases, IoT devices, and cloud environments. By leveraging the platform, organizations can seamlessly feed this real-time data into their online machine learning models, allowing them to learn and adapt continuously. Striim’s low-latency data streaming ensures that the online learning models are always working with the most current data, enabling timely and accurate decision-making.

How Online Machine Learning Can Make a Difference

Online machine learning is an approach in which training occurs incrementally by feeding the model data continuously as it arrives from the source. The data from real-time streams are broken down into mini-batches and then fed to the model. Here’s how it can make a difference.

Save Computing Resources

Online learning is accessible regardless of computing resources. If you have minimal computing resources and a lack of space to store streaming data, you can still leverage it successfully.

Once an online learning system is done learning from a data stream, it can discard it or move the data to a storage medium, saving your business a significant amount of money and space. Online machine learning doesn’t require powerful and heavy-end hardware to process streaming data. That’s because only one mini-batch is processed in the memory at a time, unlike offline machine learning, where everything has to be processed at once. As a result, you can even use an affordable piece of hardware like Raspberry Pi to perform online machine learning.

“ML can be applied with data streaming systems in two ways,” shares Dmitriy Rudakov. “Model inference, i.e., calling the model in real time, can be done via different CDC techniques. This process does not require a lot of computing resources as the model is already trained, and the real-time app is just accessing it to generate some useful insights. Incidentally, if there is a change of properties in time (drift), the real-time system can make calls to calculate model accuracy scores and initiate retraining via automation.

Alternatively, training models can be done via the initial load phase, where, for a short period, the system can read and process all relevant data or subsets of data to train the model of choice. Training can also be done in real-time by sending event batches broken into chunks, according to use case needs, to the training modules, which will save computing resources and ensure freshness of models, thus addressing the drift problem.”

Prevent the occurrence of concept drifts

Online machine learning can also address concept drift — a known problem in machine learning. In machine learning, a ‘concept’ refers to a variable or a quantity that a machine learning model is trying to predict.

The term ‘concept drift’ refers to the phenomenon in which the target concept’s statistical properties change over time. This can be a sudden change in variance, mean, or any other characteristics of data. In online machine learning, the model computes one mini-batch of data at a time and can be updated on the fly. This can help to prevent concept drift as new streams of data are continuously used to update the model.

Learning from large amounts of data streams can help with applications that deal with forecasting, spam filtering, and recommender systems. For example, if a user buys multiple products (e.g., a winter coat and gloves) within a space of minutes on an e-commerce website, an online machine learning model can use this real-time information to recommend products that can complement their purchase (e.g., a scarf).

Online learning is closely connected to another concept called operationalizing machine learning, as both involve the continuous updating and adaptation of models with real-time data. Online learning enables models to refine their predictions on-the-fly, which is essential for maintaining accuracy in live environments. With this connection in mind, let’s explore how Striim supports these processes to enhance decision-making and operational efficiency.

Operationalizing Machine Learning with Striim

Operationalizing machine learning involves integrating models into live environments to leverage real-time data for continuous predictions and decision-making. This approach tackles challenges like handling high volumes of data, managing the speed at which data is generated and collected, and addressing the variety of data formats. For businesses, operationalizing machine learning translates into real-time insights, agility, improved accuracy, and enhanced operational efficiency.

Striim is an ideal platform for this task, offering comprehensive data movement capabilities crucial for digital transformation. It ingests and processes streaming data in real-time, performing essential transformations, filtering, and enrichment before the data is fed into online learning models. “ The only way to keep the model fresh is leveraging data provided in real time,” shares Dmitriy Rudakov. By continuously feeding these models with fresh data, Striim ensures they can adapt in real-time, keeping predictions and decisions accurate as conditions change.

The connection between operationalizing machine learning and online machine learning is crucial. Online machine learning, which incrementally updates models with new data, ensures continuous learning and adaptation—exactly what’s needed for operationalizing machine learning in dynamic, real-world environments.

To address the challenges of data variety and ensure models stay current, Striim can help you with:

Event-driven data capture and processing to train models incrementally.
Capturing schema changes from source systems and managing data drift.
Handling large volumes of streaming data from multiple sources.
Performing filtering, enriching, and data preparation on streaming data.
Providing data-driven insights and predictions by integrating trained models with real-time data streams.
Tracking data evolution and assessing model performance, enabling automatic retraining with minimal human intervention.

With these capabilities, Striim provides a robust foundation for operationalizing machine learning, supporting continuous, real-time learning and adaptation. Learn more in our guide to operationalizing machine learning.

Leverage Striim for Online Machine Learning Use Cases

By combining the strengths of Striim’s real-time data integration with online machine learning, your organization can effectively tackle the challenges of modern data environments. Striim’s platform not only supports seamless data streaming but also enhances the accuracy and relevance of your machine learning models by providing continuous, up-to-date insights. Whether you need to adapt to shifting data patterns or optimize resource usage, Striim equips you with the tools to maintain a competitive edge. Get a demo today to learn how Striim can empower your online machine learning initiatives and drive smarter, faster decisions.