In this video, Striim Founder and CTO, Steve Wilkes, discusses streaming integration, the need for stream processing and streaming SQL, and why they’re essential to real-world real-time solutions.
You’ve heard about streaming integration, the need for stream processing, and often hear the term streaming SQL. But what is streaming SQL, and why is it so essential to real-world real-time solutions?
IBM created the Structured Query Language, or SQL, in the 1970s as a declarative mechanism for working with relational data. It has been used for four decades as a way of creating, modifying and querying data in almost every database on the planet. However, because databases store data before it is available for querying, this data is invariably old.
In the world of real-time data and streaming systems there is also a need to work with data, and Striim chose 5 years ago to use a variant of SQL for stream processing. This streaming SQL looks very much like the static database variant, but needs new constructs to deal with the differences between stored and real-time continuous data.
Database SQL works against an existing set of data and produces a result set. If the data changes, the SQL needs to be run again. Streaming SQL receives a continuous and never-ending amount of data, and continually produces new results as new data arrives.
The simplest things that can be done with this data are filtering and transformation. These operations work event-by-event with every input potentially creating zero or one output.
For example, if we want to limit data moving from one stream to another to a certain location, we could write a simple WHERE clause.
SELECT *
FROM OrderStream
WHERE zip = 94301
And if we want to combine first and last names into full name, we can use concatenation, with other, more complex, functions of course available.
SELECT *,
FirstName + ‘ ‘ + LastName as FullName
FROM OrderStream
WHERE zip = 94301
However, because streaming queries receive events one-by-one, additional constructs are required for aggregate queries that work against a set of data, so windows and event tables need to be introduced.
A window contains a set of events bounded by some criteria. This could be the last 5 minutes worth of data, last 100 events, or hold events until no more arrive within a certain time. Windows can also be partitioned, so the sets are based on the criteria per some data value, for example last 100 actions carried out per customer. Event tables hold the last event that occurred for some key, for example the last temperature reading per room.
Streaming SQL can work against windows and event tables and will output results whenever there is any change. Aggregate queries against windows will recalculate whenever the window is updated, giving running counts, sums over micro-batches, or activity within a session.
For example to create a running count and sum of purchases per item in the last hour, from a stream of orders, you would use a window, and the familiar group by clause.
CREATE WINDOW OrderWindow
OVER OrderStream
KEEP WITHIN 1 HOUR
PARTITION BY itemId
SELECT itemId, itemName,
COUNT(*) as itemCount,
SUM(price) as totalAmount
FROM OrderWindow
GROUP BY itemId
Enriching data is just as easy, it uses the standard notion of a JOIN. The Striim platform supports all types of joins familiar to database users including inner, outer, cross and self-joins through nested queries. Striim enables users to load large amounts of data into in-memory caches and event tables from databases, files, hdfs and other sources. This can be reference, context or historical data, and can be updated through the incorporation of CDC.
For example, if we want to enrich the orders stream to include details about customer and location, we can join with reference data loaded into caches from the customer table and location database.
SELECT o.orderid, o.itemname,
o.custid, o.price, o.quantity,
c.name, c.age, c.gender, c.zip,
z.city, z.state, z.country
FROM OrderStream o,
CustInfo c, ZipInfo z
WHERE o.custid = c.id
AND c.zip = z.zip
Of course, this just scratches the surface of what can be achieved through Streaming SQL. Production queries can be much more complex, utilizing case statements and even pattern matching syntax.
In this cloud migration monitoring demo, we will show how, by collecting change data from source and target and matching transactions applied to each in real time, you can ensure your cloud database is completely synchronized with on-premise, and detect any data divergence when migrating from an on-premise database.
Migrating applications to AWS requires more than just being able to run in VMs or cloud containers. Applications rely on data and that data needs to be migrated as well. In most cases, the original applications are essential to the business and cannot be stopped during this process since it takes time to migrate the data and time to verify the application after migration. It is essential the data changes are collected and delivered during and after that initial load. As the data is so crucial to the business and change data will be continually applied for a long time, mechanisms are verified that the data is delivered correctly are an important aspect of any cloud migration. This migration monitoring demo will show how by collecting changed data from source and targets and matching transactions applied to each in real time, you can assure your cloud database is completely synchronized with on premise and it takes any day to divergence where migrating front on-premise database.
The key challenges with monitoring cloud database migrations include enabling data migration without a production outage; with monitoring during and after migration; detecting out of sync data should any divergence occur with this detection happening immediately at the time of divergence; preventing further data corruption; running the monitoring solution, non intrusively with low overhead; and obtaining sufficient information to enable fast resynchronization. In our scenario, we’re monitoring the migration of an on premise application to AWS. A Striim dashboard shows real time status complete with alerts and is powered by continuously running data pipeline. The on premise application uses an Oracle database and cannot be stopped. The database transactions are continually replicated to an Amazon Aurora MySQL database. The underlying migration solution could either be streams, migration solution or other solutions such as AWS DMS. The objective is to monitor ongoing migration of transactions and alerts when any transactions go out of sync, indicating any potential data discrepancy.
This is achieved in the Striim platform through this continuous query processing layer. Transactions are continuously collected from the source and target databases in real time and matched within a time window. If matching transactions do not occur within a period of time, they’re considered long running. If no match occurs in an additional time period, the transaction is considered missing. Alerts are generated in both cases. The number of alerts from missing transactions and long running transactions are displayed in the dashboard. Transaction rates and operation activity are also available in the dashboard and can be displayed for all tables or just for critical tables and users. You can immediately see live updates and alerts where the transactions do that get propagated to the target within a user configured window. With lung running transactions that eventually make it to target, also tracked. The dashboard is used of customizable, making it easy to add additional visualizations for specific monitoring as necessary. You’ve seen how Striim can be used for continuous monitoring of your on premise to cloud migrations. Talk to us today about this solution and get started immediately using a download from our website or test out Striim in the AWS marketplace.
In this video, Striim Founder and CTO, Steve Wilkes, talks about moving data to Amazon Web Services in real-time and explains why streaming data integration to AWS – with change data capture (CDC) and stream processing – is a necessary part of the solution.
Adopting Amazon web services is important to your business and why? Real-time data movement through streaming integration, change, data capture and stream processing necessary parts of this process you’ve already decided that you want to adopt Amazon web services is going to be Amazon rds or ever Amazon redshift, Amazon s three Amazon, Canisius, Amazon EMR, any number of other technologies you may want to migrate existing applications to AWS scale elastically as necessary or use the cloud for analytics or machine learning or any applications in AWS as VMs or containers. So only parts of the problem. You also need to consider how to move data to the cloud and to your applications. Analytics are always up to date. Make sure the data is in the right format to be valuable. Most important starting point is ensuring you can stream data to the cloud in real time. Batch data movement can cause unpredictable load enclave targets and that’s a high latency meaning it as often now as old from an applications having up to a second.
Information is essential. For example, to provide current customer information, accurate business reporting, offer real time decision maker streaming data from on premise to Amazon web services required making use of appropriate data collection technologies for databases. This has changed their to capture or CDC. We start rectally and continuously intercepts database activity and collects all the inserts, updates and deletes as events as they happen. Love data requires file Taylor which reads at the end of one or more file across potentially multiple machines and streams the latest records as they are written. Other sources like IoT data or third party SAS applications also requires specific treatments in order to ensure data can be streamed in real time which you have streaming data. The next consideration is what processing is necessary to make that data valuable. Your specific AWS destination, and this depends on the use case for database migration or lesson scalability use cases, but the target Schema is similar to the source.
Moving raw data from on premise databases to Amazon RDS or Aurora. Maybe sufficient important consideration here is that the source applications typically cannot be stopped and it takes time to do an initial load based way. Collecting and delivering database change during and after. The initial load is essential for zero downtime migrations. The real time application sourcing from Amazon, nieces or analytics use cases built on Amazon redshift or Amazon EMR, maybe necessary to perform stream processing before the data is delivered to the cloud. There’s processing can transform the data structure and in Richard with additional context information while the data is in flight, adding value to the data and optimizing downstream analytics stream streaming integration platform. We continuously collect data from on premise or other cloud sources and delivered to all of your Amazon web service endpoints to can take care of initial loads as well as CDC for the continuous application of change. And these data flows can be created rapidly and monitored and validating continuously through our intuitive UI, the stream, your cloud migration, scaling, and analytics. We built an iterated on at the speed of your business, ensuring your data. There’s always where you wanted when you want.
but don’t do the Google cloud platform is important to your business. Well, why are realtime data movement change data capture the stream processing necessary parts of this process? You’ve already decided that you want to adopt the Google play platform. This could be Google big query and pubsub Pie sinkhole type data protocol. Any number of other technologies you may want to migrate existing applications to the cloud scale elastically as necessary or you say five analytics on machine learning, but running applications in the cloud as vms or containers is only part of the problem. You also need to consider how to move data to the cloud and share your applications or analytics always up to date and make sure the data is in the right format to be valuable. The most important starting point is ensuring you can stream data to the cloud in real time. Batch data movement can cause unpredictable load on the play of targets and has a high latency meaning.
Speaker 2:00:59The data is often how as all for modern applications have up to the second inflammation is essential. For example, to provide current customer information, accurate business reporting or for real time decision making, streaming data from on premise to the Google play platform because making use of appropriate data collection technologies for example, change that to capture or CDC dark thing continuously intercepted database activity and collects all the inserts, updates and deletes as events as they happen. Mt Data. It requires file taping which reads at the end of one of our file and potentially multiple machines and streams. The latest record society ratio. Other sources like Iot data or third party SAS applications also require specific treatment. You know that’s what your daddy can street in real time, which it has streaming data. The next consideration is what processing is necessary to make the data valuable for your specific bouquet destination.
Speaker 2:01:57And this depends on the use case for database migration and the elastic scalability use cases where the targets here might have similar to the source, maybe rule change data from on premise databases to Google cloud sequel may be sufficient. However, there are real time applications sourcing from Google pubsub. Well analytics use cases built on Google. Big Query of data pro, it may be necessary to perform street processing before that data is delivered to the cloud. This processing can transform the data structure and then enrich it with additional context information while the data is in flight, adding value to the data and optimizing the industry and analytics stream streaming integration platform. We continuously collect data from on premise or private databases and deliver to only go Google cloud endpoints. Street could take care of initial leverage as well as CDC for the continuous application of change. And these data flows can be created rapidly and monitored and validated continuously through our intuitive UI and stream your cloud migration, scaling and analytics. Can we build an iterative download speed of your business and should be? Your data is always way warranted when you
In this demo, you’re going to see how you can utilize Striim to do real-time collection of change data capture from Oracle Database and deliver that, in real-time, into Microsoft Azure SQL Server. I’m also going to build a custom monitoring solution of the whole end-to-end data flow. (The demo starts at the 1:43 mark.)
Unedited Transcript:
Today, see how you can move data from Oracle to Azure SQL Server running in the cloud in real time using Striim and change data capture. So you have data in lots of article tables on premise and you want to move this into Microsoft Azure SQL Server in real time. How do you go about doing this without affecting your production databases? You can’t use SQL queries because typically these would be queries against a timestamp like table scans that you do over and over again and that puts a load on the Oracle database and you can also skip important transactions. You need change data capture, and CDC enables non-intrusive collection of streaming database change stream provides change data capture as a collector out of the box. This enables real time collection of change data from Oracle SQL server and my sequel. The CDC works because databases write all the operations that occur into transaction logs, change data capture listens to those transaction logs and instead of using triggers or timestamps, it directly reads these logs to collect operations.
This means that every DML operation, every insert update and delete is written to the logs captured by change data capture and turned into events by our platform. So in this demo you’re going to see how you can utilize Striim to do real time collection of change data capture from your Oracle database and deliver that in real time into Microsoft Azure SQL server. Also going to build the custom monitoring solution, the whole end to end data flow. First of all, connect to Microsoft Azure SQL server. In this instance we have two tables, t customer and t cost odd that we can show here are currently completely empty.
We’re going to use a data flow that we’ve built in Striim to capture data from a on premise Oracle database using change data capture. You can see some of the configuration properties here and deliver that. After doing some processing into Microsoft Azure SQL Server and you can see the properties for configuring that here. To show this, we’re going to run some SQL against Oracle and the SQL does a combination of inserts, updates and deletes against our two Oracle tables. When we run this, you can see the data immediately in the initial stream. That data stream is then split into multiple processing steps and then delivered into Azure SQL server. If we redo the query against Azure tables here, you can see that the previously empty tables now have data in them and that data was delivered and will continue to be delivered live as long as changes are happening in the Oracle database. In addition to the data movement, we’ve also built a monitoring application complete with dashboard that shows you the data floating through the various tables, the types of operations are occurring and the entire end to end transaction leg. This is the difference between when a transaction was committed on the source system and when it was captured and applied to the target and also some of the most recent transactions. This was built again using a data flow within the Striim platform.
This data flow uses the original streaming change data from the Oracle database and then the place of processing in the form of SQL queries to generate statistics. In addition to generating data for the dashboard, you can also use this as rules to generate alerts for thresholds, etc. And the dashboard itself is not hard coded. It’s generated using our dashboard builder, which utilizes queries to connect to the backend. Each visualization you’re seeing here is powered by a query. It’s the back end data, and there are lots of visualizations to choose from. So you’ve hoped you’d have enjoyed seeing how to move Oracle data on premise into the cloud using Striim.
How to build change data capture into Kafka and do some processing on that, and then do some delivery into other things. So this is pure integration play. You start off by doing change data capture from MySQL. In this case, MySQL would build the initial application and then configure how you get data from the source so we can figure the information to connect into MySQL. When you do this, we’ll check and make sure everything is going to work right, that you already have change data capture configured properly. And if it wasn’t, how you have to fix it and how to do it. You don’t select the tables that you’re interested in. We’ve got to collect the change data, and this is going to create a data stream, but then go to two different to Kafka.
So we’re going to configure how we want to write into Kafka, and that’s basically setting up what the broker configuration is, what the topic is and how we want to format the data. In this case, we’ve got to write to add as JSON, when we save this, this is gonna create a data flow. And the data flow is very simple. In this case, it’s two components. We go in from MySQL CDC source into a Kafka writer. We can test this by deploying the application. And it’s a two stage process. You deploy it first, which will put all the components out over the cluster and then you run it. And now we can see the data that’s flowing in between. So if I click on this, I can actually see the real time data and you see there’s a data and there was it before.
That’s basically the before updates. You get the before image as well, so you can see what’s actually changed. So this is real time data flowing through the MySQL application, the raw data may not be that useful. And one of the pieces of data in here is a product ID. And that probably doesn’t contain enough information, so what we’re going to do first is we’re going to extract the various fields from this and those various fields include the location ID products, id, how much stock there is, et cetera. This is an inventory monitoring table and we just turned that from kind of a raw array format into a set of name fields. So it’ll make it easier to work with later on. And you can see the structure is very different. Now what we’re actually seeing in that data stream, if we then add additional context to this, what we’ll be able to do is join that data with something else. So first of all, we’ll just configure this so that instead of writing the raw data add to Kafka, we’ll write that process state ad. And you can see all we have to do is change the input stream. So that will change the data flow. Now are write process data into Kafka.
But now we go into add a cache and this is a distributed in memory data grid that’s going to contain additional information that we want to join with our raw data. And so this is product information. So every product ID is a description and price and some other stuff. So first of all, we just create a data type that corresponds to our database table. Configure what the keys and the key in this case is the product Id. Then we specify how are we going to get the data. And it could be from files, it could be from HDFS. We’re going to use a database reader to load it from MySQL table. So especially specify all the connections and the query we’re going to use. And we now have a cache of products information to use this, we modify as SQL to just join in the cache. So anyone that’s ever written any SQL before knows what a join looks like. We’re just joining on the product Id. So now instead of just the raw data, we now have these additional fields that we’re pulling in in real time from the product information. So if we start this and look at the data again, you will actually be able to see the additional fields like description and brand and category and price that came from that other type that’s all joined in memory. There’s no database lookups going on is actually really, really fast.
If you already have data on Kafka or another message bus or anywhere else for that matter is new files, you may want to kind of read it and push at some of the targets. So what we’re going to do now is going to take that data we just wrote to Kafka. We’re going to use Kafka reader in this case. So it will just search for that and track the source and then we can configure that with the properties connect to the broker that we just used. So because we noticed JSON data, we’re going to use a Jason parser. I was going to break it up into a adjacent object structure. And then create this data stream. Okay, when we deploy this and start this application, it’ll start reading from that Kafka a topic.
Well, we can look at that data and we can see this is the data that we were writing that previously with all the information in it and it’s adjacent full Max. You can see the adjacent structure though. So the other targets that we go into right to the JSON structure might not work. So what were you going to do now is build a query that’s going to pull the various fields, edit that JSON structure and creates a well-defined data stream that has various individual fields in it. So a variety crew to do that. It’s directly accessing the JSON data and saves that. And now instead of the original data stream that we had with the JSON in it, when we deploy this, start it up and looked at the data, and this is incidentally how you would build applications, looking at the data all the time, as you’re building on adding additional components into it. If we’re looking at the data stream now, then you’d be able to see that we have those individual fields, which is what we had before on the other side of Kafka, but don’t forget that it may not be stream to Kafka. It could be anything else. And if you were doing something like we just did with CDC into Kafka than Kafka into additional targets, you don’t have to have Kaftan in between. You can just like take the CDC and push it out to the targets directly.
So, uh, what are we going to do now is going to add a simple target, which is going to write to a file. And we do this by choosing the file adopt. So the fall reuter and especially finding the format we want. So we are gonna write this. I’ve seen the CSV format. We actually call it DSV because it’s delimiter separated. And the limits could be anything. It doesn’t have to be a coma and save that. And now we have something that’s going to right out to the file. So if we deploy this and start this up, then we’ll be creating a file with real-time data.
And after a while it’s got some data in it and then we can use something like a Microsoft Excel to actually view the data to check that it’s kind of what we wanted. So let’s take a look in Excel and we can see the data that we initially collected from MySQL be written to capita being slightly from Kafka and then being and back out into the CSV file. They just have one target and a single data flow, or you can, it’s multiple targets if you want. We’re going to add to, in rising into Hadoop and into Azure Blob Storage. So what we do is in the case of Hadoop, we don’t want all the data to go to a dud. So as a simple CQ to restrict the data and do this by location id. So when location 10 is going to be written to do, that’s so some filtering going on there.
And now we will add in the Hadoop target. So you’re gonna write to HDFS as a target, drag that into the data flow and see there’s many ways of working the platform. We also have a scripting language by the way, that enables you to do all of this from vi or emacs or whatever your favorite attack status or is. And we’re going to write to HDFS. I see an Avro format, so it will specify the scheme of file. And then when this is started up, we’ll be writing into HDFS as well as to this local file system. And similarly, if we want to write into Azure Blob Storage, we can take the adaptive for that and just search for that and drag that in from the targets. And we’ve got to do that on the original source data, not that query. So we’ll drag it into a, that original data stream.
Okay. And now we just configure this with information from Azure. So you need to find out what is the URL, and you should know what your key is and the username and password and things like that. You go into collect that information if you don’t have it already. And then add that into the target definition for your Azure Blog Storage. I’m gonna write that out in JSON format. So that’s kind of very quickly how you can do real time streaming data integration with our platform. And all of that data was streaming. It was being created by doing changes to MySQL.
In this webinar, join Alex Woodie, Editor-in-Chief of Datanami, and Steve Wilkes, founder and CTO of Striim, for an in-depth look at some strategies that companies can begin to implement today to address these needs. Learn ways to start the migration toward a data architecture capable of handling the oncoming tsunami of data through a “streaming-first” approach. Topics including data processing and integration at the Edge; migrating existing systems and processes toward a streaming architecture; and strategies for avoiding data storage altogether, without losing information or intelligence.
Welcome and thank you for joining us for today’s webinar. My name is Katherine and I will be serving as your moderator. The presentation today is entitled Strategies for Managing the Oncoming Tsunami of Data. We all know that over the next several years, data volumes will skyrocket. What has not been made clear until recently though is that there simply won’t be enough storage on earth to store it all. Over the next 55 minutes, we will discuss strategies for addressing this day two day lose by leveraging a streaming data architecture. We are honored to have as our first speaker today, Alex Woodie, managing editor of Datanami. Alex has covered the high tech and IT industry as a technology journalist for more than a decade, focusing on emerging trends and systems storage software, business intelligence, cloud and mobility.
Joining Alex is Steve Wilkes Co founder and CTO of Striim. Steve has served in several technology leadership roles prior to founding Striim, including heading up the advanced technology group at GoldenGate Software and leading Oracle’s cloud data integration strategy. Throughout the event. please feel free to submit your questions in the Q and A panel located on the left hand side of your screen. Alex and Steve will be answering questions throughout the presentation and we’ll address as many as possible after the presentation wraps up. With that. It is my pleasure to introduce Alex Woodie.
Thank you for that introduction. Katherine. We’re all tired of the term big data, bu it accurately describes the problem as well as the technology solutions that we’ve created to deal with it. Striim asked me to provide some background in the history of where we are today and that’s what I intend to do. So it all started back about 25 years ago and in the mid nineties with the growth of the Internet, it was clear at the time that it would eventually have a major effect on our lives, but we didn’t know exactly how it would all transpire, but we knew that it would be big. Looking back on the past 25 years, a pattern has emerged as more people built things on the Internet. It drove a need for bigger and better technology, bigger disks to hold all the data, better processors to process the data and faster networks to move all the data.
As the hardware improved, so did the software, much of which was open source from web servers and databases to operating systems and search indexers. A unified stack began to emerge starting with the lamp stack of Linux, the Apache web server, the MySQL database and PHB python parole. In addition to a javascript html 5 CSS and JSON, and these technologies supercharged the productivity of web developers to meet our demand. Even bigger and better websites as the web development tool solidified and got easier to use, the web exploded in 2000 when Doug Cutting released his Lucene search engine as an open source project. They were on the order of about a million websites in the world and it was a great increase to be sure from the early nineties when they were basically zero. But it’s a far cry from where we are today as a number of websites surpassed more than 1 billion, or one unique websites for every seven people on the planet.
By 2014 that’s where we got in, right in retrospect, this time presented a great flowering of technology and it was in my view, the origin of today’s big data tools as the ecommerce engine started to move ahead, it exposed gaps in the technology of the day. The web giants that Silicon Valley saw it first and it all started with an extension to cutting flagship search engine, Doug Cutting and his colleague Mike Cafarella set out to build an automated web crawler to index the Internet to improve search results. The resulting product called Nudge could crawl a hundred pages a second. That was lightning fast at the time, but because it was limited to running on a single machine with about a terabyte of disk and a hundred gigs of Ram, which was pretty beefy for the time, it had a hard limit of about a hundred million webpages. It soon became clear that that wouldn’t be nearly enough.
So cutting and Cafarella decided to parallel paralyze it. They managed to expand Nudge to four nodes before the system became too complex to handle and it still wasn’t enough to handle the expected growth of the web. The developers weren’t sure how to proceed until they finally stumbled across an obscure paper written by Google that described the Google File System. Here they found the blueprints for solving the same problem that they were dealing with in 2004. Using the paper as their guide, cutting and Cafarella developed a Java version of the Google file system, which they called the Nudge distributed file system guided by another Google paper they described. They described that a system for parallel processing called MapReduce, Cutting and Cafarella created the first processing agent to work with the new file system which Cutting would soon rename. When Yahoo caught wind of the developers work in 2006 it hired Cutting to help me compete against the upstart Google.
Yahoo finally went live with Hadoop in 2007 I do certainly days where a mix of success and failures the new system could scale like nothing before it, but it was a different beast entirely and it required constant attention. While there were other distributed computing frameworks, inactive development at the time, Yahoo gained the lion share of the attention from prospect of users. It also had a liberal license at the Apache Software Foundation, which probably helped it spread. By late 2007 most of the web giants in Silicon Valley had heard of Yahoo’s success with Hadoop and they were starting to use it, Facebook, linkedin, Twitter. They all innovated atop Hadoop and developed products that would soon become top level Apache projects of their own including hive h base and storm. Facebook also developed. Cassandra had documented oriented database based on the Diet on Dynamo, a key value database created by amazon.com in 2004 to handle holiday shopping traffic.
The development of those SQL databases has largely paralleled the rise of Hadoop. It flourished in the ensuing years. Cloudera was founded in 2008 and with it, the big data concept of bringing the compute to the data migrated out of silicon valley and into enterprises around the world. Cutting would join Cloudera a year later to help guide the development of the fledgling computer system that was so green and yet had so much potential. The Hadoop dream of providing a central data store for a variety of engines was a tantalizing one. Since the dawn of time, we’ve been struggling with the need to move data to the computing resources. Storage servers remain separate from processing servers and never the two would meet except perhaps over expensive 10 gig networks or if you happen to work in the HPC field, infiniband network. Hadoop appeared on the scene like a silver bullet prep to tackle every data storage and processing task we could throw at it.
It’s a parents in 2007 occurred just in the nick of time to handle huge uptake and unstructured data different by the mobile web that occurred just after Apple introduced the iPhone in 2007 and a searching success of social media sites like Facebook, Twitter, and Linkedin, or perhaps it was presence enabled. This explosion of video chat logs, media files and cat pictures. It’s really a chicken and the egg problem and nobody will ever know if the explosion of data and consumer based information sharing would ever have happened. Have you not suddenly been graced with the ideal combination cheap storage and compute embodied by? I do. In any event, while the web giants were the biggest early buyers of Hadoop style computing and the biggest contributors to the growing Hadoop stack, they were soon followed by their enterprise brethren, banks, retailers, travel companies, insurance companies, and manufacturers soon wanted processed data like the web giants did.
After all, why should Google and Facebook get the consumer data and therefore drive all the transactions? The idea that we’re all software companies now soon to a cold and a big data explosion continued to roll. Armed with nearly unlimited data storage, measuring into the petabytes, data science and data scientists trained on machine learning techniques, companies were determined to find hidden in fights in the form of unexpected correlations or anomalies that they could turn into business advantage. Harvard Business Review declared the data scientists the sexiest job at the 21st century. Hadoop for better for worse, was basically synonymous with big data. If you’re quote doing big data unquote, then you’re probably using it with this novel Schema on read approach. Hadoop was hailed as a new data warehouse that didn’t penalize you for ingesting and processing huge sums of semi structured and unstructured data that would never fit into a relational data warehouse like the ones from the old guard.
Companies like Teradata, Oracle, Microsoft, IBM, and HP. However, Chink started to appear in the Hadoop armor. All of the platform could ingest huge sums of unstructured data like nothing else and survive failures of multiple nodes. The design of HDFS supported real time or near real time workloads. It was for better or for worse and batch only paradigm Apache hive and other SQL based warehouses for Hadoop. Introduce Interactive SQL processing into the ball game but that didn’t eliminate all performance concerns in the new data lakes but and in early interactive processing simply didn’t work for certain types of workloads. Fraud detection on credit card transactions for example, requires some second responses to queries surfacing a product recommendation on an ecommerce site and also requires having to answer within a certain amount of time early versions of hive with this MapReduce batch dependency is was ill suited for these types of workloads.
The demand for fast data analysis spurred the creation of a new class of data processing engine. Separate from into the pack was led by Apache storm, which Twitter developed to analyze huge numbers of tweets in something closer to realtime. Linkedin had its own real time processing engine called Sam Zone, which had developed alongside its real time DNF pipeline called Costco Yahoo, who developed its own projects called Samoa. In as far the parallel development of these real time, big data engines created a problem. Of course if I do put the single version of the truth as it was advertised to be, then how do you manage the separate streaming platform? The problem gave rise to a new architecture by Nathan Mars, the creator of storm called lambda. Put simply the lambda architecture simultaneously splits all data and feeds it into two separate systems. One that flowed into the real time streaming system for real time decision making and another one that flowed into Hadoop for the end of the day.
Batch processing to account for late arriving files in any errors that cropped up with the real time system you list to say stitching together two separate systems based on different frameworks that use different programming paradigms increased the complexity level immensely despite the fact that the lambda was seen as the only way to satisfy the competing requirements of processing data in a way that simultaneously fast, thorough and efficient any history of big data storage would be remissed. If a no SQL databases weren’t mentioned, no SQL database have some of the same characteristics of I do that both give developers flexible schemos and different Corey Mac mechanisms besides no sequence. While helping administrators by being distributed and fault-tolerant and running on cheap clusters of x 86 computers. No SQL databases, however, provide functions above and beyond what you get in a flat file storage system, which I do mostly is no SQL databases in are mainly used as operational data stores for structured and semi-structured data like JSON as opposed to the data lake for semi structured and unstructured data, which is hadoop’s biggest use case.
Just as we’ve seen a proliferation of engines that plug into Heti, we’ve seen the rise of specialized no SQL databases designed to handle specific tasks. We’ve seen key value stores like memcache d and radice excel. It’s serving read heavy workloads such as travel websites while document stores like Mongo DB and counter base excel. It serving back ends to most of the world’s most popular web and mobile apps. Why call them? Stores like Cassandra dominate the most intensive scale out use cases while graph databases like neo 4 j and tightened DB pro provide entirely new twist on data processing through degrees of connectedness. At the same time other No SQL databases such as Splunk, elastic search and Spark logic serve even more specialized in these cases and in addition to the Hadoop stack, the stream processing systems and the No SQL databases you have object storage to contend with as well and object store treat every piece of data as an object as opposed to a file system like Hadoop or a block storage method like sand storage.
It sometimes is considered the simplest and most scalable data storage method. Each record is given an identifier which is stored in a metadata store while the object itself and stored in the cluster. Amazon’s S3 today is the most dominant object storage system by far. The S3 API is the standard adopted by other storage systems as well. Together the No SQL database has to do strict. The stream processing systems and object stores seem poised to upend the status quo and the $2 trillion IT industry. While No SQL databases process transaction Hadoop provides the insights through machine learning workloads. Stream processors deliver the instantaneous insights while huge data lakes filled with the videos and pictures would be efficiently stored in object stores. With all the big data pieces in place. It was just a matter of putting them all together and yet the innovation didn’t end as the big datasets continued to grow.
So did the number of big data projects designed to help the process at all. The hype surrounding Hadoop peaked around 2014 which incidentally is the same year that Apache Spark emerged from incubator status to replace large parts of the Hadoop stack. The death of MapReduce as the primary engine of Hadoop was relatively swift and today’s Spark is the de facto standard processing engine for a range of big data workloads including stream processing and machine learning. All Sparks graph and SQL processing capabilities are are quickly maturing and yet innovation still hasn’t slowed. Today. We’re seeing the rise of other frameworks like Apache Flink and Apache Beam that advocate a stream first approach to big data instead of running separate Hadoop infrastructure. Architecture for Lincoln being proposed that we process all data, even batch oriented data as if it were real time. This approach has a number of benefits including the elimination of the lambda, the architecture and the simplification of the stack and considering the huge amount of data that the Internet of things is predicted to in the coming years. It may be the only way to keep our head above water as much as people seem to hate the term big data, the term still have legs and why? Because the term athlete describes the core of the problem we faced because managing the data and including storing it, accessing it, governing it, cleansing it, securing it, and ultimately turning it into useful information ultimately is a big data problem.
Consider the growth of data that we’ve experienced up to this point. In 2003 the world generated on the order of five exabytes of data. By 2006, two years after Facebook ignited the social media revolution, data generation exploded to 161 exabytes. According to the IDC. By 2010, three years after the first iPhone reached the consumer hands, people in their devices cracked a zettabyte barrier for the first time. The exponential growth continued in 2014 when we created 4.4 zettabytes of data that was about as many bits generated as there are stars in the physical universe. According to the IDC, by 2016 the world was generating 16 zettabytes of data per year. Those are huge numbers to be sure, but here’s the rub. Most of that data is never stored according to the IDC. Our ability to manufacture storage capacity trails far behind our ability to generate data.
The key number here is 15% that’s the fraction of data that we generate that ultimately gets written to disk or tape or flashdrive or optical or cloud or any other permanent storage mechanism. Most of the data we have created up to this point has been a femoral. The beds to make up our telephone calls, the TV and radio signals that are broadcast and never written down. The http requests that fetch data into our I ops and browsers that we, that we eventually closed and forget about. In many ways, our world has always resembled snapchat, data’s fleeting, appearing and disappearing in our lives to suit our whims and needs. The snapchat like trend will continue with the Internet of things. The IoT is widely expected to supercharge our data generation capability to a mind boggling 160 zettabytes per year by 2025 according to idcs most recent recent data age report storage manufacturers are doing their best to keep up with a huge growth, hard drive makers are building bigger and better hard drives, including five terabyte drives today and 30 terabyte drives on the roadmap while tape manufacturers are shipping six terabyte lto cartridges today with 48 terabyte cartridges on the roadmap.
This might, the innovation has spending desk, solid state date formats and cloud storage. The IDC predicts our storage capacity to hold steady at roughly 15% of the data we generate. That number of 15% appears to be some sort of magic number reflecting in a way the ratio between the raw data of questionable value in the actionable information that we’re willing to invest in.
The question then becomes how best to extract that 15% kernel of value from the other 85% of chaff. We’ll be writing 100% of it temporarily to disk with the hopes of willing down to 15% that we value through some sort of map method. That approach seems unlikely. While it had a shot of working in 2004 when Google researchers, Jeffrey Dean and Sanjay Ghemawat published the seminal MapReduce paper, there’s almost no chance of it working with the large amount of today. While Google was the source of inspiration for both core components of early Hadoop, including the Google file system that Cutting would implement and Java as HDFS and the original MapReduce method. It has long since moved beyond that form of computing. So what did the mighty Google replace MapReduce with stream processing? Of course in 2014, the company revealed cloud data flow of new big data, software development and execution paradigm designed to enable developers to build data flows or pipelines that produce that, uh, process, exabytes worth of data.
At this point, I think I would like to ask our first polling question. We would like to know what the state of stream processing is in your organization. Do you have any plans to implement it? Do you have plans to implement it within a year, within one to two years or beyond two years or do you have no plans to implement it or have you implemented it already? We’re going to wait just a little bit here. While everybody takes the poll and we can collect the results. All right, well it looks like we have about 15% of you have already implemented stream processing. About 15 and half plans to implement with a year about 23 within two years. Nobody with long term implementation plans and roughly half of you have no current plans. One of the current tenants of a Google data flow, which is turned into Apache Beam is the unification of batch and realtime data processing.
While data flow ostensibly enables developers to process data as soon as it comes off the wire, usually Apache Kafka or Amazons Kinesis, Google realized that the same technique can also be used for batch processing and thereby eliminating the need to build, maintain separate systems, ask for the land to architecture with Apache beam developers can write batch and streaming applications by using utilizing a single API once more. The notion of runners developers can access other execution frameworks from within the Beam API, including Flink, Spark and and Apex as well as the cloud data flow engine living on the Google cloud engine. This brings us up to the present. The state of big data is still in flux and evolving at a tremendous pace. The data generated by the Iot is exploding instead of petabytes of data, which we used to think was big data.
Now, individual companies aren’t talking about storing more than an exabyte of data. When a worldwide storage of data is measured in the does that have bytes to keep up with this massive generation, we’re moving beyond batch based methods and bodied by deep and it’s file system. Today. The future is clearly focused squarely on real time data processing methods at the edge upon mobile devices and using new hardware form factors because it’s really the only chance that we have to keep our heads above the digital tsunami. That completes my presentation and I’m going to hand it on over to Steve.
Thank you, Alex. So I just want to kind of reiterate what Alex was saying and show you in a graphical form. So really hits home and hopefully those of you who aren’t thinking about stream processing will be convinced that you may need to. So today we’re around 16 zettabytes of data annually, as Alex mentioned. And by 2025 IDC is estimating 160 zettabytes of data, just to put that into perspective and an exponential graph. That means that in every two year period in this graph, more data is generated during those two years that was generated in the entirety of mankind’s life on earth up until that point. So every two years represents more data than was ever generated before, which is to me just quite amazing. Right now IDC is saying around 5% of it needs to be dealt with in real time.
By 2025, around 25% of that data will need to be real time and have that real time data. 95% of it will be generated by IoT, and that’s a massive increase. You can see the exponential curve. It wasn’t really starting to hit for the real time data until now. And by 2025, we’re going to be overwhelmed with the amount of data being generated. As Alex mentioned, only a small percentage of that data can be stored. Now, if only a small percentage of that data can be stored, then you have no choice. The only logical conclusion is that you have to process and analyze data in memory in a streaming fashion in order to deal with a huge amount of data that are being generated. Now, it’s not just IoT data that is streaming. As I’ve mentioned, there is this trend to move towards a streaming first architecture, and if you think about it, it’s quite natural and batch is actually quite artificial.
Everything that happens within the enterprise happens because of some event, because something happens. It could be a user typing something into a form that puts stuff into a database. It could be a web application that’s writing to a database. It could be web applications that generate web logs. It could be machines doing stuff to generate logs. It could be devices that sending things as events as those things happen, but all of it is happening. One event at a time, log files aren’t created a whole file at a time. They don’t wait until everything’s done and then write. Everything’s for a log. Databases don’t always work, you know, huge numbers of rows at a time. They typically, you know, row by row in serve by inset, update by update. So databases can be streaming through change data capture logs can also be streaming by reading at the end of the log and taking things as they are written.
And devices obviously can also be streaming. So it helps convince you that stream processing is a major infrastructure requirement and it can also be the precursor to everything else you’re doing. And the only real limitations on turning a streaming architecture to applying batch concepts within it cause some things has to be batch, right? Your end of day report needs to be done at the end of the day. The only real limitation to that is memory. The more ram you have, the more data in a streaming fashion you can contain. And the larger your in memory batches can be that you’re doing through stream processing. But there’s no real requirement if you have enough memory to store all of this stuff on disk. So some of the things that you might be thinking about doing, we’ll talk about how you can think about that in the streaming fashion, right?
If you’re creating a data lake, the first thing you have to think about is kind of what is the end goal. You know, what is the point of creating this data lake? What’s the information that’s going in it? And the Alex is reporting on the other people have reported on kind of some of the failures that have happened with Hadoop. And some of the approaches that have been taken that were just completely wrong. And the overall completely wrong approach is to throw everything in there, you know, raw fashion and kind of hope for the best kind of hope you’re gonna get some value out of it later. So thinking about your end goal is kind of the first thing. And the second thing is going to how does it scale? Um, how do queries are slow? You can pair Hadoop queries with a well architected data warehouse, the queries incredibly slow.
In order to speed that up, you need to think about what does the data look like when you put it in Hadoop. And typically raw data isn’t the best thing to put into her do from a query and perspective. You need to be able to preprocess it and enrich that data. And denormalize it in a database terminology in order that when you query it, you can get your queries results back fast. And you also have to think about do we need all of the raw data? Is it the best form? Should we instead be doing aggregates and writing aggregates into Hadoop that also facilitate fast queries? So you think, you know, is the raw data actually useful? Um, and then how do, how does it scale? Um, how do you scaled feeding it, you know, putting things into Hadoop. You need to think about the overall architecture.
The use of big data really do need to be considered. And how do you score scale, storing it? Is this something that you want to do on premise or you want to do in the cloud than it turns out that you know, AWS is currently the top Hadoop distribution and it’s in the code. So that’s also something to consider. The other thing you might be consider doing is providing streaming data as a service. And organizations are investing in Kafka for example, if you’re investing in Kafka, again you ask yourself the question, what is the end goal? Because Kafka is just a message queue and message cues have been around for a long time. MQ series has been around for decades. So if you’re putting stuff onto a message queue, why are you doing it? Do you want to do real time analytics? Is it something that you using to feed your data lake? I expecting people to do self service analytics on the stuff that you’ve written into Kafka.
And again, you ask yourself the question, is the raw data itself useful?If you’re putting data onto Kafka, imagine you’re doing change data capture from a database and you’re feeding that raw data into Kafka. If you have a nicely normalized database, then the majority of your data is going to consist of Ids, your customer or the detailed table change, someone added the road to it is going to contain your customer Id, order Id, item Id, some timestamps, something else. All of those ideas are not going to be useful for someone doing analytics on a message queue or delivering even into a data lake. See you need to do enrichment of that data before it even make sense. And then how do you integrate if Kafka is just a message queue, how do you get data in and out of it? How do you integrate it with your existing investments like databases and how do you get value out of it?
How do you perform the in memory analytics and processing and even create dashboards from it to actually give value to your end audience. So we recommend that if you try and do any of these things, you know, you don’t boil the ocean, you don’t try and do everything in one go. You don’t throw all of your raw data into Hadoop. You don’t throw all of your raw data onto a message issue, you literally take things one stream at a time and do this as part of your overall business goal. You identify business use cases that make sense for these types of architectures. And the streaming piece is part of your overall data architecture. It doesn’t immediately replace everything that you have a co-exist. You probably already have an operational data store, enterprise data warehouse. You probably already have ETL jobs that run in batch mode that are taking things from databases and or files and putting those into a data warehouse, putting those into Hadoop, right?
So this streaming architecture, yes, eventually it can supplant a lot of those things, but that’s not something you want to do in one go. Cause these things are already working and they’re already part of your enterprise already, part of your decision making process. So streaming integration can be applied one stream at a time. You can connect to it into your databases using change data capture. You can read log files in real time. You can also access other machine data, message queue sensors, etc, and do all of this, streaming architecture. And those can drive real time applications. Those can allow you to do real time analytics. They can also feed data to machine learning. So the machine learning can learn but then they can use machine learning models and do real time scoring as part of the streaming integration piece. They can also read and write data from data warehouses and Hadoop.
And also use data in data warehouses, databases and they domfor enrichment purposes. So load all the reference data that you may have lying around in these things into memory and make that part of your streaming architecture. And so this kind of an enhanced version of lambda. Other people have talked about kind of a kappa architecture where your’re keeping kind of time windows in memory, that kind of replaced batch. But if you are moving everything towards kind of a streaming architecture, we may as well call it come of the Omega architecture cause you can’t go higher than that in the Greek alphabet. Striim is a complete a end to end platform. I just introduced kind of what stream does a little bit to give you some context around how some of these things can be done. And we do both streaming integration and analytics, all these things apply to stream.
These are things you should be considering if you’re looking at moving towards streaming. So first thing is you need to do continuous data collection and not just from iot devices but also from message cues. You may have, that could be Kafka, it could be Flume, it could be MQ series, it could be JMS. You probably already have investments in those things and it could also log files as I mentioned, can be read in a streaming fashion. Sensors typically are inherently streaming. They’re sending data continually. But databases can also be streaming through change data capture. Stream processing is an essential part of this architecture. You need to be able to do in memory filtering, transformation, aggregation of data. And as I mentioned, enrichment is key. You need to be able to load context and reference data into memory and join that with the streaming data in real time.
In order to try to get value out of your streaming data, you need to do stream analytics and normally resection a what used to be called complex event processing, looking for patterns, sequences of events over time that might indicate something interesting is going on. Do the statistical analysis and compare real time data with statistics, integrate machine learning and do that in real time and be able to visualize all of this and create dashboards around it to actually get value out of it. And then of course, you need to be able to deliver your results of this processing. And that could be putting aggregated data or enrich data into Hadoop, into Kafka or delivering into the cloud, putting it into S3 or, or Redshift or Azure SQL. So you need to think about all of these pieces. If you’re embarking on a stream architectures and you need to think about how do all these pieces fit together and how do they make sense?
There are different categories of business use case that you can solve using a streaming architecture. And we break those down into kind of real time data integration, detecting patterns and also monitoring and building metrics and KPIs. Um, here’s just some examples. You could be feeding your data lake in real time or feeding your Kafka or I’ve, I messaged a few infrastructure in real time. You might be migrating data into the cloud or setting up a real time hybrid cloud where data on premise is mirrored in the cloud for reporting or scalability, purposes. Or you might be looking at IoT edge processing, doing preprocessing of data, doing aggregation, redundancy removal, extracting the signal from the noise, turning data into information and doing that as the edge on pattern detection anomalies. There’s a lot of things you could be doing here, whether it’s fraud detection or kind of active cyber security monitoring, anti money laundering, doing things based on locations, doing predictive maintenance in real time and a lot of the iot analytics.
And then kind of building on that. If you’re looking at building a monitoring metrics and KPIs, then you’d, so think about real time call center or network quality monitoring, sale monitoring, general if you’re providing API is API is or other a SLA is your customers and monitoring those in real time and seeing if you or your customers are breaking them. And there’s a lot of things that you could be doing here as well. So I’m going to take a quick pause here and we’re going to ask, ask the next survey question, which is around what type of use cases would you think of as being useful for, streaming out of architectures.
Um, we’ll wait a short time for those survey results to come in. And I would encourage you to Kinda answer this. It’s kind of interesting to see any will’s going to help target some of the things we talk about later as we go through the rest of the presentation. So while those results are coming in you hear a lot of talk about stream processing. You hear it coming from a lot of different vendors and you know, even Kafka vendors and people that focus on talking about kind of stream processing. Well there are a lot of different components to a stream processing architecture. Yes. A message queue, like Kafka is an important aspect of that. But it’s only one tiny sliver of the entire architecture. You need to consider how do you load large amounts of data into memory that’s an in memory data grid that you need in order to actually do that side of things.
Um, you need to think about how do you do your stream processing? How do you do your analytics, how do you integrate with machine learning? How do you visualize? Where do you store the results? Where do you push them? Is that something you need to consider as well? And then of course, how do you collect the data, whether it’s real time from devices messaged use log files and also databases for change data capture and how do you deliver that data act some way, whether it’s in the cloud or on premise. So yes, Kafka is an important aspect of your stream processing, but there’s a lot of other pieces of technology that you’ll need as well. So we’ve got a lot of the results through now for what are you considering stream processing for and interestingly, but not unexpectedly iot and machine data seems to be the top people are actually also looking at it for log intelligence, real time monitoring of machine logs, recommendations, fraud detection, and then we have some others which we would obviously be interested in drilling down.
If you have one of those others, then you can email us and kind of let us know what the others is. That’d be great. So let’s go back to the slides. I’m going to zoom through the next one and get onto kind of the iot portion. This is just a reiteration of kind of the pieces that you need and a streaming architecture. And this is the Striim platform. One of the things that we do that’s kind of key to success I think is we lower the bar drastically for people wanting to build streaming data flows. And we do that by doing all of the stream processing through SQL. So you can write continuous queries, continuous processing in memory without having to learn how to code in Java script or Java or C# or any of the other programming language that you may need.
And we’re also an end to end platform that contains all of these pieces so you don’t have to choose and evaluate and work out which, um, message infrastructure you’re going to use, which cache system you’re gonna use with stories you’re gonna use and how you get data in and out of all of this and then how you do the processing and analytics. So those are kind of the key aspects of a streaming architecture. And this is kind of how we put them together at Striim. And we also add this overall kind of consistent UI that allows us to build things in a very easily utilizing our platform. And also allows us to kind of build dashboards to visualize your streaming data in real time as well. This is an example of a data flow. So when you think about stream processing is not just a single query running on Striim, it is an entire pipeline and those pipelines contain multiple data processing steps and it may well be that the streams generated at each point.
Those intermediate streams are useful for other purposes as well. And typically we find organizations that start with a data stream start with a source of streaming data for one particular application. Use that for a lot of other use cases as well. So not just the raw streams for the intermediate streams that you are creating as part of this processing can also be reused. And as I mentioned, we also allow you to do building dashboards and visualization as move on to kind of Iot. IoT is crucial to a lot of industries already. It is going to be crucial to almost every industry going forward. That huge growth in IoT data doesn’t just affect one industry, affects everyone. And it amazes me that you know, an industry as old established as the insurance industry. And so as established that they use 500 years worth of data to determine where to build their headquarters creates a safest place in the country.
That industry is adopting IoT to monitor your driving habits and I’m sure they’ll use it to monitor other aspects of your life in order to provide the most tailored, most suitable policy going forward. And if it’s affecting insurance, it’s going to affect almost everyone. Agriculture I think will take that big advantage out of IoT. It’s essential to have a smart architecture and that’s my architecture means that you are processing or collecting, processing, storing, analyzing the data where it makes sense. And that could be within the device. It could be at the edge, so a collection of devices together. But a small number of devices at the edge managed by an edge server. It could be on premise. And it could be in the cloud and all of these pieces have to integrate together. The scale is obviously larger.
This side you have a huge number of devices, but the depth that you have in the view that you have into things increases as you move over to the right side because you are aggregating and looking at the combined value of a large number of devices and that could involve correlating device data together, but most likely it’s going to involve correlating device data with other enterprise data and doing that in real time to get the best value. Whether it’s true enrichment of that device data by adding context to it. Like you have a device id, you want to know what that device is, where it comes from. Maybe you have to look that up in your asset database or integrate with the ERP system in real time. Or it could be that you are correlating it. You are looking for events that are happening concurrently with your devices and logs from other machines.
Maybe security logs or other logs that might indicate something interesting is happening. So you need this kind of generic scalable IoT architecture that incorporates edge processing. And yet one of the things you have to remember with IoT is a loT of the devices out there right now, are not Internet enabled. You know, there’s just a world of things that a large number of the things still need connecting. And so pretty often you need this physical gateway, a box that you have to plug the wires into in order to actually connect to these devices. A lot of manufacturing devices, a lot of things like air conditioning systems, hospital kind of healthcare, medical monitoring devices, they need to be wired. So you have to ask how do you get that data? And that’s where you need a protocol translation gateway to turn the communication with those old school wired bus back net, to two port devices into something that works for the Internet.
You need to do edge processing and analytics and you may also want to do machine learning, kind of model scoring at the edge to build a model in the cloud or on premise in your own data center and then use the results of that model, move that to the edge and do real time scoring and maybe do real time, predictive maintenance or quality monitoring or any other aspect that you can model that you can say is this behaving normally? Is this behaving as my model will predict if I also want to do processing and analytics on premise and move some of that through a hub into the cloud where you do more processing and analytics and maybe feed your machine learning and moving that model back. And we built something like this recently with some partners where we use the Dell EMC gateways, which are pretty beefy gateways.
They have quite a lot of processing power and memory. We integrated with the Azure IoT Gateway SDK to act as a translation gateway that was talking to a Bluetooth and APCU 80 devices. We have the Striim’s edge server that was actually doing edge processing and analytics and we used Statistica for machine learning where we actually built a model in the cloud using statistica that was predicting product quality and then scored that model at the edge to do real time analytics. So that’s a kind of an example of the architecture. There’s lots of benefits to kind of an architecture like this. If you can switch out your practical translation gateway, you can connect to anything you can react immediately cause you’re doing it at the edge. You can do aggregation and turning data into information, remove the seeing the signal from the noise a doing all of that at the ASU.
Limit, the data center to the cloud or into your a B. Data storage can scale it as required by adding more edge nodes, more on premise nodes, more cloud nodes and control everything centrally. So how do you handle is oncoming tsunami of data. So just to reiterate, some of the guidelines that we’ve given you first is transition away from batch. It is an artificial construct. The world is not batch. The world is event driven. Batch was something that was a limitation of technology. So you need to move towards a streaming first architecture where you are at least collecting data in a streaming fashion but move towards in memory processing and analytics, especially edge processing for IoT. Don’t store all of your data. Data is not information. If you have a nest thermometer is sending your certain temperature in the room every second, three and a half thousand data points in an hour, if your room stays at 70 degrees, that’s one piece of information is 70 degrees for an hour.
You don’t need this three and a half thousand data points. So process at the edge, do filtering aggregation, remove redundancy before sending that to the cloud. And you do need a complete end to end platform. Open source provides lots and lots of pieces, but it doesn’t provide an overall solution. You’ll still have to glue all that together. And most enterprises that we’ve spoken to, they focus on solving their business problems. I know Alex said that all businesses are software companies. That is true, but you have to choose how much of it do you want to build yourself. So I think we can go into kind of questions and a discussion.
Thanks so much Steve. I’d like to remind everyone to submit your questions via the Q and A panel on the left hand side of your screen. At this time I’ll mention that the slides from today’s presentation as well as a recording of the Webinar will be made available for download. We will be sending a followup email with links to these assets within the next few days. So let’s turn to our questions. Steve, you mentioned Statistica. One of our participants was asking do we have a machine learning capability in Striim or do we have to prepare a third party, something for using machine learning?
So we have some degree of machine learning, kind of only focusing on real time. So for example, we have real time linear polynomial multivariable regression algorithms in our platform that you can train over a training window and then use to predict out into the future. And similar kind of real time clustering we told limited the amount of data that you can store in memory. UThat’s why we’ve partnered with companies like Statistica and we’ve had customers also work with machine learning software like H2O where we generate and enrich data. It’s getting in a form for machine learning, then trained and analysts work their magic and derive value out of that data. Create a model which is an export it and typically they’ve been exporting them as jar files, which we then can incorporate directly into a SQL. So you can then basically write queries within Striim. They reference the machine learning functions and do real time scoring. And that’s kind of how we’ve done about it today. And if you have any particular machine learning software in mind, we can talk about kind of ways that we can integrate with that.
Excellent. Thank you. This question is also for you, Steve. You mentioned change data capture. One of our participants is asking, can you compare all Oracle golden gate streaming with Striim?
Because we are very fond of GoldenGate because the four of us that founded the Striim, we were, on the executive team of GoldenGate prior to the acquisition in 2009 by Oracle. But your GoldenGate is great at doing database replication is credit moving data from one place to another. And they also do have kind of this thing data adapt to the can write the rule change data somewhere else. We are a full streaming integration platform and what that means is that you can do quite complex data flows, quite complex processing of the data and very importantly we can enrich the data by loading arguments of reference data into memory, and do that in real time. And I mentioned earlier why that would be important for say a normalized data coming from database. So I think it’s crucial to recognize that your different software CD for different things. We are a full integration platform that can also do analytics on that streaming data and can also build dashboards and visualizations over that streaming data. We’re not going to limited to replicating data from one to another.
Perfect. Thank you. This next question is for Alex. Many of the computing enhancements of the last decade have been made in software. Do you think we’re entering a period where hardware innovation will again impact the industry?
Oh, thanks Katherine. I do. We’re starting to see some smaller processors come out. Google came out with it’s TPU last week. Intel has had its knuck quadrant. Qualcomm is building a snapdragon. People want to be able to do the scoring part of the machine learning out on the edge. And that’s a big part of the stream processing story here that Steve’s been talking about. A lot of the work that people are trying to do on big clusters in the data lake on Hadoop and other platforms. They’re realizing they needed to do it out on the edge. And we’re seeing a wave of innovation now with some of these smaller processors that are going into devices out on the edge. So I definitely think that we’re going to see more hardware innovation out there, especially, with deep learning and artificial intelligence becoming such a big driver of technology these days.
Excellent. And that in fact, Alex, there was a related question to that and that was, what will be the long term impact of deep learning on the big data space and what’s the intersection of deep learning and stream processing?
That’s a great question. I’m not exactly sure what the answer is going to be. Deep learning right now from what I can tell is mostly focused on solving a few problems, image recognition,you know, pulling, still photos off of video feeds and trying to determine, you know, is that a sidewalk? Is that a stop sign? Right. It’s being driven by a lot of the autonomous car developments that are going on. And also voice recognition and natural language processing. Those, from what I can tell are the main uses for deep learning. But I don’t know, maybe Steve can talk about how, if any of that stuff is going to make its way into stream processing.
This also segues into another question that came up, which was how my data streaming be utilized in the healthcare industry, specifically our hospital, well I don’t know which hospital is your hospital, but in general there’s a lot of possibilities around streaming data. Part of it would be around kind of patient monitoring and we involved the interesting story you once the commonality between hospitals and airports. It sounds like one of his bad jokes, the answers wheelchairs and the wheelchairs have a big impact on how airlines actually run on schedule and how hospitals move patients in and out. So doing real time tracking of wheelchairs and other equipment you know, like trash carts and a portable blood pressure monitors and all those things inside the hospital, it could also be very, very useful. So you know, where these things, I immediately, and you can optimize your flow monitoring patients.
You put a price on them when they answer, you know, where they are at all times, whether they wandered out into the fire escape or the restroom where you’ve lost them, you can find them. Right. Um, through real time tracking is important, but the other piece would be it mentioned the world where you combine a real time biometric monitoring, you know, all of the things you hooked up to when you’re in a hospital, right? And you can anonymize that data. You can send that into a cloud and you can also enriched that data with additional context like the patients symptoms. And you know, treatment and, and other things, right? Do deep learning on that again, is all anonymized. There’s no patient in specific patient information that, right. Do deep learning on that name much and you can then apply what you’ve learned in that deep learning to the real time signals that you’re getting from patients.
Maybe that machine learning can spot something that might be a potential risk that isn’t just an obvious sign that a single biometric monitor would spot. So I think there’s a lot of potential, a lot of things that people are looking at to kind of use real time data and you real time data streaming in the health care industry and that goes across all industries. Imagine you’re streaming all of the data from your fields in agriculture. The soil quality, water, content, sunlight, your whole bunch of things, right? Even monitoring with video cameras for past and sending drones as is acting with lasers. So there’s a lot of things that you could do in almost every industry and it all relies on having up to the second information and being able to react on it immediately.
Great. That was a very thorough response. Thank you both. Steve, this next question is for you what data sources does stream work with? And a couple of examples the person gave with Mongo or couch TV.
So we haven’t, today’s put kind of real time data collection for Mongo and coach Db. On the database side, we support a change data capture. So real time streaming of database change, the inserts are based in disease as they happen. Um, for, uh, Oracle SQL server, my sequel and HP nonstop databases. Um, I said you can source data from, uh, Mongo and coach CB through other means. Um, which would basically be in the form of, of queries. You might have to kind of build something, um, to work with some of these things by the edition. We do read from a log files in real time. I HTFS a hive. Um, also, um, working with message cues like, uh, JMS and CAFCA and QP flume and with devices through variety of protocols, TCP, UDP, http, MQP, MQTT, all those kind of the wide protocols. And we have some adapters for the protocol translation gateways as well.
So sources of data means different things. We can source data for example, from JDBC databases in the form of queries, but typically for loading in memory data, our in memory data grid for enrichment purposes, you wouldn’t want to do that for real time data because JDBC query is on real time single result set. So you have to differentiate between gonna loading static data, which we use for reference and context information and kind of if you want to do batch in a streaming fashion to real streaming data which would go through change data capture. So as things are happening within a database, you are streaming that.
Excellent. Shall we squeeze in just one more question? How does today’s stream processing relate to older messaging techniques developed, developed by Tibco software, AG, IBM, SAP and Oracle that are still in widespread use? Let’s take a stab at that one, Alex. Um, you know, I think you might pay better for this one, Steve. Now obviously with your history at GoldenGate ride, you’re pretty well steeped in this stuff.
So there’s definitely messaging technologies out there, right? JMS has been around for years. MQ series has been around for even more years and a lot of those messaging technologies were typically applied to kind of application integration and kind of SOA service oriented architectures. And so inherently are not necessarily designed with the throughput in mind on some of the other requirements in mind from a scalability cluster ability other things that will take the load of kind of huge amounts of IoT data. Right. Um, so it’s kind of, and that’s possibly part of the reason why Kafka is kind of had such a sudden rise in popularity. Um, but that, that’s just half the story, right? The message queues is half the story. The other piece is the stream processing and as I mentioned, something like complex event processing has been around for quite some time and it used to be that it was, you know, not designed for specific purposes and the barriers to entry were really high.
Um, it was difficult to get data in, is difficult to get data out is difficult to visualize and kind of build dashboards over it and to integrate it with other data to integrate it with reference data, for example, that you load in memory. So I think one of the major things that is happening and we are seeing that more and more is, is the integration of multiple in-memory components. So the combination of a high speed message, infrastructure in memory, data grid, in memory stream processing and analytics, in memory databases, in memory visualization, in memory, machine learning, scoring, in memory transaction processing, kind of all those things kind of coming together to do a lot more stuff in memory. And part of the reason why that is possible now is because memory is getting cheaper and more available. And as we start to see new interesting forms of memory, like Intel’s crosspoint with Maria, the speed of Ram with assistant storage. The amount of in memory processing that you’d be able to do will increase astronauts astronomically almost exponentially. Which should help him, help keep that, help that keep up with the growth in data that we saw earlier.
Thank you so much Steve and Alex. Unfortunately we are out of time. If we did not get to your specific question, we will follow up with you directly within the next few hours. On behalf of Alex and Steve, I would like to thank you again for joining us for today’s discussion. Have a great rest of your day.
Ovum analyst, Tony Baer and Striim Co-founder and CTO, Steve Wilkes discuss the need for hybrid open source data platforms, which combine open source with proprietary IP for a more cost-effective data management solution.
Welcome and thank you for joining us for today’s Webinar. My name is Katherine and I will be serving as your moderator. The presentation today is entitled The Real Costs and Benefits of Open Source Data Platforms. We are honored to have as our first speaker, Tony Baer, principal analyst at Ovum. Ovum is a market leading research and consulting firm focused on helping digital service providers thrive in the connected digital economy. Okay. Tony leads Ovum’s big data research area focusing on how big data must become a first class citizen in the data center it organization and the business. Okay. Joining Tony is Steve Wilkes Co-founder and CTO of Striim. Steve has served in several technology leadership roles prior to founding Striim, including heading up the advanced technology group at GoldenGate Software and leading Oracle’s cloud data integration strategy. Okay. Throughout the event, please feel free to submit your questions in the Q & A panel located on the right hand side of your screen. Tony and Steve, we’ll address all questions after the presentation wraps up. With that, it is my pleasure to introduce Tony Bear.
Okay. Thank you Katherine and thank you everybody for taking time out of your day to join us in our discussion on the real costs and benefits of open source of data platforms. Um, this is one of those perennial topics that we get questions from our clients, and I have to basically give my similar gratification with Striim for giving us the opportunity to come to share this discussion with you. The fact is that open source is obviously not a new phenomenon in the software market, but the fact is is that in the area, especially the areas that I personally cover, which is data management and big data, it’s just almost impossible to avoid running into open source. And so very frequently I get questions from clients about basically the reliability, the value, and the role of open source versus proprietary software.
Now, of course, basically this issue is as old for instance, as Linux itself. I remember covering the emergence of Linux back almost 20 years ago. And it basically proves the viability of this new alternative model to software development that today is becoming more and more, at least in my area, the norm. In fact, very often when the first questions I asked when Icome across new software firms is: is your product available as open source? So with that said, let’s take a look at what we’ll be talking about over the next hour or so. First of all, take a look at it. Essentially, why are we having this discussion? Why open source? What’s the draw, and then we’ll cut to the chase, which is really looking at the cost and benefits.
And then we’ll look at some real life examples. Where were these costs and benefits played out? And then we’ll then conclude with the takeaways. And spoiler alert, our take on this is that really is that a hybrid model that combines the innovation of open source with the reliability. And last mile of proprietary really is the most successful model and the most viable model for enterprise software. OK, let’s go the races here. So first part is why we’re having this conversation. There’s no question that open source is becoming more and more routine. A routine basically occurrence in the software world. And Black Duck software, which is a software firm that basically provide services that basically helps enterprises track their licenses so they don’t have any IP violations in terms of when they use when they utilize open source code, they conduct annual survey on what they call the future of open source if they’ve been doing this for probably at least about the better part of a decade or so.
And they do this with their partner, with partner organization to our bridge partners and they basically survey, but you know, a bunch of folks in to see basically, to look at how open source is being used in some of the key issues on it. And these were some of the results from the most recent survey, which was published. It was 2016 so it was published earlier this year. And it found that, you know, the use of open source grew 65% among this sample group over the past year. So pretty significant uptick. And so where was most of the open source, what types of software did you know tend to be open source? What tend to be used the most and what they found that, and this is gonna be a key thoughts and carrying on to this discussion, it was basically in commodity building blocks.
So in areas like operating systems, I mean vendors used to basically compete on operating systems and then Linux essentially the advent of Lenny’s really show that the value add is further up on the stack. And so of course today for instance, like Microsoft is no longer defined as a Windows company. So offering systems was a key area. It probably has two reasons, but also data platforms and development tools that can be, we saw most of the use of open source and then they asked the respondents what was the driver, why do you use open source? And the biggest reason was freedom from vendor lockin. Now this part, this next question which is on participation in the open source community, this to me is an outlier which showed that the group that Black Duck surveyed was probably not of enterprises in general because about two thirds of this group based important, they actually contributed to open source projects from our research.
We’ve found that most enterprises that use open source do not necessarily contribute; you really couldn’t scale this number out. It’s not like two thirds of all enter two thirds of the global 2000 contribute. But they certainly amongst this group was a high contribution rate, higher active participation rate. But this next point I thought was very interesting and actually potentially a cause for concern, which is, they asked about governance and basically half of the companies that responded have no formal selection or approval policies for open source software. The question is, would you see the same thing with conventional proprietary software? And so that’s an area where obviously I think there’s still a lot of best practices to be alert or a lot of lessons to be learned, so what has made open source successful?
And so the key watchword on this slide is community successful constructs, communities freed success. That means that when you have a community of critical mass, we have enough folks that are participating in it. So it’s broad enough based on the activity, it’s sufficiently critical mass that good things happen. And for instance, among the benefits when you have a successful open source community is that community members, they will get this steer, you know, it’s a meritocracy. They get to steer where, you know, where the software it goes. It’s not a member, a matter of basically hoping that a vendor realizes their needs or feels their pain and responds, no. The community basically in essence votes on this, at least the people who are contributors.
And in terms of when you have a critical mass community, the velocity of code commits is very high. And along with that, with a large community, because again, you’re tapping into what is theoretically the world’s largest virtual software R&D organization, both to get tested and caught and solved a lot more quickly. So that’s what happens when things work out, right? Not, I mean, not all open source projects are successful. Not all the open source communities aren’t successful. So in this page, we have a few examples. Some of the poster children that really, you know, that really made this myth reality. Linux is the obvious one that’s really kind of the grandfather lecture. Just actually the exception that proves the rule, which was at Linux, even though it’s community had basically someone at the top who had just pretty much, I won’t say unquestioned moral authority, but basically very widely accepted moral authority.
It’s pretty unusual. But the rest of these projects like Apache Hadoop and Apache Spark, also very broad based communities and tend to be better examples of this model in practice. So great debate here. Why open source, why proprietary now the why open source. Those points on the left basically where the chief points that were identified in that Black Duck software survey. And again it kind of reinforces what we’ve said in the previous page, which is that why didn’t you know what accompanies yourself? You know, why the enterprise is used open source? Because they perceive that when an open source project is successful that the quality will be better. These are things that the code quality will be better. They also feel that basically that the features will be competitive because it is vetted by the community. So therefore there is definitely a groundswell, a critical mass market that wants and needs these features and will therefore be motivated incentive to basically improve and fix them.
And of course along with that, because essentially open source, you’re not restricted by basically a vendor licensed where they own the source code is that you have the ability to fix it and customize it, you know, to well and of course it’s gonna depend on the open source license. There are dozens of them out there, but in general, if you’re looking at some of the most popular licenses today tend to be patterned off the Apache license, which allows you to add basically value add on top of the open source code. Um, that pretty much is the case. Now, why proprietary? Well, in many cases, open source projects by their very nature are going to be very narrow and specific. And so therefore they’re not necessarily unified solutions. You have to put the pieces together. That’s what vendors do. And in turn they have their unique intellectual property.
But also what’s very important is that they provide accountability. So it’s essentially they give you one throat to choke, so to speak. Also as they put together a solution. Therefore basically they’re ultimately accountable for security and chances are with vendor’s software, the security should theoretically cover all the functionality in that particular software product, there’s ultra proceeded customer focus that basically successful software companies basically follow the customers. Now again, when you look at these points, there are always gonna be exceptions to every rule. The world is not black and white. But the answer essentially the debating points between open source and proprietary. So one other point though that was not actually among the top responses from the blacktop survey is that a lot of folks believe that with open source that it’s going to provide them cost savings.
A good example of this is that I had a field as a query from one of our enterprise clients who are based out in Singapore. I think they were like a big banking institution. This is like about four or five years ago and is when Hadoop was still new and the perception was to do open source. And so this client they were looking at what types of packages would work on open source. And I went to the court with them, but I asked them that what was driving was they said, well, we would like to get rid of our Oracle software because Hadoop is free. I said, well, not exactly. And they said, you know, who are you going to have to maintain this? I said, well, we’re going to hire consultants.
And so basically what a lot of the rest of this discussion about is going to be basically, you know, we’re going to talk about the perceived cost savings, but we’ll then look at what statements are real and what costs are real. And so cut to the chase here. What do we think is the answer? Well, what we found from our experience in looking at open source software products is that the best recipe for success typically is a hybrid type of model. Very often it used to be called open core where the core of the technology or the colonel would be open source, but then around it, the bender would surround it with their own value add and some of the advantage of that, well of course there’s a value here. There’s a value to the vendor that they don’t have to reinvent the wheel.
A good example of that are the folks who are sponsoring this called Striim. They have basically a hybrid solution. When it came time to choose a messaging system or to develop messaging. They realize that technology already exist in open source and that is not part of their core value ads. So rather than spend their time having to reinvent a messaging system, they showed zero MQ. It also gives you the chance to leverage commodity infrastructure as a lot of open source software typically is designed to do today. Another good advantage of this model is that it gives you both the vendor and the customer, the chance to harness the innovation that’s coming from the open source community.
Especially with the latest common building blocks also, and this is very important, not just in the vendor but also to the customer, but its commercial viability in that it’s kind of like silver. You know, the, the metaphor I’m taking up here as a surgery was successful with the patient, died, well maybe have great software by the vendors give me such great deal. They can’t make money on it. Ultimately it’s not going to be of much value to you because of that vendor is not viable. You’re not going to have anybody to support it. So ultimately it is in not just the vendors interests to be commercially viable, that’s also in the customer’s interest. And so basically this is where the role of unique IP becomes very critical and also becomes very critical because the vendor is best situated to deliver that enterprise.
Then there’s also what we call the last mile of functionality and that’s where, for instance you have these common building blocks, but at the end you need to do that integration. Like with Striim they did that integration with Oracle at the log level, they do that change data capture. That’s not the type of thing that’s going to be very viable for an open source community because that’s going to be a very narrow purpose project. And so therefore you need to go narrow. But deep there is that does not suit itself to open source. So that was again, that basically is why we see the hybrid model being most viable and will return back to us. It really does give you the best of both worlds. So given that they’re not all open source projects are alike and at the risk of or of oversimplifying guilty as charged, we are oversimplifying here.
We’ve basically shown two ends of the spectrum because there are many different types of models on one end as they’ve been your led on the other end is the community side. And the vendor lead is basically where the vendor essentially owned. You know, they basically put the source code out out on someplace like GitHub. But the vendor ultimately basically leads that project. It’s not only, you know, it’s not governed by any type of community. Then this is the other side of, you know, the other extreme, which is the community led words, you know, hosted by a foundation such as Apache, the premise with the gold standard of open source communities. So what’s basically the difference in the vendor? Can you in the technology vendor led project and the good example that you know when they say something like Mongo DB is that the vendor essentially makes the roadmap decisions on where the product goes.
Whereas the community side, the community basically it’s a merit and meritocracy. Now that being said, reality is not black and white in that we’ve noticed that a number of vendors that have hit, you know, basically led their own projects have also, we’ll start to basically dabble in community on community side as as well. And the same as is true with vendors who basically deal with community projects and ready go off and actually a vendor might lead a project initially type it in their own phase or their own matter of kind of like incubation. The difference basically is that regulated led open source products in many ways aren’t like proprietary software products. The difference here though is that the coat and the roadmaps are publicly of are, are publicly available. That’s basically the difference. Again, it’s not to say that one model is better than the other.
For instance, from Mongo DB and its customers that the vendor led model works very well for Spark to do, you know, Linux community model would prove quite successful. So now let’s basically look a little more detail at the cost of savings and we’ll start with the good news and I’m kind of like paraphrasing the old Meineke Muffler commercial and yeah, I originally had a better picture here, but what the quality was crappy. So we’ll go with the screaming babies, but I’m not going to pay a lot for this software. The thing was thought with open source is that the cost model changes. You don’t pay for the software itself. So it doesn’t matter how much software you use or download, you’re not paying a perpetual license or subscription for the software itself.
On the other hand, open source software, just because you can get that software free doesn’t mean open source software is free. It’s freely available, but it’s not great is one way or another you’re gonna pay it. And the preferred model is worry. Basically you go with a commercial open source provider that brokers that packages a distribution and does all the integration of all the open source modules and hopefully does some of that last mile stuff. And they’re basically the typical model is that you pay a subscription for support. That’s actually a fairly familiar model because it’s kind of like proprietary software where you’re paying it. It’s kind of like the annual maintenance part of what you do with proprietary software. The only difference is you’re not paying that upfront cost of capital costs of per of a perpetual license for vendors. The savings are and avoid reinventing wheels as we mentioned before.
And we gave the example of Striim and with their use of an open source messaging, you know, technology as part of their solution or enterprise is as mentioned, there’s no professional licensing. And so therefore you eliminate that, that upfront capital costs, you’re taking advantage typically of commodity technology. Most open source gets popular because the technology is affordable and it works on affordable technology. You know, typically for instance, like, you know, x eight, six machines for instance. And along with that, there are, it has basically, you know, altered pricing expectations and that’s kind of where that picture kind of comes in. I talked to you do folks and they would love to, you know, be able to get the multiples, you know, charge the multiples that like all the enterprise software folks like the Microsofts and the Oracles of the world, you know, have historically charged, but realistically their market is not going to put up with that.
And so what we see the Hadoop market as being in the few hundreds of millions, it’s nowhere close to the existing enterprise database market, which is in which is well north of 10 billion. And you know, even as do princess matures, it will never get to that 10 billion mark basically because the community expects or the customer base expects the lower cost software. However, what we need to point out, and we’re going to talk about this more on the next page on the next slide on the terms of cost, is that the savings picture is going to differ between whether you use what we call raw open source, where you’re going directly to the community website and downloading those packages or projects versus whether you basically subscribe to a vendor. Um, you know, with support distributions, which you mentioned down here, when you typically subscribe to an open source and software vendor, most of them, the vast majority of them are actually following the hybrid open core model.
By the way, there is some proprietary technology there anyway. Okay, so let’s go into those costs. We’re going to look at it from the standpoint of raw open source where you go to, you know,the project site you bite you, you don’t bother going through a vendor, you don’t pay for distribution, uh, or you don’t pay subscription, you know, a forced distribution. And here basically the picture is very similar to that of basically implementing your own homegrown software. The only difference is you didn’t write the original software. You’re probably going to be wearing a lot of others and stuff as a result of this. But that’s another story. But the advantage of course is that you get the flexibility of dealing with it. What’s in essence, a best of breed strategy because you’re picking and choosing your open source projects, but you’re also bearing integration costs as well.
Security is going to vary by open source project. It’s going to be more complete in some than others than prisons. You know, some security technologies, you know, some security projects may not necessarily support all the open source projects that you want to implement. So the key headache for organizations that basically go this raw and you know, no open source in the wild download route is that they need to harmonize the security and integrate all of the pro integrate all the software. By the way, there’s also the fear of obsolescence. I’ll admit it’s not exclusive to open source because you can basically get a vendor product and that vendor goes out of business. Well that’s all she wrote. The same thing can happen in open source. Just because the software is still up there in a website doesn’t mean that it’s not been put out extended life support, um, or end of life support should site.
And so this is very typical of the growing pains and maturing technologies. Cause some are gonna be winners and some are going to be not in that sense, you’re making bets that are very similar with proprietary software as well. But it’s not saying that people I think really think very heavily about was open source because I think, well it’s free. So there is something at risk there. And then there’s the question of extensibility, which is that because many open source projects are very narrow in scope that they’ll require additional functionality. And so we’ve going to go onto a few real life examples here to kind of bear out what we’re talking about here. And this is the case of a bank that implemented a cyber security solution. Then they put it up and then they basically they went the route of basically of debt of basically going through open source projects in the wild.
And so they implemented Storm, which is a data flow routing. The open source project and what’s storm for streaming and Metron for security analytics and Kebana which goes along with, you know, which actually is a related to elastic search or basically it’s for visualizing log analytics. They also had a gooey alert UI when I’m actually surprised they don’t see up here is elastic search. I would have assumed that would be part of it as well. Um, but anyway, um, this was essentially, you know, putting together a bunch of these projects don’t give us between this and homegrown software. They didn’t write the original projects, but the otherwise had to bear the full load of having to integrate all this stuff and patch it. And it basically, this is, you know, this cybersecurity solution was not trivial.
Acquired about 45 engineers and it cost about 20, 30 million to basically maintain, keep us working over about five or six year period. And the big pain points there were, they were there gaps, you know, at the last mile, especially with end to end security which required that custom, you know, last mile development. Another example is there is a communication service provider and trying to extend a call center application. And so they use several open source projects, flume for data in motion for routing data, reading data flow, essentially logstash, which is for collecting and transforming lug with a lot of data and elastic search. Hey, founded here, didn’t leave it off the list and this was actually a much more modest solution. Five practitioners, you know, just quiet five practitioners caught in the neighborhood of 3 million plus or minus that were five or six years.
Key caps here though again were that last mile, which is with change data capture, integration with the customer databases is where their call center. Basically you want to know what’s basically that, you know, what’s happening with the customer. And that required a lot of costly extra development and a lot of costing maintenance. And our last example here is dealing with unplanned adolescents. This is a credit card processing firm that was doing, had a real time transaction processing application. And in this case, again, this was a very, on the surface it looked like are very successful, quick hit. Um, they use spring XD, which is a component for building data pipelines. And it was, you know, I mean it was very quick to inland just a couple of months, you know, not that many engineers. So on the face of it, it served their purposes for what they were doing.
A problem is that the vendor pulls the plug on it when they went to a different strategy. And so the vendor place spring XD project on end of life as a result of that, you know, the credit card company was back at square one. And so, and again, there’s that, I will say the same problem can have with proprietary software as well. But I think the important point here is that just because it’s open source, doesn’t mean it’s going to be successful. And this case actually just like you depend on a vendor for product mode routes for proprietary software. Well this is the case of a vendor led open source project and this case the vendor pulls the plug.
So what tends to work with open source as we said, it basically works best with commodity technology, which draws a critical mass target audience of developers and prospective customer. Do you guys serve a wide enough market to build a big enough community? Um, it had and that also the algae is extensible and the licensing is extensible and that’s where that the Apache license has proven very popular because it does allow you to add your own value add on top of the open source. It doesn’t require you to give it back to the open structure. I mean, like a lot of the earlierGen one licenses like the GPL licenses, which were the original licenses, um, in the open source world. Um, and but deal is that what you’re looking at, and again, it’s got to be critical mass technology so it’s not written for an overly narrow use case and it’s not hardwired to any specific platform.
And the API APIs are published and freely available and they’re open and ideally, yeah, if you have those a open API, hopefully it’s an avoid best of breed integration issues. Um, but again, that’s another reason why we believe the hybrid model is best. We’ll get to that in a sec or get back to that in a sec. But where do we say proprietary software? Where’s proprietary IP really come in? Well, where we really see it happening is, especially at the application business logic level. Now of course, see every rule, there are exceptions. Yes, there should be CRM, a customer relationship management system, but for the most part we’ve not seen a lot of open source of the application level. And there’s really a good reason for that, which is that that’s where businesses want to differentiate themselves in terms of how they do business.
And so do you want to basically, even though obviously we have applications software that has not commoditized this a bit. The differentiation is too specific to really make it well suited for an open source project. I mean you don’t want to get a lowest common denominator solution for your business. Um, we’ve also found it’s very good for niche technologies and solutions where basically the market, the addressable market is too small. Therefore the addressable community of developers is going to be too small to really support an open source projects such as doing let’s say a connector to a data, you know, a connector to a, to a log into logging system of a database. And that again ties into this last one if unique and custom integration use cases. So what we see here is that proprietary IP as best for differentiating enterprise solutions, also good for and to, and security because basically, you know, a vendor can basically take trucks and making sure all those gaps are filled.
It’s, that’s going to be kind of, it’s going to come basically kind of hit or miss when you have a community. Also this also polished user experiences in UI. And I think it’s not because you can’t open source that. I think it’s probably because the dates are of developers that I don’t really think in terms of functionality and maybe I’m building some stereotypes here, but we’ve just not seen misery experiences as being the type of thing that you open. That’s open source. And yes we do see some goalies on various social source projects but they’re really polished once. I think it’s, that’s really uh, that’s really where the vendor comes in and keep it and you know, with the exception of let’s say certain categories of products like were apple made, it is named UI is usually not the um, is not the, uh, you know, I guess the, you know, the show stopper and also again, last mile of integration and consistent SLS.
That’s where you really gonna to account on a vendor to really make sure that essentially the trains run on time and that and also has all the tracks connect anyway. So in general, we believe that using hybrid open core for solutions for last mile is differentiated intellectual property. But at the same time you get to get the best of both worlds by leveraging community building blocks. It’s essentially we’re commodity meets enterprise grade and so the takeaways and all this is that, you know, we’re not saying don’t use open source, we’re saying this is open source where it’s appropriate, but when you do so keep your eyes open, it’s open source is not send out honestly free. So look at the real cost commercially some and you’re either going to pay that cost either through subscriptions with commercially supported, um, you know, annual subscriptions, which is pretty much like paying maintenance for conventional software and it’s all, or you’re going to be paying for it in terms of all the spadework you have to do when you implement rural open stores out in the wild.
It’s a lot like doing again, your homegrown software hybrid open core approach we found to be the most reliable and the most viable model for enterprise software because we said it combines the best of both worlds. It keeps the vendor in business, which means that it keeps, it gives you the assurance that you’re going to have a vendor, they’re a throat to choke and someone there to support your software. Um, but yet it also taps the rapid innovation and the open source community to come the economies of a commodity software, but yet at the same time you get that unique IP and last one integration and security. So that pretty much wraps up my part of the conversation at this point. I like to turn it over to Steve.
Well, thank you Tony. I’m going to start by introducing Striim, talk a little bit about our platform and doing that to provide context for the discussion around open source in this hybrid open source model because you know, the Striim as a company building our platform, any intelligent software company, if it’s been built already, then why build it? Again, we going into the discussions around that and how important it was to integrate pieces of open source into the platform. So Striim has been around for around five years now. We have a mature technology that’s been in production with customers for more than three years. And we are continually evolving and releasing new updates to the platformc, adding new functionality, new connections, etc. And Striim provides a full end to end streaming data integration and analytics platform. So what the platform does in a nutshell is you to collect data continually from a whole variety of enterprise sources, things that may be inherently streaming, like message buses and sensors that continue data in, in real time.
And then things that you may not think of as streaming like files for example. Um, so with files you, you know, we will collect the data at the end end of the file as it’s written and stream that it in real time. And in databases, most people think of those as a historical record of what’s happened in the past. But we use a change data capture technology to see what’s happening in the database in real time. Yeah. Those inserts updates and deletes as they’re happening and stream those out. So once you’ve done continuous data collection, you have real time a memory data streams. And the simplest thing you can do with the platform is just deliver that somewhere else. So from a streaming integration perspective, you can take stuff that’s being written into our database, for example, and stream that onto Kafka or take your web bug files from on premise and stream those out into uh, Amazon S3 or Azure, SQL db in the cloud.
That’s kinda an a thing that you can do with the platform. But typically our customers are doing more than that and that requires some degree of processing. So we have this ability to do in memory, SQL based continuous queries that can process that data and analyze that data. And this in conjunction with a time series windows and an in memory data grid allow you to do things like transforming the data, uh, from one form to another, filtering it, uh, aggregating it. So looking at, you know, the last minutes worth of data or the last a hundred events and the last entry for each one of these records, etc, and also enriching the data. And that’s where the memory dating creed comes in because you need to load large amounts of data into memory reference data and then join that with the streaming data in real time as things are passing through.
So that’s kind of how you can process data and get it into a form you want before delivering it. An example would be if you’re reading change data from your nicely normalized database. Yeah. And say it’s an audit detailed table, you’re just gonna see a whole bunch of ids, you know, order Id, this customer Id, this item Id this. And if you’re just delivering that to say, that’s not gonna mean much to the people reading the data from Kafka. So in Richmond typically will be, you’d load reference data into memory and then join that. And so now instead of those ideas do get the older information, the customer information, the item information that’s all written out. And so you can now do more intelligent analytics and talking about analytics, we have the capability of doing um, in memory statistical analysis on anomaly detection, pattern detection through a complex event processing syntax and correlation across data streams and across windows.
And this enables you to look for things that happen at the same point of time or within the rate, the time range and also to look things I co located in, in the same geographic location for example. So doing this correlation is an important aspect of almost all of the analytics use cases that we see. Then on top of all of this processing, as I mentioned, is happening through the sequel based queries. You can then build visualizations all using our platform. So real time dashboards, you can trigger a workflows, you can generate alerts to notify people that things are happening and we have ways of integrating third party machine learning directly into the streaming data flows as well for realtime scoring and inference. That’s basically what the platform does. It’s uh, in memory streaming integration on analytics platforms, it’s everything from continuous data collection, processing, analytics and delivery.
Yeah. And this suits different categories of use cases. It’s a piece of middleware that enables you to do a whole bunch of different things. So on the real-time data integration side, we have customers that are delivering data real time into a data lake for example, after processing it and getting into a form that they want or integrating on premise and cloud, keeping a cloud database up to date with an on premise database for hybrid cloud initiatives or doing IoT edge processing. On top of that we have the analytics type applications where your customers are doing things like fraud detection or a cybersecurity monitoring where you analyzing lots of different data feeds and anti money laundering and location based solutions where people move locations really, really quickly. So you need to have real time insight into what’s happening. And that’s where a streaming platform really comes into play because you can process the data as it’s being produced.
And then we have the customers that have built dashboards and this is typically used for building and monitoring real time metrics and key performance indicators in order to real time quality monitoring, a SLA monitoring, etc. And these use cases across many, many industries. I’m not going to go into all of these cause that’ll take up another couple of hours, but we have use cases across lots of different industries on, you know, we touch lots of different departments within those organizations as well. The key differentiators of our platform are that it is a full end to end platform that does everything from collection, processing, analytics, delivery and visualization of real-time data that is designed to be easy to use. So it’s very fast to build and deploy these applications. We have a UI for drag and drop building of data flows. We have the SQL like language that enables you to use a whole bunch of different types of uh, your, your internal resources, whether they’re developers or business analysts or data scientists to actually build out these data flows.
Okay. We are enterprise grade and that means that we are inherently clustered, scalable, reliable with, you know, full tolerance and exactly once processing and recovery built into the platform. And we have end to end security and we can integrate with a whole bunch of things. You know, so we work with the top three cloud platforms, the top three big data platforms. We have changed there to capture for the major databases. We have deep integration with Apache Kafka and other open source solutions. So that’s the platform in a nutshell. And it’s going to important to kind of understand that within the context of how this works with opensource. So now imagine you didn’t have Striim and you needed to build a streaming data framework or platform from open source. And this is kind of the process that we went through as well. So we’re talking from experience here and how to build such a platform.
Well inherently you’re going to be taking data from sources, moving it to targets and doing stuff in the middle. As I mentioned, that stuff can be quite complex and it requires a lot of different categories of software. And so in order to move data render class, so you have a high speed message infrastructure in order to load large amounts of reference data into memory, you need a distributed in memory data grid or cash to store the results you need distributed results, storage. And then you have data collection. That livery and the processing of the data. You have to have some way of developing the solutions and visualizing the solutions. Those will have different categories of open stores that you need. Yeah. And we’ll just add a few in here so you can get an idea. But every single one of these categories has, you know, a large number of different pieces of open source that you can choose from.
And that isn’t the end game, right? Those are just some pieces that you need. And in order to get it all to work, he needs some glue code around all of this to handle all of the enterprise grade stuff. That clustering, scalability, reliability, security and management that enables all these pieces to work together and to scale together to be reliable together a single security policy across all of them. And then you need a layout as i enables your developers to actually build the applications. Cause this is just the framework, the platform, there’s no solutions yet. So, um, the goal of this as an enterprise would be that you’re building something that enables people to build analytics or integration much more quickly. And so you need a abstraction layer, an API connectivity in the web server and things like that. Yeah. So that’s all the pieces.
And that’s kind of the pieces we looked at as well when we were building an app platform. Well, so if you look at the process involved in that if you’re going to build this from open source, uh, first of all, you need to design it. We just did that. We had a diagram that showed the various pieces that we were looking at UI deeper than that in reality. But that’s an idea. And then for each component of open source that you’re interested in, you have to look at the different options. So identify what open source is available in each of the categories and evaluate each one. You may have performance requirements, scalability requirements, overhead requirements, uh, even, you know, software language requirements. You know, I want everything to be Java, I want everything to be javascript and it’s kind of hard to mix the two.
Okay. Then you need to build the integration and build the glue code and the layers around all of that. As your integrating things, you might find that some things don’t work together. So you may have to go and identify different components that you’re going to use. Then you go into testing and of course testing like results in changes that might result in, you have to change things up because things don’t work once you have as built, you’re going to have to maintain it. And some of the things that Tony mentioned, if the open source software that you’ve chosen is upgraded in some way and it changed its APIs and you know, thankfully know Kafka is now finally out of Beta is a one o uh, after all this time. But up until now, the APS were changing all the time. And so every upgrade you went, every version you went, you may have to modify all your, the integration code to work with the new API.
And the other case that Tony mentioned was the open source is deprecated. Maybe they’re not going to fix any bugs in anymore. It’s contributed contributors left, they’ve all moved on to the next big thing. Um, in that case, you may just have to replace your piece of open source, which means you now have to identify a new one and goes through the evaluation process and the reintegration process. So you have to do all of that. Um, and also get support if there are any bugs that you find and go to the community or the vendor to fix things. And you need to do all of that, um, before you can start to build your applications. Um, and so that’s quite an involved process and that’s why you talk about these large timeframes that are involved in a lot of these open source projects. You’re talking even with something basic, you know, six months to a year before we can even start to get any results out of it.
Whereas if you download that platform, we’ve done all this already and so you’re literally just installing Striim and then you can start building the applications and you’ll talk to as support and we will manage all the issues with all the open source. So we include that enables you to kind of build your applications faster and gets to deployment. At faster if we look at what our platform looks so I can, how we’ve integrated open source and we have things at the edge, you know, so we have sources, targets in our platform that integrate with open source pieces including Kafka, HDFS, flume, HBS, hive, et cetera. All of the things you’d expect if you’re reading data or writing data somewhere. But then we also have pieces of open source within our platform and we went through that identification and evaluation process and or to choose the best of breed in all of these cases and integrate them together.
We have two different versions of messaging. We have a high speed messaging and it runs in network speed that utilizes a Java version of zero MQ and Cryo. We have the persistent heights be messaged infrastructure, uh, for recovery purposes and application decoupling purposes. We use Kafka for that. It’s built into the product ships with the product. We have Hazelcast that manages clustering metadata management and control. We have an implementation of j cash in memory data grid for high speed, uh, look ups and very fine grain control over where data is stored, um, across the cluster. And we utilize the last day, uh, Tony will be happy, as I mentioned last week, again, uh, for distributed results storage. So those are all the pieces of open source that we, we have chosen. Um, there are a lot more supporting classes I couldn’t possibly fit in here.
You know, like JSON parser for example. Um, there’s lots more of those, but you know, they’re not going to major, major components that are their own software category by themselves. Um, then of course we built all of the glucose, so we had to work out how do we do scalability, the distributed technology clustering, failover across all of these species. So they all work together. Uh, how do we manage reliability and exactly the ones processing all the way from sources to targets. How do we have a single security policy, role based security and encryption across all of these things and full management and monitoring interface APIs, UI that enables you to control this is a whole platform rather than individual pieces. We have a full set of API APIs, whether they’re through ask scripting language, JDBC IDFC rest API as a websocket cpis that enables you to connect with the platform.
But then we also have a whole bunch of secret sauce that all the pieces that are enabled to continuous data collection, whether it’s from devices, big data, databases, etc, through change data capture and continuous data delivery. Um, all those things that you see on the other side. And then the real key is SQL based processing and analytics, which is our own intellectual property. Um, that leads you to do the filtering, transform aggregation enrichment; do complex event processing, anomaly detection and correlation. And that’s, you know, a piece of the platform that we currently hold a patent for. Um, I was told you mentioned, you know, on top of that you very yourself and see, uh, UIs either in open source software or even within the enterprise. Not many enterprises that are building a data processing platform are going to take the time to build a drag and drop UI or a command line interface or some other easy way to actually build the applications.
Yeah, they’re going to rely on developers to write code to build the applications. So we provide a drag and drop UI for building data flows, um, doing all the analytics and for building the dashboards. And so I’ve seen all of that. So that’s kind of how we incorporate open source into our platform, but our customers don’t have to worry about it. And if one of these pieces was decommissioned, it was no longer supported or there were major books in it, then we handle that for the customer. They don’t have to have developers on call the $3 million spent to keep people, uh, maintaining and upgrading the platform continually. No. So a unique it includes is change data capture that enables you to get data from databases in real time as it’s happening and also enables you to handle things like changes in Schema. So if the tables structure changes that we can modify how we write things out to do pool capital for example.
Okay. Uh, in memory distributed processing, which is a patented technology. Yeah. The enables you to, uh, have the SQL based processing happening across the cluster and intelligently route things across the cluster and join things with this distributed cash and handle a full tolerance and exactly once processing, um, with rollback and recovery, uh, that enables you to scale the applications and also trust that they’re going to work and pick up where they left off. If you know, for example, you’ve lost the entire cluster. Um, so these are some key things that we’ve added in and on top of that kind of UI and dashboard builder, they, as we mentioned, you rarely get from open source.
We are recognized by the industry for doing a lot of innovative, innovative work, both on kind of streaming analytics and on Iot. And also a great place to work and we’re very happy about that one. Just very quickly, um, drill down into a couple of these customer stories. No, we have a customer who’s built out an anti-piracy solution. Um, it’s useful, um, video feeds, etc, and the media customers and enables kind of real time monitoring of the usage of a feeds media and correlates multiple logs in real time. It’ll just to identify is a really a subscriber or not. And why did they choose us? Well, they looked at a number of different open source log analytics products. They had some concerns about the amount of people that it would involve to maintain it. They estimated that if they hadn’t used that platform, they would have had to have triple the size of the team to actually do the development and ongoingly maintain and keep the platform up to date.
And also some kind of limitations with single open source solutions that they could have chosen for this and kind of the integration that it needed to do. So we were chosen because we had, in addition to the log capture, change data capture from the Oracle database, we have the SQL processing language that enabled them to use what they wouldn’t ordinarily have thought of as developers who people in the analytics team to kind of build out some of the data flows. And we had is great visualization that they could use for monitoring. And we could easily integrate with the existing code that they had for a machine learning solution. A second case is a leading credit card network. And what they needed to do was to very quickly identify potential threats. And if you have lots and lots of security applications out there, you know, a large number of different types of security logs, you’re gonna end up with alerts from all of them.
And if you get an alert, security analysts will have to manually fill down and correlate and look at, um, what else is happening. So if you get a port scan from a certain IP address, what else is that IP address doing? And they’d have to manually correlate and look across the whole bunch of different mugs. So they’re using our platform to pre correlate across all the logs and identify things, uh, have activity in more than one place or with certain rules around that that enabled them to identify high priority potential threats and act on those much more quickly, uh, reduced the amount of time that you have to spend looking at data because we would pre correlate all the data. I’m a really enables him to pretty quickly see everything that they need in order to make decisions. And they chose us. Um, cause they, they, you try things before they built a python application that did things with a four hour window, so it was four hours behind.
And they wanted this information as soon as possible in real time. They looked at two different types of open source. Um, one of them, the one that they worked on first. That was the story that Tony told that got de supported as they spent a year of development, unfortunately. So then, yeah, looked again, uh, options. One of the open source things that they did, the identification and evaluation and testing that one didn’t scale. Um, as, as they needed, I was talking 10 billion events a day, a that needs a process and they just couldn’t get that performance. So this other piece of open source, um, and it also didn’t have all the required features they needed and they chose us because we have this, uh, necessary scalability and also the sequel based processing and analytics so they could built out of the state of affairs and update them and build additional applications really quickly.
So key takeaways really are that this blended approach does all that work that you would need to do to identify and evaluate and choose and integrate and maintain all of those bits of open source and provides it all for you without you having to worry about abstracted away so that you can use it very easily. But also it key integrates with open source. You may have chosen already, so if you already have your own Capitan class so you don’t have to use the one we ship with, we can integrate with that. Um, as if it was ours. Um, if you already have Hadoop we can read and write from that for you. Um, but you, our solution is also enterprise grade and you can get started much faster and more cost efficiently. And you know, we basically take the best of open source that we have chosen that we have gone through all that process and bundle it with a unique IP. That office is patented technology for realtime integration analytics provides you the UI for building everything, the ease of use end to end security, reliability and scalability and gives you this ability to build dashboards and visualizations in a single platform. So that’s the end of my part of the presentation and we will now open it up for the Q. And. A
Thanks so much Dave and Tony. I’d like to remind everyone to submit your questions via the Q&A panel on the right hand side of your screen. While we’re waiting, I’ll mention that a link to the recording of today’s Webinar will be emailed to you within the next day or so. Feel free to share this link with your colleagues. Now let’s turn to our questions. Our first question is, does CDC need to be turned on in the database for Striim to handle CDC from it?
That’s a great question. And the answer is yes. Um, but we can help you with that. You know, so different databases have different requirements for enabling change data capture. You know, Oracle for example, needs supplemental logging turned on. Um, see my SQL requires you have a been logged, etc. Um, but if you’re using our platform, you using a UI using our wizards, when you make that initial connection into the database, we will check all of that for you and we will tell you if you don’t have the correct things configured and if you don’t what you need to do in order to achieve that. So, um, yeah, it does have to be turned on but we can help you with it.
Great. Thanks. Our next question, um, when it comes to real time data integration, how does Striim differ with other products in the market? Okay.
Okay. Um, what you’ve just seen in the presentation, hopefully, um, if we are a full end to end platform, right? So, um, we can do real time integration and other people talk about real time integration as well. Um, you know, there are solutions out there that can do change data capture into Kafka, right? But if you look at a little bit underneath the covers, what they mean is that they are doing change data capture and they’re writing exactly that data into Kafka. Um, there’s no processing available. There’s no, uh, enrichment. There’s no, uh, advanced features available within that, you know, so, um, because people talk about real time integration, streaming integration kind of as a marketing piece, it doesn’t mean that they’re necessarily doing everything that you need, you know, to try to achieve that. Yeah. So it’s really our completeness that is the big differentiator that we have all these data sources to turn anything into a data stream.
We have all of the processing in sequel. You don’t need developers to write Java code or Java script. They’ll see sharp to actually do all of the processing of the data. We have a lot of different data targets and you can take one source CDC and push that into Kafka cloud database, a blob storage and Hadoop all within a single data flow. You don’t, you know, it’s very easy to build these things out that we have. Um, some, some of that is some videos on that as well, you know, so, um, it’s really the completeness of the platform and plus because you now have things streaming, if you want to move to do real time analytics, you’re perfectly positioned to do that and our platform can help you with that too.
Excellent. Thanks Steve. Our next question, does Striim take care of upgrades of all the underlying open source technologies? For example, if I upgrade to the latest Striim version, I get the latest compatible open source technologies.
So, um, it depends on kind of the integration point, right? So if you’re talking about sources and targets, things that we connect to, then we’re always trying to keep up to date with whatever it is our customers are requiring. And that may not be the last, last version of everything. Um, it, it could be the previous, the last three depends on what has got market traction, but we will always support our customers, ensure that they can connect with whatever they have already. Um, if it’s something that it’s an integral part of our platform that is completely hidden from customers. Um, we have our own mechanisms by which we choose when to upgrade. Um, and it obviously depends on kind of stability, security, amend to API changes, uh, integration, effort, etc. So we may not always be shipping with the last release of something within our platform, but we got to keep things current because we obviously want to take advantage of any bug fixes and security fixes that have gone in.
And if customers points at the point something out to us. Yeah. And that concern, for example, about a security hole in one of the pieces that we incorporate, um, then that’s obviously something that we can patch and fix quickly. So the, uh, that was a long answer. The short answer is it depends.
Perfect. I think that also addresses our next question. I regret that we’re out of time. If we did not get to your specific question, we will follow up with you directly within the next few hours. On behalf of Tony and Steve, I would like to thank you again for joining us for today’s discussion. Have a great rest of your day.
Today we’re going to talk about what is a streaming first architecture and why is it important to your data modernization efforts. We’ll talk about the Striim platform and we’ll give you some examples of what customers are doing on the solutions they are building using our platform, but take some time to give you a demonstration of how it all works and then open things up at the end for questions and answers.s
So most of you are probably aware of this, the wealth news fast and it’s not just the world is the data that moves fast. Data is being generated largely by machines now. And so businesses need to run at machine speed. You need to be able to understand what’s happening right now and react with immediately and also that it’s not too late. Yeah. At the same time, customers and employees expectations are rapidly rising and positive reason for this is the smartphone boom and people’s access to realtime information, the ability to see what’s happening to your friends, what’s happening in the world. Uh, instant access to news and some access to, uh, messages, communication, um, because the consumer world is instant. Those consumers are also employees and executives and the, they expect instant responses and insight into what’s happening within the enterprise. And the quality of applications used to deal with is also driving desire for similar quality business applications.
The other side of the coin is that businesses needs to compete. Um, technology has always been a source of that competition. And as technology is rapidly changing and we’re getting more and more data. Businesses that are more and more data-driven have a competitive edge and this is now have almost all departments ranging from engineering or manufacturing all the way through to marketing being incredibly. And so the survival of most businesses depends on the ability to innovate and utilize new technologies and data itself is also massively increasing. Almost everything can generate data, it could be trucks or your refrigerator. Um, TV is a wearable devices that you have, health care devices that oh becoming more and more portable. Even things like Istomin making its way from being caught to a restaurant. It’s tracked and produces large amounts of data and this data is growing exponentially.
IDC did this study like a couple of months ago and they’re estimating that today the 16 Zettabytes of data by 2025 is going to increase 10 fold and all of that data around 5% of it now is real time increasing to 25% of it in 2025 and by real time they mean is produced and needs to be processed and analyzed in real time. So that’s 40 zettabytes 21 zeros of data that will need to be processed in real time. And of that 95% it will be generated by devices. The kick that really is the only a small percentage of the state it can ever be stored there physically, not enough hard drives being produced to store all this data. So if you can’t store it, what can you do with it? Well, the only logical conclusion is that you need to process [inaudible] analyze this data in memory in a streaming fashion close to where the data’s generated. It may be you’re turning the raw data, you know, thousand data points a second into aggregated data that is less frequent but still contains the same information content. Yeah. So that kind of thing is what people talk about is age processing was really trained to handle these huge volumes of data that people see coming down the line.
And it’s not just IoT data that the rise in streams, you know, okay, everything, every piece of data is generated because something happened, some kind of event, someone was working on the enterprise application, someone was doing stuff on a website or using a web application and machines were generating logs based on what they were doing. Applications generation, those databases as generating logs and network devices, everything generating logs. But they’re all based on what’s happening based on events. So if the data is created based on events in a streaming fashion, then it needs to be processed and analyzed in a streaming fashion. If you collecting things in batches, then there’s no way you’ll ever get to a real time architecture and a real time insights into what’s happening. But if you collect things as streams, then you can do these other things. You can do batch processing on the streaming data, you can deliver it somewhere else, but at least the data needs to be streaming.
So your stream pro so thing as he merged as a major infrastructure requirements and is helping drive enterprise modernization. So moving to a streaming fest, data architecture readings that you’re transitioning at these data collection to real time collection. You’re not doing batch collection of data. You’re doing real time collection of data, whether it’s from devices, files, databases, uh, wherever it’s originating and you’re doing this increments that you’re not trying to boil the ocean and replace everything in one go. You’re doing it. Use case by use cases, proud of your data modernization projects. And this means that things that have high priority to become real time to give you real time insights and a potential business competitive edge or better support for your customers or reduce the uh, amount of money you’re spending on manufacturing and by improving product quality, any of these things. Um, can we drive as today to modernization and your doing it use case by use case, you’ve placing pieces of it, bridging the old a new worlds of data.
Now some of the things that our customers are telling us, uh, and we have these legacy systems and by legacy that can mean anything that was installed, you know, over a year ago and they can’t keep up or they don’t predict that they’ll be able to keep up with the large amounts of volumes of data that they’re expecting to see with the requirements for low latency, kind of real time insights into data and with the ability to rapidly innovate and rapidly produce new types of analysis, new types of processing give you new types of insights into what’s happening in your business. Okay. They’re also telling us that we can’t just rip and replace these systems and the need to have the new systems and the old systems work together, uh, with potentially fail over from one to the other. While you’re doing this replacement. It’s a Striim has been around for around five years now.
Uh, we are the providers of a platform, the Striim platform that does streaming integration on analytics. The platform is mature. It’s been in production with customers for more than three years now. There’s customers all in a range of industries from financial services, Telco, healthcare, retail. We’re seeing a lot of activity in Iot. Striim is a complete end to end platform that does streaming integration on, on the mystics across the enterprise, cloud and IoT. We have an architecture that’s very flexible that allows you to deploy data flows to bridge enterprise, cloud and Iot. You can deploy pieces of an application at the edge close to where the data is generated. And that doesn’t have to just be IoT data. It can be any data that’s generated but close to that data. Other pieces that are running on premise, doing some processing and other pieces are in the cloud.
Also doing processing our analytics. It’s very flexible how you can actually deploy applications using our platform because applications consists of continuous realtime data collection. And that can be from things that you may think of as kind of real time, uh, sensors. Sending events, a message cues, et cetera. Are there things like files that you may think are AAP is batch processing? Um, we could read at the end of the file and as new records are written to the file stream that has immediately turning files, the rollover, et Cetera, into a source of streaming data. Suddenly with databases, um, most people think of databases as a historical record what’s happened in the past. But by using a technology called change data capture, you can see the inserts, updates, deletes, everything that’s happening in that database in real time. So you can collect that nonintrusive Lee from the database on stream.
That’s it. So now you have a stream of all the changes happening in the database. Okay. So all of the applications built with that platform use some form of continuous data collection. On top of that, you can then do real time stream processing and this is through a SQL based queries. There’s no programming involved in the Java, no c sharp, no Java script. You can build everything using SQL and this allows you to do filtering transformation aggregation of the data. Yeah. By utilizing data windows. So you can say what’s happened in the last minute, uh, you can look for a change in data and only send that out and et cetera. And then enrichment of data, which is also very important. And that is the ability to load large amounts of reference data into memory across the distributed cluster and join that in real time with streaming data to add additional context to it.
So an example would be if you have device data coming in and it’s device x, Y, z value one, two, three. Okay, that doesn’t mean much to things downstream that might be trying to analyze it. But if you join that with some context and you say, well, device Xyz is this sensor on this particular motor, on this particular machine, now you have more context. And if you include that data, you can do much better on top of the stream processing. You can actually do streaming analytics that can be correlating data together, joining data from multiple different data streams and looking for things that match in some way. So maybe using a web blogs and network logs and you’re trying to join by IP address. Um, and you’re looking for things that have happened on either side in the last 30 seconds. That kind of correlation to complex event, a processing which is looking for seek because of events over time, the mass, some kind of pattern.
So if this happens, followed by this, followed by this and it’s important you can do a statistical analysis on anomaly and integrate with third party machine learning. Yeah. We can also generate alerts and trigger external systems and build these really rich streaming dashboards or later visualized results of your analytics. Yeah. And any of the data that’s initially collected, the results of processing, the results of analytics that can all be delivered somewhere and you can deliver to lots of different targets in a single application. So you can push stuff to enterprise and cloud databases, files or do, uh, Kafka, et cetera. Okay. As a new breed of middleware that supports streaming integration analytics, it’s very important that we integrate with your existing software choices. So we have lots of data collectors and data delivery that work with systems you may already have. It wasn’t the big data systems, enterprise databases, open source, um, pieces we can integrate with it and do all of this in a enterprise create fashion that is inherently clustered, distributed, scalable, reliable and secure as a general purpose piece of middleware.
We support lots of different types of use cases from real-time data integration, uh, analytics and being able to build dashboards and monitor things. And these use cases across all different industries and they can range from a building your data lake and preparing the data before you land it. I’m doing tag migrations, I’m doing iot edge processing. And then on the other lytics and patterns side, there’s things like fraud detection, uh, predictive maintenance, uh, anti money laundering is some of the things that we received from customers, uh, around that. And then if you want to build these dashboards and monitor things in real time and look to see whether things are, for example, meeting SLAs or meeting expectations and yeah, we’ve done things like call center quality monitoring, SLA monitoring. Yeah. Looking at the, that worked from a customer perspective and being able to alert when things aren’t running normally. And I use cases across many different industries. There’s a lot of texts on here. Yeah. The takeaway is that we have used cases in a lot of different industries.
Yeah. One of the examples is, uh, using Striim for hybrid cloud integration. And that’s really where you have a database on premise and you want to move it or copy it to the cloud. And it’s one thing to just like take the database and put it in the cloud, but that will miss anything that’s happening while you’re doing it or miss things have happened since you’ve done it. So it’s really important that you include a change data capture in this to continually feed your hybrid cloud database with new information. And so by using a set of wizards, you can build this really quickly that allows you to join a on premise, for example, oracle database and deliver real time data from that into, uh, for example, Azure SQL DB. So you now have an exact copy that is always up to date of the on premise database.
Another totally different example is uh, using us for security monitoring, which is where you have lots of different logs being produced by VPNs firewalls, network routers, individual machines, essentially, uh, microcontrollers. Anything that can produce a log and you recognize a unusual behavior, um, is most often seen by affecting multiple systems, no security unless they get a lot of alerts from all these logs and all these systems all the time. But a lotamz of those are false positives. So the goal of this was to identify things that were really high priority for them to look up first by seeing what’s the activity happening that was affecting multiple things. So for example, if you have a port scan from a network row to the same, this guy’s looking at other stuff. Is there any activity on the other machines that he’s looked at? Okay. Are they doing port scans?
Are they connecting to external sites and downloading malware? So by doing this correlation in memory in real time, you can spot threats at a higher priority. And also by pre correlating all the data together and providing that to the analysts, they can see immediately the information they need rather than having to go and manually look for this across a whole bunch of different bugs. And this really increases the ominous productivity. So a couple of other examples from our customers. One is a very simple, uh, realtime data movements where data from, uh, HP nonstop and SQL server databases is being pushed out into, uh, multiple targets, uh, whether it’s Hadoop, HDFS, Kafka, HBase, and they’re using a as a analytics hub for their communities. So basically ensuring that wherever they want to put the data, that’s always up to date and that is always containing real time information from there and they can see on the other databases.
And then the glucose monitoring company, uh, are using us to see, uh, events coming in from, uh, devices on these implantable devices, uh, real time monitoring of glucose. And it’s really important that these things work. So they are looking at the, whether the device is having any errors, whether it’s suddenly going offline and being able to see in real time any of these devices not working properly. And this is really important to their, their patients because their patients rely on these devices to check their glucose glucose levels. So this has really reduced the, uh, times detect that there’s an issue going on and has improved patient safety massively. Okay. We have recognized generally by a lot of the analysts in both the in memory computing and the streaming analytic landscapes. And we’re also getting a lot of recognition from various publications and [inaudible] a trade show organizers and then also very importantly, one that best places to work, uh, which is really vindication of, you know, us being a, a really great company, a key differentiation.
Striim’s end to end platform does everything from collection and processing, Oh, lytics delivery visualization, the streaming data that is easy to use, uh, with the SQL language for building a processing and analysis that allows you to build and deploy applications in days. Um, and we’re enterprise grade, which means that we are inherently a scalable in a distributed architecture, reliable and secure. Okay. And that you can integrate us. We’re easy to integrate with, uh, your existing technology choices. So those are the kind of key things to remember about why we’re different. So with that we’re going to go into a demonstration. Sothe first part of this, um, basically going to show you how to do the integration rather than to type a lot of things. Uh, we’re just going to go through, uh, how to build a change data capture a into Kafka and do some processing on that and then do some delivery into other things.
So this is pure integration play. You start off by doing a change data capture from SQL, in this case, my SQL and okay, build the initial application and then configure how you get data from the source so we can figure the information to connect into my sequel. When you do this, we’ll check and make sure everything is going to work, that you already have. Change data, capture, configure properly. And if it wasn’t with how you had to fix it and how to do it, you don’t select the tables that you’re interested in. We’ve got to collect the change data from, and this is going to create a data stream, that data stream. Yeah. But then go to two different to Kafka. So we’re going to configure how we want to write into Kafka. Um, and that’s basically setting up what the broker configuration is, what the topic is and how we want to format the data.
In this case we’ve got the right to add as JSON, when we save this, this is going to create a data flow and the data flow is very simple. In this case it’s two components. We’re going from my SQL CDC source into a Kafka writer. We can test this by deploying the application and it’s a two stage process. You deploy first, um, which we’ll put all the components out over the cluster and then you run it and now we can see the data that’s flowing in between. So if I click on this, I can actually see the real time data. And you see there’s a data and there’s it before. That’s basically the four updates. You get the before image as well, so you can see what’s actually changed. So is real time data flooding through [inaudible], um, um, my sequel application. Okay. But it doesn’t usually end there.
Uh, the raw data may not be that useful. And one of the pieces of data in here is um, a product id. Uh, and that probably is, it doesn’t contain enough information. So what we’re going to do first is we’re gonna extract the various fields from, from this and those various fields include the location id, product Id, how much stock there is, et cetera. This is a inventory monitoring table and we’ve just turned that from kind of a rural, a format into a set of name fields. So I’ll make it easier to work with later on. You can see the structure is very different. Now what we’re actually seeing in that data stream. If we then, uh, once add additional context to this, what we’ll be able to do is join that day. There was something else. So, first of all, we’ll just configure this so that instead of writing the raw data at Cafca, we’ll write that process data ad and you can see all we have to do is change the input stream. So that will change the data flow. Now to right that uh, process data at into Kafka.
But now we’re going to add a cache and this is a distributed in memory data grid that’s going to contain additional information that we want to join with a raw data. And so this is product information. So every product ID is a description and price and some other stuff. So first of all we’ll just create a, a data type that corresponds to our database table. Yeah. And configure what the key is. And the key in this case is the product Id. Then we specify how we are going to get the data. And it could be from files, it could be from acfs. Yeah. We’re going to use a database reader to load it from my SQL table. So especially specify all the connections and the query we’re going to use. And we now have a cash of products information. So use this, we modify as sequel to just join in the cache.
So anyone that’s ever written any secret before knows what a join looks like. We’re just joining, uh, on the product Id. So now instead of just the raw data, we now have these additional fields that we’re pulling in in real time from the product information. So if we start this and look at the data again, you’ll actually be able to see the additional fields like description, um, and brand and category and price that came from that other type that’s all joined in memory. There’s no database lookups going on is actually really, really fast. So that’s where I seem to Kafka. If you already have data on Kafka or another message bus or anywhere else for that matter is new files. Um, you may want to kind of read it and push at some of the targets. So, well we’re going to do now is we’re going to take that data that we just wrote to Kafka.
We’re going to use a Kafka, a Rita in this case. So we’ll just search for that and tracking the capital sauce. And then we can figure that with the properties connected to the broker that we just used. So the uh, and because we noticed JSON data, we’re going gonna use it Jason Pasta. I was going to break it up into a adjacent object structure and then create this data stream. Okay. When we, uh, deploy this and uh, start this application, it’ll start reading from that Kafka a topic and we can look at that data and we can see, uh, this is the data that we were writing previously with all the information in it and it’s adjacent full Max. You can see the adjacent structure though. So the other targets that we go into right to, uh, the Jason Structure might not work. So what were you going to do now?
Is We got after in the query that’s going to pull, uh, the various fields edit that Jason’s structure and creates a well-defined, a data stream that has various, um, individual fields in it. So we’ll write a query to do that. That’s directly accessing the JSON dSata and save that. And now instead of original data stream that we had with the JSON in it, when we deploy this on, uh, start it up and look at the data. And this is incidentally how you would build applications, looking at the data all the time, um, as you’re building and adding additional components into it. Um, if we’re, uh, look at the data stream now, then you’d be able to see that, uh, we have those individual fields, which is what we had before on the other side of Costco, but doesn’t forget that, um, it may not be stream rights into Catholic. It could be anything else. And if it, you were doing something like we just did with CDC into Kafka than Kafka into additional targets, you don’t have to have Kafka between, um, you can just take the CDC and push it out to the targets directly.
So, uh, what are we gonna do now is going to add a simple target, which is going to write to a file. And, uh, we do this by choosing the file. So the file writer and especially finding the formats we want. So we are going to write this. I’ve seen the CSV format. Um, we actually call it DSV because it’s delimiter separated, right? Um, and the, the limits can be anything. It doesn’t have to be a coma and save that. And now we have something that’s going to rotate to the file. So if we deploy this and start this up, then we’ll be creating a file with the real time data.
And um, it, yeah, after a while it’s got some data in it and then we can use something like Microsoft Excel twice. She viewed the data, um, on checks. That is kind of what we wanted. So let’s take a look in XL and he can see, uh, the data that we initially collected from a, my SQL be written to capita slightly from Kafka and then being risked back out into this CSV file. But you don’t just to have, they just have one target and a single data flow. You can multiple targets if you want. We’re going to add to in rising into Hadoop and into a zoo of blob storage. So what we do is, uh, in the case of Hadoop, we don’t want all the data to go to Duke. So we’ll either the simple CQ to restrict the data and do this by location id.
So when location 10 is going to be written to her, that’s so some filtering going on there. And now we will add in the Hadoop target. Uh, so you’re gonna write to HDFS as a target. Uh, drag that into the data flow and see you. There’s many ways of working with the platform. We also have a scripting language by the way, that enables you to do all of this from vi or emacs or wherever your favorite tech status or is. Um, and we are going to write to HDF, let’s see, an Afro format. So it will specify the scheme of file. And then when this is started up, we’ll be writing into HTFS as well as to this local file system. And similarly, if we want to write into a zoo of blob storage, we can take the adaptive for that and just search for that and drag that in from the targets. And we’ve got to do that on the original source data, not that query. So we’ll drag it into that original data stream.
Yeah. And now we just configure this with information from a sewer. So you need to find out, you know, uh, what is the a server, a URL, what is, and you should know what your key is and we use name and password and uh, things like that. You’re going to, uh, collect that information, uh, if you don’t have it already. And then add that into the, uh, target definition for as your blob storage. I’m gonna write that out in Jason Format. So that’s kind of very quickly. Hey, you can do data integration, real time streaming data integration with our platform. Yeah. And all of that data was streaming. It was being created by doing changes to my SQL. Uh, well, no, see some analytics. I have a couple of applications I’ll show you very quickly. Um, the applications are defined through Ah, data flows. Data flows typically start at [inaudible], the data source.
They’re doing a whole bunch of processing and you can have them in subflows as well. And the each one is suppose can be doing, um, you’re reasonably complex things with, you know, nested data flows. So if I deploy this application and then we go and take a look at a dashboard, you’ll be able to see how you can start visualizing some of this data. So this data is, uh, coming from, uh, ATM machines and other cash point taught the ways of getting cash. And the goal of the application is to try and spot if the decline rates for a credit card transactions, et Cetera, is going up. So what it’s doing is it’s taking the raw transactions and then it’s slicing and dicing it by a whole bunch of different dimensions and it’s trying to spot has the decline rate increased by more than 10% in the last five minutes and is doing that across, you know, generally and kind of across all of the different dimensions as well.
And each one of these visualizations and nothing’s hard coded in, in here. It was all built using our dashboard editor where you can drag and drop. Yeah. The visualizations into the dashboard. Each visualization is configured with a query that tells you how to get the data from the backend. And then set of properties that tell you how to get the data from the query into the visualization and obviously other configuration information. So that’s kind of an example of one, a analytics application, uh, built using our platform. And well just go and take a a look at a totally different one that does something completely different. I’ll just stop this one. And this one is tracking passengers on employees, a SNF port. And so the data was coming from a location monitoring devices that, you know, I see tracked Wifi. Okay. And if we take a look at a dashboard for this, you can see, you know, it’s still rich dashboard that have lots of information on it.
Um, and the data here is coming from ah, location information joined with zones that have been set up. So these zones, uh, represents different airline ticketing. And what we’re doing is we’re tracking the number of employees that are in different crisis. And, uh, if the number of passengers goes up too much and you need additional employees and it’s going to flag it by turning red and it will send out a request for more employees. And the red dots here are the employees of white dots for the passengers. So as more red dots arrive in this location, then it will basically notice that new employees Reuter its euro and uh, the that will go away because you know, things are actually, uh, okay. Now and the other thing that is tracking is, you know, individual locations of all the passengers. And this passenger over here, uh, just walks into a presumed prohibited zone.
So, uh, what that’ll send out on alert and now an employee will try and track that guy down and remove him from the heritage zone. So that’s a totally different analytics application. But again, this one was also built using our dashboard builder. So I hope I kind of gives you an idea of the kind of variety of different things that you can do using the Striim platform and yes, to finalize things, the, you know, stream, um, okay. Can Be easily fitted into your existing data architecture. You can start to take, take some of the load away from um, what were maybe existing ETL jobs and try and move those into real time. And we can integrate with all of the sources that you may be using for existing ETL. But we can also integrate with a lot of other sources as well. Um, and pull data out of things you may not get enough access before.
Well it could also integrate with your things you have already, maybe you have an operational data store, enterprise data warehouse maybe already writing data into do we can integrate with that data, maybe use that for context information. Well we can also write that. So you can use us to populate your operational data store, enterprise data warehouse or I do. And we can also integrate with your machine learning choices as well. So these real time applications play a really important part. They’re a new class of application to give you real time insights into things, but we can also be part of driving your big data applications and legacy applications.
As I mentioned before, the platform consists of real time data collection across, you know, very many different types of data sources and then a realtime data delivery into a lot of different targets. The example you saw originally you seems to capita was just this very simple no processing, right? We then added in processing and that’s all done through these in memory continuous queries that are written for SQL, like language that can incorporate a time series analysis through windowing that allows you to do things like the transformation of data that you saw filtering a enrichment of data by loading large amounts of data into memory, um, through as an external context and also aggregation of data in order to kind of get a while, I haven’t been last minute, et cetera. Well, aggregating by dimensions as you show them the transaction log six. The other piece is the let’s take x where we can do anomaly detection, uh, pattern matching through complex event processing and very importantly the correlation that it was totally important to the security application. And then on top of this you can trigger alerts, uh, external workflows. You can run ad hoc queries against the platform if you want to see what’s going on right now in a data stream and also build these a realtime dashboards and we can integrate with your, your choices for machine learning and do realtime scoring very close to where the data’s generated.
We integrate with most of your existing enterprise software choices, uh, being able to connect to a whole bunch of different sources and a whole bunch of different formats. Yeah. Delivered to a whole bunch different targets, but then also integrate with your choices of big data platforms. For this map, our Hortonworks Cloudera, the choice of cloud, whether it’s Amazon, Microsoft or Google, um, and then run on operating systems and virtual machines so that we can run on premise at the edge and in the cloud. Okay. Well, so baby, well fitted for Iot, we have a separate cut down edge server that’s designed to run on a gateway hardware that may not be as powerful as what you’d be running for a Striim cluster processing analytics can happen at the edge. Uh, we can deliver that data directly into the cloud or into the streams of on premise.
Um, and you can have these applications that span a on premise, uh, edge processing, uh, on premise, on the six through the Striim server and also expired. And typically the amount of data and the speed of the data is crater on the left hand side and you’re reducing that data down to the important information. Okay. But it’s covering a lot more territory. So you may have, you know, I’m single, I just as like a single machine in the factory, but then you can have a Striim serve that covers the factory and then the cloud that covers all the factories that an organization my own.
Yeah. And we believe that Iot is not a siloed technology. It shouldn’t be thought of separately, especially Iot data. Iot data needs to be integrated with your existing enterprise data, uh, and is part of your enterprise data assets. So as part of your data modernization, as you’re thinking about Iot, don’t think about it separately. Think about how do I integrate this Iot data and get most value added that because you’d be much more valuable if it has more meaning and it can have more meaning by correlating it and joining it with your other enterprise data. Okay. We are one consistent, easy to use platform that only do we have converged in memory architecture that combines, uh, the memory caches and the high speed messaging and uh, and memory processing. But we also have a single UI that allows us to design, uh, analyze, deployed, uh, visualize and monitor, um, your streaming applications.
The key takeaways from this, I hope. Yeah. Is that okay, you really need to start thinking about streaming first architecture. Right now you need to start thinking about how do I get a continuous data collection from my important data sources and consider that from a point of view of what do I require immediate insight into, uh, how do I increased the competitiveness in my company or operational efficiency or uh, whatever reason you may have to doing it. Whatever pushes you might have for real time applications. How do I go about doing this on a piece by piece basis? Um, and start by streaming those important data. Also consider sources where your data volumes are going to be growing and you may need to do a pre processing in flight before you store any data. Um, and that’s another area where streaming first subsidy essential, um, we believe that Striim is the right solution for this because we have a streaming architecture that addresses both of these concerns and other a concerns you may have as well, especially can kind of be enterprise grade being able to run mission critical applications and you shouldn’t be kind of ripping and replacing everything.
Your Mongo, this has to be kind of use case driven. Right? And it’s probably everyone out there has a use case that they need to get some real time insights into something and that’s a really good place to start. So if you want to find out more about Striim and go to the striim.com website, you can contact us, uh, the email or the support thing on there in tweet us a check out Facebook page and linkedin pages as well. And with that I will open it up for any questions.
Thanks so much Steve. I’d like to remind everyone to submit your questions via the Q and a panel on the right hand side of your screen. While we’re waiting, I’ll mention that we’ll, we will be sending out a followup email with the link to this recording within the next few days. Now let’s turn to our questions. Our first question is what’s the difference between streaming first and an event driven architecture?
Okay, that’s a great question. A venture [inaudible] has been talked about for a long time. Um, I know they were kind of like gotten a category all the way back in like 2002, 2003. The emphasis there. It was on the data movement, it was on the enterprise data bus of that move things around. And so that was where you put your events. And so the whole kind of SOA, event driven architecture, um, that bus was the crucial thing. Um, that technology has matured. Now we have had messaged presses around for a long time. We have new ones coming up, have come out recently like cafca that really caught everyone’s imagination. I’ll technology kind of is found. The importancy we talking about with streaming first is the data collection piece that you want to put old as much of your data as you can into data streams so that you can get to real time insights.
Uh, if you want to, you could just take those data streams that you’ve collected in real time and deliver them out. So my advice to your data warehouse or cloud database or whether they just become storage, um, if that’s kind of what you want to do with the data. But as long as you’re collecting the data in real time, you’re not going to get real time insights after you put it in a database or in Hadoop is always going to have some latency involved by reading stuff from storage. But as you start to identify applications where you do need real time insights, you can move them into acting in memory straight on the data streams. So what we really mean by streaming first is you are employing a enterprise driven architecture, right? But you’re focusing on ensuring that you at least do the data collection first.
That’s great. Thanks so much Steve. Um, the second question is how do you work with machine learning? So there are a couple of different ways in which we, we’ve worked with machine learning. I can do this kind of by way of example, the we, we actually have a, a customer who is integrating machine learning is part of the overall data flow and use of Striim. And the first piece is essential is being able to prepared data and continuously feed some storage that you go into. Do machine learning on. So machine learning requires data and it requires a lot of data but also needs that data to be refreshed so that if you’d need to rebuild the model, if you started to identify the model’s no longer working, then you need to be able to have up to the minute data. And so with Striim you can collect the data, you can maybe uh, join it, it um, and Richard, um, you can pivot it so that you can end up with uh, a multivariable a structure they suitable for training, machine learning before you write that ad to files to do, um, database.
And then outside the stream you can usual machine lending software on this customer. The data they were collecting was a web activity and VPN activity and other stuff around users and they pushed all of this add into a file store and then use h two o there choice of machine learning software to build a model. And the model was modeling user behavior. Do not use as normally do what is the usual pattern of activity for each one of our users. You know, when do they logging in when they accessing things? What applications are they using, what order of applications are there? They have in, they built this machine learning model. They expressed all of that. They then exported that model, uh, as a jar file and they incorporated it into a data flow straight from the raw data in our platform. So the raw data was going through the processing into the store. But then we’re also taking that raw data in memory in a streaming fashion and pushing it through the model and checking to see whether it matched the model and then alerting on any anomalous behavior. So the two places where we really work with machine learning are delivering data into the stories of machine learning combined. And then once it’s learned taking the model and doing real time inference or scoring in order to detect anomalies, make predictions, et cetera.
That’s a great example. Thanks. Um, we, I think we have time for just one more question. Uh, this one is a confluent just announced SQL on Kafka. How does that compare to what you do?
Okay. Well, SQL on data streams is a great idea. And obviously we’ve been doing that for, since homelessness inception of the company. Um,
I like it. It’s brand new, right? So it’s not going to be mature for awhile, but it’s only looking at a small part of everything you need in order to try actually do all things I showed you today. Um, being able to run SQL against the stream, uh, I think it’s, you know, a window is, is, is one thing. Um, but there are other types of things we need to incorporate. For example, we can incorporate data in a distributed data grid, so Kashi Information, um, data and results, storage and feedback results into processing a whole bunch of other things. Um, but the, I think the primary thing that I see is that, that focusing on interactive ad hoc queries against streams, and that’s good. Um, being able to just, you know, see what’s going on in the stream and I analyze it, but the power of our platform is combining your sources.
Query is, target’s caches, results everything into a data flow that becomes an application, right, that you can deploy. Um, uh, as a whole. And so it’s gonna take a while until all of the things that we spent the last five years working out. Um, like yeah, I could security with role based security model for uh, these types of queries. How did you integrate them into a whole application? How do you to play the application across the cluster? Um, all of those kind of things that are essential for mission critical applications that we support our customers, um, that utilize SQL. So I think being an end to end platform, we can do all of that and then having to combine all the pieces together so the SQL may be useful, might be harder with the, the key SQL that was announced earlier this week.
Great. Thanks so much Steve. I regret that we’re out of time. If we did not get to your specific question, we will follow up with you directly within the next few hours. Okay. On behalf of Steve Wilkes and the Striim team, I would like to thank you again for joining us for today’s discussion. Have a great rest of your day.