The Real Costs and Benefits of Open Source Data Platforms

Ovum analyst, Tony Baer and Striim Co-founder and CTO, Steve Wilkes discuss the need for hybrid open source data platforms, which combine open source with proprietary IP for a more cost-effective data management solution.

Download the full paper here: The Real Costs and Benefits of Open Source Data Platforms

To learn more about the Striim platform, go here.

 

Unedited Transcript: 

Welcome and thank you for joining us for today’s Webinar. My name is Katherine and I will be serving as your moderator. The presentation today is entitled The Real Costs and Benefits of Open Source Data Platforms. We are honored to have as our first speaker, Tony Baer, principal analyst at Ovum. Ovum is a market leading research and consulting firm focused on helping digital service providers thrive in the connected digital economy. Okay. Tony leads Ovum’s big data research area focusing on how big data must become a first class citizen in the data center it organization and the business. Okay. Joining Tony is Steve Wilkes Co-founder and CTO of Striim. Steve has served in several technology leadership roles prior to founding Striim, including heading up the advanced technology group at GoldenGate Software and leading Oracle’s cloud data integration strategy. Okay. Throughout the event, please feel free to submit your questions in the Q & A panel located on the right hand side of your screen. Tony and Steve, we’ll address all questions after the presentation wraps up. With that, it is my pleasure to introduce Tony Bear.

Okay. Thank you Katherine and thank you everybody for taking time out of your day to join us in our discussion on the real costs and benefits of open source of data platforms. Um, this is one of those perennial topics that we get questions from our clients, and I have to basically give my similar gratification with Striim for giving us the opportunity to come to share this discussion with you. The fact is that open source is obviously not a new phenomenon in the software market, but the fact is is that in the area, especially the areas that I personally cover, which is data management and big data, it’s just almost impossible to avoid running into open source. And so very frequently I get questions from clients about basically the reliability, the value, and the role of open source versus proprietary software.

Now, of course, basically this issue is as old for instance, as Linux itself. I remember covering the emergence of Linux back almost 20 years ago. And it basically proves the viability of this new alternative model to software development that today is becoming more and more, at least in my area, the norm. In fact, very often when the first questions I asked when Icome across new software firms is: is your product available as open source? So with that said, let’s take a look at what we’ll be talking about over the next hour or so. First of all, take a look at it. Essentially, why are we having this discussion? Why open source? What’s the draw, and then we’ll cut to the chase, which is really looking at the cost and benefits.

And then we’ll look at some real life examples. Where were these costs and benefits played out? And then we’ll then conclude with the takeaways. And spoiler alert, our take on this is that really is that a hybrid model that combines the innovation of open source with the reliability. And last mile of proprietary really is the most successful model and the most viable model for enterprise software. OK, let’s go the races here. So first part is why we’re having this conversation. There’s no question that open source is becoming more and more routine. A routine basically occurrence in the software world. And Black Duck software, which is a software firm that basically provide services that basically helps enterprises track their licenses so they don’t have any IP violations in terms of when they use when they utilize open source code, they conduct annual survey on what they call the future of open source if they’ve been doing this for probably at least about the better part of a decade or so.

And they do this with their partner, with partner organization to our bridge partners and they basically survey, but you know, a bunch of folks in to see basically, to look at how open source is being used in some of the key issues on it. And these were some of the results from the most recent survey, which was published. It was 2016 so it was published earlier this year. And it found that, you know, the use of open source grew 65% among this sample group over the past year. So pretty significant uptick. And so where was most of the open source, what types of software did you know tend to be open source? What tend to be used the most and what they found that, and this is gonna be a key thoughts and carrying on to this discussion, it was basically in commodity building blocks.

So in areas like operating systems, I mean vendors used to basically compete on operating systems and then Linux essentially the advent of Lenny’s really show that the value add is further up on the stack. And so of course today for instance, like Microsoft is no longer defined as a Windows company. So offering systems was a key area. It probably has two reasons, but also data platforms and development tools that can be, we saw most of the use of open source and then they asked the respondents what was the driver, why do you use open source? And the biggest reason was freedom from vendor lockin. Now this part, this next question which is on participation in the open source community, this to me is an outlier which showed that the group that Black Duck surveyed was probably not of enterprises in general because about two thirds of this group based important, they actually contributed to open source projects from our research.

We’ve found that most enterprises that use open source do not necessarily contribute; you really couldn’t scale this number out. It’s not like two thirds of all enter two thirds of the global 2000 contribute. But they certainly amongst this group was a high contribution rate, higher active participation rate. But this next point I thought was very interesting and actually potentially a cause for concern, which is, they asked about governance and basically half of the companies that responded have no formal selection or approval policies for open source software. The question is, would you see the same thing with conventional proprietary software? And so that’s an area where obviously I think there’s still a lot of best practices to be alert or a lot of lessons to be learned, so what has made open source successful?

And so the key watchword on this slide is community successful constructs, communities freed success. That means that when you have a community of critical mass, we have enough folks that are participating in it. So it’s broad enough based on the activity, it’s sufficiently critical mass that good things happen. And for instance, among the benefits when you have a successful open source community is that community members, they will get this steer, you know, it’s a meritocracy. They get to steer where, you know, where the software it goes. It’s not a member, a matter of basically hoping that a vendor realizes their needs or feels their pain and responds, no. The community basically in essence votes on this, at least the people who are contributors.

And in terms of when you have a critical mass community, the velocity of code commits is very high. And along with that, with a large community, because again, you’re tapping into what is theoretically the world’s largest virtual software R&D organization, both to get tested and caught and solved a lot more quickly. So that’s what happens when things work out, right? Not, I mean, not all open source projects are successful. Not all the open source communities aren’t successful. So in this page, we have a few examples. Some of the poster children that really, you know, that really made this myth reality. Linux is the obvious one that’s really kind of the grandfather lecture. Just actually the exception that proves the rule, which was at Linux, even though it’s community had basically someone at the top who had just pretty much, I won’t say unquestioned moral authority, but basically very widely accepted moral authority.

It’s pretty unusual. But the rest of these projects like Apache Hadoop and Apache Spark, also very broad based communities and tend to be better examples of this model in practice. So great debate here. Why open source, why proprietary now the why open source. Those points on the left basically where the chief points that were identified in that Black Duck software survey. And again it kind of reinforces what we’ve said in the previous page, which is that why didn’t you know what accompanies yourself? You know, why the enterprise is used open source? Because they perceive that when an open source project is successful that the quality will be better. These are things that the code quality will be better. They also feel that basically that the features will be competitive because it is vetted by the community. So therefore there is definitely a groundswell, a critical mass market that wants and needs these features and will therefore be motivated incentive to basically improve and fix them.

And of course along with that, because essentially open source, you’re not restricted by basically a vendor licensed where they own the source code is that you have the ability to fix it and customize it, you know, to well and of course it’s gonna depend on the open source license. There are dozens of them out there, but in general, if you’re looking at some of the most popular licenses today tend to be patterned off the Apache license, which allows you to add basically value add on top of the open source code. Um, that pretty much is the case. Now, why proprietary? Well, in many cases, open source projects by their very nature are going to be very narrow and specific. And so therefore they’re not necessarily unified solutions. You have to put the pieces together. That’s what vendors do. And in turn they have their unique intellectual property.

But also what’s very important is that they provide accountability. So it’s essentially they give you one throat to choke, so to speak. Also as they put together a solution. Therefore basically they’re ultimately accountable for security and chances are with vendor’s software, the security should theoretically cover all the functionality in that particular software product, there’s ultra proceeded customer focus that basically successful software companies basically follow the customers. Now again, when you look at these points, there are always gonna be exceptions to every rule. The world is not black and white. But the answer essentially the debating points between open source and proprietary. So one other point though that was not actually among the top responses from the blacktop survey is that a lot of folks believe that with open source that it’s going to provide them cost savings.

A good example of this is that I had a field as a query from one of our enterprise clients who are based out in Singapore. I think they were like a big banking institution. This is like about four or five years ago and is when Hadoop was still new and the perception was to do open source. And so this client they were looking at what types of packages would work on open source. And I went to the court with them, but I asked them that what was driving was they said, well, we would like to get rid of our Oracle software because Hadoop is free. I said, well, not exactly. And they said, you know, who are you going to have to maintain this? I said, well, we’re going to hire consultants.

And so basically what a lot of the rest of this discussion about is going to be basically, you know, we’re going to talk about the perceived cost savings, but we’ll then look at what statements are real and what costs are real. And so cut to the chase here. What do we think is the answer? Well, what we found from our experience in looking at open source software products is that the best recipe for success typically is a hybrid type of model. Very often it used to be called open core where the core of the technology or the colonel would be open source, but then around it, the bender would surround it with their own value add and some of the advantage of that, well of course there’s a value here. There’s a value to the vendor that they don’t have to reinvent the wheel.

A good example of that are the folks who are sponsoring this called Striim. They have basically a hybrid solution. When it came time to choose a messaging system or to develop messaging. They realize that technology already exist in open source and that is not part of their core value ads. So rather than spend their time having to reinvent a messaging system, they showed zero MQ. It also gives you the chance to leverage commodity infrastructure as a lot of open source software typically is designed to do today. Another good advantage of this model is that it gives you both the vendor and the customer, the chance to harness the innovation that’s coming from the open source community.

Especially with the latest common building blocks also, and this is very important, not just in the vendor but also to the customer, but its commercial viability in that it’s kind of like silver. You know, the, the metaphor I’m taking up here as a surgery was successful with the patient, died, well maybe have great software by the vendors give me such great deal. They can’t make money on it. Ultimately it’s not going to be of much value to you because of that vendor is not viable. You’re not going to have anybody to support it. So ultimately it is in not just the vendors interests to be commercially viable, that’s also in the customer’s interest. And so basically this is where the role of unique IP becomes very critical and also becomes very critical because the vendor is best situated to deliver that enterprise.

Then there’s also what we call the last mile of functionality and that’s where, for instance you have these common building blocks, but at the end you need to do that integration. Like with Striim they did that integration with Oracle at the log level, they do that change data capture. That’s not the type of thing that’s going to be very viable for an open source community because that’s going to be a very narrow purpose project. And so therefore you need to go narrow. But deep there is that does not suit itself to open source. So that was again, that basically is why we see the hybrid model being most viable and will return back to us. It really does give you the best of both worlds. So given that they’re not all open source projects are alike and at the risk of or of oversimplifying guilty as charged, we are oversimplifying here.

We’ve basically shown two ends of the spectrum because there are many different types of models on one end as they’ve been your led on the other end is the community side. And the vendor lead is basically where the vendor essentially owned. You know, they basically put the source code out out on someplace like GitHub. But the vendor ultimately basically leads that project. It’s not only, you know, it’s not governed by any type of community. Then this is the other side of, you know, the other extreme, which is the community led words, you know, hosted by a foundation such as Apache, the premise with the gold standard of open source communities. So what’s basically the difference in the vendor? Can you in the technology vendor led project and the good example that you know when they say something like Mongo DB is that the vendor essentially makes the roadmap decisions on where the product goes.

Whereas the community side, the community basically it’s a merit and meritocracy. Now that being said, reality is not black and white in that we’ve noticed that a number of vendors that have hit, you know, basically led their own projects have also, we’ll start to basically dabble in community on community side as as well. And the same as is true with vendors who basically deal with community projects and ready go off and actually  a vendor might lead a project initially type it in their own phase or their own matter of kind of like incubation. The difference basically is that regulated led open source products in many ways aren’t like proprietary software products. The difference here though is that the coat and the roadmaps are publicly of are, are publicly available. That’s basically the difference. Again, it’s not to say that one model is better than the other.

For instance, from Mongo DB and its customers that the vendor led model works very well for Spark to do, you know, Linux community model would prove quite successful. So now let’s basically look a little more detail at the cost of savings and we’ll start with the good news and I’m kind of like paraphrasing the old Meineke Muffler commercial and yeah, I originally had a better picture here, but what the quality was crappy. So we’ll go with the screaming babies, but I’m not going to pay a lot for this software. The thing was thought with open source is that the cost model changes. You don’t pay for the software itself. So it doesn’t matter how much software you use or download, you’re not paying a perpetual license or subscription for the software itself.

On the other hand, open source software, just because you can get that software free doesn’t mean open source software is free. It’s freely available, but it’s not great is one way or another you’re gonna pay it. And the preferred model is worry. Basically you go with a commercial open source provider that brokers that packages a distribution and does all the integration of all the open source modules and hopefully does some of that last mile stuff. And they’re basically the typical model is that you pay a subscription for support. That’s actually a fairly familiar model because it’s kind of like proprietary software where you’re paying it. It’s kind of like the annual maintenance part of what you do with proprietary software. The only difference is you’re not paying that upfront cost of capital costs of per of a perpetual license for vendors. The savings are and avoid reinventing wheels as we mentioned before.

And we gave the example of Striim and with their use of an open source messaging, you know, technology as part of their solution or enterprise is as mentioned, there’s no professional licensing. And so therefore you eliminate that, that upfront capital costs, you’re taking advantage typically of commodity technology. Most open source gets popular because the technology is affordable and it works on affordable technology. You know, typically for instance, like, you know, x eight, six machines for instance. And along with that, there are, it has basically, you know, altered pricing expectations and that’s kind of where that picture kind of comes in. I talked to you do folks and they would love to, you know, be able to get the multiples, you know, charge the multiples that like all the enterprise software folks like the Microsofts and the Oracles of the world, you know, have historically charged, but realistically their market is not going to put up with that.

And so what we see the Hadoop market as being in the few hundreds of millions, it’s nowhere close to the existing enterprise database market, which is in which is well north of 10 billion. And you know, even as do princess matures, it will never get to that 10 billion mark basically because the community expects or the customer base expects the lower cost software. However, what we need to point out, and we’re going to talk about this more on the next page on the next slide on the terms of cost, is that the savings picture is going to differ between whether you use what we call raw open source, where you’re going directly to the community website and downloading those packages or projects versus whether you basically subscribe to a vendor. Um, you know, with support distributions, which you mentioned down here, when you typically subscribe to an open source and software vendor, most of them, the vast majority of them are actually following the hybrid open core model.

By the way, there is some proprietary technology there anyway. Okay, so let’s go into those costs. We’re going to look at it from the standpoint of raw open source where you go to, you know,the project site you bite you, you don’t bother going through a vendor, you don’t pay for distribution, uh, or you don’t pay subscription, you know, a forced distribution. And here basically the picture is very similar to that of basically implementing your own homegrown software. The only difference is you didn’t write the original software. You’re probably going to be wearing a lot of others and stuff as a result of this. But that’s another story. But the advantage of course is that you get the flexibility of dealing with it. What’s in essence, a best of breed strategy because you’re picking and choosing your open source projects, but you’re also bearing integration costs as well.

Security is going to vary by open source project. It’s going to be more complete in some than others than prisons. You know, some security technologies, you know, some security projects may not necessarily support all the open source projects that you want to implement. So the key headache for organizations that basically go this raw and you know, no open source in the wild download route is that they need to harmonize the security and integrate all of the pro integrate all the software. By the way, there’s also the fear of obsolescence. I’ll admit it’s not exclusive to open source because you can basically get a vendor product and that vendor goes out of business. Well that’s all she wrote. The same thing can happen in open source. Just because the software is still up there in a website doesn’t mean that it’s not been put out extended life support, um, or end of life support should site.

And so this is very typical of the growing pains and maturing technologies. Cause some are gonna be winners and some are going to be not in that sense, you’re making bets that are very similar with proprietary software as well. But it’s not saying that people I think really think very heavily about was open source because I think, well it’s free. So there is something at risk there. And then there’s the question of extensibility, which is that because many open source projects are very narrow in scope that they’ll require additional functionality. And so we’ve going to go onto a few real life examples here to kind of bear out what we’re talking about here. And this is the case of a bank that implemented a cyber security solution. Then they put it up and then they basically they went the route of basically of debt of basically going through open source projects in the wild.

And so they implemented Storm, which is a data flow routing. The open source project and what’s storm for streaming and Metron for security analytics and Kebana which goes along with, you know, which actually is a related to elastic search or basically it’s for visualizing log analytics. They also had a gooey alert UI when I’m actually surprised they don’t see up here is elastic search. I would have assumed that would be part of it as well. Um, but anyway, um, this was essentially, you know, putting together a bunch of these projects don’t give us between this and homegrown software. They didn’t write the original projects, but the otherwise had to bear the full load of having to integrate all this stuff and patch it. And it basically, this is, you know, this cybersecurity solution was not trivial.

Acquired about 45 engineers and it cost about 20, 30 million to basically maintain, keep us working over about five or six year period. And the big pain points there were, they were there gaps, you know, at the last mile, especially with end to end security which required that custom, you know, last mile development. Another example is there is a communication service provider and trying to extend a call center application. And so they use several open source projects, flume for data in motion for routing data, reading data flow, essentially logstash, which is for collecting and transforming lug with a lot of data and elastic search. Hey, founded here, didn’t leave it off the list and this was actually a much more modest solution. Five practitioners, you know, just quiet five practitioners caught in the neighborhood of 3 million plus or minus that were five or six years.

Key caps here though again were that last mile, which is with change data capture, integration with the customer databases is where their call center. Basically you want to know what’s basically that, you know, what’s happening with the customer. And that required a lot of costly extra development and a lot of costing maintenance. And our last example here is dealing with unplanned adolescents. This is a credit card processing firm that was doing, had a real time transaction processing application. And in this case, again, this was a very, on the surface it looked like are very successful, quick hit. Um, they use spring XD, which is a component for building data pipelines. And it was, you know, I mean it was very quick to inland just a couple of months, you know, not that many engineers. So on the face of it, it served their purposes for what they were doing.

A problem is that the vendor pulls the plug on it when they went to a different strategy. And so the vendor place spring XD project on end of life as a result of that, you know, the credit card company was back at square one. And so, and again, there’s that, I will say the same problem can have with proprietary software as well. But I think the important point here is that just because it’s open source, doesn’t mean it’s going to be successful. And this case actually just like you depend on a vendor for product mode routes for proprietary software. Well this is the case of a vendor led open source project and this case the vendor pulls the plug.

So what tends to work with open source as we said, it basically works best with commodity technology, which draws a critical mass target audience of developers and prospective customer. Do you guys serve a wide enough market to build a big enough community? Um, it had and that also the algae is extensible and the licensing is extensible and that’s where that the Apache license has proven very popular because it does allow you to add your own value add on top of the open source. It doesn’t require you to give it back to the open structure. I mean, like a lot of the earlierGen one licenses like the GPL licenses, which were the original licenses, um, in the open source world. Um, and but deal is that what you’re looking at, and again, it’s got to be critical mass technology so it’s not written for an overly narrow use case and it’s not hardwired to any specific platform.

And the API APIs are published and freely available and they’re open and ideally, yeah, if you have those a open API, hopefully it’s an avoid best of breed integration issues. Um, but again, that’s another reason why we believe the hybrid model is best. We’ll get to that in a sec or get back to that in a sec. But where do we say proprietary software? Where’s proprietary IP really come in? Well, where we really see it happening is, especially at the application business logic level. Now of course, see every rule, there are exceptions. Yes, there should be CRM, a customer relationship management system, but for the most part we’ve not seen a lot of open source of the application level. And there’s really a good reason for that, which is that that’s where businesses want to differentiate themselves in terms of how they do business.

And so do you want to basically, even though obviously we have applications software that has not commoditized this a bit. The differentiation is too specific to really make it well suited for an open source project. I mean you don’t want to get a lowest common denominator solution for your business. Um, we’ve also found it’s very good for niche technologies and solutions where basically the market, the addressable market is too small. Therefore the addressable community of developers is going to be too small to really support an open source projects such as doing let’s say a connector to a data, you know, a connector to a, to a log into logging system of a database. And that again ties into this last one if unique and custom integration use cases. So what we see here is that proprietary IP as best for differentiating enterprise solutions, also good for and to, and security because basically, you know, a vendor can basically take trucks and making sure all those gaps are filled.

It’s, that’s going to be kind of, it’s going to come basically kind of hit or miss when you have a community. Also this also polished user experiences in UI. And I think it’s not because you can’t open source that. I think it’s probably because the dates are of developers that I don’t really think in terms of functionality and maybe I’m building some stereotypes here, but we’ve just not seen misery experiences as being the type of thing that you open. That’s open source. And yes we do see some goalies on various social source projects but they’re really polished once. I think it’s, that’s really uh, that’s really where the vendor comes in and keep it and you know, with the exception of let’s say certain categories of products like were apple made, it is named UI is usually not the um, is not the, uh, you know, I guess the, you know, the show stopper and also again, last mile of integration and consistent SLS.

That’s where you really gonna to account on a vendor to really make sure that essentially the trains run on time and that and also has all the tracks connect anyway. So in general, we believe that using hybrid open core for solutions for last mile is differentiated intellectual property. But at the same time you get to get the best of both worlds by leveraging community building blocks. It’s essentially we’re commodity meets enterprise grade and so the takeaways and all this is that, you know, we’re not saying don’t use open source, we’re saying this is open source where it’s appropriate, but when you do so keep your eyes open, it’s open source is not send out honestly free. So look at the real cost commercially some and you’re either going to pay that cost either through subscriptions with commercially supported, um, you know, annual subscriptions, which is pretty much like paying maintenance for conventional software and it’s all, or you’re going to be paying for it in terms of all the spadework you have to do when you implement rural open stores out in the wild.

It’s a lot like doing again, your homegrown software hybrid open core approach we found to be the most reliable and the most viable model for enterprise software because we said it combines the best of both worlds. It keeps the vendor in business, which means that it keeps, it gives you the assurance that you’re going to have a vendor, they’re a throat to choke and someone there to support your software. Um, but yet it also taps the rapid innovation and the open source community to come the economies of a commodity software, but yet at the same time you get that unique IP and last one integration and security. So that pretty much wraps up my part of the conversation at this point. I like to turn it over to Steve.

Well, thank you Tony. I’m going to start by introducing Striim, talk a little bit about our platform and doing that to provide context for the discussion around open source in this hybrid open source model because you know, the Striim as a company building our platform, any intelligent software company, if it’s been built already, then why build it? Again, we going into the discussions around that and how important it was to integrate pieces of open source into the platform. So Striim has been around for around five years now. We have a mature technology that’s been in production with customers for more than three years. And we are continually evolving and releasing new updates to the platformc, adding new functionality, new connections, etc. And Striim provides a full end to end streaming data integration and analytics platform. So what the platform does in a nutshell is you to collect data continually from a whole variety of enterprise sources, things that may be inherently streaming, like message buses and sensors that continue data in, in real time.

And then things that you may not think of as streaming like files for example. Um, so with files you, you know, we will collect the data at the end end of the file as it’s written and stream that it in real time. And in databases, most people think of those as a historical record of what’s happened in the past. But we use a change data capture technology to see what’s happening in the database in real time. Yeah. Those inserts updates and deletes as they’re happening and stream those out. So once you’ve done continuous data collection, you have real time a memory data streams. And the simplest thing you can do with the platform is just deliver that somewhere else. So from a streaming integration perspective, you can take stuff that’s being written into our database, for example, and stream that onto Kafka or take your web bug files from on premise and stream those out into uh, Amazon S3 or Azure, SQL db in the cloud.

That’s kinda an a thing that you can do with the platform. But typically our customers are doing more than that and that requires some degree of processing. So we have this ability to do in memory, SQL based continuous queries that can process that data and analyze that data. And this in conjunction with a time series windows and an in memory data grid allow you to do things like transforming the data, uh, from one form to another, filtering it, uh, aggregating it. So looking at, you know, the last minutes worth of data or the last a hundred events and the last entry for each one of these records, etc, and also enriching the data. And that’s where the memory dating creed comes in because you need to load large amounts of data into memory reference data and then join that with the streaming data in real time as things are passing through.

So that’s kind of how you can process data and get it into a form you want before delivering it. An example would be if you’re reading change data from your nicely normalized database. Yeah. And say it’s an audit detailed table, you’re just gonna see a whole bunch of ids, you know, order Id, this customer Id, this item Id this. And if you’re just delivering that to say, that’s not gonna mean much to the people reading the data from Kafka. So in Richmond typically will be, you’d load reference data into memory and then join that. And so now instead of those ideas do get the older information, the customer information, the item information that’s all written out. And so you can now do more intelligent analytics and talking about analytics, we have the capability of doing um, in memory statistical analysis on anomaly detection, pattern detection through a complex event processing syntax and correlation across data streams and across windows.

And this enables you to look for things that happen at the same point of time or within the rate, the time range and also to look things I co located in, in the same geographic location for example. So doing this correlation is an important aspect of almost all of the analytics use cases that we see. Then on top of all of this processing, as I mentioned, is happening through the sequel based queries. You can then build visualizations all using our platform. So real time dashboards, you can trigger a workflows, you can generate alerts to notify people that things are happening and we have ways of integrating third party machine learning directly into the streaming data flows as well for realtime scoring and inference. That’s basically what the platform does. It’s uh, in memory streaming integration on analytics platforms, it’s everything from continuous data collection, processing, analytics and delivery.

Yeah. And this suits different categories of use cases. It’s a piece of middleware that enables you to do a whole bunch of different things. So on the real-time data integration side, we have customers that are delivering data real time into a data lake for example, after processing it and getting into a form that they want or integrating on premise and cloud, keeping a cloud database up to date with an on premise database for hybrid cloud initiatives or doing IoT edge processing. On top of that we have the analytics type applications where your customers are doing things like fraud detection or a cybersecurity monitoring where you analyzing lots of different data feeds and anti money laundering and location based solutions where people move locations really, really quickly. So you need to have real time insight into what’s happening. And that’s where a streaming platform really comes into play because you can process the data as it’s being produced.

And then we have the customers that have built dashboards and this is typically used for building and monitoring real time metrics and key performance indicators in order to real time quality monitoring, a SLA monitoring, etc. And these use cases across many, many industries. I’m not going to go into all of these cause that’ll take up another couple of hours, but we have use cases across lots of different industries on, you know, we touch lots of different departments within those organizations as well. The key differentiators of our platform are that it is a full end to end platform that does everything from collection, processing, analytics, delivery and visualization of real-time data that is designed to be easy to use. So it’s very fast to build and deploy these applications. We have a UI for drag and drop building of data flows. We have the SQL like language that enables you to use a whole bunch of different types of uh, your, your internal resources, whether they’re developers or business analysts or data scientists to actually build out these data flows.

Okay. We are enterprise grade and that means that we are inherently clustered, scalable, reliable with, you know, full tolerance and exactly once processing and recovery built into the platform. And we have end to end security and we can integrate with a whole bunch of things. You know, so we work with the top three cloud platforms, the top three big data platforms. We have changed there to capture for the major databases. We have deep integration with Apache Kafka and other open source solutions. So that’s the platform in a nutshell. And it’s going to important to kind of understand that within the context of how this works with opensource. So now imagine you didn’t have Striim and you needed to build a streaming data framework or platform from open source. And this is kind of the process that we went through as well. So we’re talking from experience here and how to build such a platform.

Well inherently you’re going to be taking data from sources, moving it to targets and doing stuff in the middle. As I mentioned, that stuff can be quite complex and it requires a lot of different categories of software. And so in order to move data render class, so you have a high speed message infrastructure in order to load large amounts of reference data into memory, you need a distributed in memory data grid or cash to store the results you need distributed results, storage. And then you have data collection. That livery and the processing of the data. You have to have some way of developing the solutions and visualizing the solutions. Those will have different categories of open stores that you need. Yeah. And we’ll just add a few in here so you can get an idea. But every single one of these categories has, you know, a large number of different pieces of open source that you can choose from.

And that isn’t the end game, right? Those are just some pieces that you need. And in order to get it all to work, he needs some glue code around all of this to handle all of the enterprise grade stuff. That clustering, scalability, reliability, security and management that enables all these pieces to work together and to scale together to be reliable together a single security policy across all of them. And then you need a layout as i enables your developers to actually build the applications. Cause this is just the framework, the platform, there’s no solutions yet. So, um, the goal of this as an enterprise would be that you’re building something that enables people to build analytics or integration much more quickly. And so you need a abstraction layer, an API connectivity in the web server and things like that. Yeah. So that’s all the pieces.

And that’s kind of the pieces we looked at as well when we were building an app platform. Well, so if you look at the process involved in that if you’re going to build this from open source, uh, first of all, you need to design it. We just did that. We had a diagram that showed the various pieces that we were looking at UI deeper than that in reality. But that’s an idea. And then for each component of open source that you’re interested in, you have to look at the different options. So identify what open source is available in each of the categories and evaluate each one. You may have performance requirements, scalability requirements, overhead requirements, uh, even, you know, software language requirements. You know, I want everything to be Java, I want everything to be javascript and it’s kind of hard to mix the two.

Okay. Then you need to build the integration and build the glue code and the layers around all of that. As your integrating things, you might find that some things don’t work together. So you may have to go and identify different components that you’re going to use. Then you go into testing and of course testing like results in changes that might result in, you have to change things up because things don’t work once you have as built, you’re going to have to maintain it. And some of the things that Tony mentioned,  if the open source software that you’ve chosen is upgraded in some way and it changed its APIs and you know, thankfully know Kafka is now finally out of Beta is a one o uh, after all this time. But up until now, the APS were changing all the time. And so every upgrade you went, every version you went, you may have to modify all your, the integration code to work with the new API.

And the other case that Tony mentioned was the open source is deprecated. Maybe they’re not going to fix any bugs in anymore. It’s contributed contributors left, they’ve all moved on to the next big thing. Um, in that case, you may just have to replace your piece of open source, which means you now have to identify a new one and goes through the evaluation process and the reintegration process. So you have to do all of that. Um, and also get support if there are any bugs that you find and go to the community or the vendor to fix things. And you need to do all of that, um, before you can start to build your applications. Um, and so that’s quite an involved process and that’s why you talk about these large timeframes that are involved in a lot of these open source projects. You’re talking even with something basic, you know, six months to a year before we can even start to get any results out of it.

Whereas if you download that platform, we’ve done all this already and so you’re literally just installing Striim and then you can start building the applications and you’ll talk to as support and we will manage all the issues with all the open source. So we include that enables you to kind of build your applications faster and gets to deployment. At faster if we look at what our platform looks so I can, how we’ve integrated open source and we have things at the edge, you know, so we have sources, targets in our platform that integrate with open source pieces including Kafka, HDFS, flume, HBS, hive, et cetera. All of the things you’d expect if you’re reading data or writing data somewhere. But then we also have pieces of open source within our platform and we went through that identification and evaluation process and or to choose the best of breed in all of these cases and integrate them together.

We have two different versions of messaging. We have a high speed messaging and it runs in network speed that utilizes a Java version of zero MQ and Cryo. We have the persistent heights be messaged infrastructure, uh, for recovery purposes and application decoupling purposes. We use Kafka for that. It’s built into the product ships with the product. We have Hazelcast that manages clustering metadata management and control. We have an implementation of j cash in memory data grid for high speed, uh, look ups and very fine grain control over where data is stored, um, across the cluster. And we utilize the last day, uh, Tony will be happy, as I mentioned last week, again, uh, for distributed results storage. So those are all the pieces of open source that we, we have chosen. Um, there are a lot more supporting classes I couldn’t possibly fit in here.

You know, like JSON parser for example. Um, there’s lots more of those, but you know, they’re not going to major, major components that are their own software category by themselves. Um, then of course we built all of the glucose, so we had to work out how do we do scalability, the distributed technology clustering, failover across all of these species. So they all work together. Uh, how do we manage reliability and exactly the ones processing all the way from sources to targets. How do we have a single security policy, role based security and encryption across all of these things and full management and monitoring interface APIs, UI that enables you to control this is a whole platform rather than individual pieces. We have a full set of API APIs, whether they’re through ask scripting language, JDBC IDFC rest API as a websocket cpis that enables you to connect with the platform.

But then we also have a whole bunch of secret sauce that all the pieces that are enabled to continuous data collection, whether it’s from devices, big data, databases, etc, through change data capture and continuous data delivery. Um, all those things that you see on the other side. And then the real key is SQL based processing and analytics, which is our own intellectual property. Um, that leads you to do the filtering, transform aggregation enrichment; do complex event processing, anomaly detection and correlation. And that’s, you know, a piece of the platform that we currently hold a patent for. Um, I was told you mentioned, you know, on top of that you very yourself and see, uh, UIs either in open source software or even within the enterprise. Not many enterprises that are building a data processing platform are going to take the time to build a drag and drop UI or a command line interface or some other easy way to actually build the applications.

Yeah, they’re going to rely on developers to write code to build the applications. So we provide a drag and drop UI for building data flows, um, doing all the analytics and for building the dashboards. And so I’ve seen all of that. So that’s kind of how we incorporate open source into our platform, but our customers don’t have to worry about it. And if one of these pieces was decommissioned, it was no longer supported or there were major books in it, then we handle that for the customer. They don’t have to have developers on call the $3 million spent to keep people, uh, maintaining and upgrading the platform continually. No. So a unique it includes is change data capture that enables you to get data from databases in real time as it’s happening and also enables you to handle things like changes in Schema. So if the tables structure changes that we can modify how we write things out to do pool capital for example.

Okay. Uh, in memory distributed processing, which is a patented technology. Yeah. The enables you to, uh, have the SQL based processing happening across the cluster and intelligently route things across the cluster and join things with this distributed cash and handle a full tolerance and exactly once processing, um, with rollback and recovery, uh, that enables you to scale the applications and also trust that they’re going to work and pick up where they left off. If you know, for example, you’ve lost the entire cluster. Um, so these are some key things that we’ve added in and on top of that kind of UI and dashboard builder, they, as we mentioned, you rarely get from open source.

We are recognized by the industry for doing a lot of innovative, innovative work, both on kind of streaming analytics and on Iot. And also a great place to work and we’re very happy about that one. Just very quickly, um, drill down into a couple of these customer stories. No, we have a customer who’s built out an anti-piracy solution. Um, it’s useful, um, video feeds, etc, and the media customers and enables kind of real time monitoring of the usage of a feeds media and correlates multiple logs in real time. It’ll just to identify is a really a subscriber or not. And why did they choose us? Well, they looked at a number of different open source log analytics products. They had some concerns about the amount of people that it would involve to maintain it. They estimated that if they hadn’t used that platform, they would have had to have triple the size of the team to actually do the development and ongoingly maintain and keep the platform up to date.

And also some kind of limitations with single open source solutions that they could have chosen for this and kind of the integration that it needed to do. So we were chosen because we had, in addition to the log capture, change data capture from the Oracle database, we have the SQL processing language that enabled them to use what they wouldn’t ordinarily have thought of as developers who people in the analytics team to kind of build out some of the data flows. And we had is great visualization that they could use for monitoring. And we could easily integrate with the existing code that they had for a machine learning solution. A second case is a leading credit card network. And what they needed to do was to very quickly identify potential threats. And if you have lots and lots of security applications out there, you know, a large number of different types of security logs, you’re gonna end up with alerts from all of them.

And if you get an alert, security analysts will have to manually fill down and correlate and look at, um, what else is happening. So if you get a port scan from a certain IP address, what else is that IP address doing? And they’d have to manually correlate and look across the whole bunch of different mugs. So they’re using our platform to pre correlate across all the logs and identify things, uh, have activity in more than one place or with certain rules around that that enabled them to identify high priority potential threats and act on those much more quickly, uh, reduced the amount of time that you have to spend looking at data because we would pre correlate all the data. I’m a really enables him to pretty quickly see everything that they need in order to make decisions. And they chose us. Um, cause they, they, you try things before they built a python application that did things with a four hour window, so it was four hours behind.

And they wanted this information as soon as possible in real time. They looked at two different types of open source. Um, one of them, the one that they worked on first. That was the story that Tony told that got de supported as they spent a year of development, unfortunately. So then, yeah, looked again, uh, options. One of the open source things that they did, the identification and evaluation and testing that one didn’t scale. Um, as, as they needed, I was talking 10 billion events a day, a that needs a process and they just couldn’t get that performance. So this other piece of open source, um, and it also didn’t have all the required features they needed and they chose us because we have this, uh, necessary scalability and also the sequel based processing and analytics so they could built out of the state of affairs and update them and build additional applications really quickly.

So key takeaways really are that this blended approach does all that work that you would need to do to identify and evaluate and choose and integrate and maintain all of those bits of open source and provides it all for you without you having to worry about abstracted away so that you can use it very easily. But also it key integrates with open source. You may have chosen already, so if you already have your own Capitan class so you don’t have to use the one we ship with, we can integrate with that. Um, as if it was ours. Um, if you already have Hadoop we can read and write from that for you. Um, but you, our solution is also enterprise grade and you can get started much faster and more cost efficiently. And you know, we basically take the best of open source that we have chosen that we have gone through all that process and bundle it with a unique IP. That office is patented technology for realtime integration analytics provides you the UI for building everything, the ease of use end to end security, reliability and scalability and gives you this ability to build dashboards and visualizations in a single platform. So that’s the end of my part of the presentation and we will now open it up for the Q. And. A

Thanks so much Dave and Tony. I’d like to remind everyone to submit your questions via the Q&A panel on the right hand side of your screen. While we’re waiting, I’ll mention that a link to the recording of today’s Webinar will be emailed to you within the next day or so. Feel free to share this link with your colleagues. Now let’s turn to our questions. Our first question is, does CDC need to be turned on in the database for Striim to handle CDC from it?

That’s a great question. And the answer is yes. Um, but we can help you with that. You know, so different databases have different requirements for enabling change data capture. You know, Oracle for example, needs supplemental logging turned on. Um, see my SQL requires you have a been logged, etc. Um, but if you’re using our platform, you using a UI using our wizards, when you make that initial connection into the database, we will check all of that for you and we will tell you if you don’t have the correct things configured and if you don’t what you need to do in order to achieve that. So, um, yeah, it does have to be turned on but we can help you with it.

Great. Thanks. Our next question, um, when it comes to real time data integration, how does Striim differ with other products in the market? Okay.

Okay. Um, what you’ve just seen in the presentation, hopefully, um, if we are a full end to end platform, right? So, um, we can do real time integration and other people talk about real time integration as well. Um, you know, there are solutions out there that can do change data capture into Kafka, right? But if you look at a little bit underneath the covers, what they mean is that they are doing change data capture and they’re writing exactly that data into Kafka. Um, there’s no processing available. There’s no, uh, enrichment. There’s no, uh, advanced features available within that, you know, so, um, because people talk about real time integration, streaming integration kind of as a marketing piece, it doesn’t mean that they’re necessarily doing everything that you need, you know, to try to achieve that. Yeah. So it’s really our completeness that is the big differentiator that we have all these data sources to turn anything into a data stream.

We have all of the processing in sequel. You don’t need developers to write Java code or Java script. They’ll see sharp to actually do all of the processing of the data. We have a lot of different data targets and you can take one source CDC and push that into Kafka cloud database, a blob storage and Hadoop all within a single data flow. You don’t, you know, it’s very easy to build these things out that we have. Um, some, some of that is some videos on that as well, you know, so, um, it’s really the completeness of the platform and plus because you now have things streaming, if you want to move to do real time analytics, you’re perfectly positioned to do that and our platform can help you with that too.

Excellent. Thanks Steve. Our next question, does Striim take care of upgrades of all the underlying open source technologies? For example, if I upgrade to the latest Striim version, I get the latest compatible open source technologies.

So, um, it depends on kind of the integration point, right? So if you’re talking about sources and targets, things that we connect to, then we’re always trying to keep up to date with whatever it is our customers are requiring. And that may not be the last, last version of everything. Um, it, it could be the previous, the last three depends on what has got market traction, but we will always support our customers, ensure that they can connect with whatever they have already. Um, if it’s something that it’s an integral part of our platform that is completely hidden from customers. Um, we have our own mechanisms by which we choose when to upgrade. Um, and it obviously depends on kind of stability, security, amend to API changes, uh, integration, effort, etc. So we may not always be shipping with the last release of something within our platform, but we got to keep things current because we obviously want to take advantage of any bug fixes and security fixes that have gone in.

And if customers points at the point something out to us. Yeah. And that concern, for example, about a security hole in one of the pieces that we incorporate, um, then that’s obviously something that we can patch and fix quickly. So the, uh, that was a long answer. The short answer is it depends.

Perfect. I think that also addresses our next question. I regret that we’re out of time. If we did not get to your specific question, we will follow up with you directly within the next few hours. On behalf of Tony and Steve, I would like to thank you again for joining us for today’s discussion. Have a great rest of your day.

The Critical Role of a Streaming First Data Architecture

Steve Wilkes, Striim Co-founder and CTO, discusses the need for a “streaming first” data architecture and walks through a demo of Striim’s enterprise-grade streaming integration and streaming analytics platform.

To learn more about the Striim platform, go here.

 

Unedited Transcript:

Today we’re going to talk about what is a streaming first architecture and why is it important to your data modernization efforts. We’ll talk about the Striim platform and we’ll give you some examples of what customers are doing on the solutions they are building using our platform, but take some time to give you a demonstration of how it all works and then open things up at the end for questions and answers.s

So most of you are probably aware of this, the wealth news fast and it’s not just the world is the data that moves fast. Data is being generated largely by machines now. And so businesses need to run at machine speed. You need to be able to understand what’s happening right now and react with immediately and also that it’s not too late. Yeah. At the same time, customers and employees expectations are rapidly rising and positive reason for this is the smartphone boom and people’s access to realtime information, the ability to see what’s happening to your friends, what’s happening in the world. Uh, instant access to news and some access to, uh, messages, communication, um, because the consumer world is instant. Those consumers are also employees and executives and the, they expect instant responses and insight into what’s happening within the enterprise. And the quality of applications used to deal with is also driving desire for similar quality business applications.

The other side of the coin is that businesses needs to compete. Um, technology has always been a source of that competition. And as technology is rapidly changing and we’re getting more and more data. Businesses that are more and more data-driven have a competitive edge and this is now have almost all departments ranging from engineering or manufacturing all the way through to marketing being incredibly. And so the survival of most businesses depends on the ability to innovate and utilize new technologies and data itself is also massively increasing. Almost everything can generate data, it could be trucks or your refrigerator. Um, TV is a wearable devices that you have, health care devices that oh becoming more and more portable. Even things like Istomin making its way from being caught to a restaurant. It’s tracked and produces large amounts of data and this data is growing exponentially.

IDC did this study like a couple of months ago and they’re estimating that today the 16 Zettabytes of data by 2025 is going to increase 10 fold and all of that data around 5% of it now is real time increasing to 25% of it in 2025 and by real time they mean is produced and needs to be processed and analyzed in real time. So that’s 40 zettabytes 21 zeros of data that will need to be processed in real time. And of that 95% it will be generated by devices. The kick that really is the only a small percentage of the state it can ever be stored there physically, not enough hard drives being produced to store all this data. So if you can’t store it, what can you do with it? Well, the only logical conclusion is that you need to process [inaudible] analyze this data in memory in a streaming fashion close to where the data’s generated. It may be you’re turning the raw data, you know, thousand data points a second into aggregated data that is less frequent but still contains the same information content. Yeah. So that kind of thing is what people talk about is age processing was really trained to handle these huge volumes of data that people see coming down the line.

And it’s not just IoT data that the rise in streams, you know, okay, everything, every piece of data is generated because something happened, some kind of event, someone was working on the enterprise application, someone was doing stuff on a website or using a web application and machines were generating logs based on what they were doing. Applications generation, those databases as generating logs and network devices, everything generating logs. But they’re all based on what’s happening based on events. So if the data is created based on events in a streaming fashion, then it needs to be processed and analyzed in a streaming fashion. If you collecting things in batches, then there’s no way you’ll ever get to a real time architecture and a real time insights into what’s happening. But if you collect things as streams, then you can do these other things. You can do batch processing on the streaming data, you can deliver it somewhere else, but at least the data needs to be streaming.

So your stream pro so thing as he merged as a major infrastructure requirements and is helping drive enterprise modernization. So moving to a streaming fest, data architecture readings that you’re transitioning at these data collection to real time collection. You’re not doing batch collection of data. You’re doing real time collection of data, whether it’s from devices, files, databases, uh, wherever it’s originating and you’re doing this increments that you’re not trying to boil the ocean and replace everything in one go. You’re doing it. Use case by use cases, proud of your data modernization projects. And this means that things that have high priority to become real time to give you real time insights and a potential business competitive edge or better support for your customers or reduce the uh, amount of money you’re spending on manufacturing and by improving product quality, any of these things. Um, can we drive as today to modernization and your doing it use case by use case, you’ve placing pieces of it, bridging the old a new worlds of data.

Now some of the things that our customers are telling us, uh, and we have these legacy systems and by legacy that can mean anything that was installed, you know, over a year ago and they can’t keep up or they don’t predict that they’ll be able to keep up with the large amounts of volumes of data that they’re expecting to see with the requirements for low latency, kind of real time insights into data and with the ability to rapidly innovate and rapidly produce new types of analysis, new types of processing give you new types of insights into what’s happening in your business. Okay. They’re also telling us that we can’t just rip and replace these systems and the need to have the new systems and the old systems work together, uh, with potentially fail over from one to the other. While you’re doing this replacement. It’s a Striim has been around for around five years now.

Uh, we are the providers of a platform, the Striim platform that does streaming integration on analytics. The platform is mature. It’s been in production with customers for more than three years now. There’s customers all in a range of industries from financial services, Telco, healthcare, retail. We’re seeing a lot of activity in Iot. Striim is a complete end to end platform that does streaming integration on, on the mystics across the enterprise, cloud and IoT. We have an architecture that’s very flexible that allows you to deploy data flows to bridge enterprise, cloud and Iot. You can deploy pieces of an application at the edge close to where the data is generated. And that doesn’t have to just be IoT data. It can be any data that’s generated but close to that data. Other pieces that are running on premise, doing some processing and other pieces are in the cloud.

Also doing processing our analytics. It’s very flexible how you can actually deploy applications using our platform because applications consists of continuous realtime data collection. And that can be from things that you may think of as kind of real time, uh, sensors. Sending events, a message cues, et cetera. Are there things like files that you may think are AAP is batch processing? Um, we could read at the end of the file and as new records are written to the file stream that has immediately turning files, the rollover, et Cetera, into a source of streaming data. Suddenly with databases, um, most people think of databases as a historical record what’s happened in the past. But by using a technology called change data capture, you can see the inserts, updates, deletes, everything that’s happening in that database in real time. So you can collect that nonintrusive Lee from the database on stream.

That’s it. So now you have a stream of all the changes happening in the database. Okay. So all of the applications built with that platform use some form of continuous data collection. On top of that, you can then do real time stream processing and this is through a SQL based queries. There’s no programming involved in the Java, no c sharp, no Java script. You can build everything using SQL and this allows you to do filtering transformation aggregation of the data. Yeah. By utilizing data windows. So you can say what’s happened in the last minute, uh, you can look for a change in data and only send that out and et cetera. And then enrichment of data, which is also very important. And that is the ability to load large amounts of reference data into memory across the distributed cluster and join that in real time with streaming data to add additional context to it.

So an example would be if you have device data coming in and it’s device x, Y, z value one, two, three. Okay, that doesn’t mean much to things downstream that might be trying to analyze it. But if you join that with some context and you say, well, device Xyz is this sensor on this particular motor, on this particular machine, now you have more context. And if you include that data, you can do much better on top of the stream processing. You can actually do streaming analytics that can be correlating data together, joining data from multiple different data streams and looking for things that match in some way. So maybe using a web blogs and network logs and you’re trying to join by IP address. Um, and you’re looking for things that have happened on either side in the last 30 seconds. That kind of correlation to complex event, a processing which is looking for seek because of events over time, the mass, some kind of pattern.

So if this happens, followed by this, followed by this and it’s important you can do a statistical analysis on anomaly and integrate with third party machine learning. Yeah. We can also generate alerts and trigger external systems and build these really rich streaming dashboards or later visualized results of your analytics. Yeah. And any of the data that’s initially collected, the results of processing, the results of analytics that can all be delivered somewhere and you can deliver to lots of different targets in a single application. So you can push stuff to enterprise and cloud databases, files or do, uh, Kafka, et cetera. Okay. As a new breed of middleware that supports streaming integration analytics, it’s very important that we integrate with your existing software choices. So we have lots of data collectors and data delivery that work with systems you may already have. It wasn’t the big data systems, enterprise databases, open source, um, pieces we can integrate with it and do all of this in a enterprise create fashion that is inherently clustered, distributed, scalable, reliable and secure as a general purpose piece of middleware.

We support lots of different types of use cases from real-time data integration, uh, analytics and being able to build dashboards and monitor things. And these use cases across all different industries and they can range from a building your data lake and preparing the data before you land it. I’m doing tag migrations, I’m doing iot edge processing. And then on the other lytics and patterns side, there’s things like fraud detection, uh, predictive maintenance, uh, anti money laundering is some of the things that we received from customers, uh, around that. And then if you want to build these dashboards and monitor things in real time and look to see whether things are, for example, meeting SLAs or meeting expectations and yeah, we’ve done things like call center quality monitoring, SLA monitoring. Yeah. Looking at the, that worked from a customer perspective and being able to alert when things aren’t running normally. And I use cases across many different industries. There’s a lot of texts on here. Yeah. The takeaway is that we have used cases in a lot of different industries.

Yeah. One of the examples is, uh, using Striim for hybrid cloud integration. And that’s really where you have a database on premise and you want to move it or copy it to the cloud. And it’s one thing to just like take the database and put it in the cloud, but that will miss anything that’s happening while you’re doing it or miss things have happened since you’ve done it. So it’s really important that you include a change data capture in this to continually feed your hybrid cloud database with new information. And so by using a set of wizards, you can build this really quickly that allows you to join a on premise, for example, oracle database and deliver real time data from that into, uh, for example, Azure SQL DB. So you now have an exact copy that is always up to date of the on premise database.

Another totally different example is uh, using us for security monitoring, which is where you have lots of different logs being produced by VPNs firewalls, network routers, individual machines, essentially, uh, microcontrollers. Anything that can produce a log and you recognize a unusual behavior, um, is most often seen by affecting multiple systems, no security unless they get a lot of alerts from all these logs and all these systems all the time. But a lotamz of those are false positives. So the goal of this was to identify things that were really high priority for them to look up first by seeing what’s the activity happening that was affecting multiple things. So for example, if you have a port scan from a network row to the same, this guy’s looking at other stuff. Is there any activity on the other machines that he’s looked at? Okay. Are they doing port scans?

Are they connecting to external sites and downloading malware? So by doing this correlation in memory in real time, you can spot threats at a higher priority. And also by pre correlating all the data together and providing that to the analysts, they can see immediately the information they need rather than having to go and manually look for this across a whole bunch of different bugs. And this really increases the ominous productivity. So a couple of other examples from our customers. One is a very simple, uh, realtime data movements where data from, uh, HP nonstop and SQL server databases is being pushed out into, uh, multiple targets, uh, whether it’s Hadoop, HDFS, Kafka, HBase, and they’re using a as a analytics hub for their communities. So basically ensuring that wherever they want to put the data, that’s always up to date and that is always containing real time information from there and they can see on the other databases.

And then the glucose monitoring company, uh, are using us to see, uh, events coming in from, uh, devices on these implantable devices, uh, real time monitoring of glucose. And it’s really important that these things work. So they are looking at the, whether the device is having any errors, whether it’s suddenly going offline and being able to see in real time any of these devices not working properly. And this is really important to their, their patients because their patients rely on these devices to check their glucose glucose levels. So this has really reduced the, uh, times detect that there’s an issue going on and has improved patient safety massively. Okay. We have recognized generally by a lot of the analysts in both the in memory computing and the streaming analytic landscapes. And we’re also getting a lot of recognition from various publications and [inaudible] a trade show organizers and then also very importantly, one that best places to work, uh, which is really vindication of, you know, us being a, a really great company, a key differentiation.

Striim’s end to end platform does everything from collection and processing, Oh, lytics delivery visualization, the streaming data that is easy to use, uh, with the SQL language for building a processing and analysis that allows you to build and deploy applications in days. Um, and we’re enterprise grade, which means that we are inherently a scalable in a distributed architecture, reliable and secure. Okay. And that you can integrate us. We’re easy to integrate with, uh, your existing technology choices. So those are the kind of key things to remember about why we’re different. So with that we’re going to go into a demonstration. Sothe first part of this, um, basically going to show you how to do the integration rather than to type a lot of things. Uh, we’re just going to go through, uh, how to build a change data capture a into Kafka and do some processing on that and then do some delivery into other things.

So this is pure integration play. You start off by doing a change data capture from SQL, in this case, my SQL and okay, build the initial application and then configure how you get data from the source so we can figure the information to connect into my sequel. When you do this, we’ll check and make sure everything is going to work, that you already have. Change data, capture, configure properly. And if it wasn’t with how you had to fix it and how to do it, you don’t select the tables that you’re interested in. We’ve got to collect the change data from, and this is going to create a data stream, that data stream. Yeah. But then go to two different to Kafka. So we’re going to configure how we want to write into Kafka. Um, and that’s basically setting up what the broker configuration is, what the topic is and how we want to format the data.

In this case we’ve got the right to add as JSON, when we save this, this is going to create a data flow and the data flow is very simple. In this case it’s two components. We’re going from my SQL CDC source into a Kafka writer. We can test this by deploying the application and it’s a two stage process. You deploy first, um, which we’ll put all the components out over the cluster and then you run it and now we can see the data that’s flowing in between. So if I click on this, I can actually see the real time data. And you see there’s a data and there’s it before. That’s basically the four updates. You get the before image as well, so you can see what’s actually changed. So is real time data flooding through [inaudible], um, um, my sequel application. Okay. But it doesn’t usually end there.

Uh, the raw data may not be that useful. And one of the pieces of data in here is um, a product id. Uh, and that probably is, it doesn’t contain enough information. So what we’re going to do first is we’re gonna extract the various fields from, from this and those various fields include the location id, product Id, how much stock there is, et cetera. This is a inventory monitoring table and we’ve just turned that from kind of a rural, a format into a set of name fields. So I’ll make it easier to work with later on. You can see the structure is very different. Now what we’re actually seeing in that data stream. If we then, uh, once add additional context to this, what we’ll be able to do is join that day. There was something else. So, first of all, we’ll just configure this so that instead of writing the raw data at Cafca, we’ll write that process data ad and you can see all we have to do is change the input stream. So that will change the data flow. Now to right that uh, process data at into Kafka.

But now we’re going to add a cache and this is a distributed in memory data grid that’s going to contain additional information that we want to join with a raw data. And so this is product information. So every product ID is a description and price and some other stuff. So first of all we’ll just create a, a data type that corresponds to our database table. Yeah. And configure what the key is. And the key in this case is the product Id. Then we specify how we are going to get the data. And it could be from files, it could be from acfs. Yeah. We’re going to use a database reader to load it from my SQL table. So especially specify all the connections and the query we’re going to use. And we now have a cash of products information. So use this, we modify as sequel to just join in the cache.

So anyone that’s ever written any secret before knows what a join looks like. We’re just joining, uh, on the product Id. So now instead of just the raw data, we now have these additional fields that we’re pulling in in real time from the product information. So if we start this and look at the data again, you’ll actually be able to see the additional fields like description, um, and brand and category and price that came from that other type that’s all joined in memory. There’s no database lookups going on is actually really, really fast. So that’s where I seem to Kafka. If you already have data on Kafka or another message bus or anywhere else for that matter is new files. Um, you may want to kind of read it and push at some of the targets. So, well we’re going to do now is we’re going to take that data that we just wrote to Kafka.

We’re going to use a Kafka, a Rita in this case. So we’ll just search for that and tracking the capital sauce. And then we can figure that with the properties connected to the broker that we just used. So the uh, and because we noticed JSON data, we’re going gonna use it Jason Pasta. I was going to break it up into a adjacent object structure and then create this data stream. Okay. When we, uh, deploy this and uh, start this application, it’ll start reading from that Kafka a topic and we can look at that data and we can see, uh, this is the data that we were writing previously with all the information in it and it’s adjacent full Max. You can see the adjacent structure though. So the other targets that we go into right to, uh, the Jason Structure might not work. So what were you going to do now?

Is We got after in the query that’s going to pull, uh, the various fields edit that Jason’s structure and creates a well-defined, a data stream that has various, um, individual fields in it. So we’ll write a query to do that. That’s directly accessing the JSON dSata and save that. And now instead of original data stream that we had with the JSON in it, when we deploy this on, uh, start it up and look at the data. And this is incidentally how you would build applications, looking at the data all the time, um, as you’re building and adding additional components into it. Um, if we’re, uh, look at the data stream now, then you’d be able to see that, uh, we have those individual fields, which is what we had before on the other side of Costco, but doesn’t forget that, um, it may not be stream rights into Catholic. It could be anything else. And if it, you were doing something like we just did with CDC into Kafka than Kafka into additional targets, you don’t have to have Kafka between, um, you can just take the CDC and push it out to the targets directly.

So, uh, what are we gonna do now is going to add a simple target, which is going to write to a file. And, uh, we do this by choosing the file. So the file writer and especially finding the formats we want. So we are going to write this. I’ve seen the CSV format. Um, we actually call it DSV because it’s delimiter separated, right? Um, and the, the limits can be anything. It doesn’t have to be a coma and save that. And now we have something that’s going to rotate to the file. So if we deploy this and start this up, then we’ll be creating a file with the real time data.

And um, it, yeah, after a while it’s got some data in it and then we can use something like Microsoft Excel twice. She viewed the data, um, on checks. That is kind of what we wanted. So let’s take a look in XL and he can see, uh, the data that we initially collected from a, my SQL be written to capita slightly from Kafka and then being risked back out into this CSV file. But you don’t just to have, they just have one target and a single data flow. You can multiple targets if you want. We’re going to add to in rising into Hadoop and into a zoo of blob storage. So what we do is, uh, in the case of Hadoop, we don’t want all the data to go to Duke. So we’ll either the simple CQ to restrict the data and do this by location id.

So when location 10 is going to be written to her, that’s so some filtering going on there. And now we will add in the Hadoop target. Uh, so you’re gonna write to HDFS as a target. Uh, drag that into the data flow and see you. There’s many ways of working with the platform. We also have a scripting language by the way, that enables you to do all of this from vi or emacs or wherever your favorite tech status or is. Um, and we are going to write to HDF, let’s see, an Afro format. So it will specify the scheme of file. And then when this is started up, we’ll be writing into HTFS as well as to this local file system. And similarly, if we want to write into a zoo of blob storage, we can take the adaptive for that and just search for that and drag that in from the targets. And we’ve got to do that on the original source data, not that query. So we’ll drag it into that original data stream.

Yeah. And now we just configure this with information from a sewer. So you need to find out, you know, uh, what is the a server, a URL, what is, and you should know what your key is and we use name and password and uh, things like that. You’re going to, uh, collect that information, uh, if you don’t have it already. And then add that into the, uh, target definition for as your blob storage. I’m gonna write that out in Jason Format. So that’s kind of very quickly. Hey, you can do data integration, real time streaming data integration with our platform. Yeah. And all of that data was streaming. It was being created by doing changes to my SQL. Uh, well, no, see some analytics. I have a couple of applications I’ll show you very quickly. Um, the applications are defined through Ah, data flows. Data flows typically start at [inaudible], the data source.

They’re doing a whole bunch of processing and you can have them in subflows as well. And the each one is suppose can be doing, um, you’re reasonably complex things with, you know, nested data flows. So if I deploy this application and then we go and take a look at a dashboard, you’ll be able to see how you can start visualizing some of this data. So this data is, uh, coming from, uh, ATM machines and other cash point taught the ways of getting cash. And the goal of the application is to try and spot if the decline rates for a credit card transactions, et Cetera, is going up. So what it’s doing is it’s taking the raw transactions and then it’s slicing and dicing it by a whole bunch of different dimensions and it’s trying to spot has the decline rate increased by more than 10% in the last five minutes and is doing that across, you know, generally and kind of across all of the different dimensions as well.

And each one of these visualizations and nothing’s hard coded in, in here. It was all built using our dashboard editor where you can drag and drop. Yeah. The visualizations into the dashboard. Each visualization is configured with a query that tells you how to get the data from the backend. And then set of properties that tell you how to get the data from the query into the visualization and obviously other configuration information. So that’s kind of an example of one, a analytics application, uh, built using our platform. And well just go and take a a look at a totally different one that does something completely different. I’ll just stop this one. And this one is tracking passengers on employees, a SNF port. And so the data was coming from a location monitoring devices that, you know, I see tracked Wifi. Okay. And if we take a look at a dashboard for this, you can see, you know, it’s still rich dashboard that have lots of information on it.

Um, and the data here is coming from ah, location information joined with zones that have been set up. So these zones, uh, represents different airline ticketing. And what we’re doing is we’re tracking the number of employees that are in different crisis. And, uh, if the number of passengers goes up too much and you need additional employees and it’s going to flag it by turning red and it will send out a request for more employees. And the red dots here are the employees of white dots for the passengers. So as more red dots arrive in this location, then it will basically notice that new employees Reuter its euro and uh, the that will go away because you know, things are actually, uh, okay. Now and the other thing that is tracking is, you know, individual locations of all the passengers. And this passenger over here, uh, just walks into a presumed prohibited zone.

So, uh, what that’ll send out on alert and now an employee will try and track that guy down and remove him from the heritage zone. So that’s a totally different analytics application. But again, this one was also built using our dashboard builder. So I hope I kind of gives you an idea of the kind of variety of different things that you can do using the Striim platform and yes, to finalize things, the, you know, stream, um, okay. Can Be easily fitted into your existing data architecture. You can start to take, take some of the load away from um, what were maybe existing ETL jobs and try and move those into real time. And we can integrate with all of the sources that you may be using for existing ETL. But we can also integrate with a lot of other sources as well. Um, and pull data out of things you may not get enough access before.

Well it could also integrate with your things you have already, maybe you have an operational data store, enterprise data warehouse maybe already writing data into do we can integrate with that data, maybe use that for context information. Well we can also write that. So you can use us to populate your operational data store, enterprise data warehouse or I do. And we can also integrate with your machine learning choices as well. So these real time applications play a really important part. They’re a new class of application to give you real time insights into things, but we can also be part of driving your big data applications and legacy applications.

As I mentioned before, the platform consists of real time data collection across, you know, very many different types of data sources and then a realtime data delivery into a lot of different targets. The example you saw originally you seems to capita was just this very simple no processing, right? We then added in processing and that’s all done through these in memory continuous queries that are written for SQL, like language that can incorporate a time series analysis through windowing that allows you to do things like the transformation of data that you saw filtering a enrichment of data by loading large amounts of data into memory, um, through as an external context and also aggregation of data in order to kind of get a while, I haven’t been last minute, et cetera. Well, aggregating by dimensions as you show them the transaction log six. The other piece is the let’s take x where we can do anomaly detection, uh, pattern matching through complex event processing and very importantly the correlation that it was totally important to the security application. And then on top of this you can trigger alerts, uh, external workflows. You can run ad hoc queries against the platform if you want to see what’s going on right now in a data stream and also build these a realtime dashboards and we can integrate with your, your choices for machine learning and do realtime scoring very close to where the data’s generated.

We integrate with most of your existing enterprise software choices, uh, being able to connect to a whole bunch of different sources and a whole bunch of different formats. Yeah. Delivered to a whole bunch different targets, but then also integrate with your choices of big data platforms. For this map, our Hortonworks Cloudera, the choice of cloud, whether it’s Amazon, Microsoft or Google, um, and then run on operating systems and virtual machines so that we can run on premise at the edge and in the cloud. Okay. Well, so baby, well fitted for Iot, we have a separate cut down edge server that’s designed to run on a gateway hardware that may not be as powerful as what you’d be running for a Striim cluster processing analytics can happen at the edge. Uh, we can deliver that data directly into the cloud or into the streams of on premise.

Um, and you can have these applications that span a on premise, uh, edge processing, uh, on premise, on the six through the Striim server and also expired. And typically the amount of data and the speed of the data is crater on the left hand side and you’re reducing that data down to the important information. Okay. But it’s covering a lot more territory. So you may have, you know, I’m single, I just as like a single machine in the factory, but then you can have a Striim serve that covers the factory and then the cloud that covers all the factories that an organization my own.

Yeah. And we believe that Iot is not a siloed technology. It shouldn’t be thought of separately, especially Iot data. Iot data needs to be integrated with your existing enterprise data, uh, and is part of your enterprise data assets. So as part of your data modernization, as you’re thinking about Iot, don’t think about it separately. Think about how do I integrate this Iot data and get most value added that because you’d be much more valuable if it has more meaning and it can have more meaning by correlating it and joining it with your other enterprise data. Okay. We are one consistent, easy to use platform that only do we have converged in memory architecture that combines, uh, the memory caches and the high speed messaging and uh, and memory processing. But we also have a single UI that allows us to design, uh, analyze, deployed, uh, visualize and monitor, um, your streaming applications.

The key takeaways from this, I hope. Yeah. Is that okay, you really need to start thinking about streaming first architecture. Right now you need to start thinking about how do I get a continuous data collection from my important data sources and consider that from a point of view of what do I require immediate insight into, uh, how do I increased the competitiveness in my company or operational efficiency or uh, whatever reason you may have to doing it. Whatever pushes you might have for real time applications. How do I go about doing this on a piece by piece basis? Um, and start by streaming those important data. Also consider sources where your data volumes are going to be growing and you may need to do a pre processing in flight before you store any data. Um, and that’s another area where streaming first subsidy essential, um, we believe that Striim is the right solution for this because we have a streaming architecture that addresses both of these concerns and other a concerns you may have as well, especially can kind of be enterprise grade being able to run mission critical applications and you shouldn’t be kind of ripping and replacing everything.

Your Mongo, this has to be kind of use case driven. Right? And it’s probably everyone out there has a use case that they need to get some real time insights into something and that’s a really good place to start. So if you want to find out more about Striim and go to the striim.com website, you can contact us, uh, the email or the support thing on there in tweet us a check out Facebook page and linkedin pages as well. And with that I will open it up for any questions.

Thanks so much Steve. I’d like to remind everyone to submit your questions via the Q and a panel on the right hand side of your screen. While we’re waiting, I’ll mention that we’ll, we will be sending out a followup email with the link to this recording within the next few days. Now let’s turn to our questions. Our first question is what’s the difference between streaming first and an event driven architecture?

Okay, that’s a great question. A venture [inaudible] has been talked about for a long time. Um, I know they were kind of like gotten a category all the way back in like 2002, 2003. The emphasis there. It was on the data movement, it was on the enterprise data bus of that move things around. And so that was where you put your events. And so the whole kind of SOA, event driven architecture, um, that bus was the crucial thing. Um, that technology has matured. Now we have had messaged presses around for a long time. We have new ones coming up, have come out recently like cafca that really caught everyone’s imagination. I’ll technology kind of is found. The importancy we talking about with streaming first is the data collection piece that you want to put old as much of your data as you can into data streams so that you can get to real time insights.

Uh, if you want to, you could just take those data streams that you’ve collected in real time and deliver them out. So my advice to your data warehouse or cloud database or whether they just become storage, um, if that’s kind of what you want to do with the data. But as long as you’re collecting the data in real time, you’re not going to get real time insights after you put it in a database or in Hadoop is always going to have some latency involved by reading stuff from storage. But as you start to identify applications where you do need real time insights, you can move them into acting in memory straight on the data streams. So what we really mean by streaming first is you are employing a enterprise driven architecture, right? But you’re focusing on ensuring that you at least do the data collection first.

That’s great. Thanks so much Steve. Um, the second question is how do you work with machine learning? So there are a couple of different ways in which we, we’ve worked with machine learning. I can do this kind of by way of example, the we, we actually have a, a customer who is integrating machine learning is part of the overall data flow and use of Striim. And the first piece is essential is being able to prepared data and continuously feed some storage that you go into. Do machine learning on. So machine learning requires data and it requires a lot of data but also needs that data to be refreshed so that if you’d need to rebuild the model, if you started to identify the model’s no longer working, then you need to be able to have up to the minute data. And so with Striim you can collect the data, you can maybe uh, join it, it um, and Richard, um, you can pivot it so that you can end up with uh, a multivariable a structure they suitable for training, machine learning before you write that ad to files to do, um, database.

And then outside the stream you can usual machine lending software on this customer. The data they were collecting was a web activity and VPN activity and other stuff around users and they pushed all of this add into a file store and then use h two o there choice of machine learning software to build a model. And the model was modeling user behavior. Do not use as normally do what is the usual pattern of activity for each one of our users. You know, when do they logging in when they accessing things? What applications are they using, what order of applications are there? They have in, they built this machine learning model. They expressed all of that. They then exported that model, uh, as a jar file and they incorporated it into a data flow straight from the raw data in our platform. So the raw data was going through the processing into the store. But then we’re also taking that raw data in memory in a streaming fashion and pushing it through the model and checking to see whether it matched the model and then alerting on any anomalous behavior. So the two places where we really work with machine learning are delivering data into the stories of machine learning combined. And then once it’s learned taking the model and doing real time inference or scoring in order to detect anomalies, make predictions, et cetera.

That’s a great example. Thanks. Um, we, I think we have time for just one more question. Uh, this one is a confluent just announced SQL on Kafka. How does that compare to what you do?

Okay. Well, SQL on data streams is a great idea. And obviously we’ve been doing that for, since homelessness inception of the company. Um,

I like it. It’s brand new, right? So it’s not going to be mature for awhile, but it’s only looking at a small part of everything you need in order to try actually do all things I showed you today. Um, being able to run SQL against the stream, uh, I think it’s, you know, a window is, is, is one thing. Um, but there are other types of things we need to incorporate. For example, we can incorporate data in a distributed data grid, so Kashi Information, um, data and results, storage and feedback results into processing a whole bunch of other things. Um, but the, I think the primary thing that I see is that, that focusing on interactive ad hoc queries against streams, and that’s good. Um, being able to just, you know, see what’s going on in the stream and I analyze it, but the power of our platform is combining your sources.

Query is, target’s caches, results everything into a data flow that becomes an application, right, that you can deploy. Um, uh, as a whole. And so it’s gonna take a while until all of the things that we spent the last five years working out. Um, like yeah, I could security with role based security model for uh, these types of queries. How did you integrate them into a whole application? How do you to play the application across the cluster? Um, all of those kind of things that are essential for mission critical applications that we support our customers, um, that utilize SQL. So I think being an end to end platform, we can do all of that and then having to combine all the pieces together so the SQL may be useful, might be harder with the, the key SQL that was announced earlier this week.

Great. Thanks so much Steve. I regret that we’re out of time. If we did not get to your specific question, we will follow up with you directly within the next few hours. Okay. On behalf of Steve Wilkes and the Striim team, I would like to thank you again for joining us for today’s discussion. Have a great rest of your day.

 

Making the Most of Apache Kafka – Streaming Analytics for Kafka

In Part 4 of this blog series, we shared how the Striim platform facilitates data processing and preparation, both as it streams in to Kafka, and as it streams out of Kafka to enterprise targets. In this 5th and final post in the “Making the Most of Apache Kafka” series, we will focus on enabling streaming analytics for Kafka data, and wrap it up with a discussion of some of Striim’s enterprise-grade features: scalability, reliability (including exactly once processing), and built-in security.

Streaming Analytics for Kafka

To perform analytics on streaming Kafka data, you probably don’t want to deliver the data to Hadoop or a database before analyzing it, because you will lose the real-time nature of Kafka. You need to do the analytics in-memory, as the data is flowing through, and be able to surface the results of the analytics through visualizations in a dashboard.

Kafka analytics can involve correlation of data across data streams, looking for patterns or anomalies, making predictions, understanding behavior, or simply visualizing data in a way that makes it interactive and interrogable.

The Striim platform enables you to perform Kafka analytics in-memory, in the same way as you do processing – through SQL-based continuous queries. These queries can join data streams together to perform correlation, and look for patterns (or specific sequences of events over time) across one or more data streams utilizing an extensive pattern-matching syntax.

Continuous statistical functions and conditional logic enable anomaly detection, while built-in regression algorithms enable predictions into the future based on current events.

Of course, analytics can also be rooted in understanding large datasets. Striim customers have integrated machine learning into data flows to perform real-time inference and scoring based on existing models. This utilizes Striim in two ways.

Firstly, as mentioned previously, you can prepare and deliver data from Kafka (and other sources) into storage in your desired format. This enables the real-time population of raw data used to generate machine learning models.

Secondly, once a model has been constructed and exported, you can easily call the model from our SQL, passing real-time data into it, to infer outcomes continuously. The end result is a model that can be frequently updated from current data, and a real-time data flow that can matches new data to the model, spots anomalies or unusual behavior, and enables proactive responses.

The final piece of analytics is visualizing and interacting with data. The Striim Platform UI includes a complete Dashboard builder that an enables custom, use-case-specific dashboards to be rapidly created to effectively highlight real-time data and the results of analytics. With a rich set of visualizations, and simple query-based integration with analytics results, dashboards can be configured to continually update and enable drill-down and in-page filtering.

Putting It All Together 

Building a platform that makes the most of Kafka by enabling true stream processing and analytics is not easy. There are multiple major pieces of in-memory technology that have to be integrated seamlessly and tuned in order to be enterprise-grade. This means you have to consider the scalability, reliability and security of the complete end-to-end architecture, not just a single piece.

Joining streaming data with data cached in an in-memory data grid, for example, requires careful architectural consideration to ensure all pieces run in the same memory space, and joins can be performed without expensive and time-consuming remote calls. Continually processing and analyzing hundreds of thousands, or millions, of events per second across a cluster in a reliable fashion is not a simple task, and can take many years of development time.

The Striim Platform has been architected from the ground up to scale, and Striim clusters are inherently reliable with failover, recovery and exactly-once processing guaranteed end-to-end, not just in one slice of the architecture.

Security is also treated holistically, with a single role-based security model protecting everything from individual data streams to complete end-user dashboards.

If you want to make the most of Kafka, you shouldn’t have to architect and build a massive infrastructure, nor should you need an army of developers to craft your required processing and analytics. The Striim Platform enables Data Scientists, Business Analysts and other IT and data professionals to get the most value out of Kafka without having to learn, and code to, APIs.

For more information on Striim’s latest enhancements relating to Kafka, please read this week’s press release, “New Striim Release Further Bolsters SQL-based Streaming and Database Connectivity for Kafka.” Or download the Striim platform for Kafka and try it for yourself.

Back to top