If you are following Streaming Analytics, you have likely observed a movement in big data systems towards event stream processing to continuously analyze events, score them, and make decisions in real-time. This movement is also adopting online ML (machine learning) and predictive analytics based on incremental algorithms that learn from, and adopt to, changes reflected in the event data.
My support around this movement has been to architect streaming analytics platforms that are better at managing stateful information. As the time for ingesting and analyzing each event shrinks from weeks and days to minutes and seconds, purely stateless event processing that uses external systems to handle stateful information becomes less practical. With state management platforms specifically designed to support streaming analytics, development of analytic solutions accelerates rapidly, and solution complexity is reduced.
When I refer to a streaming analytics platform, I am referring to a platform with processing, messaging, coordinating, and state persisting core services that operate in concert. This platform executes solutions as several either scheduled or long-lived jobs. Often, some of these jobs will be stateful.
State management allows the stateful jobs to define states configured for the specific needs of a solution. This includes profile sets containing profiles with name/value properties that may also be based on moving windows such as time interval or count. The state management of these profile sets can be configured to provide immediate or delayed persistence to guarantee state retention between jobs or after system failures. It can also be configured to provide consistency of the states for a distributed process when states may be shared between processes. Without expanding on this here, let me say that state consistency coupled with guarantees of exactly-once, at-least-once, or at-most-once processing in a distributed data flow architecture is one of the key features of a well-designed state management system.
In my last two blogs, Stateful Event Processing and The Need for States and Models in Event Processing, I outlined the methods and benefits of adding state management to event stream processing solutions designed as data flows on distributed architectures. In this blog, I look at the creation and use of models in the stream analytics as it relates to stateful processing.
For the most part, analytics models are trained periodically and off-line, using batch processing systems that analyze entire datasets. Data is first collected into datasets over a period of time, and a model’s needed configuration is identified using a variety of ML algorithms. This results in models that encapsulate the statistical distributions, cluster geometry, constraint-based rule sets, decision trees, score cards, etc. These models are then periodically embedded in long-lived event processing solutions to classify, score, and make decisions about events as they arrive. This dual phased approach is convenient given that the ML algorithms typically require large datasets to be analyzed with multiple passes. This is something that is hard to do directly in an event processing flow.
However, for some solutions, it is becoming possible to embed this training process directory into the event stream processing data flow. Reworking the multi-pass offline ML algorithms to learn and adapt incrementally has lower computational cost in resources and shorter processing times. Redesigning the offline algorithms to use incremental techniques can render the collection and storage of the entire datasets unnecessary by enabling the same models to be derived in-the-moment as new events arrive.
Consider the (slightly oversimplified) example of computing a moving average or standard deviation of a particular value; values such as the average age of customers or the standard deviation of each customers’ purchases as tracked by their bank. While these measures alone are not models, they may be used by models. These measures must be updated continuously in order to keep the models relevant. This is achieved by computing and tracking the moving average and variance of each entity’s number of events, the sum of the values, and the sum of the square of the values. From these three quantities, the measures of interest may be computed and refreshed at every new event without needing the entire dataset.
Contextual Information flowing in and out of moving boxes much like stateful event processing.
Without needing states, data in the events provide real-time context. Events readily offer information such as location, weather, or time. But decisions also need stateful information based on historic contexts to be added to profiles from the observations in the moment, context such as what was spent in the last 12 months. Contextual profiles must also encompass stateful information based on derived context in addition to the real-time and historical contexts. Derived context is obtained by analyzing many related events over a period of time; say how many times have we called this customer in the last month, or the ratio of failed transactions in a one hour interval. Examining all three types of contextual information enables organizations to build decision solutions that can enable, yet not be limited to, use cases such as marketing, customer service, origination, fraud, or risk.
Now, consider that the platform offers stateful APIs that allow the steps in the processing data flow to identify the entity of interest from an event and understand the entity’s state through its profile that holds such contextual information. If this can be made performant, reliable and consistent across a distributed system, the solution and online algorithm developers need not waste time solving such problems.
One issue in architecting a platform arises when deciding how much of the state management support should be left to the design and implementation of the solution developers, and how much should be provided out-of-the-box as stateful APIs in the underlying platform. In platforms that adopt data-flow architectures based on distributed computing to increase throughputs by parallelizing, while offering fault tolerance and processing guarantees such as exact-once processing, such efforts are non-trivial.
One thing I have learned at FICO is that having solution developers correctly design and implement their own state management in a distributed stream processing environment is time consuming for them and hard to get right. This argues that the support should be provided by the platforms and not be left to the solution developers. It is my belief that in enhancing the underlying platforms, these movements of leveraging the state management in streaming analytics will accelerate the online ML and use of incremental algorithms.
FICO’s DMP Streaming (DMP-S) is an evolving platform that offers hosted solutions such as stateful APIs. The platform thus allows solutions to extract and track contextual information from events, thereby improving analytics be they for real-time, near time, or batched events. The solutions can extract and analyze contextual information from events and persist such information as profiles.
With this perspective, we see that state management provided in the event stream processing platforms such as FICO DMP Streaming simplifies the overall complexity of stateful solutions allowing developers to focus on their designs and their use of algorithms, analytics, and models. What currently depends on several phases and different tools, technologies, and methodologies to separately analyze data, train and retrain models, and process live streams, may be converging to simpler big data eco-systems. From a more-distant perspective, the eco-systems behind data discovery, data mining, machine learning, online machine learning, and the use of models in real-time event filtering, cleansing, analysis, scoring, and complex event determination are slowly coalescing into effective suites of interoperable tools and execution environments. With stateful event processing platforms providing streaming analytics with contextual information and on-line analytics, solution developers can start at a much higher level of abstraction and reduce the needed number of tools and different execution environments.
Yet as the abstraction layer rises, the burden is now on us platform architects to address performance issues, scalability, latency constraints, resource costs, reliability and consistency to name a few challenges. For more on that, stay tuned for my next blog.