Processing events quickly and with better analytics is Big Data’s next milestone. In this blog, I will outline how descriptive and contextual information using stateful event processing enables organizations to make better business decisions.
An organization generates events. In our connected economy, billions of events flow between people, machines, and organizations. Every human related activity, every connected machine, and every part of an organization generates events. From these events, using big data analytics, organizations can make better decisions. The more promptly these organizations can process these events, the more effective they become.
The events may be analyzed daily, every minute, or as they arrive. Whether the events are delivered all at once or arrive one at a time, an event processing system can increase the throughput of generated decisions using data flow pipelines. With the right architecture and sufficient resources, it is possible to process events at arbitrary volumes and arriving at any velocity.
However, managing states becomes an issue. If the systems are stateless, they require that all needed data is contained in the ingested events and that no side effects exist during processing. This relegates any stateful management to peripheral systems to be handled either before or after processing, reducing the system’s ability to perform complex business logic. Allowing a system to manage states internally allows the system to handle arbitrarily complex business logic. It should, however, be noted that this puts greater burden on the system, especially if the system uses distributed processing to be fault tolerant and manage consistency correctly.
Imagine a point-of-sale event: A customer buys a product, and a merchant generates an event. This event associates relationships between entities that are known by the processing organization. The event will contain references to those entities, as they relate to the transaction, such as the product ID and the merchant ID, along with the date of sale, and the sale total. This information is necessary but hardly sufficient to make intelligent business decision, decisions like: Should the customer be sent a promotional incentive? Is this transaction a fraud? Should more items be preordered to meet demand?
Indeed, not all information is captured or known at the time of the event. This is where the notion of an entity’s state comes in. The stateful information of any entity must be deduced or inferred from information extracted from previous events, and that information must be kept by the organization. What stateful data is needed, how it is computed and stored, and where it is used depends on the solution and its processing system. In general, however, design patterns have emerged for how events, stateful data, and models are used in stream-based analytic solutions.
FICO’s DM Suite (Decision Management Suite) enables a suite of tools and processing platforms that offer the full lifecycle and scope of big data, with a particularly strong focus on enabling solutions around event processing. Let’s look at the general pattern related to stateful event processing.
Our design pattern uses a data flow paradigm to define a primary flow as a series of conceptual stages, as shown in the diagram below, to create a processing pipeline. The emerging solution may also contain side flows that handle various visualization, management, and logging needs. The seven stages of the primary flow are colored blue in the diagram. Each stage may consist of one or more consecutive tasks that perform specific transformations or analytics as each event flows through the pipeline.
INGEST. The ingestion stage pulls events from data sources and injects them into the pipeline. A message containing an event is then parsed to identify and isolate a set of variables that describe the event. As the event passes each stage, more variables are added (or removed) to provide necessary information to later stages.
NORMALIZE, VALIDATE, FILTER. In stage two, a series of tasks may prepare the event variables for efficient processing downstream. This includes normalizing their values, such as mapping dates, phone numbers, and addresses, to their organization’s declared canonical form, validating those values, then filtering out events that are malformed or incomplete.
ENRICH. The third stage locates and injects descriptive information about the entities from external sources into the pipeline. This is entity information that changes infrequently, but is necessary to understand the event. In this stage, for example, the customer ID is looked up, and information such as their age and zip code is added to the event. This enriches the event with information that will be relevant at the later decision-making stage.
DESCRIBE. In addition to an organization having descriptive information about its entities, it must also track each entity’s recent history of activity and behavior. This is the entity’s contextual information tracked in the fourth step. For instance, the number of purchases the customer has made today, or how many other customers also visited that store today. Unlike the descriptive information, the contextual information is constantly changing as events arrive. Understanding the customer’s behavior, both in its context and as it fits into a series of similar transactions, improves the effectiveness of the remaining analytics and decision-making stages.
PREDICT, RECOMMEND, or DECIDE. With the descriptive and contextual information, models are used to predict, recommend, or make a decision. This takes place in stages five and six which will be described in later blogs.
EMIT. The end of the processing pipeline emits the analyzed and processed events to subscribing services that may act on received predictions, recommendations, or decisions, or store the enriched events for later use. Recall that while there is usually a primary processing pipeline, auxiliary outflows and back flows, may produce data for desired services such as stream visualization, error tracking, usage tracking, or model retraining.
Next in this series, I’ll continue with the roles that data and models play in stateful event processing as it occurs in these enriched data flow pipelines. I'll expand on why models are needed and how models are created, updated, and used in the streams to classify and score events, and how the processing is scaled to be performant.