Speed & Scale
EmcienPatterns is designed to deliver enterprise-grade speed — 1,000 predictions per second on a single core — and can be scaled horizontally to surpass the speed of streaming data. This performance is achieved by a unique and innovative combination of 3 strategic components:
- A knowledge graph, which compresses data and decreases traversal time
- A patented algorithm, which automates noise reduction and feature selection, and generates an optimal model
- Vertical and horizontal scaling, which ensures speed for high volume data
Each component builds upon the advantages of the others’ so that the benefits of the combined set are far greater than the sum of the individual parts. These components are detailed further below.
A Knowledge Graph Compresses Data and Decreases Data Traversal Time
Emcien’s knowledge graph delivers both data compression and decreased data traversal time, significantly boosting data analysis speeds.
Note: Like Google and Facebook, Emcien employs a smart data structure known as a graph data model or “knowledge graph.” If you’re not familiar with graph analytics, it may be helpful to learn more about it, including how it differs from graph databases.
Graphs Compress Data
On a graph, data values are represented by nodes, relationships between values are represented as edges, and the number of times a relationship occurs is represented with a weight.
For example, a customer transaction where the customer Joe bought a green product for $100 would be represented on a graph with three nodes and 3 edges, and all weights set at “1.”
An example of a graph with 3 nodes, 3 edges, and weights at “1”
As thousands of additional transactions are analyzed and added to the graph, only the new and unique data values – not the repeat values – are added as nodes, expanding the size of the graph as little as possible.
New instances of a connection between two existing data values are represented with increases in the weight of edges, which does not expand the size of the graph at all.
A new customer transaction in which Beth purchases a green product for $100 only expands the graph by one new node – Beth – because the repeat data values green product and $100 are already represented. And, the transaction only adds one new edge between $100 and green, as shown below:
The new transaction expands the graph by only 1 node (from 3 nodes to 4 nodes)
In this way, a knowledge graph compresses data to just the unique data values. Other representations of the same two transactions are less compact. This is because they represent every single instance of a data value – including repeat instances – which adds redundant data unnecessarily, as shown in the below:
A traditional representation includes redundant data unnecessarily
Because graphs represent only unique data elements, the size of data requiring analysis is reduced:
The difference in data size increases with a graph vs. a traditional representation
For example, a national retailer with 2,000 stores and 25,000 product SKUs in each store may have over 25 terabytes of sales data in table form. That sales data can be represented on a graph with 500,000 nodes, and just 5 gigabytes of data.
Ultimately, this accelerates the data analysis process considerable. This is because finding all the relevant connections in 5GB of data is much faster than finding those same connections in 25TB of data.
Graphs Decrease Data Traversal Time
Because knowledge graphs are very compact, they have the unique ability to bring related data values into close proximity with one another, making it easy to identify patterns quickly. This is how Google serves up search engine results to users so quickly in response to their searches.
Related data values may be located very far apart in a traditional representation of a large data set. For example, the first and the last sales transaction on a table with tens of thousands of rows may be connected to one another in some way, but there’s so much data to traverse between them that discovering this pattern is a very time-consuming process.
Related transactions can be very far apart on a traditional representation
Graphs index the data based on their relatedness. This means related transactions are directly connected, eliminating data traversal, and resulting in near-instantaneous information retrieval.
Related transactions are directly connected on a graph representation
A Patented Algorithm Automates Noise Reduction and Feature Selection and Generates an Optimal Model
Emcien’s patented pattern extraction algorithm applies innovative mathematics to the compact knowledge graph to automate noise reduction and feature selection, and to generate an optimal model, quickening analysis and prediction.
The Algorithm Automates Noise Reduction and Feature Selection
Traditionally, zeroing in on only the most important, relevant connections in your data is hard. Data expertise, software knowledge, and time are required to produce results that are imperfect.
Emcien’s patented algorithm, powered by the latest innovations in combinatorial math, quickly and expertly finds the complex signal hiding in a sea of noise.
First, the algorithm connects all data values that co-occur.
Then, it automatically defocuses the “noise” – the myriad irrelevant connections that block or blur the signal.
For example, over 90% of grocery store transactions include milk, bread and eggs. Emcien’s algorithm immediately recognizes that flowers and lightbulbs do not connect with milk, bread and eggs.
Because Emcien can quickly identify what doesn’t matter, it can focus on the relevant connections that do matter.
Finally, it identifies and selects those important, relevant features (or variables), that should be included in the predictive model, and extracts the predictive patterns that lead to an outcome of interest.
The Algorithm Generates an Optimal Model
The patterns Emcien extracts are essentially a predictive model in the form of a set of predictive rules. The rule set represents the critical relationships between data values that predict an outcome of interest:
A rule set describes relationships between data values related to an outcome
Most rules engines create an explosion of rules that grinds data analysis to a halt.
Emcien’s rule generation methodology creates the smallest set of rules capable of high-accuracy prediction.
Emcien’s methodology achieves this by judiciously deciding what predictive rules to include in the predictive model and which to exclude. Only rules that improve prediction accuracy without duplicating intelligence provided by other rules are added to the model. Rules that don’t improve accuracy, or that do duplicate existing intelligence, are not added.
The rule set is typically only about 0.01% of the size of the data set it describes, and much smaller than an iTunes song. As a result, when the software applies the predictive rule set to new data to generate predictions, the process is extremely fast.
Vertical and Horizontal Scaling Ensures Speed for High Volume Data
EmcienPatterns leverages vertical scaling in its Analysis module and horizontal scaling in its Prediction module to accelerate critical processes and deliver even greater speed and scalability.
Vertical Scaling Speeds Data Analysis
The Analysis module analyzes historical data to create a predictive model for an outcome of interest.
The data analysis algorithm leverages parallelism through symmetric multiprocessing. The computational work required for analysis is subdivided into smaller units, with each unit assigned to a separate processor, so data can be analyzed in parallel across multiple cores.
As a result of this parallel computing, the analysis process scales vertically, increasing speed.
Then the Analysis module delivers the product of that analysis – the predictive model in the form of a predictive rule set – to the Prediction module so that it can generate predictions.
Horizontal Scaling Accelerates Predictions
The Prediction module then applies the predictive model to new data to generate predictions.
To predict on high volume data at high speed, the Prediction module can be spun out into multiple prediction instances.
Because the model generated by the algorithm is small and can be exported from the Analysis module, it can be copied and distributed to an unlimited number of these prediction instances.
A rule set describes relationships between data values related to an outcome
Then, the new data can be split into several units and each can be fed to a prediction instance so that predictions are made on the many data units simultaneously.
This horizontal scaling ensures predictions on high volume data are generated and delivered to the user in an accelerated fashion.
As the volume of data increases, all entities – prediction module instances, copies of the predictive model, data set units – can also be increased.
This ensures the fast response time required for every use case is achieved, regardless of data volume.