At the core:
A Knowledge Graph
At the core of EmcienPatterns software is a knowledge graph – an information model that directly connects related data values – shown below:
The Analysis module converts raw data to a knowledge graph before layers of algorithms are applied to identify predictive patterns. This conversion to a graph confers significant technical advantages and business benefits:
- Ingestion of many data formats and types, which maximizes the value of collected data
- Resilience to dirty data, which eliminates the need for data cleansing
- Data compression and decreased data traversal time, which help maximize speed and scale and minimize hardware requirements
- Acceptance of unlimited values, which helps minimize RAM and compute time while maximizing prediction accuracy and business impact
These advantages are explored in detail below, but first, what is a knowledge graph?
There are many ways to represent data. The most common way – and the way traditional databases represent data – is using a table with columns and rows. For example, a customer transaction where the customer Joe bought a green product for $100 would be represented in this way using a table:
On a knowledge graph, data values are represented by nodes, relationships between values are represented as edges, and the number of times a relationship occurs is represented with a weight. The same customer transaction captured in the table above can be represented using a graph in this way:
Note: Knowledge graphs – also referred to as graph representations, graph data models, and graph analytics – are often confused with graph databases because of their similar naming, but knowledge graphs and graph databases are entirely different concepts.
Graphs Ingest Many Data Types & Formats, Maximizing the Value of Collected Data
Organizations collect and store data in order to turn it into insight and business value. This data is diverse in type (numerical, categorical, etc.) and format (JSON, POS/transactional, wide, long, etc.)
However, much of the data collected cannot be analyzed as-is with traditional representations and analysis methods. Categorical data must first be converted to numbers before it can be analyzed alongside numerical data. POS/transactional data must first be converted to rows and columns (wide format) before it can be analyzed by standard statistical computing software.
As a result, analyzing this data takes more time because conversion must first take place or the analysis is never performed.
Knowledge graphs are data-agnostic, meaning they are compatible with diverse data rather than being compatible with a single type or format of data.
This is because when data values are converted to a graph representation, all data values in the data set are substituted for unique identification symbols or “tokens.” These tokens are random identifiers that contain none of the meaning of the original data values.
For example, an original data value of “Joe” becomes token “A.” The data value “I love it” becomes token “B” and the data value “Total cost $90-$110” becomes token “C” as shown below:
The primary purpose of this tokenization is to simplify data values so that EmcienPatterns can focus on the connections and patterns between values instead of the values themselves, speeding analysis and prediction processes.
However, one significant additional benefit of tokenization is that every data value is converted to a “thing.” When every data value is a “thing,” it’s the same type. This makes analysis of diverse data easy, so insight and value from the data collected can be maximized.
Graphs Are Resilient to Dirty Data, Eliminating Data Cleansing
For many organizations, a significant percentage of their data is dirty – containing redundancies, missing values, mismatched formatting or other imperfections – because data entry and collection processes can introduce errors.
A “dirty” data set containing missing values and incorrect values in table form
Unfortunately, some algorithms won’t work if the data set contains dirty data. Therefore, all the imperfections in the dirty data must be identified and fixed through a manual, time-intensive data cleansing process before it can be analyzed.
As a result, analysts may discard dirty data from the data set so the time and work can be avoided. Or, the dirty data is fully cleansed, but the manual process slows or halts the progress of the predictive analytics project for which the data is being used, lengthening time-to-value.
The knowledge graph’s data agnosticism is once again helpful, as it means the graph can accept and connect dirty data with normal, non-dirty data – including the diverse data types mentioned previously – eliminating the need for data cleansing.
When data is converted to a graph, missing values are simply not added.
Values that are present, but imperfect, are like misspellings and mismatched formatting are simply added to the graph as nodes just like normal values:
The same dirty data set in graph form (dirty data highlighted with red nodes)
Then, when algorithms are applied to the graph to analyze the data, any low relevance values and connections – which typically includes imperfect data values – do not have significant patterns and hence do not become part of the predictive model.
When data cleansing is not needed, predictive analytics projects can progress and produce value unimpeded by manual data cleansing processes.
Graphs Compress Date & Decrease Data Traversal Time, Helping Deliver Enterprise-Grade Speed & Scale
The knowledge graph delivers both data compression and decreased data traversal time, significantly boosting data analysis speeds.
Graphs Compress Data
As new data are analyzed and added to the graph, only the new and unique data values – not the repeat values – are added as nodes, expanding the size of the graph as little as possible.
New instances of a connection between two existing data values are represented with increases in the weight of edges, which does not expand the size of the graph at all.
For example, a new customer transaction in which Beth purchases a green product for $100 only expands the graph by one new node – Beth – because the repeat data values green product and $100 are already represented. And, the transaction only adds one new edge between $100 and green, as shown below:
In this way, a knowledge graph compresses data to just the unique data values.
Other representations of the same two transactions are less compact. This is because they represent every single instance of a data value – including repeat instances – which adds redundant data unnecessarily, as shown below:
Because graphs represent only unique data elements, the size of data requiring analysis is reduced:
For example, a national retailer with 2,000 stores and 25,000 product SKUs in each store may have over 25 terabytes of sales data in table form. That sales data can be represented on a graph with 500,000 nodes, and just 5 gigabytes of data.
Ultimately, this accelerates the data analysis process considerably.. This is because finding all the relevant connections in 5GB of data is much faster than finding those same connections in 25TB of data.
Graphs Decrease Data Traversal Time
Because knowledge graphs are very compact, they have the unique ability to bring related data values into close proximity with one another, making it easy to identify patterns quickly. This is how Google serves up search engine results to users so quickly in response to their searches.
Related data values may be located very far apart in a traditional representation of a large data set. For example, the first and the last sales transaction on a table with tens of thousands of rows may be connected to one another in some way, but there’s so much data to traverse between them that discovering this pattern is a very time-consuming process.
Graphs index the data based on their relatedness. This means related transactions are directly connected, eliminating data traversal, and resulting in near-instantaneous information retrieval.
Graphs Help Minimize Hardware Requirements
As described previously, knowledge graphs are supremely effective at compressing the raw data to just the unique values.
Therefore, the knowledge graph representation of a data set is small – much smaller than a more traditional table representation of the same data set.
This knowledge graph compression combines with the additional compression performed during model generation and the simplicity of the Prediction module to ensure that
EmcienPatterns has minimal hardware requirements.
As result, EmcienPatterns runs optimally on most any commodity laptop, eliminating the cost and time to create complex setups with multiple compute clusters that many hardware-intensive analytics tools require.