Last week we met with a national retail company. They have about 5,000 brick-and-mortar locations across the country, more than 20,000 unique products, and a new loyalty program that includes a store credit card for customers.

The Business Problem

But they had one big problem. Some of their customers weren’t paying their credit card bills. And this was eating a large hole in the profitability of their loyalty program. The finance team wanted a better way to see early on which customers would be delinquent and sent to collections so they could try interventions to maximize payments and boost profitability.

So, they did what any smart company would do …turn to a team of data scientists and business analysts. They hoped the team could use the wealth of customer data they had collected to build a predictive delinquency model. And they wanted it in the next week so they could boost their loyalty program numbers before the next quarterly report was out.

So, the company gave the data team a wealth of customer data all in the POS/transactional data format that they use, and that is commonly used in retail and elsewhere. And then they waited.

The Data Problem

The data team planned to used R – an open source statistical computing software widely used by data scientists and data miners – to analyze the data and build the predictive model.

R’s standard algorithm packages require that the data first be organized into rows and columns (known as “wide” format) where each unique feature (a.k.a. category or variable) like “household income” or “zip code” is its own column. So the data team needed to first convert the POS data to this wide format – a pesky, but simple task that can be automated using a script – before R could analyze the data.

Unfortunately, R – like many other common statistics and analysis tools – has a critical limitation that soon surfaced to create massive problems. R limits the number of data values you can have in each column to just ~50.

The data team wanted the predictive model to take into account which store a customer purchased from, but the company had 5,000 different stores, not 50. And they wanted the model to consider what products each customer had purchased. But they had 20,000 possible products, not 50.

They had only two options – neither of them any good.

They could throw out these features and not even include them. But doing so would mean the predictive model might be missing critical information, making the accuracy of predictions shaky and the business impact slight.

Or they could pivot the data in these particular columns, blowing each data value out into its own column. So instead of 1 “product” column with 20,000 data values, they would have 20,000 product columns – each column representing a single product. And instead of having 1 “store” column with 5,000 data values, they’d have 5,000 store columns – each column representing a single store.

Doing so would turn a typical retail data set that originally had as many as 20 columns into an enormous data set with more than 25,000 columns. And, analyzing a 25,000-column data set is an extraordinarily difficult computational problem. The computation is so difficult that it would take an impossible amount of RAM and an intolerable amount of compute time (weeks or even months) to solve.

The Solution – A Knowledge Graph

So, the data team called us in. They thought our predictive analytics software might be able to transcend the limitations of standard statistical tools because, like Google and Facebook, our software uses a knowledge graph:



What is a knowledge graph?

A knowledge graph is a way to represent and directly connect related data. On a graph, data values are represented by nodes, relationships between values are represented as edges, and the number of times a relationship occurs is counted by a weight.

For example, a single customer transaction where the customer Joe bought a green product for $100 would be represented on a graph with three nodes and 3 edges, and all weights at “1.”



Knowledge graphs can ingest data in its native format – JSON, wide, long, tagged, POS/transactional – so the team wouldn’t need to convert their POS-format data to wide format.

But more importantly, knowledge graphs impose no limitations on the number of data values that you can include per column in your data set. A data set that has a “products” column with 20,000 products in it and a “store” column with 5,000 stores in it can be ingested by our software – and then and converted to a graph – quickly and easily.

As thousands of additional transactions are analyzed and added to the graph, only unique data values are added as new nodes, expanding the size of the graph as little as possible. Additional instances of a connection between two existing data elements are represented with increases in the weight of edges, which does not expand the size of the graph.

A purchase by Beth of a green product for $100 only expands the graph with one new node – Beth – because the repeat data elements green product and $100 are already present. And it adds one additional edge between $100 and green, as shown below:



In this way, a knowledge graph compresses data to just the unique data elements, and ensures that complex computational problems can be solved with minimal RAM and compute time.


The data team wondered if our software could get the predictive model built in just a week like the finance team had requested.

We uploaded their POS data in its native form to our software, EmcienPatterns. We waited for a couple of minutes as the software converted the raw data to a knowledge graph. We waited another couple of minutes as the software identified all the connections, predictive patterns, and generated a predictive model.

Voila! The data team had their predictive model in under 5 minutes.

Now all they had to do was apply the predictive model to new customer data to predict which loyalty program members would default on their payments.

They uploaded their new customer data to EmcienPatterns, which took another handful of minutes to apply the model and generate a prediction, and likelihood, for every customer with a credit card.

Now the finance team knew exactly which customers were likely to default on their credit card payments and it took minutes, not days and not a week.

Our Point of View

Enterprise data is messy and diverse. It’s dirty, in different formats, from different systems and departments. And you should be able to use it all without having to spend all of your time cleaning it, contorting it, or shoehorning it.

That’s why we created EmcienPatterns. It’s predictive analytics software for the real world because it solves the everyday problems that make data analysis and prediction so hard, costly, time-consuming, and stand in the way of big business value.

To learn more about the other real-world problems we’re solving, visit our product page. And stay tuned for more tales of woe from our real-world meetings with real-world businesses.


Emily Gay

Marketing Director

Emily helps companies understand how new data technologies can solve their biggest challenges. In-house and agency-side, she's spent nearly a decade helping brands use data to make smarter decisions and optimize KPIs.

How is automation transforming analysis? Download Redefining Analytics to find out.