Businesses all over are clamoring to capitalize on the enormous potential of predictive analytics. But they are being held back by a set of common real-world problems. One of the biggest obstacles is dirty data. In fact, in most every meeting we have with company executives, there comes a point in the conversation where they admit that their data is a mess and they worry that having imperfect data will impede progress.

So what is dirty data, why is it a problem, and what’s the solution?

What is dirty data?

Dirty data is any data that is imperfect. It may be inaccurate, incomplete or erroneous. Common errors in dirty data include missing values, redundant values or duplicate data, spelling or punctuation errors, and mismatched formatting.

For example, take CRM data. Sales reps may only fill in certain fields for their sales accounts. They may not fill in the customer demographic information entirely, or revenue data, etc. This results in empty cells or even entire blank rows in the data. It’s very typical of enterprise data to have missing values because their data entry processes and requirements aren’t perfect.

Here’s another example. There is a manufacturing company that replaces certain fields in their ERP system with a special character after the order has been fulfilled. While this may have seemed like a good idea at the time, the result is data with lots of meaningless special characters that make it hard to analyze and uncover predictive patterns.

Why is it a problem?

Traditional analytical tools and methods cannot analyze dirty data. As a result, businesses need to employ data specialists to cleanse the data before running an analysis. Cleansing is typically a manual, expensive, time-consuming task of fixing each imperfection by filling in data, formatting it, and correcting errors. Sometimes specialists will throw out data that’s imperfect, which robs the business of potential value.

What’s the solution?

Because dirty data is one of the biggest obstacles to a successful predictive analytics project, Emcien set out to create software that can analyze dirty data without the need for cleansing.

Emcien’s machine learning software leverages graph analytics – a technology that’s highly resilient to dirty data – to convert every data set to a graph data set. The conversion process is fully automated. Here’s an illustration of the conversion process:

In graph form, your data set is organized according to how data values are connected, and missing values and all other imperfections are ignored. As a result, you can analyze your data without cleansing or throwing out imperfect data.

And because graphs are data-agnostic, Emcien can analyze all of your data, regardless of type or source or industry, including unstructured and categorical data, so you get maximum value from the data you have.

Additionally, the graph method scales to very large data sets, as has been proven by applications like Google, Facebook, and LinkedIn that are built upon graph models.

It is very worthwhile for your business to work on improving the quality of your data with better process. Better quality data can yield better results. But because of Emcien’s innovative use of graph technology, imperfect data is no longer a stumbling block to succeeding with predictive analytics. With Emcien, you can get your predictive analytics project off-the-ground quickly and easily.


Radhika Subramanian


Radhika helps executives adopt breakthrough technologies to outperform the competition. A serial entrepreneur, Radhika has been taking leading brands to new heights with groundbreaking applications of math and data for decades.

How is automation transforming analysis? Download Redefining Analytics to find out.