When we hear stories of analysts and data scientists, the infamous data janitor work often takes a back seat to their successes. While much of the work of data analysis is successful discoveries and valuable insights, what goes on behind the scenes has been obscured from the view of the public. Hours, days, and even longer cycles spent on exploratory data analysis are common, and preparing data can represent as much as 80% of total analysis time.

This is in large part because statistical modeling tools like R, Python’s analytics packages, SAS, and others can be quite picky about the quality and completeness of the data they ingest. Just getting to data exploration can require extensive cleansing before any analysis can take place. The result is that data professionals of all stripes spend a lot of time cleansing, then exploring their data before they can even begin their analysis.

Mixed, messy, and incomplete data.

This workflow means that much of the exploratory data analysis ends up being fruitless, and only in hindsight is it clear that a lot of the cleansed data wasn’t helpful for the final analysis. EmcienScan users turn this workflow on it’s head by automating data discovery at the very beginning of the analysis. Before working with any raw data, a quick scan of the data will boil data’s connections down to a concrete predictability score shown as the data’s predictability signal.

Helping the data janitor with automated data discovery.

Users can then click down to the data they need, using the predictability scores as a guide. With the relevant data isolated, it becomes a much more orderly and straightforward process to create a useful model or make predictions on new data. For the first time, Emcien customers are able to take unknown data and in seconds know what that data can tell them about a given outcome. They know what data needs to be cleansed and what data needs to go into a Patterns analysis or into their model. Flipping that workflow makes the entire analytics process faster and more efficient, and with more time for analysis the results are improved as well.