EmcienScan’s drag and drop capability isn’t technologically difficult, but it’s important to point out exactly why the drag and drop feature is so unique.
The reason we’re excited about EmcienScan, and in part the drag and drop feature, is that you can go straight to the UCI Machine Learning Library, Kaggle.com, or anywhere else you find open data, download large files, drag them to EmcienScan, and get real and valuable insights as fast you can download the data.
What makes EmcienScan the most advanced data discovery tool available is its flexibility. Before EmcienScan, data discovery meant extensive data cleansing, eyeballing the data with Excel or a text tool, etc. There was no easy way to quickly discover information about a new data set. Instead, analysts and data scientists are famously spending as much as 80% of their time cleansing data, rather than gaining insights from it.
EmcienScan lets analysts cut out a lot of that tedious data cleansing by identifying the predictive relationships in even particularly dirty data, identifying the portion that contains value. Scan transforms the discovery process, moving discovery before cleansing and avoiding all of the time taken cleansing data that isn’t relevant to the outcome of interest.
For a quick demonstration, I can go to kaggle.com and download a data set. The public education dataset is a good example because it has so many columns. In this case, Kaggle has created a mashup of student debt and college performance data.
Download the data, find the 1.3 gig .CSV file and drag it to the Scan box. After a few minutes Scan will alert you to the completed job.
At first glance, predictability doesn’t look great:
Click on the Scan and you’ll get a deeper view. Immediately, the scan has identified 62% of the columns in our data as disconnected. In this case we (Kaggle) have collected a lot of data that isn’t likely to help us much.
But for this data set we already have a specific objective: to understand more about veteran students. Inside of the Predictable Columns section we can see that the column “veteran” is highly predictable. This is a great indicator because we’re likely to be able to successfully analyze veterans’ success without cleansing and prepping the entire data set.
Clicking the Column shows us, in order of predictive strength, every column that relates to our “veteran” category.
Without any data querying or data science skills we’ve narrowed down a data set over 1,000 columns to the 411 columns that relate to veteran students, all in about five minutes.
Have something to share on data discovery or have your own preferred method? Let us know.
EmcienScan can also connect directly to a long list of data sources for automatic and recurring data scans. We’ll follow up with more detail, but for now this is best summed up by Adrian Lane, Security Strategist at Securosis:
“When you have a production database schema with 40,000 tables, most of which are undocumented by the developers who created them, finding information within a single database is cumbersome. Now multiply that problem across financial, HR, business processing, testing, and decision support databases—and you have a big mess.”