There are two broad categories of data – numerical and categorical. Numerical data is data that is measurable, such as time, height, weight, amount, and so on. Categorical data is data that is divided into groups. Examples of categorical variables are race, sex, age group, and educational level.
While numerical data is easy to relate to – that is how nature is in some ways. They are usually more difficult from predictive modeling point of view. That’s because of the possible number of ways in which they can be handled.
Let-s take a very simple example – if you want to analyze income level performance by gender you can look at percentage of males and females playing sports. Now, what if you want to analyze performance by age? How many possible ways can you think to analyze this? It’s not obvious. Should this be analyzed by individual age or age groups. If by age groups, then how should the ages be grouped? What will yield the best results?
AI and ML algorithms are pretty naive by themselves and cannot work out of the box on raw data. Hence the need for engineering meaningful features from raw data is of utmost importance which can be understood and consumed by these algorithms.
Emcien’s software manages the entire process of creating meaningful variables from your raw data in a fully automated way. This is typically a very manually intensive step that has been automated – enabling the end-to-end analytics pipeline.
EmcienPatterns’s Analysis module ingests raw data and then automatically “bins” numerical data values into the most predictive ranges of values before converting the data to a knowledge graph.
Data binning is known as “bucketing,” “bucketizing,” “quantization” and “banding” – is a data pre-processing technique that the software uses to maximize the predictive power of a data set in order to heighten the business impact of predictions.
To illustrate data binning, here is an example data set containing the ages and education levels of a population, which could be used to predict outcomes like income level:
The Name column contains categorical values, but both Age and Years of Education columns contain numerical values.
The software “bins,” or converts, each numerical value in the Age column (29, 51, 73, 55, 34, 19, 24, 45, 61) into a value range (19-29, 34-51, 55-73) also known as “bins” or “bands” or “buckets”:
Simultaneously, the software bins each numerical value in the Years of Education column (16, 15, 10, 10, 18, 13, 16, 16, 17) into these value ranges (13-14, 15-16, 17-18):
This is the data set after the software has binned all numerical values:
The Software Determines the Optimal Binning Method
These value ranges might appear random, but they are not. The software bins the numerical values into those value ranges that will maximize predictive power most.
In order to determine what value ranges will maximize predictive power, the software first trials every binning methods on the data and then applies the optimal method to each variable so that the results yield that the highest predictive accuracy for the selected outcome.
Below are the binning methods the software trials and can apply.
When using the equal frequency method, the software creates bins that contain approximately the same number of instances. The number of instances that fall into a band is the population of the band.
The software divides the range of the variable into equal width bands when it’s using the equal width binning method. The bands may then have very different populations.
In the statistical banding method, the software computes the mean (mu) and standard deviation (sigma) of the variable. It then creates bands like: [mu-2*sigma, mu-sigma), [mu-sigma, mu), [mu, mu+sigma), [mu+sigma, mu+2*sigma].
Using the unique value binning method, the software creates a bin for each unique value of the variable. This is only viable if there are not too many unique values. If values of a variable are sorted, then the unique values appear as runs of the same value.
Information banding is an algorithm that finds the banding with the maximum possible mutual information. Mutual information is a measure of how much one random variables tells us about another. For example – knowing the age can tell you about the income level.
After trying the alternative banding methods for a numeric variable X, a meta-algorithm selects the banding method that gives the greatest predictability mutual information between the resulting bands for X and the dependent variable Y.
EmcienPatterns then applies another meta-algorithm is to select a subset of the independent variables to include in the model according to their mutual information with the dependent variable Y. This is an automated process of feature selection.
One of the most important criteria for success of AI and AML algorithms is the set of features selected. EmcienPatterns delivers continuous success by automating the feature selection process to ensure successful results.