Select Page

Machine LearningAlgorithms

The Current Landscape

There are many statistical procedures and AI/machine learning algorithms that are now available as modules in the open source R library. These include decision trees, gradient boosting, regression, logistic regression, neural networks, and many others.

This has led to several companies that provide a platform for data scientists to easily mix-and-match these R modules to build custom predictive models.

More recently, there are advancements in this sector eliminating manual effort and the need for data science skills with “automated machine learning.” The idea is to have the software take a data set and run every appropriate algorithm at the same time, in parallel. The one that gets the best result, on some criterion, is declared the winner. The goal is production of a model without human intervention: one-click-data-in-model-out.

But many solutions still require some manual data modeling and manual model re-builds/updates when the patterns in the data change. What’s more, the computational complexity is burdensome and time-consuming.

Emcien’s Difference: Many Algorithms

Emcien’s automated machine learning includes parsing the data, feature selection, model selection, model building, analysis and finally generating results that business can act on. All this is accomplished as an end-to-end data factory that is completely automated. Emcien uses a sequence of algorithms for the many intermediate steps. These software selects the best algorithms that are all designed to work together to deliver answers with the highest accuracy.

Data Binning Algorithms

Emcien’s algorithms are not from the R library because of a difference in the basic mathematical approach. The difference between Emcien and other approaches is that all numeric data is “binned”, or converted to the numeric value ranges that will maximize the predictive power of the data set.

For example, the numeric variable age is converted to the set of bands: [0_to_13), [13_to_21), [21_to_40), [40_to_65), [65+]. This means that all of the “atomic” data is tokens (character strings) and not numbers.

On the other hand, classical methods convert categorical data into numbers. A gender variable {M,F} becomes a zero/one variable where 0 means M and 1 means F. An American state variable with 50 values in {AL,…,WY} becomes 50 zero/one variables: X(AL)=1 if the state is Alabama or 0 otherwise, and so forth to X(WY)=1 if the state is Wyoming or 0 otherwise. A ZIP code variable would turn into hundreds of zero/one variables.

In order to determine what value ranges will maximize predictive power, the software first trials a variety of binning methods on the data and then applies the best method that delivered the highest predictive accuracy for the selected outcome. The binning algorithm also delivers the first level of feature selection. More advanced feature selection is done later in the process.

The binning algorithms the software trials and can apply include: Equal Frequency, Equal Width, Statistical Banding, Unique Values, and Information Banding.

Algorithm Parses Data and Generates the Knowledge Graph

Once the numeric variables have been binned, all of the atomic data elements are character strings or tokens, which we will refer to as items. Now suppose we have a data set that is a set of observations or events, each of which is a set of items.

The next step is to construct a co-occurrence knowledge graph. This graph has a node (circle) for each item and an arc (line) for each pair of items that appear together in at least one observation. Each arc has a numeric weight that is the number of observations in which that pair of items appear together.

Emcien’s software applies a sparse-matrix algorithm for producing a representation of the knowledge graph from the data set. This algorithm has been extended with rules for leaving out arcs that will not be useful for making predictions and/or recommendations.

Several Clustering Algorithms

A cluster is a set of items. Each item is a cluster of size 1, referred to as a 1-cluster. Each arc in the knowledge graph corresponds to a 2-cluster. A 3-cluster, say {a, b, c}, exists if {a, b}, {a, c}, and {b, c} are all 2-clusters, and if a, b, and c all appear together in at least one observation.

A Priori Algorithm

Emcien uses a version of the “a priori” algorithm to generate clusters of size 3 and greater.

There may be an astronomical number of clusters of size 3 and greater, so Emcien uses several algorithms to defocus the noise and prunes clusters that are not helpful.

Conditional Probability Algorithm

The software applies a conditional probability algorithm to compute the probability of each dependent outcome (value of Y) given a particular cluster.

Bayesian Scoring Algorithm

Emcien also uses a Bayesian scoring algorithm to assign a numeric score to each dependent outcome, for each cluster. These scores are based on the conditional probabilities.

Mutual Information Algorithm

The software applies a mutual information algorithm to assign a number to each cluster, based on how much it tells us about the dependent variable, Y.

Loss Function Algorithm

Emcien also uses a loss function algorithm to assign a number to each cluster based on how well it works with other clusters to create a representation of the whole data set.

Pruning Algorithm

Finally, the software uses a pruning algorithm to combine mutual information and loss to decide which clusters to keep and which to discard.

When the software arrives at a final set of clusters, each cluster has an initial score for each outcome. We can then explore the space of possible scores in order to further reduce our loss function. This can be done by Steepest Descent, Conjugate Gradients, or any other nonlinear optimization algorithm.

Algorithms to Generate Predictive Rules

The clusters become rules. For example, suppose Y has 2 outcomes, {NO, YES}. The 3-cluster {a, b, c} becomes a rule of the form:

If a new observation has items a, b, and c, then
Y = NO with probability p0, and
Y = YES with probability p1,

…where p0 and p1 are the conditional probabilities computed for cluster {a, b, c}.

Each surviving cluster becomes a rule. The complete set of rules is our representation of the data set. We use these rules to make predictions about future observations.

Predict it. What Happens Next?

The final set of predictive rules is the predictive model. This model is typically very small, about half the size of an iTunes song. The small size of the predictive model makes it very suitable for deploying at the edge.

When a new observation arrives, many of the rules may be satisfied. If an observation is an instance of a rule, we say that the rule “fires” or “lights up.” If thousands of rules light up, how do we make a final prediction? More algorithms are applied:

Scoring Algorithm

Algorithm to convert conditional probabilities into scores.

Combination Algorithm

Algorithm to combine scores from all of the rules that light up.

Voting Algorithm

Voting algorithm: the outcome with the biggest score wins.

When a prediction is made for a new observation, we can display the rules that lit up and their importance in making the final prediction. The predictions are explainable, or interpretable. This is not a “black box,” unlike neural networks and other methods that cannot explain their output.        