When an analyst or data scientist builds a prediction model, selecting which features to include is critical in making accurate predictions. This can be a long process, since it typically means evaluating features independently and weighing how much each contributes to the outcome. Additionally, while including more data creates the potential for increased accuracy, it also increases the complexity of the model. Trying to find the middle ground between complexity and accuracy can be time consuming.
With approaches like regression modeling, much of this complexity is easily lost. Invariably, this process relies on human intuition to determine what will be included in the model. While intuition is valuable, starting with a cognitive bias towards how we think or feel data interacts will ultimately result in potentially valuable data points being tossed out of consideration.
Unfortunately, considering the interaction effects variables have on one another is a cumbersome and labor-intensive task. It can be difficult to work backwards to deconstruct the different combinations of relationships, but hiding within these complex interactions are the truly predictive properties of the data. Without a process for sorting these different combinations, most analysts are left weighting and overfitting to eke out slightly more value from the data.
But why is extracting this extra meaning difficult?
- It is difficult to completely remove personal bias from feature selection
- Knowing which features come together to form valuable interaction variables is nearly impossible without lots of experimentation.
- Combining features is computationally difficult and time-consuming with traditional querying methods
Thankfully, there are machine learning approaches that are uniquely suited to identifying complex interactions in data. Employing a computational engine to crawl these connections, we can let the algorithms measure the interactions across data points.
To illustrate, imagine a data scientist has been tasked with modeling a customer’s annual spend on his or her credit card, to optimize offers and better understand the customer base. The first step would be to take the data set through EmcienScan to see if there were data points that should or should not be considered for analysis:
EmcienScan indicates that many of the variables within the data set should be considered for analysis, but that a few of the columns should not be considered as they are weakly correlated and may add noise to the analysis. Now, the data scientist can take this information and begin finding the correlations these variables have with customer spend, but that may take time and wouldn’t consider the interactions these variables may have on one another. EmcienPatterns makes this easy by finding all of the interactions within the data set and outputting them as predictive rules:
The result is a list of the very targeted and complex rules that, when used together, capture the analyzed data set with a much greater accuracy than stats-based approaches.