The Emcien real-time prediction engine makes it easy to generate fast and accurate predictions. This article covers everything from the process of automatically analyzing data to applying detected signatures to new data in real time.
How the real-time prediction engine works
Historical data is analyzed automatically by Emcien, extracting predictive signatures from the data. The real-time prediction engine then applies those signatures to incoming data streams.
Exercise: Predicting Diabetes for Patients
We will demonstrate the capabilities of real-time prediction engine using a public dataset that contains patient metrics and a diagnosis of Diabetic or Not Diabetic. This example assumes you already know how to upload a dataset for analysis and will walk you through the steps to apply Emcien's automatically generated signatures to make predictions for diabetes.
Getting Started
- Download the diabetes data files.
- Load the banded train and test data sets into Emcien. Check out the Loading Your Data article for more information.
- Begin a new analysis by clicking Analyze Data on the Emcien home page.
- On the Analyze Data page, select the sample diabetes data set you just uploaded.
- To predict the outcome of the Diabetes column in the data set, type Diabetes in the Prediction Category field.
- Click Start Analysis.
- When the analysis is complete, View Results is displayed.
Click View Results and the Emcien dashboard is displayed for the sample data set.
Making Predictions
After analyzing the sample data set, you can being making predictions using new data.
- Click Predictions heading in the upper navigation.
The Predictions page is displayed. - On the Prediction page, click Predict at the top of the page.
The New Prediction page is displayed. - On the New Prediction page, select the sample diabetes data set and click Predict.
- The Emcien real-time prediction engine will then predict the outcome of the Diabetes column for each transaction. The predictions are made using the predictive patterns from Emcien’s initial analysis.
- When finished, the Predictions Details page is displayed. The Predictions Details page contains a report summarizing the prediction results.
Hover over the headers and the software will explain each metric. The table below contains summary of those explanations.
Accuracy |
Percent of the time a transaction is correctly classified as either having or not having that specific Outcome Item. (True Pos + True Neg) / Total Predictions |
---|---|
True Positive (True Pos) |
Outcome Item correctly predicted. For example, Outcome Item is ‘Sick’ and sick people were correctly identified as sick. |
False Positive (False Pos) |
Outcome Item is not correctly predicted. For example, Outcome Item is ‘Sick’ and healthy people were incorrectly identified as sick. |
Positive Accuracy (Pos Acc) |
Percentage of transactions that were correctly classified as having this Outcome Item. True Pos / (True Pos + False Pos) |
True Negative (True Neg) |
Outcome Item was not predicted, and should not have been. For example, Outcome Item is ‘Sick’ and healthy people were not identified as sick. |
False Negative (False Neg) |
Outcome Item was not predicted, but should have been. For example, Outcome Item is ‘Sick’ and sick people were identified as healthy. |
Negative Accuracy (Neg Acc) |
Percentage of transactions that were not classified as having this Outcome Item and did not have the Outcome Item. True Neg / (True Neg + False Neg) |
Transaction numbers (Trans #) |
Total number of transactions in the Prediction file that contained this Outcome Item. |
Predicted numbers (Prdct #) |
Total number of predictions made for this Outcome Item. |
When the test data does not include the known outcome values, a much simpler table will be shown that describes how many of each prediction were made. Without the outcomes, however there will be no information about the accuracy.
The summary below is a series of links to the real-time prediction engine's output files. When using the real-time prediction engine on streaming data, these files drive a downstream system or save to a database for further action.
The available files include:
Summary.txt |
The same data as shown in the summary on the results screen with accuracy amounts. |
---|---|
Results.csv |
The row by row predictions made along with the top 5 reasons why. Reasons are the rules (or clusters) from Emcien that matched on the new test row being predicted. |
Results.txt |
Intended for those that want to understand the details of how the real-time prediction engine each prediction it made. |
The Results.csv file provides the predictions and the reasons for each prediction. The following is a breakdown of each column within the Results.csv file:
Note that the rules are presented in order of predictive strength and that all of the rules collectively are needed to maximize the software's predictive capabilities.
Basket number |
The identification which matches that prediction to a specific transaction in the data set being tested. |
---|---|
Predicted outcome |
The outcome predicted by the real-time prediction engine for that transaction. |
Actual outcome |
The actual outcome as it appears in the test data set. |
Confidence |
A metric used by the real-time prediction engine to determine how well the matching rules described the tested data. Primarily an internal metric for the engine, it is not the equivalent of statistical confidence and not intended to be used as a statistical metric. |
Winning Score |
Highest score, based on the probability and frequency for all the rules that matched the row. For details on the other scores for an outcome see the Results.txt file. This score is primarily an internal mechanism for the real-time prediction engine to identify the best prediction. |
Winning matches |
The number of rules that matched that transaction. |
Reasons |
The top five rules that provided the best score to make the prediction. Each reason has one or more items that together had a high conditional probability for the predicted outcome value. (Reason Score: 0.08 = Cond Prob: 0.92 x Coverage: 0.3) Plasma glucose concentration::[0.0_to_96.0] The format of a reason includes:
|