4. Evaluate model

The Evaluate section of the Design and Evaluate worksheet looks like this:


The starting point: The Confusion Matrix

Whenever a predictive model is developed under the Design & Evaluate view, using any of the methods available, the performance of the model is automatically displayed on the right in the form of a 2 x 2 truth table, known as a Confusion Matrix, as shown above

The number displayed in each cell represents  the number of cases (e.g projects) which fall into that category.

  • In the TP (True Positive) cell are all the cases where the model attributes are present and the expected outcome is also present.
  • In the FP (False Positive) cells are all the cases where the model are present but the expected outcome is not present.
  • In the FN (False Negative) cells are all the cases where the model attributes are not present but the expected outcome is present.
  • The TN (True Negatives) are all the  cases where the model attributes are not present and the expected outcome is also not present.

Another way of viewing the results  is in the form of two overlapping sets of cases: (a) those with the model attributes (TP&TN)  and (b) those with the outcome of interest (TP&FN). Outside of these two sets is a third set of cases, which do not have the model attributes or the expected outcome (TN).


For more background information see

Model status

Below the Confusion Matrix are some descriptions of the model.

The first of these is a table telling us if the attributes in the model are Sufficient and/or Necessary for the outcome.It is easy to identify if an attribute or configuration of attributes is Necessary and or Sufficient for an outcome to be present (or absent) by examining the Confusion Matrix and identifying if any of the following patterns can be seen:

  • Where outcome is present then attributes are …
    • Sufficient but not Necessary (Sn) if FP = 0
    • Necessary but not Sufficient (Ns) if FN = 0
    • Necessary and Sufficient (NS) if FP = 0 & FN = 0
    • Neither Necessary or Sufficient (ns) if TP>0,FP>0,TN>0, FN>0
  • Where outcome is absent then attributes are …
    • Sufficient but not Necessary (Sn) if FN = 0
    • Necessary but not Sufficient (Ns) if FP = 0
    • Necessary and Sufficient (NS) if FP = 0 & FN = 0
    • Neither Necessary or Sufficient (ns) if TP>0,FP>0,TN>0, FN>0

For more background information see  http://en.wikipedia.org/wiki/Necessity_and_sufficiency

There are two other description of the model in this section:

  • Simplicity: The proportion of all available attributes that are used in the model
  • Support: The proportion of all cases that have the model attributes

Model performance

Overall measures

The first section lists a number of overall performance measures. These measure, in different ways, the extent to which the model has maximised the number of TPs and TNs and minimised the number of FPs and FNs.

  • Accuracy: The proportion of all cases which are True Positives and True Negatives. This is the default performance measure to use. However accuracy is not a good measure to use when the Prevalence of the outcome is relatively small or relatively large. In these cases the Accuracy measure gives too much weight to the column with the more prevalent outcome.
  • Balanced accuracy: This takes into account the prevalence of the outcome and the prevalence of the absence of the outcome –  (((TP/(TP+FN))+(TN/(TN+FP)))/2. This performance measure should be used when the presence of the outcome is either very common or very uncommon.
  • Gini Index: This measure is used in Decision Tree algorithms as an alternative to Accuracy. It is a measure of inequality in the distribution of cases across all four categories. Perhaps not immediate relevant but data mining packages like Rapid Miner provide this measure alongside Accuracy

The next two measures try to capture good performance in the form of minimised numbers of both FPs and FNs, rather than just one or the other. In QCA terms they measure the extent to which Coverage and Consistency have been both been improved by a model, rather than just one or the other.

For more information see https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers

Specific measures

  • True positive rate / Sensitivity / Recall / Coverage (QCA): The proportion of all cases with the outcome present that are correctly identified by the model. More is better!
  • Positive Predictive Value / Precision / Consistency (QCA): The proportion of True Positives among all the cases where all the attributes of the model are present. More is better.

Sometimes it may be preferred to optimize one of these rather than both (e.g. via F1 score above)

Relative measures

  • Lift: The ratio of two percentages: the percentage of correct positive classifications made by the model to the percentage of actual positive classifications in the test data
  • Null error rate: This is how often you would be wrong if you always predicted the majority class
  • Likelihood ratio+: The likelihood that the outcome would be expected in a case with the model attributes compared to the likelihood that that outcome would be expected in a case without the model attributes

Data profile

  • % of unique cases: The proportion of all cases with unique configurations of attributes. It is a measure of the diversity present within the set of cases.
  • % of all possible unique cases: The number of cases with unique configurations, as a proportion of the total number of configurations possible given the number of the attributes (n) in the data set (which equals 2 to the power of n). The higher the proportion the better, because the performance of a predictive model is less likely to be challenged by the arrival of new cases with new configurations of attributes. This can be seen as a measure of probable external validity.
  • Prevalence of outcome: The proportion of all cases where the outcome is present. This could influence the choice of performance measure e.g accuracy or balanced accuracy.
  • % of missing data: The percentage of all cells in a data set that have no values. See below for more on this.

Interpretation of results

Two points to note:

  1. It is possible that more than one model (i.e. configuration  of attributes) will produce the same level of performance on one or more of the above measures, including the particular numbers of cases distributed across the four cells of the Confusion Matrix. This is more likely when the numbers of attributes is large relative to the number of cases.
    1. These alternate models can be discovered by trying both exhaustive and evolutionary searches, and by manually tweaking the models produced by both methods.
    2. See the Reviewing Models page for advice on how to make choices between these models
  2. When an existing model is manually tweaked it is possible that performance may only be marginally improved or reduced. This fact highlights that it is not a black and white world out there where things either work or don’t work.  This is a “feature not a bug” because it suggests that experimentation with project design is not necessarily a high cost “either win or loose” proposition.

Missing data

This is how EvalC3 treats missing data:

  1. If a case has no data on the status of the outcome of interest being present (1), then it is treated as an outcome which is absent (0).  In this situation the case with missing outcome data will be one of the cases counted as False Positives or True Negatives. The same will be the case if it is the absence of the outcome which is of interest.
  2. If a case has no data on the status of an attribute being present (1)  which is part of a predictive model, then it is treated as an attribute which is absent (0). That case will be one of the cases counted as False Negative or True Negatives. The same will be the case if it is the absence of the outcome which is of interest.
  3. Where a case has no data on either the outcome or attributes that form part of the predictive model then that case will be counted as a True Negative.

The net result is the performance of a prediction model constructed using a data set with missing data is likely to be a conservative one, being the lowest likely.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s