Analysis sequence

Warning, this page has been frequently re-edited 🙂

There are two main stages:

  1. Manual testing of pre-existing hypotheses. This is done by entering  attributes hypothesized as important into the Design menu and observing the model’s performance in the Confusion Matrix
    • Make sure you save any models you value.
  2. Algorithmic search for better models, as described in detail below.
    • All of these models are automatically saved

In both stages the overall aim is to find a model with attributes that maximise the number of True Positives (TPs) and True Negatives (TNs) and minimises the number of False Positives (FPs) and False Negatives (FNs). The relevant model performance measure here is Accuracy i.e. (True Positive+TrueNegative)/(True Positive+TrueNegative+False Positive+False Negative). But note that there are nuances here that you will often want to explore, relating to the proportions of False Negatives to False Positives – for more, see Model Performance section here

Re algorithmic search

The best strategy may depend on the objective of the analysis.

  • When trying to understand, what has happened as part of a research or evaluation exercise,  we may need to find a number of models, which as a set do the best in accounting for all the outcomes.
  • When trying to work out what best to do next a much less comprehensive analysis may be all that is needed. We just need to find one or more models which seems to work well, and which we can have some confidence in.

The advice below is oriented towards finding a smaller number of models that best account for the outcome of interest.

  1. Start by searching for one or more attributes which are Necessary and Sufficient. These are by definition unambiguous and essential, so need to be found if they exist. But also bear in mind that they are uncommon.
    • Click on the Necessary and Sufficient button, in the Explore section, then use the first or third algorithm. Use the first, if your data set is small, or you have plenty of time. Otherwise, use the third.
  2. Then search for Necessary but Insufficient attributes, using the button of the same name. These attributes are necessary for the outcome, but not sufficient by themselves.
  3. Then search for the Sufficient but Unnecessary attributes, using the button of the same name. This is an optional solution, which will work, but it is not the only way. The search algorithms will try to find the best Sufficient model i.e one with the largest coverage (least False Negatives)
  4. Then search for one or more attributes which may be Unnecessary and Insufficient, but which are still a good predictor of the outcome. Use the “Most predictive of any kind” button.
    • Bear in mind that a model with only 1 False Negative and 1 False Positive will have a higher Accuracy than a Sufficient model with 5 False Negatives.
      • Which of these two kinds of models are preferred will depend, in part at least, on the acceptability of having any False Positives at all. A surgeon would want zero, but a gambler would typically tolerate a proportion of False Positives.

When good performing models have been identified consider doing a simple sensitivity analysis of each model.

Then proceed to do within-case investigations after identifying relevant cases from the View Cases worksheet using the guidance provided on Selecting Cases

A pdf copy of this web page is available here: Analysis sequence


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: