Testing models with new data

In the field of predictive analytics (a subset of data mining methods in the more general sense) it is common good practice to test predictive models against new data. That is a data set that was not used as the basis for developing the model. The latter is called a “training” data set and the former is the “test” data set.

In order to do this, the data set needs to be large enough to split into two sections, one for training and another for testing purposes. [Though there are various methods for carrying out training and testing a model with the same data set they are too technically demanding for EvalC3, e.g. k-fold cross-validation].

The best time to do this may well be when EvalC3 is struggling to deal with the size of a date set that you are trying to use and throws up “Out of memory” error messages. This happened to me recently when I tried loading data on 129 attributes of 400+ cases. i.e. >50,000 data points.

The simplest way to do this is to assign a random number (0,1) to each case, then sort the cases into these two groups. In order to check if the training and data set are comparable (which they probably will be) you could try developing a predictive model of what cases belong to what group. Ideally, the best model you develop will have around 50% balanced accuracy. in other words, it will be hard to find any attributes that consistently predict the presence of one of the two groups. Views on the ideal size of each data set vary.

Views on the ideal size of each data set vary and depend partly on how comparable the two data sets are. Two-thirds training and one-third test are commonly used proportions.

Test data sets are different from control groups (as used in experimental designs). Test data sets are expected to have cases with and without the outcome, whereas control data sets, where an intervention was not present,  are expected to not have the outcome.



Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.