Testing models with new data

[Updated 2018 10 07] In the field of predictive analytics (a subset of data mining methods in the more general sense) it is common good practice to test predictive models against new data. I.e. a data set that was not used as the basis for developing the model. These two datasets are typically called “training” and “test” datasets

Why doesn’t EvalC3 address this issue?

Some may notice that there is no provision within EvalC3 for this sort of case separation. There is a reason. Testing a model on a test data set is important if you want to generalize and use your model in new settings. This is typically the case with many commercial applications of predictive modeling e.g. being able to find likely loan defaulters among new clients.

But EvalC3 was designed with a different set of users in mind, those engaged, one way or another, with development aid programmes. These often have small rather than big data sets, and external validity may not always be the top priority. Internal validity may be more important i.e. working out what is going on within the existing (and often small) data set

How can you do this, if you want to?

There are simple and complex ways of doing this. With large datasets, they can be split into two sections, one for training and another for testing purposes. Two-thirds of cases in a training data set and one-third of cases in a test data set are commonly used proportions. Cases need to be assigned randomly, e.g. by numbering all cases randomly 1,2 or 3, then assigning 1 and 2 into the training set and 3 into the test set.

When datasets are small, other more complex methods can be used. These go by the generic name of cross-validation. Basically, a small part of the data set is withheld as a test set, used, then replaced by another small part, used, then replaced by another, etc. There are many variants of this practice. You need other tools like Rapid Miner Studio to do this kind of testing. Rapid Miner Studio is free and module based, You don’t need to know how to code. It can use the same kind of data set as used in EvaLC3.

 

 

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: