Usable data

The current version

A data set has to have the following structure:

  • Rows = cases, such as individual projects, households, people,
  • Columns = aspects of those cases, which include
    • At least one ID column, uniquely identifying each row of data
    • At least one Outcome measure
    • Multiple Attributes of the cases, that may or may not be good predictors of the outcome, by themselves or in combinations with others

The cells contain binary data. Here the values of 0 and 1 are used to code the absence or presence of an attribute or outcome.

Nuance: If you are concerned that 0 or 1 is too crude a description of a case attribute then the alternative is to break that attribute down into a number of subsidiary attributes, and then code for the presence or absence of each of these. If there are five subsidiary attributes this means there can be 2 to the power of 5 (i.e. 32) different forms of the original attribute, which should be more than sufficient in many situations.

Missing data: EvalC3 manages missing data values in predictable ways. If the attribute in a predictive model is a “1” i.e. is expected to be present in a case, then a missing value is interpreted as a “0”. On the other hand, if the attribute in a predictive model is a “0” i.e. is expected to be absent in a case, then a missing value is also interpreted as a “0”. In the first of these two instances, the model is “pessimistic”, i.e. assumes cases with missing values do not have the model attributes. In the second instance, the model is “optimistic” i.e assumes the cases with the missing values do have the model attributes. But if the predictive model combines multiple attributes, some of which are expected to be present and some absent, then it will be more challenging to identify in which net direction the model is biased

Please also pay attention to point 6 here on data preparation

Size of dataset: The largest dataset I have used had 597 cases and 35 attributes. On this scale the Decision Tree algorithm worked quite slowly, taking about 5 minutes to be generated. In the transition from Select Data to Design and Evaluate, it would sometimes size up and display an error message.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.