Usable data

 

The current version

A data set has to have the following structure:

  • Rows = cases, such as individual projects, households, people,
  • Columns = aspects of those cases, which include
    • At least one ID column, uniquely identifying each row of data
    • At least one Outcome measure
    • Multiple Attributes of the cases, that may or may not be good predictors of the outcome, by themselves or in combinations with others

The cells contain binary data. Here the values of 0 and 1 are used to code the absence or presence of an attribute or outcome.

Nuance: If you are concerned that 0 or 1 is too crude a description of a case attribute then the alternative is to break that attribute down into a number of subsidiary attributes, and then code for the presence or absence of each of these. If there are five subsidiary attributes this means there can be 2 to the power of 5 (i.e. 32) different forms of the original attribute, which should be more than sufficient in many situations.

Missing data: EvalC3 is tolerant of missing data values, which is probably a good thing. When a search is made for cases having the attributes of a given model those cases with these attributes are classed either as True Positives or False Positives. Those without these attributes, or with no data to indicate either way, are classed as False Negatives or True Negatives. This means that where there is significant amount of missing data EvalC could be under-reporting the incidence of cases that fit a given model (if all the missing data was actually available). It is in effect providing the most conservative estimate of a model’s performance. This possibility has implications for the selection of cases for within-case investigations, which are now described in the page on Selecting Cases.

Please also pay attention to point 6 here on data preparation https://evalc3.net/data-sets/data-preparation/

Size of dataset: The largest dataset I have used had 597 cases and 35 attributes. On this scale the Decision Tree algorithm worked quite slowly, taking about 5 minutes to be generated. In the transition from Select Data to Design and Evaluate, it would sometimes size up and display an error message.

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.