The current version
A data set has to have the following structure:
- Rows = cases, such as individual projects, households, people,
- Columns = aspects of those cases, which include
- At least one ID column, uniquely identifying each row of data
- At least one Outcome measure
- Multiple Attributes of the cases, that may or may not be good predictors of the outcome, by themselves or in combinations with others
The cells contain binary data. Here the values of 0 and 1 are used to code the absence or presence of an attribute or outcome.
Nuance: If you are concerned that 0 or 1 is too crude a description of a case attribute then the alternative is to break that attribute down into a number of subsidiary attributes, and then code for the presence or absence of each of these. If there are five subsidiary attributes this means there can be 2 to the power of 5 (i.e. 32) different forms of the original attribute, which should be more than sufficient in many situations.
Missing data: EvalC3 is tolerant of missing data values, which is probably a good thing. When a search is made for cases having the attributes of a given model those cases with these attributes are classed either as True Positives or False Positives. Those without these attributes, or with no data to indicate either way, are classed as False Negatives or True Negatives. This means that where there is significant amount of missing data EvalC could be under-reporting the incidence of cases that fit a given model (if all the missing data was actually available). It is in effect providing the most conservative estimate of a model’s performance. This possibility has implications for the selection of cases for within-case investigations, which are now described in the page on Selecting Cases.