Usable data


The current version

A data set has to have the following structure:

  • Rows = cases, such as individual projects
  • Columns = aspects of those cases, which include
    • At least one ID column, uniquely identifying each row of data
    • At least one Outcome measure
    • Multiple Attributes of the cases, that may or may not be good predictors of the outcome, by themselves or in combinations with others

The cells contain binary data. Here the values of 0 and 1 are used to code the absence or presence of an attribute or outcome.

Nuance: If you are concerned that 0 or 1 is too crude a description of a project attribute then the alternative is to break that attribute down into a number of subsidiary attributes, and then code for the presence or absence of each of these. If there are five subsidiary attributes this means there can be 2 to the power of 5 (i.e. 32) different forms of the original attribute, which should be more than sufficient in many circumstances.

Missing data: EvalC3 is tolerant of missing data values, which is probably a good thing. When a search is made for cases having the attributes of a given model those cases with these attributes are classed either as True Positives or False Positives. Those without these attributes, or with no data to say either way, are classed as False Negatives or True Negatives. This means that where there is significant amount of missing data EvalC could be under-reporting the incidence of cases that fit a given model (if all the missing data was actually available). It is in effect providing the most conservative estimate of a model’s performance. This possibility has implications for the selection of cases for within-case investigations, which are now described in the page on Selecting Cases.

Version N+1

This will be able to use polynominal data. This is where multiple categories of a given type of attribute or outcome can be coded as present or not. Such as types of NGO partners associated with a project.

This version will also be able to use numerical data. For example rating or ranking scales used to rate achievement of an outcome or absolute values of an attribute or outcome such as project cost or number of beneficiaries.

This version will be able to work with all types of data sets analysed by QCA (binary, multi-value, and fuzzy set)

A caveat: Using specific numerical values in predictive models only makes sense when the number of unique values is not large. High numbers of unique values mean that it is likely that the number of cases with that specific value will be small, and the findings much less reliable.  Attributes with a high range of values should either be re-coded into a smaller number of groups, or a more than (>) or less than (<) operator should be used to indicate which part of the whole range is being used to predict an outcome of interest.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s