The attributes of cases are the field names given at the top of each column in a dataset
Choices when importing data
When a data set is imported choices can be made about the status of each column of data. These choices affect the kind of models that can subsequently be developed using this data. They can be revisited and changed. Each column can be given one of four status:
- ID: For example the name of a project. Basically any easily recognizable identifier for a case
- Attribute: These will be the attributes of the cases that will be considered when predictive models are being developed. They are the possible “predictors” or independent variables.
- Ignore: This are the attributes that will not seen as relevant to the current modelling exercise.
- Outcome: One attribute must be selected as the outcome of interest, to be predicted by the models being developed. There may be more than one column of outcome data in the data set. If so, the others should be set to “Ignore”. Later on they can be re-assigned “outcome” status and used as the basis for a new model development.
The choices made about the status of each column of data have consequences which should be born in mind
The more attributes that remain in an imported data set, the larger the number of possible combinations of these, and any one of these may be the most accurate model. The number of possible combinations rises exponentially, i.e. it doubles every time an additional attribute is included.
This has three consequences:
- The required computation time increases. This is of greatest significance for the exhaustive search option. Exhaustive search works best with small numbers of attributes
- When there are many attributes relative to cases it is likely that there will be more than one good performing predictive model and in some cases it will not be possible to choose between them simply on the basis of their performance measures. However, subsequent within-case analyses may provide a basis for choosing between these.
- For a given number of cases available, any increase in the number of attributes (and combinations thereof) reduces the probability that this set of cases will be a comprehensive representation of all those possible combinations. This means that a model may may not perform so well when applied to new cases. These new cases may have new configurations of attributes that do not produce the outcome as previously predicted.
In EvalC3 a sub-set of a larger set of attributes can now identified which optimizes the consistency and/or diversity of the configurations in its associated data set. This is done via the Find Optimal Attributes button, which uses the Solver Add-In (more specifically, its genetic algorithm).
- Consistency is the extent to which all cases covering a given configuration have the same outcome or mixed outcomes (e.g. both present and absent)
- Maximizing consistency is important if the aim is to identify/develop predictive models that have minimal levels of False Positives. Maximizing the consistency will improve the internal validity of the model
- Calculation: Consistency is the percentage of all configurations having only one type of outcome i.e. 1- ((“# configurations … including outcome” – “#configurations … excluding outcome”) / (“# configurations …excluding outcome”))
- Diversity is the extent to which all cases represent unique configurations versus duplicate one or more configurations.
- Maximizing the diversity will reduce the number of models which best fit the data. It also means that when the model is applied to new cases not in the current data set it is less likely to fail, because there are less surprises, i.e. configurations which don’t fit the model. The external validity of the model will be improved.
- Calculation: % Diversity = # of configurations/(2^# of attributes)
- Consistency and diversity can be both maximized, though neither is likely to be perfect
- Calculation: % Maximization = (Diversity*Consistency)/(Diversity+Consistency)
A large set of attributes can also be reduced in size by removing redundant or irrelevant attributes. This be done by either of these approaches:
- Data centered: Using “feature selection” methods developed as an integral of data mining work. See Chapter 12 in Kotu, V., Deshpande, B., 2014. Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner. Morgan Kaufmann. The simplest of these methods is to identify attributes whose values correlate highly with each other across all the cases available .i.e redundant measures. One of these can then be removed. This particular approach is not available specifically within EvalC3 but can be done using normal Excel functions
- Theory centered: In its simplest form, this is using prior theory to inform choices about what attributes are likely to be more relevant than others. A more sophisticated version involves conceptualizing a broader causal chain then analysing each segment within this chain, using a smaller number of attributes thought to be relevant to a given segment. A further option is to then take the attributes making up each of the constituent models and pool them into a new smaller set and then to analyse these as a whole. In QCA this kind of process is called a two step analysis.