[Updated 2018 10 07] The attributes of cases are the field names given at the top of each column in a dataset. These are sometimes called “features” in predictive analytics, or “conditions” in Qualitative Comparative Analysis (QCA).
Choices when importing data
When a data set is imported choices can be made about the status of each column of data. These choices affect the kind of models that can subsequently be developed using this data at any one time. They can be revisited and changed. Each column can be given one of four status:
- ID: For example the name of a project. Basically any easily recognizable identifier for a case
- Attribute: These will be the attributes of the cases that will be considered when predictive models are being developed. They are the possible “predictors” or independent variables.
- Ignore: This are the attributes that will not seen as relevant to the current modelling exercise.
- Outcome: One attribute must be selected as the outcome of interest, to be predicted by the models being developed. There may be more than one column of outcome data in the data set. If so, the others should be set to “Ignore”. Later on they can be re-assigned “outcome” status and used as the basis for a new model development. Or they can be assigned “attribute status” if you are looking for relationships between outcomes.
The choices made about the status of each column of data have consequences which should be born in mind
The more attributes that remain in an imported data set, the larger the number of possible combinations of these, and any one of these may be the most accurate model. The number of possible combinations rises exponentially, i.e. it doubles every time an additional attribute is included.
This has three consequences:
- The required computation time increases. This is of greatest significance for the exhaustive search option. Exhaustive search works best with small numbers of attributes. or, when the model size is pre-specified in advance to be relatively small.
- When there are many attributes relative to cases it is likely that there will be more than one good performing predictive model and in some cases it will not be possible to choose between them simply on the basis of their performance measures. However, subsequent within-case analyses may provide a basis for choosing between these.
- For a given number of cases available, any increase in the number of attributes (and combinations thereof) reduces the probability that this set of cases will be a comprehensive representation of all those possible combinations. This means that a model may may not perform so well when applied to new cases. These new cases may have new configurations of attributes that do not produce the outcome as previously predicted.
In EvalC3 a sub-set of a larger set of attributes can now identified which optimizes the consistency and/or diversity of the configurations in its associated data set. This is done via the Find Optimal Attributes button, which uses the Solver Add-In (more specifically, its genetic algorithm).
- Consistency is the extent to which all cases covering a given configuration have the same outcome or mixed outcomes (e.g. both present and absent)
- Maximizing consistency is important if the aim is to identify/develop predictive models that have minimal levels of False Positives. Maximizing the consistency will improve the internal validity of the model
- Calculation: Consistency is the percentage of all configurations having only one type of outcome i.e. 1- ((“# configurations … including outcome” – “#configurations … excluding outcome”) / (“# configurations …excluding outcome”))
- Diversity is the extent to which all cases represent unique configurations versus duplicate one or more configurations.
- Maximizing the diversity will reduce the number of models which best fit the same data. It also means that when the model is applied to new cases not in the current data set it is less likely to fail, because there are less surprises, i.e. configurations which don’t fit the model. The external validity of the model will be improved.
- Calculation: % Diversity = # of configurations/(2^# of attributes)
- Consistency and diversity can be both maximized, though neither is likely to be perfect
- Calculation: % Maximization = (Diversity*Consistency)/(Diversity+Consistency)
- This form of optimisation has similarities to what is known as Quality-Diversity algorithms. In the EvalC3 implementation consistency of cases is the quality dimension and diversity of cases is the diversity dimension
A large set of attributes can also be reduced in size by removing redundant or irrelevant attributes. By either of these approaches:
- Data centered: Using “feature selection” methods developed as an integral of data mining work. See Chapter 12 in Kotu, V., Deshpande, B., 2014. Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner. Morgan Kaufmann. The simplest of these methods is to identify attributes whose values correlate highly with each other across all the cases available .i.e redundant measures. One of these can then be removed. This particular approach is not available specifically within EvalC3 but can be done using normal Excel functions
- Theory centered: In its simplest form, this is using prior theory to inform choices about what attributes are likely to be more relevant than others. Another approach, called two-step analysis in QCA, is to divide the attributes into two or more groups and use one group at a time. e.g. a context attributes and intervention attributes. A further option is to then take the attributes making up the models that fitted both groups. pool them into a new smaller set and then to analyse these as a whole.