[A duplicate copy, in case you missed the original page]
When you click on Select Data button this will take you to the Select Data worksheet. An example is shown below, using the same example data set.
- Reading the characteristics of the data set. Above the data set itself are a series of measures that describe the dataset:
- Configurations: The number of unique configurations of attributes in the dataset. In this example dataset, there are 14, among a total of 26 cases
- Click on Sort by Configuration to show the cases grouped by configuration
- Consistency: The number of configurations that have consistent outcomes i.e. all absent or all present, but not a mix of both.
- Diversity: The proportion of all the possible combinations (i.e. configurations) of attributes present in this data set, as a percentage of the total number that is possible given the number of attributes in this dataset. In this example Diversity of 44% = 14 / (2 to the power of 5).
- Missing data: The percentage of all the cells in the data set that have no values (0 or 1)
- Configurations: The number of unique configurations of attributes in the dataset. In this example dataset, there are 14, among a total of 26 cases
- Select Column Types and Choose Rows
- By default, the leftmost column is automatically labeled as ID. To change this click on that cell and a drop-down menu will appear that gives an option to Ignore that column, to leave it as ID or to change it to Attribute or Outcome
- By default, the rightmost column is automatically labeled as Outcome. If you want to change that, click on that cell, and choose Ignore or Attribute. You will then need to click on another column heading in the same way and change that to Outcome.
- There must be one ID column and one Outcome column in any data set being prepared for use at this stage. There may be more than one outcome of interest in the data set but only one can be labeled as such at this stage, prior to going to Design and Explore.
- All the columns between ID on the left and Outcome on the right are by default labeled as Attribute i.e. potential predictors of the outcome. But by clicking on any of these labels you can choose to change it to Ignore, or Outcome, or ID.
- The status of any of the columns can be re-assigned later on. When you do this you are in effect loading a new data set. One consequence is that the findings from the analysis of the previous data selection will no longer be accessible in the View Models view – so keep a record of those findings somewhere, if they are important.
- Click on Design &Evaluate, which will take you to that worksheet
- Optimizing the set of attributes being used
- This is an optional step to take before proceeding to Design and Evaluate. It can be useful when there are a large number of attributes in the data set, relative to the number of cases, and where there is no theory-led basis for removing some.
- By clicking on Find Optimal Attributes button a pop-up menu will provide these three options, to:
- Maximize the consistency of the configurations in the data set. A high percentage means most cases with a given configuration will have the same outcome. A low percentage means that often cases with the same configuration will have a mix of outcomes, i.e. both present and absent
- Maximize the diversity of the configurations in the data set. A high percentage means most of the possible configurations of the attributes are represented in the data set, a low percentage means that only a few of the possible configurations are represented in the data set.
- Maximize both the consistency and diversity of configurations. Neither measure may reach 100% but the highest possible measure on both will be found.
- For more information on when these different optimization strategies will be useful, see Selecting attributes and outcomes