Data preparation

The following steps will be useful to undertake, prior to loading data into EvalC3:

  1. Check each attribute column for missing data values. Prioritise the use of those attributes where there is no missing data. EvalC3 can work with cases that have missing data, but the models that are developed will be conservative, i.e. they will assume all cases with missing data do not fit the best performing model
    1. For ideas on how to deal with missing values see
      1. http://www.missingdata.org.uk/
      2. http://www.measuringu.com/blog/handle-missing-data.php
  2. Check each attribute column to ensure there is some variation in cell values. If they are all the same then the attribute will be of no value as a potential predictor. Outcome columns must include the presence and absence of outcomes.
  3. If an attribute or outcome values are originally in numerical form and needs to be dichotomised into binary form (1’s and 0’s) then take care to ensure that there is some degree of balance in the number of presence and absence cases.. Where presence (for example) is either rare or very common then be aware that there will be a greater than normal risk of  False Positives or False Negatives respectively.
  4. Try to minimize use of attributes that are highly correlated in the way they appear across cases in the data set. having more than one such attribute will not improve the predictive power of models that can be developed.

You may also find it useful to do some planning about the types of analysis to be carried out, once you have uploaded the data. Especially if you have a data set with many attribute and outcomes of interest. One way of planning an analysis is to use a data analysis matrix as described in detail here.