Data preparation

The following steps will be useful to undertake, prior to loading data into EvalC3:

  1. Check each attribute column for missing data values. Prioritise the use of those attributes where there is no missing data. EvalC3 can work with cases that have missing data, but the models that are developed will be conservative, i.e. they will assume all cases with missing data do not fit the best performing model.
    1. For ideas on how to deal with missing values see
      1. http://www.missingdata.org.uk/
      2. http://www.measuringu.com/blog/handle-missing-data.php
  2. Check each attribute column to ensure there is some variation in cell values. If they are all the same then the attribute will be of no value as a potential predictor. Outcome columns must include the presence and absence of outcomes.
  3. If an attribute or outcome values are originally in numerical form and needs to be dichotomised into binary form (1’s and 0’s) then take care to ensure that there is some degree of balance in the number of presence and absence cases.. Where presence (for example) is either rare or very common then be aware that there will be a greater than normal risk of  False Positives or False Negatives respectively.
  4. Try to minimize use of attributes that are highly correlated in the way they appear across cases in the data set. having more than one such attribute will not improve the predictive power of models that can be developed.
  5. Think about timing: when each attribute was collected or when it happened. You don’t want a predictive model that shows X leads to Y, when in fact X happened after Y
  6. Be careful when coding qualitative data from participatory or found sources 
    1. Coding of events of interest as 1/0 can be problematic, because typically we may have evidence that x event happened, but evidence of it not happening may or may not be there. It may have happened or it may have happened but was not reported.
    2. In this situation instead of coding 1/0 for presence and absence of an attribute, 1/0 in one column could represent the known presence / unknown status of an attribute and a second column could represent known absence / unknown status of an attribute.

Analysis planning: You may also find it useful to do some planning about the types of analysis to be carried out, once you have uploaded the data. Especially if you have a data set with many attribute and outcomes of interest. One way of planning an analysis is to use a data analysis matrix as described in detail here.

%d bloggers like this: