Dichotomising variable data

Technical terms

Dichotomisation is the process of converting variable data into binary data. For example we might have a string of variable measurements such as numbers of participants in an event: 7, 15,23, 45, 63, 75, 84, 93. These can then be converted into binary values representing the lower and upper values: 0,0,0,0,1,1,1,1,

There are different technical terms describing this process. In the machine learning field it is described as “binning” but in the QCA literature it is partially covered by the term “calibration”

Information and noise

If you Google “dichotomising data” you will find lots of warnings, that this is basically a bad idea!. Why so? Because if you do so you will lose information. All those fine details of differences between observations will be lost.

But what if you are dealing with something like responses to an attitude survey? Typically these have five-pointed scales ranging from disagree to neutral to agree, or the like. Quite a few of the fine differences in ratings on this scale may well be nothing more than “noise”, i.e. variations unconnected with the phenomenon you are trying to measure. A more likely explanation is that they reflect differences in respondents “response styles“, or something more random still.

Different methods

In order to dichotomised some variable measures a choice needs be made of a cut-off point, above which one value will be assigned (1) and equal to or below which another value will be assigned (0). The choice of a cut-off point can be made by a range of methods.

1. The analyst may have some prior theory in mind which suggests that values above a certain point will have different consequences to those below.

2. Prior experience with similar programs might have already shown that a certain threshold has to be passed before an intervention can have noticeable effects.

3. There may be no prior theory or experience but on examination of the data might show a significant gap in the distribution, which could be used as the basis for the values.

4. There might not be such a gap in the distribution of values, in which case the choice might be made to simply use the median value as the cut-off point.

5. The choice of cut-off value might be driven by a value concern, rather than any empirical observations or theories about what the consequences are. For example that all participants should receive at least X amount of an expected benefit.

This last method seems particular appropriate for dichotomisation of an outcome variable. Whereas the theory and experience-based methods (1 & 2 above) seem more appropriate to the dichotomisation of a variable which might have some causal role i.e. have some consequences for an expected outcome.

An inductive approach

This involves looking at the relationship between the variable data that you need to dichotomise and the outcome of interest. Let’s assume the outcome has already been dichotomised, on the basis of some level of performance that we think is necessary.

What we want to then do is construct a 2 x table, like this:

X is a cut-off value, selected within the range of values that the variable of interest has. Lets start off with the median value. Based on that we fill in the cells that with the number of cases that meet the row and column criteria

We then calculate the Chi-square statistic, which is a measure of how different the cell values are from what otherwise would be an equal distribution across all cases. The bigger the Chi-square value the more unequal the distribution. The example above has this value: 8.1.

Then manually vary the X value, choosing a value somewhere above or below the median. Here is another example with a different cut-off value. In this example the Chi-square value is 11.52. The distribution of cases is more unequal.

We could continue varying the X value until we cannot find any other one that has a higher Chi-square value. That is the one we will choose to keep and use. This is because this cut-off value is in effect a good single attribute predictor of the outcome of interest. In the above example it has an 88% accuracy (Accuracy = (TP+TN)/(TP+FP+FN+TN)). This single attribute model now provides us a with a good building block for building more complex configurational models , along with other attribute data, when using EvalC3.

Go here to find an Excel file that you can use to do either a manual or automated search for the best cut-off point with your data

This method fits with my preferred definition of information, which is ‘a difference that makes a difference ‘ – an idea suggested by Gregory Bateson some decades ago. The frst difference is between the upper and lower values on either side of a cut-off point. And the difference it makes is its ability to predict/classify the status of the outcome variable (already dichotomised)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.