This is a speculative posting on this subject, prompted by my reading of this useful paper:
Stuart, Elizabeth A. “Matching Methods for Causal Inference: A Review and a Look Forward.” Statistical Science : A Review Journal of the Institute of Mathematical Statistics 25, no. 1 (February 1, 2010): 1–21.
When cases (e.g. households) are assigned to a treatment or control group we can treat these labels as outcomes. Using household attribute data we could then develop a predictive model, which would tell us what combinations of attributes best predicted if a household was in a control or treatment group. We would then know that the two groups differed significantly, at least in the attributes that formed part of that model.
But when designing control and treatment groups we want these two groups to be as similar as possible, not different! So, ideally, in these circumstances, the best performing prediction model will perform no better (and no worse) than chance.
How could we get there, assuming we do start off with two groups that do have different attributes and the best prediction model does do better than chance? Well, after discovering such a model, we could adjust group membership by making sure households with the model attributes were evenly assigned across both groups. Then we could seek the next best performing prediction model. If it still does well, adjust group membership again on the same basis. And keep reiterating this process until the best model was no better than chance.
Caveat: Removed the cases with the model attributes may be better, though it would be at the cost of reducing the group size. This is because re-assigning to the other group may generate other differences between the groups, and the process of finding the best but poor performing model could take longer. The exclusion strategy would also work where the assignment of treatment and control roles has already been made and outcomes are on their way.
Postscript: It may be that despite all efforts it is not possible to make a control and treatment group so similar that a prediction model can’t be developed that will identify which is which with a better than chance performance. But at least we will know how they differ most significantly, i.e. in the contents of the prediction model. With that information we could, with some further investigation, either rule those attributes out as confounding factors or recognise their causal role.
An alternate approach:
Soon it will be possible, within EvalC3, to find matching pairs of cases, being with and without an outcome of interest. They might not be identical matches, but they will be the best available given the data set at hand. Once a set of these matches has been accumulated, an average measure of similarity can easily be calculated (using Hamming Distance). The Excel plug-in Solver can then be used to find a sub-set of these matches which best maximises the average similarity measure. Solver uses a genetic aka evolutionary search algorithm.
The normal way to identify members of a control group is randomization. But randomization works best when there are large numbers of cases. With smaller numbers of cases, it is less certain that both groups of cases will be comparable on all attributes. So the methods described above may be more useful when working with small numbers of cases.
There may also be times when randomisation has taken place, but you want to check how comparable the membership of each group really is. Developing a predictive model for the membership is one way of doing this.