This is a second post in a series on splitting samples. In this case, say you have a very small sub-group of a large sample. You want to look at that subgroup and controls, but you don’t want your sample to be 90% controls. Instead, you want the subgroup and a sub-sample of controls matched on some demographic variables. As a further complication, lets make one variable (age) continuous, and lets make age and sex correlated with subgroup membership. This example is heavily cribbed from a post by Norbert Köhler.

library(sn)
library(tidyverse); library(ggthemes); theme_set(theme_tufte())
library(ggExtra)
library(pander)
library(MatchIt)
library(simstudy)

Simulation

set.seed(31453)

simdef<-defData(varname="age",
                dist="uniformInt",
                formula="120;144") #age in months between 10-12
simdef<-defData(simdef,varname="sex",
                dist="binary",
                formula=".5")
simdef<-defData(simdef,varname="parent.ed",
                dist="categorical",
                formula=genCatFormula(n=6))
simdef<-defData(simdef,varname="missingdata",
                dist="binary",
                formula=".2")
simdef<-defData(simdef,varname="inSubGroup",
                dist="binary",
                formula=".005/12 * (age-132) + .005*sex + .0175")


df<-genData(12000,simdef)

df$income<-rsn(nrow(df),alpha=3)
numbers_of_bins = 10
df <- df %>%
    mutate(
        # bin i:
        i.bin = cut(income,
                    breaks = unique(quantile(
                        income,
                        probs = seq.int(0, 1, by = 1 / numbers_of_bins)
                    )),
                    include.lowest = TRUE,
                    labels=FALSE)
    )

df<-as.data.frame(df)

factorialize<-c("sex","missingdata","parent.ed","inSubGroup")
df[factorialize] <- lapply(df[factorialize], factor)
levels(df$inSubGroup)<-c("control","treatment")

pander(head(df))

id	age	sex	parent.ed	missingdata	inSubGroup	income	i.bin
1	128	0	4	1	control	1.392	9
2	138	1	5	0	control	-0.02661	1
3	130	1	6	0	control	1.911	10
4	127	1	5	0	control	0.3791	4
5	121	0	5	0	control	0.909	7
6	132	1	1	0	control	1.695	10

pander(table(df$inSubGroup))

control	treatment
11761	239

Is there an imbalance?

We only need to match on age, sex, and one other categorical variable.

imbalance_model <-
    matchit(
        inSubGroup ~ age +sex + i.bin,
        data = df,
        method = NULL,
        distance = "glm"
    )

summary(imbalance_model)

## 
## Call:
## matchit(formula = inSubGroup ~ age + sex + i.bin, data = df, 
##     method = NULL, distance = "glm")
## 
## Summary of Balance for All Data:
##          Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance        0.0202        0.0199          0.1269     0.9939    0.0346
## age           132.4812      131.8617          0.0850     1.0289    0.0266
## sex0            0.4686        0.5021         -0.0671          .    0.0335
## sex1            0.5314        0.4979          0.0671          .    0.0335
## i.bin           5.3096        5.5039         -0.0672     1.0122    0.0195
##          eCDF Max
## distance   0.0856
## age        0.0600
## sex0       0.0335
## sex1       0.0335
## i.bin      0.0367
## 
## 
## Sample Sizes:
##           Control Treated
## All         11761     239
## Matched     11761     239
## Unmatched       0       0
## Discarded       0       0

Yes, age and sex are imbalanced (which we simulated). So is income!

Nearest Neighbor Matching

Note that it is important to code variable type correctly, i.e. that factors are factors and not numeric.

Sub-sampling

We want 2 controls per ‘treatment’ participant.

matching_model <-
    matchit(
        inSubGroup ~ age + sex + i.bin,
        data = df,
        method = "nearest",
        distance = "glm",
        ratio= 2
    )

summary(matching_model,un=FALSE)

## 
## Call:
## matchit(formula = inSubGroup ~ age + sex + i.bin, data = df, 
##     method = "nearest", distance = "glm", ratio = 2)
## 
## Summary of Balance for Matched Data:
##          Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance        0.0202        0.0202               0     1.0021         0
## age           132.4812      132.4812               0     1.0021         0
## sex0            0.4686        0.4686               0          .         0
## sex1            0.5314        0.5314               0          .         0
## i.bin           5.3096        5.3096               0     1.0021         0
##          eCDF Max Std. Pair Dist.
## distance        0               0
## age             0               0
## sex0            0               0
## sex1            0               0
## i.bin           0               0
## 
## Sample Sizes:
##           Control Treated
## All         11761     239
## Matched       478     239
## Unmatched   11283       0
## Discarded       0       0

The distribution parameters should be similar, and the control n should be twice the treated n. Then we save the new data frame:

df.match<-match.data(matching_model)[,1:ncol(df)]

Checking

df.match$inSubGroup<-factor(df.match$inSubGroup,labels=c("Control","Treatment"))
histbygroup<-function(plotdata,xvar="i") {
    ggplot(plotdata,aes_string(x=xvar)) +
        geom_density() +
        facet_grid(~inSubGroup)
}

histbygroup(df.match,"i.bin")

histbygroup(df.match,"age")

barbygroup<-function(plotdata,xvar="i") {
    ggplot(plotdata,aes_string(x=xvar)) +
        geom_bar() +
        facet_grid(~inSubGroup)
}

barbygroup(df.match,"sex")

Nearest neighbor is doing pretty well! Here’s another method for comparison:

Optimal Matching

Sub-sampling

We want 2 controls per ‘treatment’ participant.

matching_model2 <-
    matchit(
        inSubGroup ~ age + sex + i.bin,
        data = df,
        method = "optimal",
        distance = "glm",
        ratio= 2
    )

summary(matching_model2,un=FALSE)

## 
## Call:
## matchit(formula = inSubGroup ~ age + sex + i.bin, data = df, 
##     method = "optimal", distance = "glm", ratio = 2)
## 
## Summary of Balance for Matched Data:
##          Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance        0.0202        0.0201          0.0546     1.0202    0.0177
## age           132.4812      132.3891          0.0126     1.0956    0.0176
## sex0            0.4686        0.5209         -0.1048          .    0.0523
## sex1            0.5314        0.4791          0.1048          .    0.0523
## i.bin           5.3096        5.2573          0.0181     1.0224    0.0073
##          eCDF Max Std. Pair Dist.
## distance   0.0607          0.2131
## age        0.0460          0.7938
## sex0       0.0523          0.7085
## sex1       0.0523          0.7085
## i.bin      0.0167          0.9767
## 
## Sample Sizes:
##           Control Treated
## All         11761     239
## Matched       478     239
## Unmatched   11283       0
## Discarded       0       0

The distribution parameters should be similar, and the control n should be twice the treated n. Then we save the new data frame:

df.match2<-match.data(matching_model2)[,1:ncol(df)]

Checking

histbygroup<-function(plotdata,xvar="i") {
    ggplot(plotdata,aes_string(x=xvar)) +
        geom_density() +
        facet_grid(~inSubGroup)
}

histbygroup(df.match2,"i.bin")

histbygroup(df.match2,"age")

barbygroup<-function(plotdata,xvar="i") {
    ggplot(plotdata,aes_string(x=xvar)) +
        geom_bar() +
        facet_grid(~inSubGroup)
}

barbygroup(df.match2,"sex")

Nearest neighbor appears to do a better job