The empirical question of whether a weaker supervisory model can reliably elicit aligned behavior from a stronger model it cannot fully evaluate