We should replace calls to `testing` with individual tests defined with `deftest` (e.g. [here](https://github.com/InferenceQL/inferenceql.query/blob/main/test/inferenceql/query/plan_test.cljc#L25)). Calls to `testing` make pinpointing test failures harder.