Did you ever train with the ViTC?

Hey,
Thanks for sharing such a great work. Digging into the code, I've noticed that there is a ConvEmbed layer from this [paper](https://arxiv.org/abs/2106.14881) implemented in the vision transformer. I was wondering whether you had a chance to train with this layer, taking into account that masking is not as straightforward as with the common PatchEmbed.
Thanks,
Carlos