-
Notifications
You must be signed in to change notification settings - Fork 117
Open
Description
In the sample code provided, features are concated before processed in the encoder.
features = torch.concat([video_tokenizer(video), audio_tokenizer(audio), time_series_tokenizer(time_data)],dim=1)
However, as I ran some tokenizers of different modaility, the tokenized shape is not identical.
For example, image is tokenized as [B, HW, C] and text is tokenized as [B,1,C] where c is the embedding dimension 768.
How are we supposed to process this? Also, Is there a sample code using text_tokenizer? It seems like the text_encoder is loading the wrong tokenizer CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
Metadata
Metadata
Assignees
Labels
No labels