Question in concating the features

In the sample code provided, features are concated before processed in the encoder.
features = torch.concat([video_tokenizer(video), audio_tokenizer(audio), time_series_tokenizer(time_data)],dim=1)

However, as I ran some tokenizers of different modaility, the tokenized shape is not identical. 
For example, image is tokenized as [B, HW, C] and text is tokenized as [B,1,C] where c is the embedding dimension 768.

How are we supposed to process this? Also, Is there a sample code using text_tokenizer? It seems like the text_encoder is loading the wrong tokenizer CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question in concating the features #68

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question in concating the features #68

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions