Skip to content

Question in concating the features #68

@memesoo99

Description

@memesoo99

In the sample code provided, features are concated before processed in the encoder.
features = torch.concat([video_tokenizer(video), audio_tokenizer(audio), time_series_tokenizer(time_data)],dim=1)

However, as I ran some tokenizers of different modaility, the tokenized shape is not identical.
For example, image is tokenized as [B, HW, C] and text is tokenized as [B,1,C] where c is the embedding dimension 768.

How are we supposed to process this? Also, Is there a sample code using text_tokenizer? It seems like the text_encoder is loading the wrong tokenizer CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions