Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segment type ids should be zeros instead of ones (minor update suggestion) #13

Open
caspillaga opened this issue Dec 22, 2021 · 0 comments

Comments

@caspillaga
Copy link

caspillaga commented Dec 22, 2021

Just for the record, in case someone finds it useful or plans to extend it.

In line 48 of

segment_ids = [1 for x in tokenized_text]

Segment type ids should be zeros, not ones as implemented there (sentence A = 0, sentence B = 1)
I believe this will not make much difference anyway. Moreover, in the new huggingface library's API this parameter can be ignored and the library creates it automatically, as seen in the code below.

In case someone finds it useful, I also updated the code to a version compatible with the updated library (transformers)

The relevant lines that changed are these (some lines ignored for clarity):

from transformers import BertTokenizerFast, BertModel

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizerFast.from_pretrained(...)
model = BertModel.from_pretrained(....)
LAYER_COUNT = 12+1 # 24+1 for bert-large
FEATURE_COUNT = 768 # 1024 for bert-large
model.eval()

# tokenize text, preserving PTB tokenized words
indexed_tokens = tokenizer._batch_encode_plus(line.split(), add_special_tokens=False, return_token_type_ids=False, return_attention_mask=False)
indexed_tokens = [item for sublist in indexed_tokens['input_ids'] for item in sublist]
indexed_tokens = tokenizer.build_inputs_with_special_tokens(indexed_tokens) # Add [CLS] and [SEP]

# Build batch and run the model
tokens_tensor = torch.tensor([indexed_tokens])
with torch.no_grad():
    encoded_layers = model(input_ids=tokens_tensor, output_hidden_states=True)['hidden_states']

# Notice that index and fout comes from the loop in the original code, ignored here for clarity
dset = fout.create_dataset(str(index), (LAYER_COUNT, len(indexed_tokens), FEATURE_COUNT))
dset[:,:,:] = np.vstack([np.array(x) for x in encoded_layers])
@caspillaga caspillaga changed the title segment type ids should be zeros instead of one segment type ids should be zeros instead of ones (minor update suggestion) Dec 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant