You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Description
- Using Tensor from openvino-node in Tokenizer encode and decode
- Extended the Tensor API
- Added constructors for Tensor
- Moved Tokenizer to the separate TS file
- Use BigInt for tokenId
- Store openvino-node addon in the AddondData to manipulate with entity
from the core part
- Aligned tests with the new API and verify tokenizer and binding
behavior.
- Updated benchmark sample with using Tensor encode
- Update documentation - https://retribution98.github.io/openvino.genai/
## Ticket
CVS-174909
## Checklist:
- [x] Tests have been updated or added to cover the new code. <!-- If
the change isn't maintenance related, update the tests at
https://github.com/openvinotoolkit/openvino.genai/tree/master/tests or
explain in the description why the tests don't need an update. -->
- [x] This patch fully addresses the ticket. <!--- If follow-up pull
requests are needed, specify in description. -->
- [x] I have made corresponding changes to the documentation. <!-- Run
github.com/\<username>/openvino.genai/actions/workflows/deploy_gh_pages.yml
on your fork with your branch as a parameter to deploy a test version
with the updated content. Replace this comment with the link to the
built docs. -->
---------
Signed-off-by: Kirill Suvorov <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Alicja Miloszewska <[email protected]>
`Tokenizer` has `encode()` and `decode()` methods which support the following arguments: `add_special_tokens`, `skip_special_tokens`, `pad_to_max_length`, `max_length`.
@@ -51,6 +65,11 @@ It can be initialized from the path, in-memory IR representation or obtained fro
51
65
auto tokens = tokenizer.encode("The Sun is yellow because", ov::genai::add_special_tokens(false));
52
66
```
53
67
</TabItemCpp>
68
+
<TabItemJS>
69
+
```js
70
+
consttokens=tokenizer.encode("The Sun is yellow because", { add_special_tokens:false });
71
+
```
72
+
</TabItemJS>
54
73
</LanguageTabs>
55
74
56
75
The `encode()` method returns a [`TokenizedInputs`](https://docs.openvino.ai/2025/api/genai_api/_autosummary/openvino_genai.TokenizedInputs.html) object containing `input_ids` and `attention_mask`, both stored as `ov::Tensor`.
@@ -121,4 +140,40 @@ If `pad_to_max_length` is set to true, then instead of padding to the longest se
121
140
// out_shape: [1, 128]
122
141
```
123
142
</TabItemCpp>
143
+
<TabItemJS>
144
+
```js
145
+
import {Tokenizer} from 'openvino-genai-node';
146
+
147
+
const tokenizer = new Tokenizer(models_path);
148
+
const prompts = ["The Sun is yellow because", "The"];
149
+
let tokens;
150
+
151
+
// Since prompt is definitely shorter than maximal length (which is taken from IR) will not affect shape.
152
+
// Resulting shape is defined by length of the longest tokens sequence.
153
+
// Equivalent of HuggingFace hf_tokenizer.encode(prompt, padding="longest", truncation=True)
154
+
tokens = tokenizer.encode(["The Sun is yellow because", "The"]);
155
+
// or is equivalent to
156
+
tokens = tokenizer.encode(["The Sun is yellow because", "The"], {pad_to_max_length: false});
157
+
console.log(tokens.input_ids.getShape());
158
+
// out_shape: [2, 6]
159
+
160
+
// Resulting tokens tensor will be padded to 1024, sequences which exceed this length will be truncated.
161
+
// Equivalent of HuggingFace hf_tokenizer.encode(prompt, padding="max_length", truncation=True, max_length=1024)
162
+
tokens = tokenizer.encode([
163
+
"The Sun is yellow because",
164
+
"The",
165
+
"The longest string ever".repeat(2000),
166
+
], {
167
+
pad_to_max_length: true,
168
+
max_length: 1024,
169
+
});
170
+
console.log(tokens.input_ids.getShape());
171
+
// out_shape: [3, 1024]
172
+
173
+
// For single string prompts truncation and padding are also applied.
174
+
tokens = tokenizer.encode("The Sun is yellow because", {pad_to_max_length: true, max_length: 128});
0 commit comments