File tree 1 file changed +4
-19
lines changed
1 file changed +4
-19
lines changed Original file line number Diff line number Diff line change 1
1
{
2
2
"cells" : [
3
3
{
4
- "attachments" : {},
5
4
"cell_type" : " markdown" ,
6
5
"metadata" : {},
7
6
"source" : [
35
34
" \n " ,
36
35
" ## Tokenizer libraries by language\n " ,
37
36
" \n " ,
38
- " For `cl100k_base` and `p50k_base` encodings, `tiktoken` is the only tokenizer available as of March 2023. \n " ,
37
+ " For `cl100k_base` and `p50k_base` encodings: \n " ,
39
38
" - Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)\n " ,
39
+ " - .NET / C#: [SharpToken](https://github.com/dmitry-brazhenko/SharpToken)\n " ,
40
40
" \n " ,
41
41
" For `r50k_base` (`gpt2`) encodings, tokenizers are available in many languages.\n " ,
42
42
" - Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))\n " ,
54
54
]
55
55
},
56
56
{
57
- "attachments" : {},
58
57
"cell_type" : " markdown" ,
59
58
"metadata" : {},
60
59
"source" : [
88
87
]
89
88
},
90
89
{
91
- "attachments" : {},
92
90
"cell_type" : " markdown" ,
93
91
"metadata" : {},
94
92
"source" : [
105
103
]
106
104
},
107
105
{
108
- "attachments" : {},
109
106
"cell_type" : " markdown" ,
110
107
"metadata" : {},
111
108
"source" : [
126
123
]
127
124
},
128
125
{
129
- "attachments" : {},
130
126
"cell_type" : " markdown" ,
131
127
"metadata" : {},
132
128
"source" : [
143
139
]
144
140
},
145
141
{
146
- "attachments" : {},
147
142
"cell_type" : " markdown" ,
148
143
"metadata" : {},
149
144
"source" : [
152
147
]
153
148
},
154
149
{
155
- "attachments" : {},
156
150
"cell_type" : " markdown" ,
157
151
"metadata" : {},
158
152
"source" : [
180
174
]
181
175
},
182
176
{
183
- "attachments" : {},
184
177
"cell_type" : " markdown" ,
185
178
"metadata" : {},
186
179
"source" : [
221
214
]
222
215
},
223
216
{
224
- "attachments" : {},
225
217
"cell_type" : " markdown" ,
226
218
"metadata" : {},
227
219
"source" : [
228
220
" ## 4. Turn tokens into text with `encoding.decode()`"
229
221
]
230
222
},
231
223
{
232
- "attachments" : {},
233
224
"cell_type" : " markdown" ,
234
225
"metadata" : {},
235
226
"source" : [
257
248
]
258
249
},
259
250
{
260
- "attachments" : {},
261
251
"cell_type" : " markdown" ,
262
252
"metadata" : {},
263
253
"source" : [
264
254
" Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries."
265
255
]
266
256
},
267
257
{
268
- "attachments" : {},
269
258
"cell_type" : " markdown" ,
270
259
"metadata" : {},
271
260
"source" : [
293
282
]
294
283
},
295
284
{
296
- "attachments" : {},
297
285
"cell_type" : " markdown" ,
298
286
"metadata" : {},
299
287
"source" : [
300
288
" (The `b` in front of the strings indicates that the strings are byte strings.)"
301
289
]
302
290
},
303
291
{
304
- "attachments" : {},
305
292
"cell_type" : " markdown" ,
306
293
"metadata" : {},
307
294
"source" : [
424
411
]
425
412
},
426
413
{
427
- "attachments" : {},
428
414
"cell_type" : " markdown" ,
429
415
"metadata" : {},
430
416
"source" : [
549
535
],
550
536
"metadata" : {
551
537
"kernelspec" : {
552
- "display_name" : " openai " ,
538
+ "display_name" : " Python 3 " ,
553
539
"language" : " python" ,
554
540
"name" : " python3"
555
541
},
563
549
"name" : " python" ,
564
550
"nbconvert_exporter" : " python" ,
565
551
"pygments_lexer" : " ipython3" ,
566
- "version" : " 3.9.9 "
552
+ "version" : " 3.7.3 "
567
553
},
568
- "orig_nbformat" : 4 ,
569
554
"vscode" : {
570
555
"interpreter" : {
571
556
"hash" : " 365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
You can’t perform that action at this time.
0 commit comments