added sharptoken as exampel

dmitry-brazhenko · dmitry-brazhenko · commit be1f1187bd20 · 2023-03-28T17:31:14.000+02:00
diff --git a/examples/How_to_count_tokens_with_tiktoken.ipynb b/examples/How_to_count_tokens_with_tiktoken.ipynb
@@ -1,7 +1,6 @@
 {
  "cells": [
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -35,8 +34,9 @@
     "\n",
     "## Tokenizer libraries by language\n",
     "\n",
-    "For `cl100k_base` and `p50k_base` encodings, `tiktoken` is the only tokenizer available as of March 2023.\n",
+    "For `cl100k_base` and `p50k_base` encodings:\n",
     "- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)\n",
+    "- .NET / C#: [SharpToken](https://github.com/dmitry-brazhenko/SharpToken)\n",
     "\n",
     "For `r50k_base` (`gpt2`) encodings, tokenizers are available in many languages.\n",
     "- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))\n",
@@ -54,7 +54,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -88,7 +87,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -105,7 +103,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -126,7 +123,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -143,7 +139,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -152,7 +147,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -180,7 +174,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -221,15 +214,13 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## 4. Turn tokens into text with `encoding.decode()`"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -257,15 +248,13 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries."
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -293,15 +282,13 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "(The `b` in front of the strings indicates that the strings are byte strings.)"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -424,7 +411,6 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -549,7 +535,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "openai",
+   "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },
@@ -563,9 +549,8 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.9"
+   "version": "3.7.3"
   },
-  "orig_nbformat": 4,
   "vscode": {
    "interpreter": {
     "hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"

Original file line number	Diff line number	Diff line change
`@@ -1,7 +1,6 @@`
`1`	`1`	`{`
`2`	`2`	`"cells": [`
`3`	`3`	`{`
`4`		`- "attachments": {},`
`5`	`4`	`"cell_type": "markdown",`
`6`	`5`	`"metadata": {},`
`7`	`6`	`"source": [`
`@@ -35,8 +34,9 @@`
`35`	`34`	`"\n",`
`36`	`35`	`"## Tokenizer libraries by language\n",`
`37`	`36`	`"\n",`
`38`		- "For `cl100k_base` and `p50k_base` encodings, `tiktoken` is the only tokenizer available as of March 2023.\n",
	`37`	+ "For `cl100k_base` and `p50k_base` encodings:\n",
`39`	`38`	`"- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)\n",`
	`39`	`+ "- .NET / C#: [SharpToken](https://github.com/dmitry-brazhenko/SharpToken)\n",`
`40`	`40`	`"\n",`
`41`	`41`	"For `r50k_base` (`gpt2`) encodings, tokenizers are available in many languages.\n",
`42`	`42`	`"- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))\n",`
`@@ -54,7 +54,6 @@`
`54`	`54`	`]`
`55`	`55`	`},`
`56`	`56`	`{`
`57`		`- "attachments": {},`
`58`	`57`	`"cell_type": "markdown",`
`59`	`58`	`"metadata": {},`
`60`	`59`	`"source": [`
`@@ -88,7 +87,6 @@`
`88`	`87`	`]`
`89`	`88`	`},`
`90`	`89`	`{`
`91`		`- "attachments": {},`
`92`	`90`	`"cell_type": "markdown",`
`93`	`91`	`"metadata": {},`
`94`	`92`	`"source": [`
`@@ -105,7 +103,6 @@`
`105`	`103`	`]`
`106`	`104`	`},`
`107`	`105`	`{`
`108`		`- "attachments": {},`
`109`	`106`	`"cell_type": "markdown",`
`110`	`107`	`"metadata": {},`
`111`	`108`	`"source": [`
`@@ -126,7 +123,6 @@`
`126`	`123`	`]`
`127`	`124`	`},`
`128`	`125`	`{`
`129`		`- "attachments": {},`
`130`	`126`	`"cell_type": "markdown",`
`131`	`127`	`"metadata": {},`
`132`	`128`	`"source": [`
`@@ -143,7 +139,6 @@`
`143`	`139`	`]`
`144`	`140`	`},`
`145`	`141`	`{`
`146`		`- "attachments": {},`
`147`	`142`	`"cell_type": "markdown",`
`148`	`143`	`"metadata": {},`
`149`	`144`	`"source": [`
`@@ -152,7 +147,6 @@`
`152`	`147`	`]`
`153`	`148`	`},`
`154`	`149`	`{`
`155`		`- "attachments": {},`
`156`	`150`	`"cell_type": "markdown",`
`157`	`151`	`"metadata": {},`
`158`	`152`	`"source": [`
`@@ -180,7 +174,6 @@`
`180`	`174`	`]`
`181`	`175`	`},`
`182`	`176`	`{`
`183`		`- "attachments": {},`
`184`	`177`	`"cell_type": "markdown",`
`185`	`178`	`"metadata": {},`
`186`	`179`	`"source": [`
`@@ -221,15 +214,13 @@`
`221`	`214`	`]`
`222`	`215`	`},`
`223`	`216`	`{`
`224`		`- "attachments": {},`
`225`	`217`	`"cell_type": "markdown",`
`226`	`218`	`"metadata": {},`
`227`	`219`	`"source": [`
`228`	`220`	"## 4. Turn tokens into text with `encoding.decode()`"
`229`	`221`	`]`
`230`	`222`	`},`
`231`	`223`	`{`
`232`		`- "attachments": {},`
`233`	`224`	`"cell_type": "markdown",`
`234`	`225`	`"metadata": {},`
`235`	`226`	`"source": [`
`@@ -257,15 +248,13 @@`
`257`	`248`	`]`
`258`	`249`	`},`
`259`	`250`	`{`
`260`		`- "attachments": {},`
`261`	`251`	`"cell_type": "markdown",`
`262`	`252`	`"metadata": {},`
`263`	`253`	`"source": [`
`264`	`254`	"Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries."
`265`	`255`	`]`
`266`	`256`	`},`
`267`	`257`	`{`
`268`		`- "attachments": {},`
`269`	`258`	`"cell_type": "markdown",`
`270`	`259`	`"metadata": {},`
`271`	`260`	`"source": [`
`@@ -293,15 +282,13 @@`
`293`	`282`	`]`
`294`	`283`	`},`
`295`	`284`	`{`
`296`		`- "attachments": {},`
`297`	`285`	`"cell_type": "markdown",`
`298`	`286`	`"metadata": {},`
`299`	`287`	`"source": [`
`300`	`288`	"(The `b` in front of the strings indicates that the strings are byte strings.)"
`301`	`289`	`]`
`302`	`290`	`},`
`303`	`291`	`{`
`304`		`- "attachments": {},`
`305`	`292`	`"cell_type": "markdown",`
`306`	`293`	`"metadata": {},`
`307`	`294`	`"source": [`
`@@ -424,7 +411,6 @@`
`424`	`411`	`]`
`425`	`412`	`},`
`426`	`413`	`{`
`427`		`- "attachments": {},`
`428`	`414`	`"cell_type": "markdown",`
`429`	`415`	`"metadata": {},`
`430`	`416`	`"source": [`
`@@ -549,7 +535,7 @@`
`549`	`535`	`],`
`550`	`536`	`"metadata": {`
`551`	`537`	`"kernelspec": {`
`552`		`- "display_name": "openai",`
	`538`	`+ "display_name": "Python 3",`
`553`	`539`	`"language": "python",`
`554`	`540`	`"name": "python3"`
`555`	`541`	`},`
`@@ -563,9 +549,8 @@`
`563`	`549`	`"name": "python",`
`564`	`550`	`"nbconvert_exporter": "python",`
`565`	`551`	`"pygments_lexer": "ipython3",`
`566`		`- "version": "3.9.9"`
	`552`	`+ "version": "3.7.3"`
`567`	`553`	`},`
`568`		`- "orig_nbformat": 4,`
`569`	`554`	`"vscode": {`
`570`	`555`	`"interpreter": {`
`571`	`556`	`"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"`