Skip to content

aallam/ktoken

Repository files navigation

Ktoken

Maven Central License Documentation

Ktoken is a BPE tokenizer designed for seamless integration with OpenAI's models.

📦 Setup

Install Ktoken by adding the dependency to your build.gradle file:

repositories {
    mavenCentral()
}

dependencies {
    implementation "com.aallam.ktoken:ktoken:0.4.0"
}

⚡️ Getting Started

val tokenizer = Tokenizer.of(encoding = Encoding.CL100K_BASE)
// For a specific model in the OpenAI API:
val tokenizer = Tokenizer.of(model = "gpt-4")

val tokens = tokenizer.encode("hello world")
val text = tokenizer.decode(listOf(15339, 1917))

⚙️ Usage Modes

Ktoken operates in two modes: Local (default for JVM) and Remote (default for JS/Native).

📍 Local Mode

Utilize LocalPbeLoader to retrieve encodings from local files:

val tokenizer = Tokenizer.of(encoding = Encoding.CL100K_BASE, loader = LocalPbeLoader(FileSystem.SYSTEM))
// For a specific model in the OpenAI API:
val tokenizer = Tokenizer.of(model = "gpt-4", loader = LocalPbeLoader(FileSystem.SYSTEM))
JVM Specifics:

Artifacts for JVM include encoding files. Use FileSystem.RESOURCES to load them:

val tokenizer = Tokenizer.of(encoding = Encoding.CL100K_BASE, loader = LocalPbeLoader(FileSystem.RESOURCES))

Note: this is the default behavior for JVM.

🌐 Remote Mode

  1. Add Engine: Include one of Ktor's engines to your dependencies.
  2. Use RemoteBpeLoader: To load encoding from remote sources:
val tokenizer = Tokenizer.of(encoding = Encoding.CL100K_BASE, loader = RemoteBpeLoader())

// For a specific model in the OpenAI API:
val tokenizer = Tokenizer.of(model = "gpt-4", loader = RemoteBpeLoader())

📋 BOM Usage

You might alternatively use ktoken-bom by adding the following dependency to your build.gradle file:

dependencies {
    // Import Kotlin API client BOM
    implementation platform('com.aallam.ktoken:ktoken-bom:0.4.0')

    // Define dependencies without versions
    implementation 'com.aallam.ktoken:ktoken'
    runtimeOnly 'io.ktor:ktor-client-okhttp'
}

🔀 Multiplatform Projects

For multiplatform projects, add the ktoken dependency to commonMain, and select an engine for each target.

📄 License

Ktoken is open-source software and distributed under the MIT license. This project is not affiliated with nor endorsed by OpenAI.

About

Kotlin multiplatform BPE tokenizer library for OpenAI models

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages