Skip to content

fix(feature-extraction): support multilingual / XLM-R sentence embedders#33

Merged
s-zx merged 1 commit into
mainfrom
fix/multilingual-sentence-embedder
May 12, 2026
Merged

fix(feature-extraction): support multilingual / XLM-R sentence embedders#33
s-zx merged 1 commit into
mainfrom
fix/multilingual-sentence-embedder

Conversation

@s-zx
Copy link
Copy Markdown
Owner

@s-zx s-zx commented May 12, 2026

Two changes that unblock loading non-English sentence embedders such as paraphrase-multilingual-MiniLM-L12-v2:

  1. PipelineConfig.tokenizerUrl — make the tokenizer URL overridable. The constructor previously hard-coded all-MiniLM-L6-v2's English-only tokenizer.json regardless of which model URL the user passed in.

  2. Conditional token_type_ids — only emit this input when the loaded ONNX model declares it (looked up from session.inputNames via ModelMetadata). XLM-R, RoBERTa and multilingual MiniLM omit token_type_ids; unconditionally feeding it tripped onnxruntime-web with "invalid input" errors.

Summary

Brief description of the changes.

Motivation

Why is this change needed?

Changes

  • Change 1
  • Change 2

Testing

  • Unit tests pass (npm run test:unit)
  • TypeScript compiles (npx tsc --noEmit)
  • Lint passes (npm run lint)
  • Tested in browser (if applicable)

Breaking Changes

List any breaking changes, or "None".

Two changes that unblock loading non-English sentence embedders such as
paraphrase-multilingual-MiniLM-L12-v2:

1. PipelineConfig.tokenizerUrl — make the tokenizer URL overridable. The
   constructor previously hard-coded all-MiniLM-L6-v2's English-only
   tokenizer.json regardless of which model URL the user passed in.

2. Conditional token_type_ids — only emit this input when the loaded ONNX
   model declares it (looked up from session.inputNames via ModelMetadata).
   XLM-R, RoBERTa and multilingual MiniLM omit token_type_ids; unconditionally
   feeding it tripped onnxruntime-web with "invalid input" errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 12, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
edge-flow-js Error Error May 12, 2026 11:39am

@s-zx s-zx merged commit ea82d43 into main May 12, 2026
1 of 3 checks passed
s-zx added a commit that referenced this pull request May 12, 2026
Bumps to 0.2.0 (minor — new PipelineConfig.tokenizerUrl field shipped
in #33 is additive backwards-compatible API surface).

Also clears two pre-existing build/test failures that were blocking the
release pipeline (unrelated to #33):

- src/pipelines/question-answering.ts — remove dead private method
  tokenOffsetToCharOffset; TS6133 under noUnusedLocals.
- tests/unit/runtime.test.ts — registerAllBackends() is sync void, so
  expect(...).resolves on its return value crashed vitest. Switch to
  expect(() => registerAllBackends()).not.toThrow().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant