Skip to content

Conversation

GregoryComer
Copy link
Contributor

@GregoryComer GregoryComer commented Aug 27, 2025

This PR adds support for building with Clang-CL on Windows, as well as CI to cover this.

Changes:

  • Update CMake to not pass -Wno-attributes on Windows. MSVC and Clang-CL don't accept this option.
  • Alias ssize_t to SSIZE_T on Windows.
  • Update CMake build in setup.py to use Clang-CL and use namespaced extension name (this makes it work without playing with paths).
  • Explicitly call .string() on std::filesystem::paths. The implicit conversion works on Clang/GCC but not with MSVC/Clang-CL.
  • Add a Windows job to build and run tests.

Fixes #111.

Test Plan:
I've add CI to run native and Python unit tests on Windows. I've also verified locally that editable and non-editable installs work on both Linux and Windows.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 27, 2025
@GregoryComer GregoryComer force-pushed the windows-support branch 14 times, most recently from ded06d3 to c060226 Compare August 29, 2025 21:31
@@ -49,7 +51,7 @@ endif()

add_subdirectory(
${CMAKE_CURRENT_SOURCE_DIR}/third-party/sentencepiece
${CMAKE_CURRENT_BINARY_DIR}/sentencepiece-build
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to resolve path length limitations in CI. Despite having long paths enabled on the system, there are limitations in the VS build tools. The extension build has a deeply nested structure which slightly exceeds the allowable length by a few characters.

@@ -124,7 +124,11 @@ TEST_F(TiktokenTest, ConstructionWithInvalidBOSIndex) {
std::vector<std::string>{"<|end_of_text|>"}),
1,
0),
#if !GTEST_OS_WINDOWS
::testing::KilledBySignal(SIGABRT),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is not available on Windows (due to not using signals).

@GregoryComer GregoryComer force-pushed the windows-support branch 5 times, most recently from f46a3e8 to b7e47a5 Compare August 29, 2025 23:38
@GregoryComer GregoryComer marked this pull request as ready for review August 30, 2025 00:01
@GregoryComer
Copy link
Contributor Author

CI is now green and the infra issue with Windows runners is resolved, opening for review.

#ifdef _WIN32
// ssize_t isn't available on Windows. Alias it to the Windows SSIZE_T value.
#include <BaseTsd.h>
typedef SSIZE_T ssize_t;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hindsight, this should probably be typedefed as TK_SSIZE_T or something similar as this is a public header. I'll update this before landing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new compiler.h header for platform and toolchain specific defs, following the pattern used in ET. I was originally going to introduce a new typedef or macro for TK_SSIZE_T, but looking at other projects, it seems like just directly typedefing to size_t is common, so I've done that. If anyone has strong opinions, I'm happy to change it.

setup(
name="pytorch-tokenizers",
version="0.1.0",
long_description=long_description,
long_description_content_type="text/markdown",
url="https://github.com/meta-pytorch/tokenizers",
packages=find_packages(),
ext_modules=[CMakeExtension("pytorch_tokenizers_cpp")],
ext_modules=[CMakeExtension("pytorch_tokenizers.pytorch_tokenizers_cpp")],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? Is it another windows nuance?

Copy link
Contributor Author

@GregoryComer GregoryComer Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, this is the way it should be, as it handles the namespacing correctly. The extension name should be the fully qualified name of the extension module. We were doing something slightly weird in that we were updating the CMake build to tell it to dump the library into a different directory (appending to extdir). But with this change, it just works on all platforms.

@GregoryComer GregoryComer merged commit 4ed91cc into meta-pytorch:main Sep 2, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tokenizers library does not build on Windows with Clang
3 participants