Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commitment to keep package size down #1942

Open
MarcoGorelli opened this issue Feb 5, 2025 · 3 comments
Open

Commitment to keep package size down #1942

MarcoGorelli opened this issue Feb 5, 2025 · 3 comments
Labels
build community help wanted Extra attention is needed

Comments

@MarcoGorelli
Copy link
Member

Narwhals started off with the objective of being a lightweight compatibility layer. As we've been adding feature and supported backends, the package size has been growing

There's a lot of essential dataframe functionality, and a lot of libraries that people want to support, so some increase in size since the earliest days is expected. But we do need to monitor it, it does need to stay under control, and Narwhals does need to stay lightweight.

Commitment: I'd like to suggest a hard commitment that:

  • The Narwhals wheel size will never go above 500 kB. It's currently 305 kB
  • Narwhals' size on disk will never go above 5000 kB (i.e.: when you make a virtual environment, the difference in size before vs after doing pip install narwhals. This includes some cached files which Python generates, but still, I think it's good to monitor the overall size). It's currently 3789 kB.
  • Never introduce any required dependencies nor compiled code

#1886 will probably increase our size a bit more. I think that's OK, as Ibis is a library that a few maintainers have said that they want to support. But it does bring us closer to the limits.

Some strategies to reduce size are:

Any help towards this goal would be appreciated - thank you, and thank you to everyone who has contributed in any way to Narwhals 🙏

@dangotbanned
Copy link
Member

dangotbanned commented Feb 7, 2025

This might only be a neglible win, but # type: ignore[no-any-return] shows up 159 times in narwhals.

I think you could remove these by adding warn_return_any = false here:

narwhals/pyproject.toml

Lines 211 to 213 in 78f8c0a

[tool.mypy]
strict = true

That rule seems like a poor fit for narwhals anyway.
You'd need to remove all those comments afterwards to avoid 159 of these though:

narwhals/stable/v1/__init__.py:2325: error: Unused "type: ignore" comment  [unused-ignore]

@adrinjalali
Copy link

This is an interesting one. Thinking about it, I'm not sure how much trying to save on whitespace characters and docstrings would help, when comparing to other packages which take about 100MB after installation.

I very much appreciate narwhals being very lightweight, but I'd say as long as it doesn't have any dependencies, it's VERY lightweight, almost independent of saving on little things.

As for including features only if a usecase is there, I think of it as a double edged sword kinda thing:

  • on one hand, you're right that it makes sense to have some sort of a compass on what to include and what not to include
  • on the other hand, if we include features only when they come up, then we'd in in the situation that if package X needs a feature in narwhals, they'd need to trigger its inclusion, and then wait for a release, and even then they'd need to "depend" on a very recent narwhals release. Let say the feature involves support of library Y, now, is this feature supports the latest release of library Y? Or every feature from start supports multiple versions of the corresponding library? Having minimum dependency versions to be very recent makes dependency resolutions hell.

So I guess the tl;dr; here for me is:

  • yes to no dependency
  • yes to having some sort of an "inclusion criteria"
  • yes to not being a 100MB kinda package

but I wouldn't worry about the package having a 5MB download size.

@MarcoGorelli
Copy link
Member Author

Thanks for your comments, much appreciated!

Maybe we don't indeed need a strict cap, but I would like to keep closely monitoring size - I don't think any dataframe library started thinking they'd get 400MB wheels, but it is where things tend to go if unchecked (seriously, the PySpark 4.0 wheel is >400MB, wut 🤯 https://pypi.org/project/pyspark/4.0.0.dev2/#files)

I'd still like to suggest slowing down on new features, so that we can focus on:

  • static typing, which has helped find useful issues
  • making tests more consistent, so it's easier to add backends
  • refactoring ExprKind parsing, so there's less duplication of logic for when expressions need broadcasting
  • setting up performance tracking, and checking in CI that overhead for some common operations is below x %
  • better docs, better organised API reference, more varied tutorials
  • setting things up for stable.v2 (especially, determining how to support order-dependent operations for SQL-like backends). We should at least have a POC ready for SQLFrame / PySpark

We can then resume adding features (filling out the .list namespace would be very nice, for example). But for the past year, things have moved very fast, and we only have limited attention and time, so I think we should spend some focused time on "important but not urgent" tasks like the ones above before expanding further

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build community help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants