Commitment to keep package size down #1942

MarcoGorelli · 2025-02-05T10:02:47Z

Narwhals started off with the objective of being a lightweight compatibility layer. As we've been adding feature and supported backends, the package size has been growing

There's a lot of essential dataframe functionality, and a lot of libraries that people want to support, so some increase in size since the earliest days is expected. But we do need to monitor it, it does need to stay under control, and Narwhals does need to stay lightweight.

Commitment: I'd like to suggest a hard commitment that:

The Narwhals wheel size will never go above 500 kB. It's currently 305 kB
Narwhals' size on disk will never go above 5000 kB (i.e.: when you make a virtual environment, the difference in size before vs after doing pip install narwhals. This includes some cached files which Python generates, but still, I think it's good to monitor the overall size). It's currently 3789 kB.
Never introduce any required dependencies nor compiled code

#1886 will probably increase our size a bit more. I think that's OK, as Ibis is a library that a few maintainers have said that they want to support. But it does bring us closer to the limits.

Some strategies to reduce size are:

Reduce overly-long docstrings. Some examples of how to do this are in docs: Make DataFrame and LazyFrame docstrings shorter and more concise #1939 and docs: shorten docstring examples in narwhals/expr.py #1915
More code-sharing. chore: refactor name namespaces to lower code duplication #1876 is a nice example, and I think there's more opportunities to do this
Directly implement some methods at the Narwhals level, instead of at the compliant level. For example, is is_duplicated just the negation of is_unique?
Freeze new features which don't have a use case. Series.hist is fine because it's been requested by Marimo, so it has a clear use case. Anything without a clear use-case, I think we may need to put the brakes on, at least until Request for contributions: Ibis support #1886 is resolved
See if there's any linter configurations that would reduce the size. I really don't want to minify Narwhals - legibility is important - but maybe there's some simple settings we can tweak, like line length / grouping imports / commas, that can reduce size a bit "for free"

Any help towards this goal would be appreciated - thank you, and thank you to everyone who has contributed in any way to Narwhals 🙏

The text was updated successfully, but these errors were encountered:

dangotbanned · 2025-02-07T13:52:29Z

This might only be a neglible win, but # type: ignore[no-any-return] shows up 159 times in narwhals.

I think you could remove these by adding warn_return_any = false here:

narwhals/pyproject.toml

Lines 211 to 213 in 78f8c0a

    
           [tool.mypy] 
        
           strict = true

That rule seems like a poor fit for narwhals anyway.
You'd need to remove all those comments afterwards to avoid 159 of these though:

narwhals/stable/v1/__init__.py:2325: error: Unused "type: ignore" comment  [unused-ignore]

Little bit of #1942 When combined with another change, a lot of cases might become one-liners (#1657 (comment))

adrinjalali · 2025-02-12T15:04:58Z

This is an interesting one. Thinking about it, I'm not sure how much trying to save on whitespace characters and docstrings would help, when comparing to other packages which take about 100MB after installation.

I very much appreciate narwhals being very lightweight, but I'd say as long as it doesn't have any dependencies, it's VERY lightweight, almost independent of saving on little things.

As for including features only if a usecase is there, I think of it as a double edged sword kinda thing:

on one hand, you're right that it makes sense to have some sort of a compass on what to include and what not to include
on the other hand, if we include features only when they come up, then we'd in in the situation that if package X needs a feature in narwhals, they'd need to trigger its inclusion, and then wait for a release, and even then they'd need to "depend" on a very recent narwhals release. Let say the feature involves support of library Y, now, is this feature supports the latest release of library Y? Or every feature from start supports multiple versions of the corresponding library? Having minimum dependency versions to be very recent makes dependency resolutions hell.

So I guess the tl;dr; here for me is:

yes to no dependency
yes to having some sort of an "inclusion criteria"
yes to not being a 100MB kinda package

but I wouldn't worry about the package having a 5MB download size.

MarcoGorelli · 2025-02-14T11:05:46Z

Thanks for your comments, much appreciated!

Maybe we don't indeed need a strict cap, but I would like to keep closely monitoring size - I don't think any dataframe library started thinking they'd get 400MB wheels, but it is where things tend to go if unchecked (seriously, the PySpark 4.0 wheel is >400MB, wut 🤯 https://pypi.org/project/pyspark/4.0.0.dev2/#files)

I'd still like to suggest slowing down on new features, so that we can focus on:

static typing, which has helped find useful issues
making tests more consistent, so it's easier to add backends
refactoring ExprKind parsing, so there's less duplication of logic for when expressions need broadcasting
setting up performance tracking, and checking in CI that overhead for some common operations is below x %
better docs, better organised API reference, more varied tutorials
setting things up for stable.v2 (especially, determining how to support order-dependent operations for SQL-like backends). We should at least have a POC ready for SQLFrame / PySpark

We can then resume adding features (filling out the .list namespace would be very nice, for example). But for the past year, things have moved very fast, and we only have limited attention and time, so I think we should spend some focused time on "important but not urgent" tasks like the ones above before expanding further

MarcoGorelli added help wanted Extra attention is needed community build labels Feb 5, 2025

MarcoGorelli mentioned this issue Feb 5, 2025

Docstring guidelines #1943

Closed

FBruzzesi mentioned this issue Feb 8, 2025

chore: shrink functions module docstrings #1974

Merged

7 tasks

dangotbanned added a commit that referenced this issue Feb 8, 2025

chore: widen parse_version to accept packages

a14fe62

Little bit of #1942 When combined with another change, a lot of cases might become one-liners (#1657 (comment))

dangotbanned mentioned this issue Feb 8, 2025

chore: widen parse_version to accept packages #1975

Merged

10 tasks

FBruzzesi mentioned this issue Feb 8, 2025

RFC, chore: shrink series module docstrings #1976

Merged

7 tasks

dangotbanned mentioned this issue Feb 9, 2025

chore(typing): add np.ndarray aliases #1977

Merged

10 tasks

FBruzzesi mentioned this issue Feb 10, 2025

build: remove docstring examples when publishing packages? #1742

Closed

dangotbanned mentioned this issue Feb 14, 2025

[Enh]: Add support for Binary & Time datatype #1989

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commitment to keep package size down #1942

Commitment to keep package size down #1942

MarcoGorelli commented Feb 5, 2025

dangotbanned commented Feb 7, 2025 •

edited

Loading

adrinjalali commented Feb 12, 2025

MarcoGorelli commented Feb 14, 2025

Commitment to keep package size down #1942

Commitment to keep package size down #1942

Comments

MarcoGorelli commented Feb 5, 2025

dangotbanned commented Feb 7, 2025 • edited Loading

adrinjalali commented Feb 12, 2025

MarcoGorelli commented Feb 14, 2025

dangotbanned commented Feb 7, 2025 •

edited

Loading