Add percentage null assertion #290

tylerriccio33 · 2025-09-28T22:37:09Z

Summary

This method checks the percentage of null values against a column in a dataset. I added some tests and documentation. I'm seeking review, since navigating tolerance is can be confusing.

I can add examples for doctests, if you guys use those?

The big thing missing is the icon, which prohibits usage with get_tabular_report. See the xfail test I added which will pass once the icon is ok. Is there some software you use to generate those. Thanks.

Related GitHub Issues and PRs

Checklist

[x ] I understand and agree to the Code of Conduct.
[x ] I have followed the Style Guide for Python Code as best as possible for the submitted code.
[ x] I have added pytest unit tests for any new functionality.

tylerriccio33 · 2025-09-28T23:03:32Z

CI passed minus pre-commit, which failed on formatting but you can just override. Maybe my ruff is out of sync.

tylerriccio33 · 2025-10-14T22:51:55Z

Any update here?

rich-iannone · 2025-10-17T01:23:06Z

Sorry again for the delay in responding. After reviewing the PR, I want to discuss an alternative approach that I think might work better for this. Really it's that Pointblank already has the tools to accomplish what col_vals_pct_null() does (existing validation methods combined with our composable features).

Option 1: Using col_vals_not_null() with thresholds

Since you want to check that a certain percentage of values are null/None, we can flip this around and check that a certain percentage are not null, using the threshold system:

import pointblank as pb
import polars as pl

# Sample data with 20% null values
data = pl.DataFrame({
    "id": range(1, 11),
    "value": [1, 2, None, 4, 5, None, 7, 8, 9, 10]
})

# Check that AT MOST 20% are null (i.e., at least 80% are not null)
# We set the threshold to allow up to 20% failure
validation = (
    pb.Validate(data=data)
    .col_vals_not_null(
        columns="value",
        thresholds=0.20,  # Allow up to 20% to fail (be null)
        brief="Value column should have at most 20% null values"
    )
    .interrogate()
)

validation

Option 2: Using pre= with existing comparison methods

For more fine-grained control (like checking if the null/None percentage is within a specific range, below a fixed value, etc.), you can use a preprocessing function with existing validation methods:

def get_null_pct(tbl):
    """Calculate percentage of null values in 'value' column."""
    import narwhals as nw
    df = nw.from_native(tbl)
    total = df.select(nw.len()).item()
    n_null = df.select(nw.col("value").is_null().sum()).item()
    
    # Return a single-row, single-column table with the percentage
    return nw.from_native(pl.DataFrame({"null_pct": [n_null / total]}))

validation = (
    pb.Validate(data=data)
    .col_vals_lt(
        columns="null_pct",
        value=0.10,
        pre=get_null_pct,
        thresholds=pb.Thresholds(critical=1),
        brief="Null percentage should be less than 10%"
    )
    .col_vals_between(
        columns="null_pct",
        left=0.15,
        right=0.25,
        pre=get_null_pct,
        thresholds=pb.Thresholds(critical=1),
        brief="Null percentage should be between 15% and 25%"
    )
    .interrogate()
)

validation

The composable nature of pre= combined with our existing validation methods gives us a lot of flexibility:

it prevents method proliferation: we could create specialized methods for every possible column statistic (.pct_null(), .pct_duplicate(), .pct_outliers(), etc.), but this would lead to an explosion of similar methods
more expressive: the pre= approach makes it crystal clear what's being calculated and validated
consistent with the R version of Pointblank: we're working toward feature parity with the R version of Pointblank first (in terms of validation methods), then we can explore additional validation types
flexible: Need to check null percentage against dynamic thresholds? Or combine it with other metrics? The pre= approach handles it all

Thanks for putting the work into the PR (I really value your contributions to the project) but I think we should leave this particular PR out. Please let me know your thoughts on this whole approach, I'm hoping you'll find it reasonable!

tylerriccio33 · 2025-11-24T23:32:20Z

So I've thought about this some more and I actually disagree for several reasons.

Coding a 10 line preprocessor that requires constructing an intermediate dataframe, collecting multiple times and using an interim column is undesirable. This opens the user up to mistakes, makes the package less useful as a high level declarative QC api, and breaks DRY.
I think using methods like col_vals_between are or col_vals_lt for an aggregated value that got manually calculated, I think it detracts from the method's goal. Also, I'd argue it confuses the user and person reading the report, since the report will say col_vals_between but in reality we're interrogating something sort of completely different. Additionally, this could effect using autobrief, like if it's not totally clear in each case, I may want to disable it. Also, using an intermediate column may confuse report readers, it certainly would confuse me.

A more philosophical point - I don't see an issue with method proliferation if the proliferation spawns helpful methods. You do already have the col_vals_expr which I've been using for the custom/modular validator instead of pre. I can only speak from experience, but I see a lot of col_vals_expr simply because we want to check aggregates.

Let me know what you think, but I do understand your opinion!

rich-iannone · 2025-11-25T05:19:57Z

Okay... I'm coming around to your point of view! You're right in that a library should contain helpful methods (if that's a lot, that's fine: it's a big problem space). And I want this library to be useful, so I'm agreement now.

So, yeah, let's resolve to getting this method put into the package. Would you be okay with changing the method name a little? Maybe to col_pct_null() (anything to make it different than col_vals_*()) and then we could set ourselves up for other col_*() methods that operate with a column aggregate.

For the icon I'm using Sketch and I could totally draw one up for this! For now (to move the PR along), just borrow another method's icon (I'll pass the finished one over before merging this).

We don't currently use doctest, but, I totally wouldn't mind if you got that going for this PR (up to you!).

tylerriccio33 · 2025-11-28T19:33:36Z

Awesome, I'll get this going!

tylerriccio33 · 2025-11-30T22:03:18Z

@rich-iannone This should be good to go, but I'd appreciate some criticism of the logic :)

rich-iannone

LGTM and thanks so much for the contribution!

tylerriccio33 added 3 commits September 28, 2025 18:33

add half null series to conftest

7b79454

add pct null method

d2092a8

add pct null tests

ac98682

change name to pct_null

ba1b3b5

tylerriccio33 force-pushed the pct-null branch from d92ed1f to ba1b3b5 Compare November 28, 2025 20:28

tylerriccio33 and others added 8 commits November 28, 2025 15:30

Merge branch 'main' into pct-null

dbbe2c2

add absolute bounds typing

22206fa

add pre-commit target

e634deb

add tests back to pct null method

8eff556

chore remove if main in tests

dc49a98

add doctests

008f19c

add null icon

39bcf5c

add durations

0abab43

rich-iannone added 8 commits December 1, 2025 10:45

Merge branch 'main' into pr/290

4ceaecf

Swap in new icon for col_pct_null()

9c78263

Expand and clarify docstring for col_pct_null()

292413a

Update validation tests to use col_pct_null()

d1ccc06

Refactor col_pct_null() to use narwhals

0080408

Add test for col_pct_null() across different backends

3acff44

Fix type in validation and add col_pct_null() formatting

3c9d2c3

Refactor test_pct_null_parametrized

c2596e0

tylerriccio33 mentioned this pull request Dec 1, 2025

Validation aggregation factory #332

Draft

Add locale-aware col_pct_null() validation text support

38ae6c8

rich-iannone approved these changes Dec 1, 2025

View reviewed changes

rich-iannone merged commit 1e6a9aa into posit-dev:main Dec 1, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add percentage null assertion #290

Add percentage null assertion #290

Uh oh!

tylerriccio33 commented Sep 28, 2025

Uh oh!

tylerriccio33 commented Sep 28, 2025

Uh oh!

tylerriccio33 commented Oct 14, 2025

Uh oh!

rich-iannone commented Oct 17, 2025

Uh oh!

tylerriccio33 commented Nov 24, 2025

Uh oh!

rich-iannone commented Nov 25, 2025

Uh oh!

tylerriccio33 commented Nov 28, 2025

Uh oh!

tylerriccio33 commented Nov 30, 2025

Uh oh!

rich-iannone left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add percentage null assertion #290

Add percentage null assertion #290

Uh oh!

Conversation

tylerriccio33 commented Sep 28, 2025

Summary

Related GitHub Issues and PRs

Checklist

Uh oh!

tylerriccio33 commented Sep 28, 2025

Uh oh!

tylerriccio33 commented Oct 14, 2025

Uh oh!

rich-iannone commented Oct 17, 2025

Uh oh!

tylerriccio33 commented Nov 24, 2025

Uh oh!

rich-iannone commented Nov 25, 2025

Uh oh!

tylerriccio33 commented Nov 28, 2025

Uh oh!

tylerriccio33 commented Nov 30, 2025

Uh oh!

rich-iannone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants