Skip to content

Conversation

@tylerriccio33
Copy link
Contributor

Summary

This method checks the percentage of null values against a column in a dataset. I added some tests and documentation. I'm seeking review, since navigating tolerance is can be confusing.

I can add examples for doctests, if you guys use those?

The big thing missing is the icon, which prohibits usage with get_tabular_report. See the xfail test I added which will pass once the icon is ok. Is there some software you use to generate those. Thanks.

Related GitHub Issues and PRs

Checklist

  • [x ] I understand and agree to the Code of Conduct.
  • [x ] I have followed the Style Guide for Python Code as best as possible for the submitted code.
  • [ x] I have added pytest unit tests for any new functionality.

@tylerriccio33
Copy link
Contributor Author

CI passed minus pre-commit, which failed on formatting but you can just override. Maybe my ruff is out of sync.

@tylerriccio33
Copy link
Contributor Author

Any update here?

@rich-iannone
Copy link
Member

Sorry again for the delay in responding. After reviewing the PR, I want to discuss an alternative approach that I think might work better for this. Really it's that Pointblank already has the tools to accomplish what col_vals_pct_null() does (existing validation methods combined with our composable features).

Option 1: Using col_vals_not_null() with thresholds

Since you want to check that a certain percentage of values are null/None, we can flip this around and check that a certain percentage are not null, using the threshold system:

import pointblank as pb
import polars as pl

# Sample data with 20% null values
data = pl.DataFrame({
    "id": range(1, 11),
    "value": [1, 2, None, 4, 5, None, 7, 8, 9, 10]
})

# Check that AT MOST 20% are null (i.e., at least 80% are not null)
# We set the threshold to allow up to 20% failure
validation = (
    pb.Validate(data=data)
    .col_vals_not_null(
        columns="value",
        thresholds=0.20,  # Allow up to 20% to fail (be null)
        brief="Value column should have at most 20% null values"
    )
    .interrogate()
)

validation
image

Option 2: Using pre= with existing comparison methods

For more fine-grained control (like checking if the null/None percentage is within a specific range, below a fixed value, etc.), you can use a preprocessing function with existing validation methods:

def get_null_pct(tbl):
    """Calculate percentage of null values in 'value' column."""
    import narwhals as nw
    df = nw.from_native(tbl)
    total = df.select(nw.len()).item()
    n_null = df.select(nw.col("value").is_null().sum()).item()
    
    # Return a single-row, single-column table with the percentage
    return nw.from_native(pl.DataFrame({"null_pct": [n_null / total]}))

validation = (
    pb.Validate(data=data)
    .col_vals_lt(
        columns="null_pct",
        value=0.10,
        pre=get_null_pct,
        thresholds=pb.Thresholds(critical=1),
        brief="Null percentage should be less than 10%"
    )
    .col_vals_between(
        columns="null_pct",
        left=0.15,
        right=0.25,
        pre=get_null_pct,
        thresholds=pb.Thresholds(critical=1),
        brief="Null percentage should be between 15% and 25%"
    )
    .interrogate()
)

validation
image

The composable nature of pre= combined with our existing validation methods gives us a lot of flexibility:

  • it prevents method proliferation: we could create specialized methods for every possible column statistic (.pct_null(), .pct_duplicate(), .pct_outliers(), etc.), but this would lead to an explosion of similar methods
  • more expressive: the pre= approach makes it crystal clear what's being calculated and validated
  • consistent with the R version of Pointblank: we're working toward feature parity with the R version of Pointblank first (in terms of validation methods), then we can explore additional validation types
  • flexible: Need to check null percentage against dynamic thresholds? Or combine it with other metrics? The pre= approach handles it all

Thanks for putting the work into the PR (I really value your contributions to the project) but I think we should leave this particular PR out. Please let me know your thoughts on this whole approach, I'm hoping you'll find it reasonable!

@tylerriccio33
Copy link
Contributor Author

So I've thought about this some more and I actually disagree for several reasons.

  1. Coding a 10 line preprocessor that requires constructing an intermediate dataframe, collecting multiple times and using an interim column is undesirable. This opens the user up to mistakes, makes the package less useful as a high level declarative QC api, and breaks DRY.
  2. I think using methods like col_vals_between are or col_vals_lt for an aggregated value that got manually calculated, I think it detracts from the method's goal. Also, I'd argue it confuses the user and person reading the report, since the report will say col_vals_between but in reality we're interrogating something sort of completely different. Additionally, this could effect using autobrief, like if it's not totally clear in each case, I may want to disable it. Also, using an intermediate column may confuse report readers, it certainly would confuse me.

A more philosophical point - I don't see an issue with method proliferation if the proliferation spawns helpful methods. You do already have the col_vals_expr which I've been using for the custom/modular validator instead of pre. I can only speak from experience, but I see a lot of col_vals_expr simply because we want to check aggregates.

Let me know what you think, but I do understand your opinion!

@rich-iannone
Copy link
Member

Okay... I'm coming around to your point of view! You're right in that a library should contain helpful methods (if that's a lot, that's fine: it's a big problem space). And I want this library to be useful, so I'm agreement now.

So, yeah, let's resolve to getting this method put into the package. Would you be okay with changing the method name a little? Maybe to col_pct_null() (anything to make it different than col_vals_*()) and then we could set ourselves up for other col_*() methods that operate with a column aggregate.

For the icon I'm using Sketch and I could totally draw one up for this! For now (to move the PR along), just borrow another method's icon (I'll pass the finished one over before merging this).

We don't currently use doctest, but, I totally wouldn't mind if you got that going for this PR (up to you!).

@tylerriccio33
Copy link
Contributor Author

Awesome, I'll get this going!

@tylerriccio33
Copy link
Contributor Author

@rich-iannone This should be good to go, but I'd appreciate some criticism of the logic :)

Copy link
Member

@rich-iannone rich-iannone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and thanks so much for the contribution!

@rich-iannone rich-iannone merged commit 1e6a9aa into posit-dev:main Dec 1, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants