Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new feature: data validation rules #81

Open
thibautjombart opened this issue Jun 18, 2019 · 5 comments
Open

new feature: data validation rules #81

thibautjombart opened this issue Jun 18, 2019 · 5 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@thibautjombart
Copy link
Contributor

thibautjombart commented Jun 18, 2019

This may be implemented in other packages, so maybe just a wrapper or documentation matter. The idea is to create validation rules for a given data.frame "a la" testthat.

Specific use cases (examples):

  • column xxx is a Date and should be greater or less than a given date
  • columns xxx - yyy must be less than a given number (e.g. delays from yyy to xxx) must be less than 30 days
  • column sex should be either male, female, or unknown
  • column age should be strictly positive, less than 150
  • column xxx should be of specific class

Really it seems to all boil down to:

  • entries in column xxx must fulfill a logical condition, e.g. xxx < whatever, xxx %in% something
  • entries in column xxx and yyy must fulfill a logical condition, e.g. xxx > yyy or xxx - yyy > something

I suspect we can use testthat as a backend, with an interface similar to the clean_spelling, e.g.

  • validate_variable(x, rule): validates a single variable
  • validate_data(x, rules = list (variable_xxx = rule_xxx, variable_yyy = rule_yyy)): applies validate_variable to a bunch of variables

Ideally validation rules could be provided in a table outside R e.g. in an excel spreadsheet, like we did for the cleaning rules in clean_spelling.

@thibautjombart thibautjombart added enhancement New feature or request help wanted Extra attention is needed high priority this feature should be completed and tested as soon as possible labels Jun 18, 2019
@zkamvar
Copy link
Member

zkamvar commented Jun 18, 2019

The assertr package is very good for this

@zkamvar
Copy link
Member

zkamvar commented Jun 18, 2019

In fact, I have used this for my own analysis: https://github.com/everhartlab/sclerotinia-366/blob/master/results/data-comparison.md

@thibautjombart
Copy link
Contributor Author

Looks great indeed. Maybe still useful to build a wrapper around it? Being able to specify rules as a separate file would be cool - proved tremendously useful for dictionary-based data cleaning. Thoughts?

@zkamvar
Copy link
Member

zkamvar commented Jun 18, 2019

I'll see what I can template.

@thibautjombart
Copy link
Contributor Author

I'll see what I can template.

I'll see what you contemplate.

@zkamvar zkamvar self-assigned this Jun 18, 2019
@thibautjombart thibautjombart removed the high priority this feature should be completed and tested as soon as possible label Nov 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants