Skip to content

Overhaul and recreate data cleaning process for activities & entities #5

@dani-ajah

Description

@dani-ajah

As a developer of the data, I would like to make sure that there is a robust data cleaning processes used to ensure all the data is clean when uploaded, and that no records are missing due to insufficient data cleaning efforts.

More details:
There are up to 200k activity records missing from the hub as they didn't correctly upload to Postgres, and therefore aren't in the search engine (which was done to avoid pages redirecting to nowhere). Instead, a robust data cleaning effort should be done to make sure those 200k records can be uploaded correctly, and ideally in bulk using one CSV instead of the current process (which uploads them line-by-line and is particularly slow on the DigitalOcean-hosted database)

Deliverable: A fully cleaned & uploadable CSV of all 565k activities

Another minor consideration that is related to date cleaning:
There is a particular case when the all program data point output is "Charity provided description when other program areas are not applicable", this output should be changed to "Not Available" (unless we can find this 'description' somewhere else?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    improvementImprove an existing feature

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Data Modeling & Processing

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions