TopNotch is a system for quality controlling large scale data sets. It addresses the following three problems:
- How to define and measure data quality
- How to efficiently ensure data quality across many data sets
- How to institutionalize existing knowledge of data sets
TopNotch uses rules to verify individual components of a data set. Each rule defines and measures some small component of data quality. The combination of rules provides a complete definition of and metrics for quality in a data set. The rules can be reused on other data sets to maximize efficiency. Finally, the clear definitions and reuseability of these rules allows users to institutionalize knowledge by documenting a data set.
- The java command and the JAVA_HOME environment variable pointing to Java 8
- Spark 2.0.2
- Clone this repo.
- Get the latest JAR, TopNotch-assembly-0.2.1.jar, either by building this project (see docs/DEVELOPMENT.md for guidance on this) or by downloading it from the releases portion of TopNotch's GitHub page. Place it in this project's top level bin folder.
- Create the configuration files to test your data set
- See the example folder for a sample data set and configuration files.
- Run bin/TopNotchRunner.sh with the plan file passed in as an argument.
- To try the example, run
bin/TopNotchRunner.sh --planPath example/plan.json
. - Note that you must set the SPARK_HOME variable either in the script or as external environment variables
- Note that if you have configured your Spark installation to use an existing HDFS system, you will need to upload example/exampleAssertionInput.parquet to that HDFS system. You should make an example folder in your home folder on HDFS and upload example/exampleAssertionInput.parquet to that folder on HDFS.
- To try the example, run
- View the resulting report and parquet file in the topnotch folder in your home directory on HDFS.
- To view the results of the example, look at the JSON file topnotch/plan.json and the Parquet file example/exampleAssertionOutput.parquet. Note that if you have configured your Spark installation to use an exisiting HDFS system, the JSON and Parquet files will appear in the topnotch and example folders in your home directory on HDFS.
Please note that you must change bin/TopNotchRunner.sh in order to run TopNotch with a master other than local. It is currently recommended that you run TopNotch in local or client mode.
The docs folder contains the documentation. What documentation you should read depends on whether you want to use, deploy, or further develop TopNotch:
- CONCEPTS.md
- Target Audience: All
- Content: An overview of the parts of TopNotch and what they should be used for.
- USER_GUIDE.md
- Target Audience: Users
- Content: A guide for how to write the TopNotch JSON input and the specific options available for each feature.
- DEVELOPMENT.md
- Target Audience: Developers
- Content: A guide on how to setup TopNotch on your local computer for development and how to run the unit tests.
- CLUSTER_INSTALL.md
- Target Audience: Developers/DevOps/ProdOps
- Content: A guide on how to install TopNotch on your cluster.
Copyright © 2017 BlackRock, Inc. All Rights Reserved.