TopNotch

What Is TopNotch?

TopNotch is a system for quality controlling large scale data sets. It addresses the following three problems:

How to define and measure data quality
How to efficiently ensure data quality across many data sets
How to institutionalize existing knowledge of data sets

TopNotch uses rules to verify individual components of a data set. Each rule defines and measures some small component of data quality. The combination of rules provides a complete definition of and metrics for quality in a data set. The rules can be reused on other data sets to maximize efficiency. Finally, the clear definitions and reuseability of these rules allows users to institutionalize knowledge by documenting a data set.

Getting Started

Requirements

The java command and the JAVA_HOME environment variable pointing to Java 8
Spark 2.0.2

Quick Start Steps

Clone this repo.
Get the latest JAR, TopNotch-assembly-0.2.1.jar, either by building this project (see docs/DEVELOPMENT.md for guidance on this) or by downloading it from the releases portion of TopNotch's GitHub page. Place it in this project's top level bin folder.
Create the configuration files to test your data set
1. See the example folder for a sample data set and configuration files.
Run bin/TopNotchRunner.sh with the plan file passed in as an argument.
1. To try the example, run bin/TopNotchRunner.sh --planPath example/plan.json.
2. Note that you must set the SPARK_HOME variable either in the script or as external environment variables
3. Note that if you have configured your Spark installation to use an existing HDFS system, you will need to upload example/exampleAssertionInput.parquet to that HDFS system. You should make an example folder in your home folder on HDFS and upload example/exampleAssertionInput.parquet to that folder on HDFS.
View the resulting report and parquet file in the topnotch folder in your home directory on HDFS.
1. To view the results of the example, look at the JSON file topnotch/plan.json and the Parquet file example/exampleAssertionOutput.parquet. Note that if you have configured your Spark installation to use an exisiting HDFS system, the JSON and Parquet files will appear in the topnotch and example folders in your home directory on HDFS.

Please note that you must change bin/TopNotchRunner.sh in order to run TopNotch with a master other than local. It is currently recommended that you run TopNotch in local or client mode.

What To Read Next

The docs folder contains the documentation. What documentation you should read depends on whether you want to use, deploy, or further develop TopNotch:

CONCEPTS.md
1. Target Audience: All
2. Content: An overview of the parts of TopNotch and what they should be used for.
USER_GUIDE.md
1. Target Audience: Users
2. Content: A guide for how to write the TopNotch JSON input and the specific options available for each feature.
DEVELOPMENT.md
1. Target Audience: Developers
2. Content: A guide on how to setup TopNotch on your local computer for development and how to run the unit tests.
CLUSTER_INSTALL.md
1. Target Audience: Developers/DevOps/ProdOps
2. Content: A guide on how to install TopNotch on your cluster.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
bin		bin
docs		docs
example		example
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TopNotch

What Is TopNotch?

Getting Started

Requirements

Quick Start Steps

What To Read Next

About

Releases

Packages

Languages

License

blackrock/TopNotch

Folders and files

Latest commit

History

Repository files navigation

TopNotch

What Is TopNotch?

Getting Started

Requirements

Quick Start Steps

What To Read Next

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages