Skip to content

User Guide

Hossam Hammady edited this page Jan 14, 2015 · 8 revisions

Setup

Code structure

The NADEEF release has the code structure described as below:

	- .                     - NADEEF root directory
	|- console              - source code for the NADEEF console
	|- core                 - source code for the NADEEF core
	    |- datamodel        - datamodel in the core, it contains the entities used in NADEEF e.g. cell, tuple
	    |- exception        - exception defined in NADEEF
	    |- pipeline         - NADEEF pipeline/operator implementation
	    |- util             - utility code for NADEEF core
	|- ruleext              - extension for abstract rule, currently it has FD and CFD implementation
	|- test                 - NADEEF test code
    |- tools                - common utility code for NADEEF
    |- service              - Thrift service layer
	|- vendors              - 3rd party libraries (e.g. JUnit, postgres JDBC driver)
	|- out                  - output directory (also the default location of nadeef.jar)
	|- nadeef.bat           - start script for windows
	|- nadeef.sh            - start script for mac and *nix
	|- nadeef.conf          - NADEEF configuration file

Build NADEEF from scratch

Prerequisite

To be able to run and compile NADEEF user needs to have the following software installed.

Compile

The build script is in the root folder named +build.xml+, to do a full compile user can use the command

    ant all

There are also other commands available and user can find the details in the +build.xml+ file.

Configuration

The configuration is NADEEF is based on the configuration file +nadeef.conf+. A sample of configuration file is shown below

{
    "database" : {
        "url" : "localhost/unittest",
        "username" : "tester",
        "password" : "tester",
        "type" : "postgres"
    },
    
    "general" : {
    	"maxIterationNumber" : 5
    },

    "ruleext" : {
        "fd" : "qa.qcri.nadeef.ruleext.FDRuleBuilder",
        "cfd" : "qa.qcri.nadeef.ruleext.CFDRuleBuilder"
    }
}

The configuration file tells the following information of your NADEEF system.

  • database: The place where NADEEF metadata is stored (e.g. violation).

    • url: dataabse url
    • username: user name
    • password: password
    • type: the type of the database, e.g. Postgres (currently only Postgres is supported).
  • general

    • maxIterationNumber: set the maximum detection iteration number.
  • ruleext

    • fd: plugin for FD rule implementation.
    • cfd: plugin for CFD rule implementation.

Start NADEEF

Start NADEEF console

Once NADEEF is compiled user can start NADEEF console via command +nadeef.sh+ (on Windows it is +nadeef.bat+). A successful start of NADEEF console is shown as below

   _  __        __        _____
  / |/ /__ ____/ /__ ___ /   _/
 /    / _ `/ _  / -_) -_)   _/
/_/|_/\_,_/\_,_/\__/\__/ __/
Data Cleaning solution (Build null, using Java 1.7.0_07).
Copyright (C) Qatar Computing Research Institute, 2013 (http://da.qcri.org).

Type 'help' to see what commands we have.

Your NADEEF started in 1004 ms.
:>

NADEEF console is the interactive console where user can start NADEEF data cleaning rules. Type 'help' in the console to list all the available commands as below:

:> help
 |Nadeef console usage:
 |----------------------------------
 |help : Print out this help information.
 |
 |load <input CleanPlan file> :
 |    load a NADEEF clean plan.
 |
 |detect [rule id] :
 |    start the violation detection with a given rule id number.
 |
 |list :
 |    list available rules.
 |
 |repair [rule id] :
 |    repair the data source with a given rule id number.
 |
 |run [rule id]:
 |    run both detect and clean with a given rule id number.
 |
 |schema [table name]:
 |    list the table schema from the data source.
 |
 |fd [table name] [fd rule value]:
 |    create FD rule with specified table name on the source data.
 |
 |exit :
 |    exit the console.

:>

What is CleanPlan?

NADEEF is driven by CleanPlan. CleanPlan is a JSON file which contains information of which data to be cleaned and which rule to be used.

A sample of CleanPlan is described as below

{
    "source" : {
        "url" : "localhost/unittest",
        "username" : "tester",
        "password" : "tester",
        "type" : "postgres"
    },
    "rule" : [
        {
            "type" : "fd",
            "table" : ["location"],
            "target" : ["location_copy"],
            "value" : ["zip|city, state"]
        }
    ]
}

In this CleanPlan there are the following two attributes

  • source: Where is the data to be cleaned.
    • type: the type of the source, it could be a database(Postgres SQL) or a CSV file(csv).
    • url: the url of the database
    • username: the user name of the database
    • password: the password of the user name
  • rule: Which rule is going to be used.
    • type: the type of the rule. It can be an user defined rule (UDR) or abstract rule, e.g. (FD, CFD).

Load, Detect, Repair, and Run

A typical user scenario of NADEEF is to first load the +CleanPlan+ and then runn the rule. In the following text we will through this scenario to introduce some of the most frequent commands in NADEEF.

First we load a CleanPlan called 'sample.json'.

:> load sample.json
:INFO:CSVDumper : Successfully created table csv_bank1
:INFO:CSVDumper : Dumped 2 rows in 276 ms.
:INFO:CSVDumper : Successfully created table csv_tran1
:INFO:CSVDumper : Dumped 4 rows in 201 ms.
1 rules loaded in 605 ms.
:>

Once it is loaded user can start to use +detect+, +repair+, and +run+ (which is the combination of +detect+ and +repair+) commands. For each command there will be a summary printed as below.

:> detect
:> [DETECT][==================================================]100 %
Detection summary:
Rule :Rule1563440042
----------------------------------------------------------------
HScope time                                       0 ms
VScope time                                       0 ms
Blocks                                            0
Original tuple count                              0
Iterator time                                     8 ms
DB load time                                      1 ms
Detect time                                      59 ms
Detect call time                                  0 ms
Detect thread count                               1
Detect tuple count                                8
Violation                                         1
Violation export time                            37
----------------------------------------------------------------
Detection finished in 59 ms and found 1 violations.

:>
:> repair
:> [REPAIR][==================================================]100 %
Repair summary:
Repair time                                      10 ms
Rule : Rule1563440042
----------------------------------------------------------------
Repair perCall time                               1 ms
EQ time                                          44 ms
Cell updated                                      1
----------------------------------------------------------------
Repair finished in 10 ms with 1 cells changed.

Violation and Audit table

After each 'detect' iteration all the violation data are stored in the 'violation' table in the database which is specified in the +nadeef.conf+. After each 'repair' all the candidate fixes are stored in the 'repair' table and final update information is stored in the 'audit' table.

An example of violations could be like

 vid |      rid       |    tablename     | tupleid | attribute |    value    
-----+----------------+------------------+---------+-----------+-------------
   0 | Rule1949358975 | csv_hospital_10k |    7663 | zipcode   | 90028
   0 | Rule1949358975 | csv_hospital_10k |    7421 | zipcode   | 90028
   0 | Rule1949358975 | csv_hospital_10k |    7421 | city      | LOS ANGELES
   0 | Rule1949358975 | csv_hospital_10k |    7663 | city      | HOLLYWOOD
   1 | Rule1949358975 | csv_hospital_10k |    7663 | zipcode   | 90028
   1 | Rule1949358975 | csv_hospital_10k |    7422 | zipcode   | 90028
   1 | Rule1949358975 | csv_hospital_10k |    7422 | city      | LOS ANGELES
   1 | Rule1949358975 | csv_hospital_10k |    7663 | city      | HOLLYWOOD

Here the violations are clustered by 'vid' and each row represents a potential violated cell.

The +repair+ table lists the candidate fixes. It is stored as an equation statement, so (cell 1 = cell 2). An example can be as below:

 id  | vid | c1_tupleid | c1_tablename | c1_attribute |  c1_value   | op | c2_tupleid | c2_tablename | c2_attribute |  c2_value   
-----+-----+------------+--------------+--------------+-------------+----+------------+--------------+--------------+-------------
   0 |   0 |       7663 | csv_hospital | state        | CA          |  0 |       7421 | csv_hospital | state        | CA
   0 |   0 |       7421 | csv_hospital | city         | LOS ANGELES |  0 |       7663 | csv_hospital | city         | HOLLYWOOD
   1 |   1 |       7663 | csv_hospital | state        | CA          |  0 |       7422 | csv_hospital | state        | CA
   1 |   1 |       7663 | csv_hospital | city         | HOLLYWOOD   |  0 |       7422 | csv_hospital | city         | LOS ANGELES
   2 |   2 |       7423 | csv_hospital | state        | CA          |  0 |       7663 | csv_hospital | state        | CA
   2 |   2 |       7423 | csv_hospital | city         | LOS ANGELES |  0 |       7663 | csv_hospital | city         | HOLLYWOOD
   3 |   3 |       7663 | csv_hospital | city         | HOLLYWOOD   |  0 |       7424 | csv_hospital | city         | LOS ANGELES

The audit table shows which cell and what data has been changed. An example can be like

 id | vid | tupleid |  tablename   | attribute |   oldvalue    |  newvalue   |            time            
----+-----+---------+--------------+-----------+---------------+-------------+----------------------------
  1 | 603 |    7424 | csv_hospital | city      | 'LOS ANGELES' | 'HOLLYWOOD' | 2013-06-13 16:53:44.277998
  2 | 618 |    7439 | csv_hospital | city      | 'LOS ANGELES' | 'HOLLYWOOD' | 2013-06-13 16:53:44.277998
  3 | 600 |    7421 | csv_hospital | city      | 'LOS ANGELES' | 'HOLLYWOOD' | 2013-06-13 16:53:44.277998
  4 | 612 |    7433 | csv_hospital | city      | 'LOS ANGELES' | 'HOLLYWOOD' | 2013-06-13 16:53:44.277998
  5 | 611 |    7432 | csv_hospital | city      | 'LOS ANGELES' | 'HOLLYWOOD' | 2013-06-13 16:53:44.277998
  6 | 619 |    7440 | csv_hospital | city      | 'LOS ANGELES' | 'HOLLYWOOD' | 2013-06-13 16:53:44.277998
  7 | 609 |    7430 | csv_hospital | city      | 'LOS ANGELES' | 'HOLLYWOOD' | 2013-06-13 16:53:44.277998

Write User Defined Rule

Detect and Repair

The core of NADEEF is rule driven. Following the NADEEF rule interface user is able to write any User defined rule (UDF) in Java and give to NADEEF to run. The most important interfaces of NADEEF rule are +detect+ and +repair+.

+Detect+ is the place where user defines what is a violation in the context of the source data. The signature of the detect function is defined as

    /**
     * Detect interface.
     * @param tuples input tuple.
     * @return Violation set.
     */
    public abstract Collection<Violation> detect(E tuples);

The input of the detect is a generic type +E+, it could be one tuple, pair tuple, or many tuples (a table). The output of the +detect+ is a collection of Violations. Given a simple rule of

A tuple, which has Country code (CC) of 31, should have the country name of either Netherlands or Holland

can be rewritten in the +detect+ method as

    @Override
    public Collection<Violation> detect(Tuple tuple) {
        List<Violation> result = Lists.newArrayList();
        if (tuple.getCell("CC").getValue().equals("31") &&
            (
                !tuple.getCell("country").getValue().equals("Netherlands") ||
                !tuple.getCell("country").getValue().equals("Holland"))
            )
        {
            Violation violation = new Violation(ruleName);
            violation.addCell(tuple.getCell("cc"));
            violation.addCell(tuple.getCell("country"));
            result.add(violation);
        }
        return result;
    }

The +repair+ function is to specify possible ways to resolve a given violation. Its interface is as follows:

    /**
     * Repair.
     * @param violation violation input.
     * @return a candidate fix.
     */
    public abstract Collection<Fix> repair(Violation violation);

The input is a Violation and the output is a Collection of Fixes. For the same example above user can write the following +repair+ method to fix tuples which has country code '31' but do not have country value 'Holland'.

    @Override
    public Collection<Fix> repair(Violation violation) {
        Fix.Builder builder = new Fix.Builder();
        Cell countryCell = violation.getCell("myTable", "country");
        Fix fix = builder.left(countryCell).right("Netherlands").build();
        return Lists.newArrayList(fix);
    }

VerticalScope, HorizontalScope, Block, and Iterator

In addition to +detect+ and +repair+ functions there are 4 extra optional interfaces user can override for optimization purpose. Why? Imaging that you have a file of 100k rows and you would like to do pairwise detection. Without any optimization the result pair number would be a simple N^2 which ends up 10 billions pairs to work on and it probably will take days to finish on the current commodity hardware.

  • VerticalScope : VerticalScope is to let user do scoping on columns. So choosing the columns which will be needed for the +detect+ to reduce the memory footprint during execution. The signature of VerticalScope is like
    /**
     * Vertical scope.
     * @param table input tables.
     * @return scoped tables.
     */
    public abstract Collection<Table> verticalScope(Collection<Table> table);
  • HorizontalScope : HorizontalScope is to let user do scoping on rows. So choosing the rows which is relevant to the +detect+ and reduce the total number of pairs generated. The signature of HorizontalScope is like
    /**
     * Horizontal scope operator.
     * @param table input tables.
     * @return scoped tables.
     */
    public abstract Collection<Table> horizontalScope(Collection<Table> table);
  • Block : Block is to let user group a table on columns. For example, given a table with country code information user may want to write +Block+ function to return a collection of tables based on the same country code value. In that case the total tuple pair number is reduced.
    /**
     * Block operator.
     * @param table input tuple.
     * @return a generator of tuple collection.
     */
    public abstract Collection<Table> block(Collection<Table> table);
  • Iterator : Iterator is to let user write how to generate a tuple, a pair of tuples, or a collection of tuples to serve the detection. The output of the iterator will be sent to the +detect+ function and the signature of Iterator is
    /**
     * Iterator.
     * @param tables a collection of tables.
     * @param iteratorOutputStream Iterator output object.
     */
    public abstract void iterator(
        Collection<Table> tables,
        IteratorStream<E> iteratorOutputStream
    );

User needs to output the tuples via iteratorOutputStream instead of returning from the function. The reason is that the output of iterator could be huge and it may run out-of-memory if we generate the pairs first (thinking about the N^2 case above). To deal with this concern NADEEF has a streaming implementation which provides kind of producer-consumer pattern to let detect consume the tuple pairs while generating them.

How to load UDF rule?

Once user completes coding the rule the next step is to compile and put it in the CleanPlan. To compile the rule user needs to put +NADEEF+ class files in the Java classpath as the following example:

javac MyRule.java -classpath out/nadeef.jar

Here the +nadeef.jar+ file is the file which is generated after a full build.

Afterwards user can put the full class name as the rule value in the CleanPlan so that NADEEF is able to load it. Make sure that the class can be found in the JAVA classpath. An example can be like this

{
    "source" : {
        "url" : "localhost/unittest",
        "username" : "tester",
        "password" : "tester",
        "type" : "postgres"
    },
    "rule" : [
        {
            "name" : "myRule",
			"type" : "udf",
            "table" : ["location"],
            "target" : ["location_copy"],
            "value" : ["qa.qcri.nadeef.test.udf.MyRule1"],
        }
    ]
}

Pipelines, The Whole Picture

NADEEF has a very flexible pipeline framework. A pipeline is data workflow which consists of several operators and data is flowing from one operator to the other. User is able to add / remove / design new operator and pipeline based on existing ones. NADEEF has currently 3 pipelines implemented. They are

  • Detect pipeline. Detect pipeline feeds the +detect+ function with source data and exports the violations into the violation table. The detect pipeline is designed as
Source Deserializer -> Scope -> Block -> Iterator -> Detect -> ViolationExport
  • Repair pipeline. Repair pipeline repairs the source data using the +repair+ function of the rule. The repair pipeline is designed as
Violation Import -> Repair -> FixExport
  • Updater pipeline. Updater pipeline fetches all the candidate fixes from repair and choose which cell eventually will be updated. The update pipeline is designed as
FixImport -> EquivalentClass -> Updater