initial commit

tspicer · tspicer · commit e2bb0ac940f7 · 2019-02-18T13:48:17.000-05:00
diff --git a/README.md b/README.md
@@ -1 +1,198 @@
 # ob_file_validation
+The purpose of this API is to validate CSV files for general compliance with established norms such as [RFC4180](https://tools.ietf.org/html/rfc4180). Imagine pouring a gallon of maple syrup into your cars gat tank. That is what bad CSV files do to data pipelines. This is why we have an API that will assist with determining the quality of CSV data sent to it.
+
+# Background
+Comma separated values (CSV) is commonly used for exchanging data between systems. While this format is common, it can present difficulties. Why? Different tools, or export processes, often generate outputs that are not CSV files or have variations that are not considered "valid" according the [RFC4180](https://tools.ietf.org/html/rfc4180).
+
+Invalid CSV files create challenges for those building data pipelines. Pipelines value consistency, predictability and testability as they ensure uninterrupted operation from source to a target destination.
+
+
+## What does a valid CSV look like?
+Here is an example of a valid CSV file.
+
+It has a header row with `foo`, `bar`, and `buzz` with a corresponding row of `aaa`, `bbb`, and `ccc`
+
+|  foo | bar  |  buzz |
+|---|---|---|---|---|
+| aaa  |  bbb |  ccc |  
+
+The CSV will look something like this;
+```bash
+foo,bar,buzz
+aaa,bbb,ccc
+```
+However, what if one day something changed. The file now looks like this:
+
+```bash
+foo,bar,buzz
+aaa,zzz,bbb,ccc
+```
+
+So what is wrong with this? RFC 4180 says that;
+
+*Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma.*  
+
+Notice the additional `zzz` is now between `aaa` and `bbb`. This file would marked as invalid because this misalignment. Is `zzz` correct? What about `ccc`? Maybe there is a missing header value for `ccc`? Regardless, the file has some issues.
+
+For other examples, please [take a look at the RFC document](https://tools.ietf.org/html/rfc4180) for guidance on proper formats used for files using Comma-Separated Values (CSV).
+
+# API Usage
+There are two steps to the validation process. The first step is to post the file(s) to the API. If the file was accepted, the API will return a polling endpoint that contains the results of the validation process.
+
+In the example we will assume you have CSV file called `your.csv` that you want to test.
+
+## Step 1: Post `your.csv` to validation API
+We will run a simple `curl` command that will send the data to the API. It will look like this:
+```bash
+curl -F 'file=@/path/to/your.csv' 'https://validation.openbridge.io/dryrun' -H 'Content-Type: multipart/form-data' -D -
+```
+For the sake of this example, we will assume that the file posted successfully. You will see a `HTTP/2 302` response like this:
+```bash
+HTTP/2 302
+content-type: application/json
+content-length: 2
+location: https://validation.openbridge.io/dryrun/poll_async/99ax58f2020423v28c6e644cd143cdac
+date: Fri, 15 Feb 2019 22:36:57 GMT
+x-amzn-requestid: 33316acc-3172-11e9-b824-ddbb1dd27279
+x-amz-apigw-id: VKbJWGPtIAMF5Wg=
+x-amzn-trace-id: Root=1-5c673f07-252f6e91546d64f019526eb6;Sampled=0
+x-cache: Miss from cloudfront
+via: 1.1 ab6d050b627b51ed631842034b2e298b.cloudfront.net (CloudFront)
+x-amz-cf-id: XLW--LDgeqe7xpLvuoKxaGvKfvNcB4BNkyvx62P99N3qfgeuAvT7EA==
+```
+
+## Step 2: Get Your Results
+Note the `location` URL endpoint:
+```bash
+https://validation.openbridge.io/dryrun/poll_async/99ax58f2020423v28c6e644cd143cdac
+```
+
+This is the your polling endpoint. You use this to obtain the results from of the validation process.  You can use `curl` again:
+```bash
+curl -s -w "%{http_code}" "https://validation.openbridge.io/dryrun/poll_async/99ax58f2020423v28c6e644cd143cdac"
+```
+If the file was properly formatted, with no errors, you will get a `HTTP/2 200` response from your polling endpoint. This indicates success! As a bonus, it will also provide an inferred schema. In our `your.csv` file the schema looks like this:
+
+```json
+{"data": {"rules": {"configuration": {"load": {"prepend_headers": false, "schema": {"fields": [{"default": null, "include_in_checksum": true, "name": "ob_mws_seller_id", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "ob_mws_marketplace_id", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "item_name", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "item_description", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "listing_id", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "seller_sku", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "price", "type": "DOUBLE PRECISION"}, {"default": null, "include_in_checksum": true, "name": "quantity", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "open_date", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "image_url", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "item_is_marketplace", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "product_id_type", "type": "BIGINT"}, {"default": null, "include_in_checksum": true, "name": "zshop_shipping_fee", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "item_note", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "item_condition", "type": "BIGINT"}, {"default": null, "include_in_checksum": true, "name": "zshop_category1", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "zshop_browse_path", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "zshop_storefront_feature", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "asin1", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "asin2", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "asin3", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "will_ship_internationally", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "expedited_shipping", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "zshop_boldface", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "product_id", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "bid_for_featured_placement", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "add_delete", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "pending_quantity", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "fulfillment_channel", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "merchant_shipping_group", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": false, "name": "ob_transaction_id", "type": "varchar(256)"}, {"default": null, "include_in_checksum": false, "name": "ob_file_name", "type": "varchar(2048)"}, {"default": null, "include_in_checksum": false, "name": "ob_processed_at", "type": "varchar(256)"}, {"default": "getdate()", "include_in_checksum": false, "name": "ob_modified_date", "type": "datetime"}]}}}, "destination": {"tablename": "sample"}, "dialect": {"__doc__": null, "__module__": "csv", "_name": "sniffed", "delimiter": ",", "doublequote": true, "encoding": "UTF-8", "lineterminator": "\r\n", "quotechar": "\"", "quoting": 0, "skipinitialspace": false}, "meta": {"creation_date": null, "version": null}}}}
+```
+## Client Code: `Bash` and `Python`
+
+
+# Trouble Shooting
+
+## Use Case 1: Unquoted commas
+If your file is comma delimited, confirm that there are no extra unquoted commas in your data. Unquoted commas in your data will be treated as a field delimiter and the result will be that your data is ‘shifted’ and an extra column(s) created.
+
+In the example below, there is an unquoted comma (Openbridge, Inc.) in the second row of the file named `company_sales_20190207.csv` that is being loaded to a warehouse table named company_sales that has been defined with 3 fields: `comp_name (string)``, `comp_no (integer)` and `sales_amt (decimal)`
+
+File: `company_sales_20190207.csv`
+
+```bash
+comp_name,comp_no,sales_amt
+Acme Corp.,100,500.00
+Openbridge,Inc., 101, 100.00
+```
+
+|  comp_name | comp_no  |  sales_amt |
+|---|---|---|---|---|
+| Acme Corp.  |  100 |  500.00 |  
+| Openbridge,Inc.  | 101 |  100.00|
+
+
+A typical error for this use case would be `Error: could not convert string to integer”`
+
+
+As you can see, the unquoted comma in the `‘Openbridge, Inc.’` value is treated as a delimiter and text after the comma is ‘shifted’ into the next column (comp_no).  In this example, the file will fail because there will be a field type mismatch between the value ‘Inc.’ (string) and the table field type (integer).
+
+### Resolution
+The first step to resolving this failure is to determine which field value(s) have the unquoted comma.  Excel can help with this as is will place the field values in spreadsheet columns based on the comma delimiter.  Once opened in Excel, you can filter one of the columns to identify values that do not belong (e.g. if you see a value of ‘openbridge’ in a field named sales_amt that should have values with a decimal field type.
+
+Once identified, there are a couple options for ‘fixing’ this data:
+
+ 1. Surround the field value in double quotes (e.g. `“Openbridge, Inc.”`). This will tell the system that the comma is part of the field value and not a delimiter.  As long as the field in question is defined properly (in this case as a string) the data for that row will be successfully loaded.
+ 2. You can also remove the comma in the field value (e.g. change the value `‘Openbridge,Inc.’` to `‘Openbridge Inc.`). While viable, the correct approach is to use the double quotes in (1).
+
+Once this issue is resolved in your file, re-post the file to the API for testing.
+
+## Use Case 2: Quotation Marks
+As described in the previous use case, quotation marks can be used to surround values with commas to prevent them from being treated as delimiters.  However, the use of only one quotation mark can be problematic because the system will consider everything after that quotation mark until the end of the row as part of the field and the file will fail to load.
+
+
+### Example
+In the example below, there is a single quotation mark in the second row of the file named `company_sales_20190208.csv` that is being loaded to a warehouse table named `company_sales`.
+
+File: `company_sales_20190208.csv`
+Table: `company_sales`
+
+```bash
+comp_name,comp_no,sales_amt
+Acme Corp.,100,500.00
+Openbridge,Inc.,“101,100.00
+```
+
+|  comp_name (string) | comp_no (integer)  |  sales_amt (decimal) |
+|---|---|---|---|---|
+| Acme Corp.  |  100 |  500.00 |  
+| Openbridge,Inc.  | 101,100.00| |
+
+
+As you can see, the single quotation mark `“101` caused the system to treat the rest of the record as the value for the comp_no field and the record will once again fail to load because the data types do not align.
+
+### Resolution:
+
+There are a couple options to resolve this issue depending on whether the quotation marks were intended to surround a value or included in error…
+ 1. If quotation marks were intended, add the closing quotation mark at the end of the respective value
+ 2. If the single quotation mark was added in error, delete it
+
+Once this issue is resolved in your file, re-post the file to the API.
+
+
+## Use Case 3: Leading or trailing spaces in field values
+
+Sometimes source systems have leading or trailing spaces in data fields. According to RFC410, ` Spaces are considered part of a field and should not be ignored.`  As a result, **we will not treat spaces as a failure** in our validation tests as it conforms to specifications.
+
+However, there are potential issues with data containing leading or trailing spaces. Specifically, problems can arise if there are empty or null values included in a file.  If those null values include spaces, a system may treat those values as strings which may not be compatible with the field type defined in your destination table (i.e. integer).
+
+Another potential issue is that spaces can cause issues with joins between tables in your warehouse.  If trimming spaces from your source data is not possible, you can typically remove those spaces by creating views in your warehouse using `TRIM` functions.
+
+In many cases you will want to remove (trim) these spaces prior to posting.  
+
+### Example
+In the example below, there is an empty value that includes spaces for the field `sales_amt` in the second row of the file named `company_sales_20190209.csv` that is being loaded to a warehouse table named `company_sales`. That record will fail because the string field type associated with the spaces does not match the decimal field type for the table definition.
+
+File: `company_sales_20190209.csv`
+Table: `company_sales`
+```bash
+comp_name,comp_no,sales_amt
+Acme Corp.,100,500.00
+Openbridge Inc., 101,   ,
+```
+
+|  comp_name (string) | comp_no (integer)  |  sales_amt (decimal) |
+|---|---|---|---|---|
+| Acme Corp.  |  100 |  500.00 |  
+| Openbridge Inc.  | 101 |‘   ‘ |
+
+### Resolution:
+There are a couple options to resolve this issue depending on whether the null value included in the file is a valid value or not.
+
+1. If the value is valid, remove the spaces in the field value to indicate a true Null value. A field with spaces is not Null.
+2. Replace the value with a valid field value that matches the field data type in the table (in our example a decimal)
+
+
+
+# Status Codes
+
+## Validation Request Endpoint
+
+* Status Code: `HTTP/2 302; Success - Processing has begun`
+* Status Code:  `HTTP/2 404; Failure, no sample file was provided`
+* Status Code: `HTTP/2 400; Failure, the rules API was unable to initialize the request for some reason`
+
+## Results Polling Endpoint
+
+* Status Code: `HTTP/2 302; Success - still processing`
+* Status Code: `HTTP/2 200; Success - processing completed, file validated successfully`
+* Status Code: `HTTP/2 502; Failure - file determined to be invalid by rules API`
+* Status Code: `HTTP/2 404; Failure - invalid request ID in polling URL (expired, etc.)`
diff --git a/validation_client.py b/validation_client.py
@@ -0,0 +1,81 @@
+'''
+Created on Feb 15, 2019
+
+@author: Devin
+'''
+import csv
+import os
+import requests
+import optparse
+from time import sleep
+
+endpoint = 'https://validation.openbridge.io/dryrun'
+
+
+def main(file_path):
+    with open(file_path, 'rb') as f:
+        files = split(f)
+    
+    invalid_parts = []
+    print "Processing file..."
+    for floc in files:
+        resp = requests.post(url=endpoint, 
+                             json={'data': {'attributes': {'is_async': True }}}, 
+                             files={ "file": open(floc, 'rb')}, 
+                             allow_redirects=False)
+        if resp.status_code != 302:
+            return "Received an unexpected response from validation API: {}".format(str(resp.status_code))
+        poll_endpoint = resp.headers['Location']
+        while True:
+            resp = requests.get(url=poll_endpoint, allow_redirects=False)
+            if resp.status_code != 302:
+                break
+            sleep(2)
+        if resp.status_code != 200:
+            invalid_parts.append(floc)
+
+    if invalid_parts:
+        response = "ERROR: Received errors for parts: {}".format(', '.join(invalid_parts))
+    else:
+        response = "SUCCESS: The file passed validation tests"
+    
+    map(os.remove, files)
+    return response
+    
+
+def split(filehandler, delimiter=',', row_limit=1000,
+          output_name_template='output_%s.csv', output_path='.', keep_headers=True):
+    files = []
+    reader = csv.reader(filehandler, delimiter=delimiter)
+    current_piece = 1
+    current_out_path = os.path.join(
+        output_path,
+        output_name_template % current_piece
+    )
+    current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
+    current_limit = row_limit
+    if keep_headers:
+        headers = reader.next()
+        current_out_writer.writerow(headers)
+    for i, row in enumerate(reader):
+        if i + 1 > current_limit:
+            current_piece += 1
+            current_limit = row_limit * current_piece
+            files.append(current_out_path)
+            current_out_path = os.path.join(
+                output_path,
+                output_name_template % current_piece
+            )
+            current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
+            if keep_headers:
+                current_out_writer.writerow(headers)
+        current_out_writer.writerow(row)
+    files.append(current_out_path)
+    return files
+
+if __name__ == '__main__':
+    parser = optparse.OptionParser()
+    parser.add_option('-f', '--file', dest='file_path', help='Path to the file')
+    
+    options, args = parser.parse_args()
+    print main(options.file_path)
diff --git a/validation_client.sh b/validation_client.sh
@@ -0,0 +1,65 @@
+#!/usr/bin/env bash
+
+# Set and test variables
+args=("$@")
+[[ -z ${args[0]} ]] && echo "ERROR: Was expecting the path and file to be passed" && exit 1
+file="${args[0]}"
+size=$(wc -c < "${file}")
+base=$(echo ${file##*/})
+name=$(echo ${base%.*})
+runtime=$(date +%Y%m%d_%H%M%S)
+workdir=${name}_${runtime}
+
+# Check for curl, we require it for this script
+if [[ ! -x "$(command -v curl)" ]]; then echo "Error: curl is not installed and is required for this script to function properly" >&2; exit 1; fi
+
+# Setup the workspace
+mkdir -p ./"${workdir}" && cp "$file" "$workdir" && cd "$workdir" || exit
+
+if [[ ! -f ${file} ]]; then echo "ERROR: You supplied ${file} but the file can not be located by this script. Please check the path or filename." && exit 1; else echo "OK: The file being sent for validation is located here: ${file}"; fi
+
+# Split the files so we can post in blocks.
+if (( size > 999 )); then
+  awk -v l=2000 '(NR==1){header=$0;next}
+                 (NR%l==2) {
+                    c=sprintf("%0.5d",c+1);
+                    close(file); file=FILENAME; sub(/csv$/,c".csv",file)
+                    print header > file
+                 }
+                 {print $0 > file}' "$file"
+  # Remove orignal file prior to testing as we don't want this sent
+  rm -f "$file"
+ else
+   echo "OK: The file being sent for validation does not exceed 10 MB"
+fi
+
+for i in ./*.csv; do
+    # Submit file for validation
+    location=$(curl -s -w "%{http_code}" -F file=@"$i" 'https://validation.openbridge.io/dryrun' -H 'Content-Type: multipart/form-data' -D -)
+    body="${location:(-3)}"
+    if [[ ${body} = '400' ]]; then
+        echo "ERROR: The validator was unable to process the file" && exit 1
+      elif [[ ${body} = '404' ]]; then
+        echo "ERROR: No sample file was provided" && exit 1
+      elif [[ ${body} = '302' ]]; then
+        echo "PENDING: The file $i was submitted for testing. Processing..."
+      else
+        echo "ERROR: An unknown error occured" && exit 1
+    fi
+    # Check the polling URL for the results
+    response=$(echo "${location}" | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort | uniq)
+    while [[ "$(curl -s -o /dev/null -w "%{http_code}" "${response}")" == "302" ]]; do echo "PENDING: Waiting for the validation if file $i to complete..." && sleep 5; done
+    res=$(curl -s -w "%{http_code}" "${response}")
+    loc=${res:(-3)}
+    if [[ ${loc} = "200" ]]; then
+       echo "SUCCESS: File $i passed validation tests"
+     elif [[ ${loc} = "502" ]]; then
+       echo "ERROR: File $i determined to be invalid." && exit 1
+     elif [[ ${loc} = "404" ]]; then
+       echo "ERROR: Invalid request ID in polling URL (expired, malformed...)" && exit 1
+     else
+       echo "ERROR: An unknown error occured" && exit 1
+    fi
+done
+
+exit 0