Skip to content

Commit e2bb0ac

Browse files
committed
initial commit
1 parent a1410c8 commit e2bb0ac

File tree

3 files changed

+343
-0
lines changed

3 files changed

+343
-0
lines changed

README.md

Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,198 @@
11
# ob_file_validation
2+
The purpose of this API is to validate CSV files for general compliance with established norms such as [RFC4180](https://tools.ietf.org/html/rfc4180). Imagine pouring a gallon of maple syrup into your cars gat tank. That is what bad CSV files do to data pipelines. This is why we have an API that will assist with determining the quality of CSV data sent to it.
3+
4+
# Background
5+
Comma separated values (CSV) is commonly used for exchanging data between systems. While this format is common, it can present difficulties. Why? Different tools, or export processes, often generate outputs that are not CSV files or have variations that are not considered "valid" according the [RFC4180](https://tools.ietf.org/html/rfc4180).
6+
7+
Invalid CSV files create challenges for those building data pipelines. Pipelines value consistency, predictability and testability as they ensure uninterrupted operation from source to a target destination.
8+
9+
10+
## What does a valid CSV look like?
11+
Here is an example of a valid CSV file.
12+
13+
It has a header row with `foo`, `bar`, and `buzz` with a corresponding row of `aaa`, `bbb`, and `ccc`
14+
15+
| foo | bar | buzz |
16+
|---|---|---|---|---|
17+
| aaa | bbb | ccc |
18+
19+
The CSV will look something like this;
20+
```bash
21+
foo,bar,buzz
22+
aaa,bbb,ccc
23+
```
24+
However, what if one day something changed. The file now looks like this:
25+
26+
```bash
27+
foo,bar,buzz
28+
aaa,zzz,bbb,ccc
29+
```
30+
31+
So what is wrong with this? RFC 4180 says that;
32+
33+
*Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma.*
34+
35+
Notice the additional `zzz` is now between `aaa` and `bbb`. This file would marked as invalid because this misalignment. Is `zzz` correct? What about `ccc`? Maybe there is a missing header value for `ccc`? Regardless, the file has some issues.
36+
37+
For other examples, please [take a look at the RFC document](https://tools.ietf.org/html/rfc4180) for guidance on proper formats used for files using Comma-Separated Values (CSV).
38+
39+
# API Usage
40+
There are two steps to the validation process. The first step is to post the file(s) to the API. If the file was accepted, the API will return a polling endpoint that contains the results of the validation process.
41+
42+
In the example we will assume you have CSV file called `your.csv` that you want to test.
43+
44+
## Step 1: Post `your.csv` to validation API
45+
We will run a simple `curl` command that will send the data to the API. It will look like this:
46+
```bash
47+
curl -F 'file=@/path/to/your.csv' 'https://validation.openbridge.io/dryrun' -H 'Content-Type: multipart/form-data' -D -
48+
```
49+
For the sake of this example, we will assume that the file posted successfully. You will see a `HTTP/2 302` response like this:
50+
```bash
51+
HTTP/2 302
52+
content-type: application/json
53+
content-length: 2
54+
location: https://validation.openbridge.io/dryrun/poll_async/99ax58f2020423v28c6e644cd143cdac
55+
date: Fri, 15 Feb 2019 22:36:57 GMT
56+
x-amzn-requestid: 33316acc-3172-11e9-b824-ddbb1dd27279
57+
x-amz-apigw-id: VKbJWGPtIAMF5Wg=
58+
x-amzn-trace-id: Root=1-5c673f07-252f6e91546d64f019526eb6;Sampled=0
59+
x-cache: Miss from cloudfront
60+
via: 1.1 ab6d050b627b51ed631842034b2e298b.cloudfront.net (CloudFront)
61+
x-amz-cf-id: XLW--LDgeqe7xpLvuoKxaGvKfvNcB4BNkyvx62P99N3qfgeuAvT7EA==
62+
```
63+
64+
## Step 2: Get Your Results
65+
Note the `location` URL endpoint:
66+
```bash
67+
https://validation.openbridge.io/dryrun/poll_async/99ax58f2020423v28c6e644cd143cdac
68+
```
69+
70+
This is the your polling endpoint. You use this to obtain the results from of the validation process. You can use `curl` again:
71+
```bash
72+
curl -s -w "%{http_code}" "https://validation.openbridge.io/dryrun/poll_async/99ax58f2020423v28c6e644cd143cdac"
73+
```
74+
If the file was properly formatted, with no errors, you will get a `HTTP/2 200` response from your polling endpoint. This indicates success! As a bonus, it will also provide an inferred schema. In our `your.csv` file the schema looks like this:
75+
76+
```json
77+
{"data": {"rules": {"configuration": {"load": {"prepend_headers": false, "schema": {"fields": [{"default": null, "include_in_checksum": true, "name": "ob_mws_seller_id", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "ob_mws_marketplace_id", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "item_name", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "item_description", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "listing_id", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "seller_sku", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "price", "type": "DOUBLE PRECISION"}, {"default": null, "include_in_checksum": true, "name": "quantity", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "open_date", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "image_url", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "item_is_marketplace", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "product_id_type", "type": "BIGINT"}, {"default": null, "include_in_checksum": true, "name": "zshop_shipping_fee", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "item_note", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "item_condition", "type": "BIGINT"}, {"default": null, "include_in_checksum": true, "name": "zshop_category1", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "zshop_browse_path", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "zshop_storefront_feature", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "asin1", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "asin2", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "asin3", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "will_ship_internationally", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "expedited_shipping", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "zshop_boldface", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "product_id", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "bid_for_featured_placement", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "add_delete", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "pending_quantity", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "fulfillment_channel", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "merchant_shipping_group", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": false, "name": "ob_transaction_id", "type": "varchar(256)"}, {"default": null, "include_in_checksum": false, "name": "ob_file_name", "type": "varchar(2048)"}, {"default": null, "include_in_checksum": false, "name": "ob_processed_at", "type": "varchar(256)"}, {"default": "getdate()", "include_in_checksum": false, "name": "ob_modified_date", "type": "datetime"}]}}}, "destination": {"tablename": "sample"}, "dialect": {"__doc__": null, "__module__": "csv", "_name": "sniffed", "delimiter": ",", "doublequote": true, "encoding": "UTF-8", "lineterminator": "\r\n", "quotechar": "\"", "quoting": 0, "skipinitialspace": false}, "meta": {"creation_date": null, "version": null}}}}
78+
```
79+
## Client Code: `Bash` and `Python`
80+
81+
82+
# Trouble Shooting
83+
84+
## Use Case 1: Unquoted commas
85+
If your file is comma delimited, confirm that there are no extra unquoted commas in your data. Unquoted commas in your data will be treated as a field delimiter and the result will be that your data is ‘shifted’ and an extra column(s) created.
86+
87+
In the example below, there is an unquoted comma (Openbridge, Inc.) in the second row of the file named `company_sales_20190207.csv` that is being loaded to a warehouse table named company_sales that has been defined with 3 fields: `comp_name (string)``, `comp_no (integer)` and `sales_amt (decimal)`
88+
89+
File: `company_sales_20190207.csv`
90+
91+
```bash
92+
comp_name,comp_no,sales_amt
93+
Acme Corp.,100,500.00
94+
Openbridge,Inc., 101, 100.00
95+
```
96+
97+
| comp_name | comp_no | sales_amt |
98+
|---|---|---|---|---|
99+
| Acme Corp. | 100 | 500.00 |
100+
| Openbridge,Inc. | 101 | 100.00|
101+
102+
103+
A typical error for this use case would be `Error: could not convert string to integer”`
104+
105+
106+
As you can see, the unquoted comma in the `‘Openbridge, Inc.’` value is treated as a delimiter and text after the comma is ‘shifted’ into the next column (comp_no). In this example, the file will fail because there will be a field type mismatch between the value ‘Inc.’ (string) and the table field type (integer).
107+
108+
### Resolution
109+
The first step to resolving this failure is to determine which field value(s) have the unquoted comma. Excel can help with this as is will place the field values in spreadsheet columns based on the comma delimiter. Once opened in Excel, you can filter one of the columns to identify values that do not belong (e.g. if you see a value of ‘openbridge’ in a field named sales_amt that should have values with a decimal field type.
110+
111+
Once identified, there are a couple options for ‘fixing’ this data:
112+
113+
1. Surround the field value in double quotes (e.g. `“Openbridge, Inc.”`). This will tell the system that the comma is part of the field value and not a delimiter. As long as the field in question is defined properly (in this case as a string) the data for that row will be successfully loaded.
114+
2. You can also remove the comma in the field value (e.g. change the value `‘Openbridge,Inc.’` to `‘Openbridge Inc.`). While viable, the correct approach is to use the double quotes in (1).
115+
116+
Once this issue is resolved in your file, re-post the file to the API for testing.
117+
118+
## Use Case 2: Quotation Marks
119+
As described in the previous use case, quotation marks can be used to surround values with commas to prevent them from being treated as delimiters. However, the use of only one quotation mark can be problematic because the system will consider everything after that quotation mark until the end of the row as part of the field and the file will fail to load.
120+
121+
122+
### Example
123+
In the example below, there is a single quotation mark in the second row of the file named `company_sales_20190208.csv` that is being loaded to a warehouse table named `company_sales`.
124+
125+
File: `company_sales_20190208.csv`
126+
Table: `company_sales`
127+
128+
```bash
129+
comp_name,comp_no,sales_amt
130+
Acme Corp.,100,500.00
131+
Openbridge,Inc.,“101,100.00
132+
```
133+
134+
| comp_name (string) | comp_no (integer) | sales_amt (decimal) |
135+
|---|---|---|---|---|
136+
| Acme Corp. | 100 | 500.00 |
137+
| Openbridge,Inc. | 101,100.00| |
138+
139+
140+
As you can see, the single quotation mark `“101` caused the system to treat the rest of the record as the value for the comp_no field and the record will once again fail to load because the data types do not align.
141+
142+
### Resolution:
143+
144+
There are a couple options to resolve this issue depending on whether the quotation marks were intended to surround a value or included in error…
145+
1. If quotation marks were intended, add the closing quotation mark at the end of the respective value
146+
2. If the single quotation mark was added in error, delete it
147+
148+
Once this issue is resolved in your file, re-post the file to the API.
149+
150+
151+
## Use Case 3: Leading or trailing spaces in field values
152+
153+
Sometimes source systems have leading or trailing spaces in data fields. According to RFC410, ` Spaces are considered part of a field and should not be ignored.` As a result, **we will not treat spaces as a failure** in our validation tests as it conforms to specifications.
154+
155+
However, there are potential issues with data containing leading or trailing spaces. Specifically, problems can arise if there are empty or null values included in a file. If those null values include spaces, a system may treat those values as strings which may not be compatible with the field type defined in your destination table (i.e. integer).
156+
157+
Another potential issue is that spaces can cause issues with joins between tables in your warehouse. If trimming spaces from your source data is not possible, you can typically remove those spaces by creating views in your warehouse using `TRIM` functions.
158+
159+
In many cases you will want to remove (trim) these spaces prior to posting.
160+
161+
### Example
162+
In the example below, there is an empty value that includes spaces for the field `sales_amt` in the second row of the file named `company_sales_20190209.csv` that is being loaded to a warehouse table named `company_sales`. That record will fail because the string field type associated with the spaces does not match the decimal field type for the table definition.
163+
164+
File: `company_sales_20190209.csv`
165+
Table: `company_sales`
166+
```bash
167+
comp_name,comp_no,sales_amt
168+
Acme Corp.,100,500.00
169+
Openbridge Inc., 101, ,
170+
```
171+
172+
| comp_name (string) | comp_no (integer) | sales_amt (decimal) |
173+
|---|---|---|---|---|
174+
| Acme Corp. | 100 | 500.00 |
175+
| Openbridge Inc. | 101 |‘ ‘ |
176+
177+
### Resolution:
178+
There are a couple options to resolve this issue depending on whether the null value included in the file is a valid value or not.
179+
180+
1. If the value is valid, remove the spaces in the field value to indicate a true Null value. A field with spaces is not Null.
181+
2. Replace the value with a valid field value that matches the field data type in the table (in our example a decimal)
182+
183+
184+
185+
# Status Codes
186+
187+
## Validation Request Endpoint
188+
189+
* Status Code: `HTTP/2 302; Success - Processing has begun`
190+
* Status Code: `HTTP/2 404; Failure, no sample file was provided`
191+
* Status Code: `HTTP/2 400; Failure, the rules API was unable to initialize the request for some reason`
192+
193+
## Results Polling Endpoint
194+
195+
* Status Code: `HTTP/2 302; Success - still processing`
196+
* Status Code: `HTTP/2 200; Success - processing completed, file validated successfully`
197+
* Status Code: `HTTP/2 502; Failure - file determined to be invalid by rules API`
198+
* Status Code: `HTTP/2 404; Failure - invalid request ID in polling URL (expired, etc.)`

validation_client.py

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
'''
2+
Created on Feb 15, 2019
3+
4+
@author: Devin
5+
'''
6+
import csv
7+
import os
8+
import requests
9+
import optparse
10+
from time import sleep
11+
12+
endpoint = 'https://validation.openbridge.io/dryrun'
13+
14+
15+
def main(file_path):
16+
with open(file_path, 'rb') as f:
17+
files = split(f)
18+
19+
invalid_parts = []
20+
print "Processing file..."
21+
for floc in files:
22+
resp = requests.post(url=endpoint,
23+
json={'data': {'attributes': {'is_async': True }}},
24+
files={ "file": open(floc, 'rb')},
25+
allow_redirects=False)
26+
if resp.status_code != 302:
27+
return "Received an unexpected response from validation API: {}".format(str(resp.status_code))
28+
poll_endpoint = resp.headers['Location']
29+
while True:
30+
resp = requests.get(url=poll_endpoint, allow_redirects=False)
31+
if resp.status_code != 302:
32+
break
33+
sleep(2)
34+
if resp.status_code != 200:
35+
invalid_parts.append(floc)
36+
37+
if invalid_parts:
38+
response = "ERROR: Received errors for parts: {}".format(', '.join(invalid_parts))
39+
else:
40+
response = "SUCCESS: The file passed validation tests"
41+
42+
map(os.remove, files)
43+
return response
44+
45+
46+
def split(filehandler, delimiter=',', row_limit=1000,
47+
output_name_template='output_%s.csv', output_path='.', keep_headers=True):
48+
files = []
49+
reader = csv.reader(filehandler, delimiter=delimiter)
50+
current_piece = 1
51+
current_out_path = os.path.join(
52+
output_path,
53+
output_name_template % current_piece
54+
)
55+
current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
56+
current_limit = row_limit
57+
if keep_headers:
58+
headers = reader.next()
59+
current_out_writer.writerow(headers)
60+
for i, row in enumerate(reader):
61+
if i + 1 > current_limit:
62+
current_piece += 1
63+
current_limit = row_limit * current_piece
64+
files.append(current_out_path)
65+
current_out_path = os.path.join(
66+
output_path,
67+
output_name_template % current_piece
68+
)
69+
current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
70+
if keep_headers:
71+
current_out_writer.writerow(headers)
72+
current_out_writer.writerow(row)
73+
files.append(current_out_path)
74+
return files
75+
76+
if __name__ == '__main__':
77+
parser = optparse.OptionParser()
78+
parser.add_option('-f', '--file', dest='file_path', help='Path to the file')
79+
80+
options, args = parser.parse_args()
81+
print main(options.file_path)

validation_client.sh

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
#!/usr/bin/env bash
2+
3+
# Set and test variables
4+
args=("$@")
5+
[[ -z ${args[0]} ]] && echo "ERROR: Was expecting the path and file to be passed" && exit 1
6+
file="${args[0]}"
7+
size=$(wc -c < "${file}")
8+
base=$(echo ${file##*/})
9+
name=$(echo ${base%.*})
10+
runtime=$(date +%Y%m%d_%H%M%S)
11+
workdir=${name}_${runtime}
12+
13+
# Check for curl, we require it for this script
14+
if [[ ! -x "$(command -v curl)" ]]; then echo "Error: curl is not installed and is required for this script to function properly" >&2; exit 1; fi
15+
16+
# Setup the workspace
17+
mkdir -p ./"${workdir}" && cp "$file" "$workdir" && cd "$workdir" || exit
18+
19+
if [[ ! -f ${file} ]]; then echo "ERROR: You supplied ${file} but the file can not be located by this script. Please check the path or filename." && exit 1; else echo "OK: The file being sent for validation is located here: ${file}"; fi
20+
21+
# Split the files so we can post in blocks.
22+
if (( size > 999 )); then
23+
awk -v l=2000 '(NR==1){header=$0;next}
24+
(NR%l==2) {
25+
c=sprintf("%0.5d",c+1);
26+
close(file); file=FILENAME; sub(/csv$/,c".csv",file)
27+
print header > file
28+
}
29+
{print $0 > file}' "$file"
30+
# Remove orignal file prior to testing as we don't want this sent
31+
rm -f "$file"
32+
else
33+
echo "OK: The file being sent for validation does not exceed 10 MB"
34+
fi
35+
36+
for i in ./*.csv; do
37+
# Submit file for validation
38+
location=$(curl -s -w "%{http_code}" -F file=@"$i" 'https://validation.openbridge.io/dryrun' -H 'Content-Type: multipart/form-data' -D -)
39+
body="${location:(-3)}"
40+
if [[ ${body} = '400' ]]; then
41+
echo "ERROR: The validator was unable to process the file" && exit 1
42+
elif [[ ${body} = '404' ]]; then
43+
echo "ERROR: No sample file was provided" && exit 1
44+
elif [[ ${body} = '302' ]]; then
45+
echo "PENDING: The file $i was submitted for testing. Processing..."
46+
else
47+
echo "ERROR: An unknown error occured" && exit 1
48+
fi
49+
# Check the polling URL for the results
50+
response=$(echo "${location}" | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort | uniq)
51+
while [[ "$(curl -s -o /dev/null -w "%{http_code}" "${response}")" == "302" ]]; do echo "PENDING: Waiting for the validation if file $i to complete..." && sleep 5; done
52+
res=$(curl -s -w "%{http_code}" "${response}")
53+
loc=${res:(-3)}
54+
if [[ ${loc} = "200" ]]; then
55+
echo "SUCCESS: File $i passed validation tests"
56+
elif [[ ${loc} = "502" ]]; then
57+
echo "ERROR: File $i determined to be invalid." && exit 1
58+
elif [[ ${loc} = "404" ]]; then
59+
echo "ERROR: Invalid request ID in polling URL (expired, malformed...)" && exit 1
60+
else
61+
echo "ERROR: An unknown error occured" && exit 1
62+
fi
63+
done
64+
65+
exit 0

0 commit comments

Comments
 (0)