GitHub

I used the urllib2 python library to read the SEC Filing document from the defined URL link. The entire document was read into a string buffer. It was cleaned for unnecessary details (ie: the headers, tailing html markers). Had to open the page in the IE in debug mode so I could examine the HTML tags and build the search strings to grab required data.

The entire script took 15 hrs to write, test, and debug. The script takes no argument at this time, therefore could be directly executed as 'python SEC_Filing.py

No additional libraries were used other than 're' for regular expressions. The output of the files are stored in the current directory for time being.

Best was done to extract as much info as possible eventhough source details was not uniform

process:

set up a search criteria to grab the *
use re.findall to look for pattern in the buffer (variable link)
loop through the tuples in the search buffer
further set up new pattern to retrieve details embedded between
tags. If there were tables, try to gather details from the table
make entry in the DOC_LINE_ARRAY for a specific pattern
make entry in the PARAGRAPH array for paragraphs that contained '$'
after process completion, dump the arrays (DOC_LINE_ARRAY and PARAGRAPH_ARRAY) to separate files.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
SEC_Filing.py		SEC_Filing.py
document.txt		document.txt
paragraph.txt		paragraph.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

sorkan/rajeshac

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages