Write a prototype that cleans & divides three separate regulatory documents into sections, subsections & paragraphs.
The regulatory documents are provided as pdf files in the docs
folder.
The prototype should classify each paragraph into type and subject matter. You can find the classes here.
- A script that does the task described above (i.e., provided a set of documents, it returns the divided documents and the classified paragraphs).
- Brief description of how you would deploy and prepare this prototype for production.
- Brief description of how you would evaluate & QA this prototype.
- List & priorization of potential improvements.
The output should be a list of paragraphs, including the section, subsection and classes. The specific output format can be up to you.
You will receive an API key to use for this. Please use it responsibly :)