Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
175 commits
Select commit Hold shift + click to select a range
da2c191
update north carolina
nanjiangwill Jul 30, 2024
d395e5d
update file
nanjiangwill Jul 30, 2024
4e04eb7
update file
nanjiangwill Jul 30, 2024
29f9034
update district extraction
nanjiangwill Jul 31, 2024
6f1e8db
Merge branch 'dev' of https://github.com/nanjiangwill/zoning into dev
nanjiangwill Jul 31, 2024
4594ef7
update code
nanjiangwill Jul 31, 2024
c3986c8
update code for district extraction and entire pipeline
nanjiangwill Aug 1, 2024
29e446a
update code for district extraction and entire pipeline
nanjiangwill Aug 4, 2024
eecc217
update code for district extraction and entire pipeline
nanjiangwill Aug 4, 2024
100af80
update viz code
nanjiangwill Aug 5, 2024
e84187e
update eval logic
nanjiangwill Aug 5, 2024
f73bae0
update district extraction logic
nanjiangwill Aug 5, 2024
d0120a6
update code for heroku deploy
nanjiangwill Aug 8, 2024
d1db355
update code for heroku deploy
nanjiangwill Aug 8, 2024
4932ab7
update code for heroku deploy
nanjiangwill Aug 8, 2024
e9ad833
update data
nanjiangwill Aug 8, 2024
264b1ca
update git submodule for heroku
nanjiangwill Aug 8, 2024
6ad10b6
update git submodule for heroku
nanjiangwill Aug 8, 2024
26d1702
update git submodule for heroku
nanjiangwill Aug 8, 2024
2ad8eee
update git submodule for heroku
nanjiangwill Aug 8, 2024
8fc9a36
update git submodule for heroku
nanjiangwill Aug 8, 2024
8117b44
update git submodule for heroku
nanjiangwill Aug 8, 2024
74ab329
update files
nanjiangwill Aug 8, 2024
6e284a2
update large file
nanjiangwill Aug 8, 2024
b53f49d
viz code
nanjiangwill Aug 8, 2024
975e27d
viz code
nanjiangwill Aug 8, 2024
10215fd
viz code
nanjiangwill Aug 8, 2024
d1164e7
viz code
nanjiangwill Aug 8, 2024
3a9b9ac
update viz code, adding box
nanjiangwill Aug 8, 2024
3310ed7
update viz code, adding box
nanjiangwill Aug 8, 2024
10a89c4
update viz code, adding box
nanjiangwill Aug 8, 2024
bbc3a01
fix small typo
nanjiangwill Aug 8, 2024
71ecd1c
udpate for texas
nanjiangwill Aug 8, 2024
891c487
udpate for texas
nanjiangwill Aug 8, 2024
a4cbec3
udpate for texas
nanjiangwill Aug 8, 2024
b1951ec
udpate for texas
nanjiangwill Aug 8, 2024
edd6010
udpate for texas
nanjiangwill Aug 8, 2024
ca22f19
udpate for texas
nanjiangwill Aug 8, 2024
7c757e6
udpate for texas
nanjiangwill Aug 8, 2024
edd927c
udpate for texas
nanjiangwill Aug 8, 2024
c013b34
udpate district extraction code
nanjiangwill Aug 9, 2024
236ab83
update streamlit for deployment
nanjiangwill Aug 14, 2024
5cf0e40
update streamlit for deployment
nanjiangwill Aug 14, 2024
17a5d48
update streamlit for deployment
nanjiangwill Aug 14, 2024
7d03849
update streamlit for deployment
nanjiangwill Aug 14, 2024
af86f66
update streamlit for deployment
nanjiangwill Aug 14, 2024
50de469
update streamlit for deployment
nanjiangwill Aug 14, 2024
cbdf731
update streamlit for deployment
nanjiangwill Aug 14, 2024
f1db98c
update streamlit for deployment
nanjiangwill Aug 14, 2024
b1f2b64
update streamlit for deployment
nanjiangwill Aug 15, 2024
7c0b56d
update streamlit for deployment
nanjiangwill Aug 15, 2024
dcbc7cf
update streamlit for deployment
nanjiangwill Aug 15, 2024
0d27ccb
update streamlit for deployment
nanjiangwill Aug 15, 2024
d519e03
update streamlit for deployment
nanjiangwill Aug 15, 2024
5710e9a
update streamlit for deployment
nanjiangwill Aug 15, 2024
fe449de
update streamlit for deployment
nanjiangwill Aug 15, 2024
0b30eab
update streamlit for deployment
nanjiangwill Aug 15, 2024
0a11129
update streamlit for deployment
nanjiangwill Aug 15, 2024
30a0a5b
update streamlit for deployment
nanjiangwill Aug 15, 2024
f3c7696
update streamlit for deployment
nanjiangwill Aug 15, 2024
faf0216
update streamlit for deployment
nanjiangwill Aug 15, 2024
9a982aa
update streamlit for deployment
nanjiangwill Aug 15, 2024
e3da026
update streamlit for deployment
nanjiangwill Aug 15, 2024
deed333
update streamlit for deployment
nanjiangwill Aug 15, 2024
c2643e1
update streamlit for deployment
nanjiangwill Aug 15, 2024
9e8b2fa
update streamlit for deployment
nanjiangwill Aug 15, 2024
7fe1b8b
update streamlit for deployment
nanjiangwill Aug 15, 2024
9256f6f
update streamlit for deployment
nanjiangwill Aug 15, 2024
f6bfe0f
change f string
srush Aug 15, 2024
cd7e611
Change style
srush Aug 15, 2024
7622a04
html design branchj
srush Aug 15, 2024
19fa703
update streamlit for deployment
nanjiangwill Aug 15, 2024
af81945
Merge branch 'html_design' into dev
nanjiangwill Aug 15, 2024
1872269
update viz code and nc target name
nanjiangwill Aug 30, 2024
274f7e0
update nc target list
nanjiangwill Aug 30, 2024
713bfd0
update target file
nanjiangwill Aug 31, 2024
910808f
Update all examples template (excluding min_lot_size_examples.pmpt)
ethan-yz-hao Aug 31, 2024
0d8e392
Update es run config for textract_es_gpt4_north_carolina_search_range_3
ethan-yz-hao Aug 31, 2024
479886b
merge es run config for textract_es_gpt4_north_carolina_search_range_…
ethan-yz-hao Aug 31, 2024
74dbcfb
invalid json fix in min_lot_size_examples.pmpt.tpl
ethan-yz-hao Aug 31, 2024
f23b098
Base prompt modification: improving the clarity on handling general r…
ethan-yz-hao Aug 31, 2024
b713408
Add popup window to prompt user enter the name when entering the webs…
ethan-yz-hao Aug 31, 2024
e2c4734
Setup timer and store elapsed_sec to database
ethan-yz-hao Aug 31, 2024
f80958c
Fix steps number
ethan-yz-hao Aug 31, 2024
6b081d5
Add synonym for thesaurus.json (max density and min_parking_spaces)
ethan-yz-hao Sep 1, 2024
6d09768
example template fix / adding data range
ethan-yz-hao Sep 4, 2024
7035880
udpate template
nanjiangwill Sep 4, 2024
64920dd
udpate merge
nanjiangwill Sep 4, 2024
d2fb08b
udpate merge
nanjiangwill Sep 4, 2024
9c1cdcc
update files
nanjiangwill Sep 4, 2024
9dd5917
Merge branch 'dev' of https://github.com/nanjiangwill/zoning into dev
nanjiangwill Sep 4, 2024
0c38d38
adding page output (main prompt and examples) / bugfix in prompt
ethan-yz-hao Sep 4, 2024
e803123
update files
nanjiangwill Sep 5, 2024
fe8ee2d
Merge branch 'ethan' into dev
nanjiangwill Sep 5, 2024
0db32e2
Merge branch 'dev' of https://github.com/nanjiangwill/zoning into ethan
ethan-yz-hao Sep 7, 2024
1095332
update files
nanjiangwill Sep 7, 2024
384fbec
Merge branch 'dev' of https://github.com/nanjiangwill/zoning into ethan
ethan-yz-hao Sep 7, 2024
9063498
update files
nanjiangwill Sep 7, 2024
1593819
Merge branch 'dev' of https://github.com/nanjiangwill/zoning into ethan
ethan-yz-hao Sep 7, 2024
c224ac6
new viz / enable popup name
ethan-yz-hao Sep 7, 2024
a34dc13
bugfix start_time
ethan-yz-hao Sep 7, 2024
edd93fb
bigger, centered title
ethan-yz-hao Sep 7, 2024
46fdad6
condense eval term and district to the same level, ordered by page
ethan-yz-hao Sep 7, 2024
0587012
displaying next item
ethan-yz-hao Sep 7, 2024
4c63e6f
add progress bar
ethan-yz-hao Sep 7, 2024
dd7a03c
formatting
ethan-yz-hao Sep 7, 2024
f8d0d10
sync with db / progress bar based on db / skipped labelled based on d…
ethan-yz-hao Sep 8, 2024
b1046cd
update files
nanjiangwill Sep 10, 2024
fd200a2
Merge branch 'ethan' into dev
nanjiangwill Sep 10, 2024
f2eb595
time / cache
ethan-yz-hao Sep 10, 2024
4a28284
update files
nanjiangwill Sep 10, 2024
d278591
avoid duplicate reload
ethan-yz-hao Sep 10, 2024
ad18bf6
remove sidebar
ethan-yz-hao Sep 10, 2024
a712c30
fix pyright issue
ethan-yz-hao Sep 10, 2024
ba929d0
fix displaying next item
ethan-yz-hao Sep 10, 2024
fb4819c
fix displaying next item
ethan-yz-hao Sep 10, 2024
b301c00
use onclick for button to avoid rerun / add model for next town / for…
ethan-yz-hao Sep 11, 2024
8ca665a
add download button
ethan-yz-hao Sep 11, 2024
fea8f4a
store ocr info in session state
ethan-yz-hao Sep 11, 2024
586ae63
interface v1
nanjiangwill Sep 11, 2024
0c260e1
interface and result v2
nanjiangwill Sep 12, 2024
3e4f6d8
remove files
nanjiangwill Sep 12, 2024
96ef537
add results
nanjiangwill Sep 12, 2024
323875b
remove files
nanjiangwill Sep 12, 2024
530ee8e
add files with git lfs
nanjiangwill Sep 12, 2024
6327e50
Added Dev Container Folder
nanjiangwill Sep 12, 2024
801f2d2
add files
nanjiangwill Sep 12, 2024
35d2bf3
Merge branch 'dev' of https://github.com/nanjiangwill/zoning into dev
nanjiangwill Sep 12, 2024
4d46b95
add files
nanjiangwill Sep 12, 2024
eb27e86
add files
nanjiangwill Sep 12, 2024
8459d62
add files
nanjiangwill Sep 12, 2024
3cf36d4
add files
nanjiangwill Sep 12, 2024
a884e1e
add files
nanjiangwill Sep 12, 2024
16a1a10
remove results
nanjiangwill Sep 12, 2024
1b38485
all downloading remove big file
nanjiangwill Sep 12, 2024
3ab740a
production interface v1
nanjiangwill Sep 12, 2024
1178e6d
duplicate
ethan-yz-hao Oct 6, 2024
bba19d2
init - batched
ethan-yz-hao Oct 6, 2024
9138971
group by eval_term within each batch
ethan-yz-hao Oct 6, 2024
4f51075
fix showing next item
ethan-yz-hao Oct 6, 2024
e1d3999
group all pdfs together
ethan-yz-hao Oct 6, 2024
989c78d
use box instead of fill to accommodate more overlap
ethan-yz-hao Oct 7, 2024
028a451
put result in sidebar / styling
ethan-yz-hao Oct 7, 2024
f36307c
formatting
ethan-yz-hao Oct 7, 2024
c08e4a1
batch
ethan-yz-hao Oct 7, 2024
d9ddc93
update
nanjiangwill Oct 10, 2024
c943769
Merge remote-tracking branch 'origin/dev' into ethan
ethan-yz-hao Oct 10, 2024
77d746e
update
nanjiangwill Oct 10, 2024
7b6fd22
init
ethan-yz-hao Oct 11, 2024
8f4b97d
init
ethan-yz-hao Oct 11, 2024
c7516eb
test
ethan-yz-hao Oct 11, 2024
862dda6
handle empty db
ethan-yz-hao Oct 11, 2024
59da36b
bugfix in get edit page
ethan-yz-hao Oct 11, 2024
5798d1d
bugfix in using pdf_data
ethan-yz-hao Oct 11, 2024
6f11986
add pre-computed list: sorted_all_results_with_search_batched
ethan-yz-hao Oct 11, 2024
0204231
remove sidebar
nanjiangwill Oct 31, 2024
fa06487
remove sidebar and batch write to firestore, make entire pipeline faster
nanjiangwill Oct 31, 2024
fe179f0
update checkbox logic
nanjiangwill Nov 1, 2024
601992b
update checkbox logic
nanjiangwill Nov 1, 2024
ea3887f
update checkbox logic
nanjiangwill Nov 1, 2024
dcf0f14
update checkbox logic
nanjiangwill Nov 1, 2024
8249c08
update checkbox logic
nanjiangwill Nov 1, 2024
8083acf
update viz
nanjiangwill Nov 4, 2024
fd775f4
update viz
nanjiangwill Nov 4, 2024
ad404e9
update viz
nanjiangwill Nov 6, 2024
54bd99c
update viz
nanjiangwill Nov 6, 2024
e8880bb
update viz
nanjiangwill Nov 6, 2024
150c7cd
update viz
nanjiangwill Nov 6, 2024
1c0e583
update viz
nanjiangwill Nov 6, 2024
d071ac4
update viz
nanjiangwill Nov 6, 2024
b687348
update viz
nanjiangwill Nov 6, 2024
7a3777d
update viz
nanjiangwill Nov 6, 2024
5f091c4
update viz
nanjiangwill Nov 6, 2024
14f7cba
update viz
nanjiangwill Nov 7, 2024
1c292c1
update viz
nanjiangwill Nov 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,6 @@ export AWS_SECRET_ACCESS_KEY=
export OPENAI_API_KEY=
export ANTHROPIC_API_KEY=
export WANDB_API_KEY=

export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
15 changes: 14 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,19 +1,32 @@
.diskcache
.env
.idea
.mypy_cache
.streamlit
.venv
.vscode
Procfile
data/*/eval
data/*/format_ocr
data/*/index
data/*/llm
data/*/normalization
data/*/ocr
data/*/original_ocr
data/*/pdfs
data/*/prompt
data/*/search
data/.DS_Store
hydra_outputs
key.json
old_eval.py
results/
old_results
results/*/*/*.json
results/*/district_extraction
results/*/district_extraction_verification
results/*/page_embedding
runtime.txt
setup.sh
wandb/
zoning.egg-info
zoning.egg-info/
Expand Down
4 changes: 3 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
exclude: '^results/'

repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: 'v4.6.0'
Expand All @@ -18,7 +20,7 @@ repos:
- id: black
- id: black-jupyter
- repo: https://github.com/pre-commit/mirrors-mypy
rev: 'v1.10.0' # Use the sha / tag you want to point at
rev: 'v1.11.0' # Use the sha / tag you want to point at
hooks:
- id: mypy
# args: ['--explicit-package-bases']
Expand Down
1 change: 1 addition & 0 deletions Procfile
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
web: sh setup.sh && streamlit run viz/viz_user_mode_batch.py
25 changes: 20 additions & 5 deletions config/base.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ global_config:
experiment_dir: results/${global_config.experiment_name} # helper variable, just used to parse

target_state: connecticut
eval_terms: ["min_lot_size", "min_unit_size", "max_height"] # all available eval terms ['floor_to_area_ratio', 'max_height', 'max_lot_coverage', 'max_lot_coverage_pavement', 'min_lot_size', 'min_parking_spaces', 'min_unit_size']
eval_terms: ["min_lot_size", "min_unit_size", "max_height"] # all available eval terms ['floor_to_area_ratio', 'max_height', 'max_lot_coverage', 'max_lot_coverage_pavement', 'min_lot_size', 'min_parking_spaces', 'min_unit_size', 'units_per_acre']

data_dir: data/${global_config.target_state}
target_town_file: ${global_config.data_dir}/target_towns_names.json
Expand All @@ -18,9 +18,12 @@ global_config:
result_output_dir: results/${global_config.target_state}/${global_config.experiment_name} # helper variable, just used to parse

pdf_dir: ${global_config.data_dir}/pdfs # normally we dont redo pdf collection, we just save them in data
ocr_dir: ${global_config.data_dir}/ocr # normally we dont redo ocr collection, we just save them in data, not in experiment results
ocr_dir: ${global_config.data_dir}/original_ocr # normally we dont redo ocr collection, we just save them in data, not in experiment results

format_ocr_dir: ${global_config.experiment_dir}/format_ocr
page_embedding_dir: ${global_config.experiment_dir}/page_embedding
district_extraction_dir: ${global_config.experiment_dir}/district_extraction
district_extraction_verification_dir: ${global_config.experiment_dir}/district_extraction_verification
index_dir: ${global_config.experiment_dir}/index
search_dir: ${global_config.experiment_dir}/search
prompt_dir: ${global_config.experiment_dir}/prompt
Expand All @@ -40,13 +43,25 @@ global_config:
ocr_config:
method: textract
run_ocr: false
input_document_s3_bucket:
pdf_name_prefix_in_s3_bucket: zoning/${global_config.target_state}/
textract_region_name: us-east-2
input_document_s3_bucket: zoning-nan
pdf_name_prefix_in_s3_bucket: ${global_config.target_state}
feature_types: ["TABLES"] # allowed ["TABLES", "FORMS", "QUERIES", "SIGNATURES", "LAYOUT"]

format_ocr_config:
temp: x

district_extraction_config:
run_district_extraction: false
embedding_model: text-embedding-3-small
llm_model: ${llm_config.llm_name}
templates_dir: ${prompt_config.templates_dir}
system_prompt_file: district_extraction_system
user_prompt_file: district_extraction_user
verification_es_endpoint: ${global_config.es_endpoint}
target_districts_file: ${global_config.target_district_file}
district_page_mapping_file: ${global_config.data_dir}/district_page_mapping.json

index_config:
method: keyword # allowed keyword/embedding
index_key: town
Expand All @@ -70,7 +85,7 @@ prompt_config:

llm_config:
llm_name: gpt-4-1106-preview
max_tokens: 256
max_tokens: 512
formatted_response: false
cache_dir: .diskcache

Expand Down
7 changes: 7 additions & 0 deletions config/templates/district_extraction_system.pmpt.tpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
You are an expert information extraction system. You are given a
passage that shows the zoning districts of as town and their
abbreviations. Your Job is to list the zoning districts and these
abbreviations. Only output districts that have abbreviations.
Please output the answer only with JSON (no text) in the format:

[{"T": "district type", "Z": "district abbreviation with number"}].
68 changes: 68 additions & 0 deletions config/templates/district_extraction_user.pmpt.tpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
Passage:

Some text about buildings

Output:

[]

Passage:

* Residential (R) districts

CELL
Residential
CELL
R-10
CELL
R-20
CELL

Output:

[{"T": "Residential", "Z": "R-10"}, {"T": "Residential", "Z": "R-20"}]

Passage:

* Business (C) districts:

(C19) Commercial 19
(C29) Commercial 29

Output:

[{"T": "Commercial 19", "Z": "C19"}, {"T": "Commercial 29", "Z": "C29"}]

Passage:

CELL
Residential Districts
CELL
R-5 District
R-10 District
R-20 District

Output:

[{"T": "R-5 Residential", "Z": "R-5"}, {"T": "R-10 Residential", "Z": "R-10"}, {"T": "R-20 Residential", "Z": "R-20"}]

Passage:

Residence AAA District
Residence B District
Historic Design District (HDD)

Output:

[{"T": "Residence AAA", "Z": "AAA"}, {"T": "Residence B", "Z": "B"}, {"T": "Historic Design", "Z": "HDD"}]

Passage:

{% macro showdocs(docs) -%}
{% for doc in docs %}
* {{doc}}
{% endfor %}
{% endmacro %}
{{showdocs(docs) | truncate(1200*4)}}

Output:
33 changes: 20 additions & 13 deletions config/templates/few_shot.pmpt.tpl
Original file line number Diff line number Diff line change
@@ -1,25 +1,32 @@
# Instructions

You are an expert architectural lawyer. You are looking for facts inside a
document about a Zoning District with the name "{{zone_name}}" and with an
abbreviated name "{{zone_abbreviation}}".
You are an expert architectural lawyer tasked with extracting specific zoning information from a
document. Your goal is to find facts about a particular Zoning District with the name "{{zone_name}}" and with an
abbreviated name "{{zone_abbreviation}}

You are looking to find the value for "{{term}}", which also goes by the
You are looking to find the value for "{{term}}", which may also be referred to by the
following other names: {{synonyms}}. Only output values that are seen in the
input and do not guess! Output MUST be valid JSON, and should follow the schema
detailed below. Ensure that the field "extracted_text" does not span multiple
lines and that it is a real substring of the input. You CANNOT make up a value
for "extracted_text", and it MUST be a substring! "extracted_text" will be used
in the python statement `extracted_text in input` and if that returns False, the
universe will be destroyed! If you cannot extract reasonable text, then you
should not return an answer. For {{term}} in residential districts, we are only
interested in the answer as it pertains to single-family homes.
detailed below. Ensure that, in the field "extracted_text", the first element of
the inner list does not span multiple lines and that it is a real substring of the input.
You CANNOT make up a value for "extracted_text", and it MUST be a substring!
"extracted_text" will be used in the python statement `extracted_text in input`
and if that returns False, the universe will be destroyed! If you cannot extract
reasonable text, then you should not return an answer. If {{zone_name}}
({{zone_abbreviation}}) is referring to a general residential district,
we are only interested in the requirement of {{term}} for single-family homes.
However, if it is referring to a specific district, like Multi Family Residential (MFR),
General Commercial (GC), etc., we are still interested in the requirement of {{term}}
for {{zone_name}} ({{zone_abbreviation}}). Remeber, the text given to you is a
document that is part of a larger document, which means you might find answer that is
not for the zone "{{zone_name}} ({{zone_abbreviation}})" but for other zones.
Double-check your answer to ensure it corresponds to the correct zone district "{{zone_name}}"

# Schema
{
"extracted_text": list[str], // The verbatim text from which the result was extracted. ONLY USE VALUES EXTRACTED DIRECTLY FROM THE TEXT. Make sure to include "\n" and any type of special characters.
"extracted_text": List[List[str, int]], // A list of lists. Each inner list must contain exactly two elements: The first element is a string representing the verbatim text from which the result was extracted. ONLY USE VALUES EXTRACTED DIRECTLY FROM THE TEXT. Make sure to include \n and any special characters and DO NOT span multiple lines. The second element is an integer representing the page where the verbatim text is found. Multiple extracted texts from different pages may correspond to the answer, so the extracted_text field should always be a list of lists, even if only one inner list is present."
"rationale": str, // A string containing a natural language explanation for the following answer
"answer": str // The value of {{term}} extracted from the text. Answer must include units and must be normalized, e.g. (sqr. ft. becomes sq ft)
"answer": str // A string representing the value of {{term}} extracted from the text. Answer must include units and must be normalized, e.g. (sqr. ft. becomes sq ft)
}

{% include term + "_examples.pmpt.tpl" %}
Loading