USPTO-LLM is an information-enriched chemical reaction dataset that provides more side information (reaction conditions and reaction steps division) for developing new reaction prediction and retrosynthesis methods and inspires new problems, such as reaction condition prediction.
The dataset comprises over 247K chemical reactions extracted from the patent documents of USPTO (United States Patent and Trademark Office), encompassing abundant information on reaction conditions.
We employ large language models to expedite the data collection procedures automatically with a reliable quality control process. The extracted chemical reactions are organized as heterogeneous directed graphs, allowing us to formulate a series of prediction tasks, such as reaction prediction, retrosynthesis, and reaction condition prediction, in a unified graph-filling framework.
The chemical reactions are organized as heterogeneous directed graphs along with their reaction condition information, such as:
id | class | rs>>ps | solvents | catalyst | temperature | time |
---|---|---|---|---|---|---|
20160114-US20160007601A1-0113 | -1 | [Cl-].C(C)N=C=NCCC[NH+](C)C.C1(=CC=CC=C1)CCC(=O)O.CC(\C=C/C)O>>C1(=CC=CC=C1)C#CC(=O)OC(C)\C=C/C |
['C(Cl)Cl '] |
['CN(C)C=1C=CN=CC1' ] |
['room temperature' ] |
['25200' ] |
This repository contains the code for generating the USPTO-LLM dataset.
In 'USPTO-LLM/Code', there are several python files and jupyter notebooks.
-
api_request_parallel_processor.py
is used to call the API to generate the heterogeneous directed graph of USPTO-LLM. Our default API is gpt-4-1106-preview. -
generate_request.py
is used to generate the json fileuspto_request.json
for each API submission based on the reaction data inuspto_full.json
and theFIXED_PROMPT
. -
extract_api_outcome.ipynb
is used to extract the reaction data from theresult_uspto_request.json
generated by the API, including the "pattern repair" technique mentioned in the paper. -
prompt_to_json.py
is used to convert the written prompt into json format and store it in the string variableFIXED_PROMPT
.
The running order is: prompt_to_json.py
→ generate_request.py
→ api_request_parallel_processor.py
→ extract_api_outcome.ipynb
Prompt is used to guide the large language model to generate the standardized heterogeneous directed graph of USPTO-LLM. The prompt is stored in the Prompt
folder, including the .txt
and .json
versions of the prompt.
In the Supplement
folder, annotation_process.pdf
provides the detailed annotation process of the USPTO-LLM dataset, including the detailed design of prompt, the process of the pattern repair technique, and the mechanism of our two-round generation strategy.
The following table shows the performance of different large language models used in this work to generate heterogeneous directed graphs in the USPTO-LLM dataset.
LLM Type | gpt-4-0613 | gpt-4-1106-preview | gpt-4-0125-preview | gpt-3.5-turbo-0125 |
---|---|---|---|---|
Dataset | USPTO_full | USPTO_full | USPTO_full | |
Reactions per Request |
3 | 3 | 3 | |
1st-round | API calls | 1030 | 2000 | 2000 |
Success Rate (Before Pattern Repair) |
0.67 | 0.65 | 0.66 | |
Success Rate (After Pattern Repair) |
0.88 | 0.84 | 0.86 | |
Valid HeteroGraphs per API call (Before Pattern Repair) |
2.00 | 1.94 | 1.98 | |
Valid HeteroGraphs per API call (After Pattern Repair) |
2.64 | 2.53 | 2.57 | |
2nd-round | API calls | 434 | 233 | 285 |
Success Rate (Before Pattern Repair) |
0.26 | 0.20 | 0.24 | |
Success Rate (After Pattern Repair) |
0.51 | 0.31 | 0.32 | |
Valid HeteroGraphs per API call (Before Pattern Repair) |
0.78 | 0.59 | 0.71 | |
Valid HeteroGraphs per API call (After Pattern Repair) |
1.54 | 0.92 | 0.95 | |
Total | API calls | 1464 | 2233 | 2285 |
Success Rate (Before Pattern Repair) |
0.78 | 0.67 | 0.70 | |
Success Rate (After Pattern Repair) |
0.88 | 0.90 | 0.90 | |
Valid HeteroGraphs per API call (Before Pattern Repair) |
1.64 | 1.80 | 1.83 | |
Valid HeteroGraphs per API call (After Pattern Repair) |
1.86 | 2.42 | 2.37 |