MatWISE, a multi-tiered framework designed to revolutionize the process of material discovery. Our mission is to help researchers and scientists navigate the vast ocean of scientific literature with ease and efficiency.
MatWISE is a tool that leverages the power of state-of-the-art natural language processing, semantic relationship mapping, and machine learning to automate the complex process of material discovery.
Our objective is to assist chemists in simplifying the process of extracting valuable chemical entities from an overwhelming amount of scientific literature.
-
Step1. Named Entity Recognition: The first step in the MatWISE involves the use of advanced Named Entity Recognition models. We utilize models such as
Stanza
to recognize and categorize entities in our large body of scientific literature. This allows us to extract material entities from the text and use it in the subsequent steps of the process. -
Step2. Semantic Relationship Mapping: In the second step, we employ the
PubChemPy
tool to combine disparate names that reference the same chemical entity. This process involves documenting metadata such as the corresponding IUPAC (International Union of Pure and Applied Chemistry) nomenclature and Compound ID (CID) designation. This ensures that all information related to a specific chemical entity is unified and easily accessible. -
Step3. Word Embedding & Caculate similarity: In the third step, we leverage the
BioGPT-Large
model to provide valuable word embeddings. Then computing similarities to capture semantic relationships within a high-dimensional vector space. By doing so, we can understand and analyze the relationships between different chemical entities in a more nuanced and detailed manner. -
Step4. Automated chemical entity screening powere by GPT-4: In the fourth step, we utilize a tool-enhanced
GPT-4
model, which operates on theLangChain
framework and incorporates WebSearch tool and PubMedQuery tool. This advanced model allows us to automatically exclude materials with undesired physicochemical properties. By providing feedback on the physical and chemical characteristics of the materials, we can successfully filter out those that do not conform to our desired criteria. This ensures that only the most suitable materials are selected for further analysis and potential use.
Note: We recommend deploying the MatWISE framework on a Linux-based system for optimal performance and compatibility. Detailed instructions will be provided in the subsequent sections.
If you are using MatWISE for the first time, you will need to download the necessary models for the NER.To download the English model with the 'BioNLP13CG' NER processor, run the following Python code:
stanza.download('en',package='mimic', processors={'ner': 'BioNLP13CG'})
Then you can run:
Named_entity_Recognition.py
Run Chemical_Entity_IUPAC_Unifying.py
to get IUPAC name and CID.
Run BioGPT.py
to get word embeddings and similarities.
You need to apply for a GPT-4 API account . Based on our tests, the performance of GPT-4 is far superior to GPT-3.5.
Note: We use GPT 4.0 in our experiment. GPT-3.5 model still support in this code,but we don't recommend using GPT-3.5
To quickly run the code, follow the steps below:
-
Clone the repository to your local machine:
git clone https://github.com/Junhang0202/chemical_gpt.git
-
Import your OPENAI_API_KEY and SERP_API_KEY:
vim ~/.bashrc #Openai_api_key export OPENAI_API_KEY="sk-xxxxx" #SERP_API export SERP_API_KEY="your own SERP_API_KEYxxxx"
-
Run Automated_chemical_screening.py:
cd xxx\chemical_material_gpt nohup python Automated_chemical_screening.py >finish_material_screening.txt 2>&1 &echo $! > chemicalmaterial_output_id.txt
You should now be able to run the code successfully. If you encounter any issues, please refer to open an issue on the repository for further assistance.