librairy
diff --git a/‎.gitignore
+3-1 b/‎.gitignore
+3-1
diff --git a/‎README.md
+87-2 b/‎README.md
+87-2
diff --git a/‎code/add_track.py
+45 b/‎code/add_track.py
+45
diff --git a/‎code/annotate_desc_levels.py
+53 b/‎code/annotate_desc_levels.py
+53
diff --git a/‎code/create_bows.py
+61 b/‎code/create_bows.py
+61
diff --git a/‎code/evaluation.py
+113 b/‎code/evaluation.py
+113
@@ -2,13 +2,15 @@
 __pycache__/
 *.py[cod]
 *$py.class
-
+*.zip
 # C extensions
 *.so
 
 # Distribution / packaging
 .Python
 build/
+results/
+models/
 develop-eggs/
 dist/
 downloads/
 
@@ -1,2 +1,87 @@
-# mesinesp2
-Use of Probabilistic Topic Models to Medical Semantic Indexing in Spanish
+# Supervised Graph-based Topic Model for Medical Semantic Indexing in Spanish
+
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+[![GitHub Issues](https://img.shields.io/github/issues/librairy/mesinesp2.svg)](https://github.com/librairy/mesinesp2/issues)
+[![License](https://img.shields.io/badge/license-Apache2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+[![Data-DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4701973.svg)](https://doi.org/10.5281/zenodo.4701973)
+
+
+
+## Task
+The [MESINESP2](https://temu.bsc.es/mesinesp2/) task of the [9th BioASQ Workshop](http://www.bioasq.org/workshop2021) aims to create an automatic semantic indexing system for Spanish medical documents based on structured vocabularies. In particular, texts have to be annotated with [DeCS headings](https://temu.bsc.es/mesinesp2/decs-headings/). These *health sciences descriptors* are a trilingual and structured vocabulary created by BIREME to serve as a unique language in indexing articles from scientific journals, books, conference proceedings, technical reports, and other types of materials, as well as for searching and retrieving subjects from scientific literature from information sources available on the Virtual Health Library (VHL) such as LILACS, MEDLINE, among others. 
+
+Three types of [documents](https://temu.bsc.es/mesinesp2/datasets/) are proposed for the task: [scientific literature](https://temu.bsc.es/mesinesp2/sub-track-1-scientific-literature/), [clinical trails](https://temu.bsc.es/mesinesp2/sub-track-2-clinical-trials/) and [patents](https://temu.bsc.es/mesinesp2/sub-track-3-patents/). They are **not long texts** and usually have assigned **several categories**. On average, scientific articles contain 1,332 characters and 10 categories, clinical trails contain 7,283 characters and 15 categories, and patents contain 1,640 characters and 10 categories.  
+
+## Proposal
+
+A probabilistic topic-based representation of DeCS categories created from previously annotated texts. Each category is described by a density distribution over the vocabulary used in the training texts. The generated topic model allows inferring the presence of DeCS categories in texts not used during training. 
+
+## Challenges
+The characteristics of the documents proposed for the taks and the assumptions of the probabilistic topic models lead to several challenges: (1) Since texts are not long, word frequency may not be adequate to measure relevance, and topic models are based on bags of words (i.e., word order does not matter, but word repetition does); (2) short text-oriented topic models assume the presence of only one topic in the text, however the documents proposed for the task may have more than one category; (3) the topic creation must be supervised to force each topic to map to a DeCS category since categories must match the DeCS headers;  and finally (4) topic inference should consider only the most relevant ones, i.e. one or several, since each text may have several categories.
+
+## Corpora
+A [Solr index](http://librairy.linkeddata.es/data/#/mesinesp/core-overview) has been created to process and annotate the texts proposed for the task. The structure of the documents is as follows:
+* **id**: unique identifier 
+* **title_s**: document name (*string*)
+* **abstract_t**: text paragraph (*terms*)
+* **journal_s**: publication journal (*string*)
+* **size_i**: number of characters (*integer*)
+* **year_i**: publication date (*integer*)
+* **db_s**: document database (*string*)
+* **codes**: list of DeCS categories (*list-of-string*)
+* **scope_s**: training, development or test (*string*)
+* **diseases**: list of diseases retrieved from the abstract (*list-of-string*)
+* **medications**: list of medications retrieved from the abstract (*list-of-string*)
+* **procedures**: list of procedures retrieved from the abstract (*list-of-string*)
+* **symptoms**: list of symptoms retrieved from the abstract (*list-of-string*)
+* **sentences**: list of list of words after pre-processing the abstract (*list-of-string*)
+* **tokens_t_**: base text for creating word-bags (*terms*)
+
+This is an example of document:
+
+````json
+{
+        "id":"ibc-ET1-3794",
+        "title_s":"Caso clínico: Manejo clínico de la hiperprolactinemia secundaria al tratamiento de un episodio maníaco con características psicóticas y mixtas en una paciente con un inicio posparto de trastorno bipolar tipo I",
+        "abstract_t":"Se presenta el caso de una paciente que ingresa por un primer episodio maníaco con sintomatología psicótica y mixta. El tratamiento inicial instaurado permitió un control parcial de los síntomas agudos y ocasionó una intensa elevación de los niveles séricos de prolactina. Ante esta situación, se planteó una solución terapéutica basada en la evidencia",
+        "journal_s":"Psiquiatr. biol. (Internet)",
+        "size_i":352,
+        "year_i":2015,
+        "db_s":"IBECS",
+        "codes":["D006966",
+          "D001714",
+          "D005260",
+          "D000068105",
+          "D011388",
+          "D006801",
+          "D011570",
+          "D011618",
+          "D049590"],
+        "scope_s":"Development",
+        "diseases":["maníaco_con_sintomatología_psicótica"],
+        "medications":["prolactina"],
+        "sentences":["presentar",
+          "casar",
+          "paciente",
+          "...."],
+        "tokens_t":" ingresar ingresar ..",
+        "_version_":1699080371235192832}
+````
+## Algorithms
+*more details coming soon*. 
+
+
+## Results
+
+Our models are publicly available as Web REST services through [Docker](https://www.docker.com/) images. The service can be started by `docker run -p 8080:7777 <model-as-a-service name>` and a Swagger-based interface is available at [http://localhost:8080](http://localhost:8080).
+
+| Algorithm | Reference           | Bag-of-Words                     |   Model-as-a-Service               | Precision | Recall | F-Measure |
+| --------- | :------------------:| :-------------------------------:|:------------------------------------:|:---------:|:------:|:---------:|
+| LabeledLDA| Ramage et. al (2009)| Frequency                        |  librairy/llda-mesinesp:latest\*     |   TBD     |  TBD   |    TBD    |
+| TR-LLDA   | novel               | TextRank + lineal normalization  |  librairy/tr-llda-mesinesp:latest\*  |   TBD     |  TBD   |    TBD    |
+| TR?-LLDA  |  novel              | TextRank + ? normalization       |  librairy/tr?-llda-mesinesp:latest\* |   TBD     |  TBD   |    TBD    |
+| R-LLDA    |  novel              | Rake + lineal normalization      |  librairy/r-llda-mesinesp:latest\*   |   TBD     |  TBD   |    TBD    |
+| R?-LLDA   |  novel              | Rake + ? normalization           |  librairy/r?-llda-mesinesp:latest\*  |   TBD     |  TBD   |    TBD    |
+
+
+\* *not available yet*
@@ -0,0 +1,45 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Fri May 14 10:23:00 2021
+
+@author: cbadenes
+"""
+import worker.annotator as workers
+import pysolr
+import multiprocessing as mp
+import time
+
+
+if __name__ == '__main__':
+    print("annotating documents..")
+    
+    solr = pysolr.Solr('http://librairy.linkeddata.es/data/mesinesp', always_commit=True, timeout=50)
+    
+    print("Number of processors: ", mp.cpu_count())
+    pool = mp.Pool(mp.cpu_count())
+    
+    
+    print("reading from solr..")
+    counter = 0
+    completed = False
+    window_size=50
+    cursor = "*"
+    
+    t = time.time()
+    while (not completed):
+        old_counter = counter        
+        try:
+            articles = solr.search(q="*:*",rows=window_size,cursorMark=cursor,sort="id asc")
+            cursor = articles.nextCursorMark
+            results = pool.map(workers.add_track,articles.docs)            
+            solr.add(results)
+            counter += len(results)
+            print(counter,"docs annotated")
+            if (old_counter == counter):
+                print("done!")
+                break
+        except:
+            print("Solr query error. Wait for 5secs..")
+            time.sleep(5.0)
+    print('Time to annotate documents: {} mins'.format(round((time.time() - t) / 60, 2)))
@@ -0,0 +1,53 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Mon May 10 18:18:23 2021
+
+@author: cbadenes
+"""
+
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Mon May 10 13:01:50 2021
+
+@author: cbadenes
+"""
+import worker.categorize as workers
+import pysolr
+import multiprocessing as mp
+import time
+
+
+if __name__ == '__main__':
+    print("annotating documents..")
+    
+    solr = pysolr.Solr('http://librairy.linkeddata.es/data/mesinesp', always_commit=True, timeout=50)
+    
+    print("Number of processors: ", mp.cpu_count())
+    pool = mp.Pool(mp.cpu_count())
+    
+    
+    print("reading from solr..")
+    counter = 0
+    completed = False
+    window_size=50
+    cursor = "*"
+    
+    t = time.time()
+    while (not completed):
+        old_counter = counter        
+        try:
+            articles = solr.search(q="codes:[* TO *]",rows=window_size,cursorMark=cursor,sort="id asc")
+            cursor = articles.nextCursorMark
+            results = pool.map(workers.categorize,articles.docs)            
+            solr.add(results)
+            counter += len(results)
+            print(counter,"docs annotated")
+            if (old_counter == counter):
+                print("done!")
+                break
+        except:
+            print("Solr query error. Wait for 5secs..")
+            time.sleep(5.0)
+    print('Time to annotate documents: {} mins'.format(round((time.time() - t) / 60, 2)))
@@ -0,0 +1,61 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Tue May 11 10:14:00 2021
+
+@author: cbadenes
+"""
+
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Mon May 10 18:18:23 2021
+
+@author: cbadenes
+"""
+
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Mon May 10 13:01:50 2021
+
+@author: cbadenes
+"""
+import model.workers as workers
+import pysolr
+import multiprocessing as mp
+import time
+import sys
+
+if __name__ == '__main__':
+    print("annotating documents..")
+    
+    solr = pysolr.Solr('http://librairy.linkeddata.es/data/mesinesp', always_commit=True, timeout=50)
+    
+    print("Number of processors: ", mp.cpu_count())
+    pool = mp.Pool(mp.cpu_count())
+    
+    
+    print("reading from solr..")
+    counter = 0
+    completed = False
+    window_size=50
+    cursor = "*"
+    
+    t = time.time()
+    while (not completed):
+        old_counter = counter        
+        try:
+            articles = solr.search(q="tokens_t:[* TO *]",rows=window_size,cursorMark=cursor,sort="id asc")
+            cursor = articles.nextCursorMark
+            results = pool.map(workers.create_bow,articles.docs)            
+            solr.add(results)
+            counter += len(results)
+            print(counter,"docs annotated")
+            if (old_counter == counter):
+                print("done!")
+                break
+        except:
+            print("Solr query error. Wait for 5secs..")
+            time.sleep(5.0)
+    print('Time to annotate documents: {} mins'.format(round((time.time() - t) / 60, 2)))
@@ -0,0 +1,113 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Mon May 10 11:32:27 2021
+
+@author: cbadenes
+"""
+
+import worker.eval as workers
+import pysolr
+import multiprocessing as mp
+import time
+import json
+import requests
+
+               
+
+def get_content(url):
+    content = {}
+    try:
+        response = requests.get(url)
+        content = response.json()
+    except:
+        print("Error getting from url: ",url)
+    return content
+
+if __name__ == '__main__':  
+
+    # Create a client instance. The timeout and authentication options are not required.
+    categories_as_topics = []
+    f = open('../results/topic-models.jsonl', mode='w')
+    for port in range(8000,8015):
+        level = port % 1000
+        print("getting settings from",port,"..")
+        settings = get_content("http://localhost:"+str(port)+"/settings")
+        if ('stats' in settings):
+            print("getting topics from",port,"..")
+            topics = get_content("http://localhost:"+str(port)+"/topics")
+            for topic in topics:
+                topic['level']=level
+                categories_as_topics.append(topic)
+            num_topics = len(topics)
+            stats = settings['stats']
+            row = { 'level':level, 'docs':int(stats['corpus']), 'topics':num_topics, 'vocabulary':int(stats['vocabulary']), 'loglikelihood':float(stats['loglikelihood'])}
+            f.write(json.dumps(row))
+            f.write("\n")
+            print(row)
+    f.close()
+
+    print("writing categories as topics...")
+    f = open('../results/categories_as_topics.jsonl', mode='w')
+    for topic_description in categories_as_topics:
+        row = {'category':topic_description['name'], 'level':str(topic_description['level']), 'topic':str(topic_description['id']), 'words':topic_description['description']}
+        f.write(json.dumps(row))
+        f.write("\n")
+        print(row)
+    f.close()
+        
+
+    print("reading development documents..")
+    solr = pysolr.Solr('http://librairy.linkeddata.es/data/mesinesp', always_commit=True, timeout=50)
+        
+
+    # print report
+    f = open('../results/dev-results.jsonl', mode='w')
+    print("Number of processors: ", mp.cpu_count())
+    pool = mp.Pool(6)
+    
+    
+    sentences = []
+    print("reading from solr..")
+    counter = 0
+    completed = False
+    window_size=50
+    cursor = "*"
+
+    while (not completed):
+        old_counter = counter
+        solr_query="scope_s:Development"
+        try:
+            t = time.time()
+            articles = solr.search(q=solr_query,rows=window_size,cursorMark=cursor,sort="id asc")
+            cursor = articles.nextCursorMark
+            results = pool.map(workers.evaluate,articles.docs)            
+            for result in results:
+                doc = result['article_id']
+                doc_results = result['results']           
+                for strategy in doc_results.keys():
+                   eval_result = doc_results[strategy]
+                   row = { 'doc':doc, 'strategy':strategy, 'tp':eval_result['tp'], 'fp':eval_result['fp'], 'fn':eval_result['fn'], 'precision':eval_result['precision'], 'recall':eval_result['recall'], 'fmeasure':eval_result['fmeasure'], 'ref-labels':eval_result['ref-labels'], 'inf-labels':eval_result['inf-labels']}
+                   f.write(json.dumps(row))
+                   f.write("\n")
+            counter += len(results)
+            print(counter,"docs evaluated")
+            print('Time to evaluate docs: {} mins'.format(round((time.time() - t) / 60, 2)))
+            if (old_counter == counter):
+                print("done!")
+                break
+        except:
+            print("Solr query error. Wait for 5secs..")
+            time.sleep(5.0)
+    f.close()
+    
+    
+    
+    
+    
+    
+    
+    
+    
+    
+