Skip to content

Commit 497d59b

Browse files
committed
evaluation of track1
1 parent 465db53 commit 497d59b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

56 files changed

+3883
-3
lines changed

.gitignore

+3-1
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,15 @@
22
__pycache__/
33
*.py[cod]
44
*$py.class
5-
5+
*.zip
66
# C extensions
77
*.so
88

99
# Distribution / packaging
1010
.Python
1111
build/
12+
results/
13+
models/
1214
develop-eggs/
1315
dist/
1416
downloads/

README.md

+87-2
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,87 @@
1-
# mesinesp2
2-
Use of Probabilistic Topic Models to Medical Semantic Indexing in Spanish
1+
# Supervised Graph-based Topic Model for Medical Semantic Indexing in Spanish
2+
3+
                                                                
4+
[![GitHub Issues](https://img.shields.io/github/issues/librairy/mesinesp2.svg)](https://github.com/librairy/mesinesp2/issues)
5+
[![License](https://img.shields.io/badge/license-Apache2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
6+
[![Data-DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4701973.svg)](https://doi.org/10.5281/zenodo.4701973)
7+
8+
9+
10+
## Task
11+
The [MESINESP2](https://temu.bsc.es/mesinesp2/) task of the [9th BioASQ Workshop](http://www.bioasq.org/workshop2021) aims to create an automatic semantic indexing system for Spanish medical documents based on structured vocabularies. In particular, texts have to be annotated with [DeCS headings](https://temu.bsc.es/mesinesp2/decs-headings/). These *health sciences descriptors* are a trilingual and structured vocabulary created by BIREME to serve as a unique language in indexing articles from scientific journals, books, conference proceedings, technical reports, and other types of materials, as well as for searching and retrieving subjects from scientific literature from information sources available on the Virtual Health Library (VHL) such as LILACS, MEDLINE, among others.
12+
13+
Three types of [documents](https://temu.bsc.es/mesinesp2/datasets/) are proposed for the task: [scientific literature](https://temu.bsc.es/mesinesp2/sub-track-1-scientific-literature/), [clinical trails](https://temu.bsc.es/mesinesp2/sub-track-2-clinical-trials/) and [patents](https://temu.bsc.es/mesinesp2/sub-track-3-patents/). They are **not long texts** and usually have assigned **several categories**. On average, scientific articles contain 1,332 characters and 10 categories, clinical trails contain 7,283 characters and 15 categories, and patents contain 1,640 characters and 10 categories.
14+
15+
## Proposal
16+
17+
A probabilistic topic-based representation of DeCS categories created from previously annotated texts. Each category is described by a density distribution over the vocabulary used in the training texts. The generated topic model allows inferring the presence of DeCS categories in texts not used during training.
18+
19+
## Challenges
20+
The characteristics of the documents proposed for the taks and the assumptions of the probabilistic topic models lead to several challenges: (1) Since texts are not long, word frequency may not be adequate to measure relevance, and topic models are based on bags of words (i.e., word order does not matter, but word repetition does); (2) short text-oriented topic models assume the presence of only one topic in the text, however the documents proposed for the task may have more than one category; (3) the topic creation must be supervised to force each topic to map to a DeCS category since categories must match the DeCS headers; and finally (4) topic inference should consider only the most relevant ones, i.e. one or several, since each text may have several categories.
21+
22+
## Corpora
23+
A [Solr index](http://librairy.linkeddata.es/data/#/mesinesp/core-overview) has been created to process and annotate the texts proposed for the task. The structure of the documents is as follows:
24+
* **id**: unique identifier
25+
* **title_s**: document name (*string*)
26+
* **abstract_t**: text paragraph (*terms*)
27+
* **journal_s**: publication journal (*string*)
28+
* **size_i**: number of characters (*integer*)
29+
* **year_i**: publication date (*integer*)
30+
* **db_s**: document database (*string*)
31+
* **codes**: list of DeCS categories (*list-of-string*)
32+
* **scope_s**: training, development or test (*string*)
33+
* **diseases**: list of diseases retrieved from the abstract (*list-of-string*)
34+
* **medications**: list of medications retrieved from the abstract (*list-of-string*)
35+
* **procedures**: list of procedures retrieved from the abstract (*list-of-string*)
36+
* **symptoms**: list of symptoms retrieved from the abstract (*list-of-string*)
37+
* **sentences**: list of list of words after pre-processing the abstract (*list-of-string*)
38+
* **tokens_t_**: base text for creating word-bags (*terms*)
39+
40+
This is an example of document:
41+
42+
````json
43+
{
44+
"id":"ibc-ET1-3794",
45+
"title_s":"Caso clínico: Manejo clínico de la hiperprolactinemia secundaria al tratamiento de un episodio maníaco con características psicóticas y mixtas en una paciente con un inicio posparto de trastorno bipolar tipo I",
46+
"abstract_t":"Se presenta el caso de una paciente que ingresa por un primer episodio maníaco con sintomatología psicótica y mixta. El tratamiento inicial instaurado permitió un control parcial de los síntomas agudos y ocasionó una intensa elevación de los niveles séricos de prolactina. Ante esta situación, se planteó una solución terapéutica basada en la evidencia",
47+
"journal_s":"Psiquiatr. biol. (Internet)",
48+
"size_i":352,
49+
"year_i":2015,
50+
"db_s":"IBECS",
51+
"codes":["D006966",
52+
"D001714",
53+
"D005260",
54+
"D000068105",
55+
"D011388",
56+
"D006801",
57+
"D011570",
58+
"D011618",
59+
"D049590"],
60+
"scope_s":"Development",
61+
"diseases":["maníaco_con_sintomatología_psicótica"],
62+
"medications":["prolactina"],
63+
"sentences":["presentar",
64+
"casar",
65+
"paciente",
66+
"...."],
67+
"tokens_t":" ingresar ingresar ..",
68+
"_version_":1699080371235192832}
69+
````
70+
## Algorithms
71+
*more details coming soon*.
72+
73+
74+
## Results
75+
76+
Our models are publicly available as Web REST services through [Docker](https://www.docker.com/) images. The service can be started by `docker run -p 8080:7777 <model-as-a-service name>` and a Swagger-based interface is available at [http://localhost:8080](http://localhost:8080).
77+
78+
| Algorithm | Reference | Bag-of-Words | Model-as-a-Service | Precision | Recall | F-Measure |
79+
| --------- | :------------------:| :-------------------------------:|:------------------------------------:|:---------:|:------:|:---------:|
80+
| LabeledLDA| Ramage et. al (2009)| Frequency | librairy/llda-mesinesp:latest\* | TBD | TBD | TBD |
81+
| TR-LLDA | novel | TextRank + lineal normalization | librairy/tr-llda-mesinesp:latest\* | TBD | TBD | TBD |
82+
| TR?-LLDA | novel | TextRank + ? normalization | librairy/tr?-llda-mesinesp:latest\* | TBD | TBD | TBD |
83+
| R-LLDA | novel | Rake + lineal normalization | librairy/r-llda-mesinesp:latest\* | TBD | TBD | TBD |
84+
| R?-LLDA | novel | Rake + ? normalization | librairy/r?-llda-mesinesp:latest\* | TBD | TBD | TBD |
85+
86+
87+
\* *not available yet*

code/add_track.py

+45
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
#!/usr/bin/env python3
2+
# -*- coding: utf-8 -*-
3+
"""
4+
Created on Fri May 14 10:23:00 2021
5+
6+
@author: cbadenes
7+
"""
8+
import worker.annotator as workers
9+
import pysolr
10+
import multiprocessing as mp
11+
import time
12+
13+
14+
if __name__ == '__main__':
15+
print("annotating documents..")
16+
17+
solr = pysolr.Solr('http://librairy.linkeddata.es/data/mesinesp', always_commit=True, timeout=50)
18+
19+
print("Number of processors: ", mp.cpu_count())
20+
pool = mp.Pool(mp.cpu_count())
21+
22+
23+
print("reading from solr..")
24+
counter = 0
25+
completed = False
26+
window_size=50
27+
cursor = "*"
28+
29+
t = time.time()
30+
while (not completed):
31+
old_counter = counter
32+
try:
33+
articles = solr.search(q="*:*",rows=window_size,cursorMark=cursor,sort="id asc")
34+
cursor = articles.nextCursorMark
35+
results = pool.map(workers.add_track,articles.docs)
36+
solr.add(results)
37+
counter += len(results)
38+
print(counter,"docs annotated")
39+
if (old_counter == counter):
40+
print("done!")
41+
break
42+
except:
43+
print("Solr query error. Wait for 5secs..")
44+
time.sleep(5.0)
45+
print('Time to annotate documents: {} mins'.format(round((time.time() - t) / 60, 2)))

code/annotate_desc_levels.py

+53
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
#!/usr/bin/env python3
2+
# -*- coding: utf-8 -*-
3+
"""
4+
Created on Mon May 10 18:18:23 2021
5+
6+
@author: cbadenes
7+
"""
8+
9+
#!/usr/bin/env python3
10+
# -*- coding: utf-8 -*-
11+
"""
12+
Created on Mon May 10 13:01:50 2021
13+
14+
@author: cbadenes
15+
"""
16+
import worker.categorize as workers
17+
import pysolr
18+
import multiprocessing as mp
19+
import time
20+
21+
22+
if __name__ == '__main__':
23+
print("annotating documents..")
24+
25+
solr = pysolr.Solr('http://librairy.linkeddata.es/data/mesinesp', always_commit=True, timeout=50)
26+
27+
print("Number of processors: ", mp.cpu_count())
28+
pool = mp.Pool(mp.cpu_count())
29+
30+
31+
print("reading from solr..")
32+
counter = 0
33+
completed = False
34+
window_size=50
35+
cursor = "*"
36+
37+
t = time.time()
38+
while (not completed):
39+
old_counter = counter
40+
try:
41+
articles = solr.search(q="codes:[* TO *]",rows=window_size,cursorMark=cursor,sort="id asc")
42+
cursor = articles.nextCursorMark
43+
results = pool.map(workers.categorize,articles.docs)
44+
solr.add(results)
45+
counter += len(results)
46+
print(counter,"docs annotated")
47+
if (old_counter == counter):
48+
print("done!")
49+
break
50+
except:
51+
print("Solr query error. Wait for 5secs..")
52+
time.sleep(5.0)
53+
print('Time to annotate documents: {} mins'.format(round((time.time() - t) / 60, 2)))

code/create_bows.py

+61
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
#!/usr/bin/env python3
2+
# -*- coding: utf-8 -*-
3+
"""
4+
Created on Tue May 11 10:14:00 2021
5+
6+
@author: cbadenes
7+
"""
8+
9+
#!/usr/bin/env python3
10+
# -*- coding: utf-8 -*-
11+
"""
12+
Created on Mon May 10 18:18:23 2021
13+
14+
@author: cbadenes
15+
"""
16+
17+
#!/usr/bin/env python3
18+
# -*- coding: utf-8 -*-
19+
"""
20+
Created on Mon May 10 13:01:50 2021
21+
22+
@author: cbadenes
23+
"""
24+
import model.workers as workers
25+
import pysolr
26+
import multiprocessing as mp
27+
import time
28+
import sys
29+
30+
if __name__ == '__main__':
31+
print("annotating documents..")
32+
33+
solr = pysolr.Solr('http://librairy.linkeddata.es/data/mesinesp', always_commit=True, timeout=50)
34+
35+
print("Number of processors: ", mp.cpu_count())
36+
pool = mp.Pool(mp.cpu_count())
37+
38+
39+
print("reading from solr..")
40+
counter = 0
41+
completed = False
42+
window_size=50
43+
cursor = "*"
44+
45+
t = time.time()
46+
while (not completed):
47+
old_counter = counter
48+
try:
49+
articles = solr.search(q="tokens_t:[* TO *]",rows=window_size,cursorMark=cursor,sort="id asc")
50+
cursor = articles.nextCursorMark
51+
results = pool.map(workers.create_bow,articles.docs)
52+
solr.add(results)
53+
counter += len(results)
54+
print(counter,"docs annotated")
55+
if (old_counter == counter):
56+
print("done!")
57+
break
58+
except:
59+
print("Solr query error. Wait for 5secs..")
60+
time.sleep(5.0)
61+
print('Time to annotate documents: {} mins'.format(round((time.time() - t) / 60, 2)))

code/evaluation.py

+113
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
#!/usr/bin/env python3
2+
# -*- coding: utf-8 -*-
3+
"""
4+
Created on Mon May 10 11:32:27 2021
5+
6+
@author: cbadenes
7+
"""
8+
9+
import worker.eval as workers
10+
import pysolr
11+
import multiprocessing as mp
12+
import time
13+
import json
14+
import requests
15+
16+
17+
18+
def get_content(url):
19+
content = {}
20+
try:
21+
response = requests.get(url)
22+
content = response.json()
23+
except:
24+
print("Error getting from url: ",url)
25+
return content
26+
27+
if __name__ == '__main__':
28+
29+
# Create a client instance. The timeout and authentication options are not required.
30+
categories_as_topics = []
31+
f = open('../results/topic-models.jsonl', mode='w')
32+
for port in range(8000,8015):
33+
level = port % 1000
34+
print("getting settings from",port,"..")
35+
settings = get_content("http://localhost:"+str(port)+"/settings")
36+
if ('stats' in settings):
37+
print("getting topics from",port,"..")
38+
topics = get_content("http://localhost:"+str(port)+"/topics")
39+
for topic in topics:
40+
topic['level']=level
41+
categories_as_topics.append(topic)
42+
num_topics = len(topics)
43+
stats = settings['stats']
44+
row = { 'level':level, 'docs':int(stats['corpus']), 'topics':num_topics, 'vocabulary':int(stats['vocabulary']), 'loglikelihood':float(stats['loglikelihood'])}
45+
f.write(json.dumps(row))
46+
f.write("\n")
47+
print(row)
48+
f.close()
49+
50+
print("writing categories as topics...")
51+
f = open('../results/categories_as_topics.jsonl', mode='w')
52+
for topic_description in categories_as_topics:
53+
row = {'category':topic_description['name'], 'level':str(topic_description['level']), 'topic':str(topic_description['id']), 'words':topic_description['description']}
54+
f.write(json.dumps(row))
55+
f.write("\n")
56+
print(row)
57+
f.close()
58+
59+
60+
print("reading development documents..")
61+
solr = pysolr.Solr('http://librairy.linkeddata.es/data/mesinesp', always_commit=True, timeout=50)
62+
63+
64+
# print report
65+
f = open('../results/dev-results.jsonl', mode='w')
66+
print("Number of processors: ", mp.cpu_count())
67+
pool = mp.Pool(6)
68+
69+
70+
sentences = []
71+
print("reading from solr..")
72+
counter = 0
73+
completed = False
74+
window_size=50
75+
cursor = "*"
76+
77+
while (not completed):
78+
old_counter = counter
79+
solr_query="scope_s:Development"
80+
try:
81+
t = time.time()
82+
articles = solr.search(q=solr_query,rows=window_size,cursorMark=cursor,sort="id asc")
83+
cursor = articles.nextCursorMark
84+
results = pool.map(workers.evaluate,articles.docs)
85+
for result in results:
86+
doc = result['article_id']
87+
doc_results = result['results']
88+
for strategy in doc_results.keys():
89+
eval_result = doc_results[strategy]
90+
row = { 'doc':doc, 'strategy':strategy, 'tp':eval_result['tp'], 'fp':eval_result['fp'], 'fn':eval_result['fn'], 'precision':eval_result['precision'], 'recall':eval_result['recall'], 'fmeasure':eval_result['fmeasure'], 'ref-labels':eval_result['ref-labels'], 'inf-labels':eval_result['inf-labels']}
91+
f.write(json.dumps(row))
92+
f.write("\n")
93+
counter += len(results)
94+
print(counter,"docs evaluated")
95+
print('Time to evaluate docs: {} mins'.format(round((time.time() - t) / 60, 2)))
96+
if (old_counter == counter):
97+
print("done!")
98+
break
99+
except:
100+
print("Solr query error. Wait for 5secs..")
101+
time.sleep(5.0)
102+
f.close()
103+
104+
105+
106+
107+
108+
109+
110+
111+
112+
113+

0 commit comments

Comments
 (0)