Skip to content

Conversation

@percevalw
Copy link
Member

@percevalw percevalw commented Sep 30, 2025

Description

  • New DocToMarkupConverter to convert documents to markdown and improved MarkupToDocConverter to allow overlapping markup annotations (e.g., This is a <a>text <b>with</a> overlapping</b> tags).
  • New helper edsnlp.utils.fuzzy_alignment.align to map the entities of an annotated document to another document with similar but not identical text (e.g., after some text normalization or minor edits).
  • We now support span_getter="sents" to apply various pipes on sentences instead of entities or spans.
  • New LLM generic extractor pipe eds.llm_markup_extractor, that can be used to extract entities using a large language model served through an OpenAPI-style API.

Checklist

  • If this PR is a bug fix, the bug is documented in the test suite.
  • Changes were documented in the changelog (pending section).
  • If necessary, changes were made to the documentation (eg new pipeline).

@percevalw percevalw force-pushed the llm-extraction branch 7 times, most recently from dc3bc9f to 03c99ee Compare September 30, 2025 17:04
@github-actions
Copy link

Docs preview URL

https://edsnlp-llm-extraction.vercel.app/

@percevalw percevalw force-pushed the llm-extraction branch 3 times, most recently from 4fb01b3 to 48bd61e Compare September 30, 2025 21:21
@github-actions
Copy link

github-actions bot commented Sep 30, 2025

Coverage Report

NameStmtsMiss∆ MissCover
edsnlp/pipes/llm/llm_markup_extractor/llm_markup_extractor.py

New missing coverage at line 352 !

             if span is None:
-                 continue
             spans.append(span)
New missing coverage at line 464 !
             if result is None:
-                 buffer[i] = ctx
             else:
New missing coverage at line 471 !
                     )
-                     buffer[i] = ctx
                 else:
New missing coverage at line 489 !
                     if len(in_flight) >= self.max_concurrent_requests:
-                         break
                     i2, d2 = next(ctx_iter)
New missing coverage at lines 493-497 !
                     break
-                 messages2 = self.build_prompt(d2)
-                 task_id2 = worker.submit(self._llm_request_coro(messages2))
-                 in_flight[task_id2] = i2
-                 pending_docs[i2] = d2

1488894.59%
edsnlp/utils/fuzzy_alignment.py

New missing coverage at line 70 !

         if len(other.begins) == 0:
-             return self
         begins = self.unapply(other.begins, side="left")

1911199.48%
edsnlp/core/stream.py

New missing coverage at line 155 !

             else:
-                 yield res
             return
Was already missing at lines 203-205
                 if isinstance(batch, StreamSentinel):
-                     yield batch
-                     continue
                 results = []
Was already missing at lines 1030-1032
                 elif op.batch_fn is None:
-                     batch_size = op.size
-                     batch_fn = batchify
                 else:

3685198.64%
edsnlp/utils/span_getters.py

New missing coverage at lines 73-75 !

     if span_getter is None:
-         yield doclike[:], None
-         return
     if callable(span_getter):
New missing coverage at lines 76-78 !
     if callable(span_getter):
-         yield from span_getter(doclike)
-         return
     for key, span_filter in span_getter.items():
Was already missing at lines 99-102
         else:
-             for span, group in candidates:
-                 if span.label_ in span_filter:
-                     yield span, group
Was already missing at line 107
     if callable(span_setter):
-         span_setter(doc, matches)
     else:
Was already missing at line 138
     if callable(value):
-         return value
     if isinstance(value, str):
Was already missing at line 187
             elif isinstance(v, str):
-                 new_value[k] = [v]
             elif isinstance(v, list) and all(isinstance(i, str) for i in v):

23910-195.82%
edsnlp/data/converters.py

Was already missing at line 428

                 elif key == "XPOS":
-                     word.tag_ = value
                 elif key == "FEATS":
Was already missing at line 836
         if self.keep_raw_attribute_values:
-             return value
         try:
New missing coverage at line 898 !
                 if not attr:
-                     continue
                 if "=" in attr:
Was already missing at line 929
             if span is None:
-                 continue
             for k, v in attrs.items():
New missing coverage at line 999 !
         if isinstance(value, (bool, int, float)):
-             return repr(value)
         s = str(value)
Was already missing at line 1116
     if isinstance(converter, type):
-         return converter(**kwargs), {}
     return converter, validate_kwargs(converter, kwargs)

3776-698.41%
TOTAL12262252397.94%
Files without new missing coverage
NameStmtsMiss∆ MissCover
edsnlp/utils/torch.py

Was already missing at line 102

 def load_pruned_obj(obj, _):
-     return obj
Was already missing at line 118
     def save_align_devices_hook(pickler, obj):
-         pickler.save_reduce(load_align_devices_hook, (obj.__dict__,), obj=obj)
Was already missing at lines 121-128
     def load_align_devices_hook(state):
-         state["execution_device"] = MAP_LOCATION
  ...
-     AlignDevicesHook = None
Was already missing at line 143
             if torch.Tensor in copyreg.dispatch_table:
-                 old_dispatch[torch.Tensor] = copyreg.dispatch_table[torch.Tensor]
             copyreg.pickle(torch.Tensor, reduce_empty)

839089.16%
edsnlp/utils/resources.py

Was already missing at line 33

     if not verbs:
-         return conjugated_verbs

241095.83%
edsnlp/utils/numbers.py

Was already missing at line 34

     else:
-         string = s
     string = string.lower().strip()
Was already missing at lines 38-41
         return int(string)
-     except ValueError:
-         parsed = DIGITS_MAPPINGS.get(string, None)
-         return parsed

164075.00%
edsnlp/utils/filter.py

Was already missing at line 206

     if isinstance(label, int):
-         return [span for span in spans if span.label == label]
     else:

741098.65%
edsnlp/tune.py

Was already missing at line 169

             )
-         except RuntimeError as e:
             if "zero total variance" in str(e):  # pragma: no cover
Was already missing at line 684
         else:
-             n_trials = compute_n_trials(
                 gpu_hours, compute_time_per_trial(study, ema=True)

2872099.30%
edsnlp/training/trainer.py

Was already missing at line 57

     if result is None:
-         result = {}
     if isinstance(x, dict):
Was already missing at lines 365-371
         if self.sub_batch_size and self.sub_batch_size[1] == "splits":
-             data = data.batchify(
  ...
-             data = data.map(lambda b: [nlp.collate(sb, device=device) for sb in b])
         elif self.sub_batch_size:
Was already missing at lines 883-890
                         raise
-                     except Exception:
  ...
-                         raise
Was already missing at lines 917-919
                     ) > grad_max_dev * math.sqrt(grad_var):
-                         spike = True
-                         spikes += 1
                     else:
Was already missing at line 926
                     if spike and grad_dev_policy == "clip_mean":
-                         torch.nn.utils.clip_grad_norm_(
                             grad_params, grad_mean, norm_type=2
Was already missing at line 930
                     elif spike and grad_dev_policy == "clip_threshold":
-                         torch.nn.utils.clip_grad_norm_(
                             grad_params,

33312096.40%
edsnlp/training/loggers.py

Was already missing at line 109

                 if col not in values and col != "step":
-                     row.append("")
                 else:
Was already missing at line 278
     def tracker(self):
-         return self.printer
Was already missing at lines 369-388
         """
-         env_logging_dir = os.environ.get("AIM_LOGGING_DIR", None)
  ...
-         accelerate.tracking.logger.debug(
             f"Initialized Aim run {self.writer.hash} in project {project_name}"
Was already missing at lines 392-394
     def log(self, values: dict, step: Optional[int], **kwargs):
-         values = flatten_dict(values)
-         return super().log(values, step, **kwargs)

14011092.14%
edsnlp/reducers.py

Was already missing at line 115

     if not hasattr(module, "__file__"):
-         return True
     if module.__file__ is None:
Was already missing at line 117
     if module.__file__ is None:
-         return False
     # Hack to avoid copying the full module dict

672097.01%
edsnlp/processing/spark.py

Was already missing at line 50

         getActiveSession = SparkSession.getActiveSession
-     except AttributeError:

471097.87%
edsnlp/processing/multiprocessing.py

Was already missing at lines 222-230

                     return re.findall(r"/[^\s]+\.so[^\s]*", f.read())
-             except Exception:
  ...
-             return []
Was already missing at lines 233-235
         loaded = loaded_libs()
-     except Exception:
-         return False
     return any(any(k in os.path.basename(p).lower() for k in libs) for p in loaded)
Was already missing at line 254
         )
-         method = "spawn"
Was already missing at lines 258-264
     if has_hdfs and method == "fork":
-         safe = "forkserver" if "forkserver" in methods else "spawn"
  ...
-         method = safe
Was already missing at lines 453-455
                     pass
-             except StopSignal:
-                 pass
             for name, queue in self.consumer_queues(stage):
Was already missing at lines 668-670
             if isinstance(docs, StreamSentinel):
-                 self.active_batches[stage].append([None, None, None, docs])
-                 continue
             batch_id = str(hash(tuple(id(x) for x in docs)))[-8:] + "-" + self.uid
Was already missing at lines 1193-1199
                 if out[0].kind == requires_sentinel:
-                     missing_sentinels -= 1
  ...
-                         missing_sentinels = len(self.cpu_worker_names)
                 continue

65722096.65%
edsnlp/processing/deprecated_pipe.py

Was already missing at lines 207-209

         def converter(doc):
-             res = results_extractor(doc)
-             return (
                 [{"note_id": doc._.note_id, **row} for row in res]

572096.49%
edsnlp/pipes/trainable/span_linker/span_linker.py

Was already missing at lines 402-404

             if self.reference_mode == "synonym":
-                 embeds = embeds.to(new_lin.weight)
-                 new_lin.weight.data = embeds
             else:

1732098.84%
edsnlp/pipes/trainable/span_classifier/span_classifier.py

Was already missing at line 379

         if not all(keep_bindings):
-             logger.warning(
                 "Some attributes have no labels or values and have been removed:"

1641099.39%
edsnlp/pipes/trainable/ner_crf/ner_crf.py

Was already missing at line 301

         if self.labels is not None and not self.infer_span_setter:
-             return
Was already missing at lines 309-311
             if callable(self.target_span_getter):
-                 for span in get_spans(doc, self.target_span_getter):
-                     inferred_labels.add(span.label_)
             else:

1733098.27%
edsnlp/pipes/trainable/layers/crf.py

Was already missing at line 21

     # out: 2 * N * O
-     return (log_A.unsqueeze(-1) + log_B.unsqueeze(-3)).logsumexp(-2)
Was already missing at line 29
     # out: 2 * N * O
-     return (log_A.unsqueeze(-1) + log_B.unsqueeze(-3)).max(-2)
Was already missing at line 98
         if learnable_transitions:
-             self.transitions = torch.nn.Parameter(
                 torch.zeros_like(forbidden_transitions, dtype=torch.float)
Was already missing at line 108
         if learnable_transitions and with_start_end_transitions:
-             self.start_transitions = torch.nn.Parameter(
                 torch.zeros(num_tags, dtype=torch.float)
Was already missing at line 117
         if learnable_transitions and with_start_end_transitions:
-             self.end_transitions = torch.nn.Parameter(
                 torch.zeros(num_tags, dtype=torch.float)

1375096.35%
edsnlp/pipes/trainable/embeddings/transformer/transformer.py

Was already missing at line 166

         if quantization is not None:
-             kwargs["quantization_config"] = quantization
Was already missing at line 189
         if self.cls_token_id is None:
-             [self.cls_token_id] = self.tokenizer.convert_tokens_to_ids(
                 [self.tokenizer.special_tokens_map["bos_token"]]
Was already missing at line 193
         if self.sep_token_id is None:
-             [self.sep_token_id] = self.tokenizer.convert_tokens_to_ids(
                 [self.tokenizer.special_tokens_map["eos_token"]]

1683098.21%
edsnlp/pipes/qualifiers/reported_speech/reported_speech.py

Was already missing at lines 24-28

         return "REPORTED"
-     elif token._.rspeech is False:
-         return "DIRECT"
-     else:
-         return None

1003097.00%
edsnlp/pipes/qualifiers/negation/negation.py

Was already missing at line 28

     else:
-         return None

1011099.01%
edsnlp/pipes/qualifiers/hypothesis/hypothesis.py

Was already missing at line 27

     else:
-         return None

981098.98%
edsnlp/pipes/qualifiers/history/history.py

Was already missing at lines 26-32

 def history_getter(token: Union[Token, Span]) -> Optional[str]:
-     if token._.history is True:
-         return "ATCD"
-     elif token._.history is False:
-         return "CURRENT"
-     else:
-         return None
Was already missing at lines 353-359
                 )
-             except ValueError:
  ...
-                 note_datetime = None
Was already missing at lines 368-374
                 )
-             except ValueError:
  ...
-                 birth_datetime = None
Was already missing at lines 440-443
                         )
-                     except ValueError as e:
-                         absolute_date = None
-                         logger.warning(
                             "In doc {}, the following date {} raises this error: {}. "

18014092.22%
edsnlp/pipes/qualifiers/family/family.py

Was already missing at line 27

     else:
-         return None

831098.80%
edsnlp/pipes/ner/tnm/model.py

Was already missing at line 147

     def __str__(self):
-         return self.norm()
Was already missing at line 171
             )
-             exclude_unset = skip_defaults

1122098.21%
edsnlp/pipes/ner/scores/sofa/sofa.py

Was already missing at line 32

             if not assigned:
-                 continue
             if assigned.get("method_max") is not None:
Was already missing at line 40
             else:
-                 method = "Non précisée"

252092.00%
edsnlp/pipes/ner/scores/elston_ellis/patterns.py

Was already missing at line 26

         if x <= 5:
-             return 1
Was already missing at lines 32-36
         else:
-             return 3
- 
-     except ValueError:
-         return None

214080.95%
edsnlp/pipes/ner/scores/charlson/patterns.py

Was already missing at lines 21-23

             return int(extracted_score)
-     except ValueError:
-         return None

132084.62%
edsnlp/pipes/ner/disorders/solid_tumor/solid_tumor.py

Was already missing at lines 131-137

         for span in spans:
-             span.label_ = "solid_tumor"
  ...
-             yield span

386084.21%
edsnlp/pipes/ner/disorders/peripheral_vascular_disease/peripheral_vascular_disease.py

Was already missing at line 108

                 if "peripheral" not in span._.assigned.keys():
-                     continue

161093.75%
edsnlp/pipes/ner/disorders/diabetes/diabetes.py

Was already missing at line 131

                 # Mostly FP
-                 continue
Was already missing at line 134
             elif self.has_far_complications(span):
-                 span._.status = 2
Was already missing at line 145
         if next(iter(self.complication_matcher(context)), None) is not None:
-             return True
         return False

303090.00%
edsnlp/pipes/ner/disorders/connective_tissue_disease/connective_tissue_disease.py

Was already missing at line 104

                 # Huge change of FP / Title section
-                 continue

151093.33%
edsnlp/pipes/ner/disorders/ckd/ckd.py

Was already missing at lines 121-124

             dfg_value = float(dfg_span.text.replace(",", ".").strip())
-         except ValueError:
-             logger.trace(f"DFG value couldn't be extracted from {dfg_span.text}")
-             return False

303090.00%
edsnlp/pipes/ner/disorders/cerebrovascular_accident/cerebrovascular_accident.py

Was already missing at lines 112-114

             if span._.source == "ischemia":
-                 if "brain" not in span._.assigned.keys():
-                     continue

182088.89%
edsnlp/pipes/ner/adicap/models.py

Was already missing at line 15

     def norm(self) -> str:
-         return self.code
Was already missing at line 18
     def __str__(self):
-         return self.norm()

162087.50%
edsnlp/pipes/misc/split/split.py

Was already missing at lines 186-188

         if max_length <= 0 and self.regex is None:
-             yield doc
-             return

732097.26%
edsnlp/pipes/misc/sections/sections.py

Was already missing at line 126

         if sections is None:
-             sections = patterns.sections
         sections = dict(sections)

451097.78%
edsnlp/pipes/misc/quantities/quantities.py

Was already missing at lines 147-149

     def __getitem__(self, item: int):
-         assert isinstance(item, int)
-         return [self][item]
Was already missing at lines 160-163
     def __eq__(self, other: Any):
-         if isinstance(other, SimpleQuantity):
-             return self.convert_to(other.unit) == other.value
-         return False
Was already missing at line 166
         if other.unit == self.unit:
-             return SimpleQuantity(self.value + other.value, self.unit, self.registry)
         return SimpleQuantity(
Was already missing at line 193
             return self.convert_to(other_unit)
-         except KeyError:
             raise AttributeError(f"Unit {other_unit} not found")
Was already missing at line 198
     def verify(cls, ent):
-         return True
Was already missing at line 264
     def __lt__(self, other: Union[SimpleQuantity, "RangeQuantity"]):
-         return max(self.convert_to(other.unit)) < min((part.value for part in other))
Was already missing at line 275
             return self.convert_to(other.unit) == other.value
-         return False
Was already missing at line 289
     def verify(cls, ent):
-         return True
Was already missing at line 888
         if snippet.end != last and doclike.doc[last: snippet.end].text.strip() == "":
-             pseudo.append("w")
         pseudo = "".join(pseudo)
Was already missing at line 1069
                             if start_line is None:
-                                 continue
Was already missing at lines 1100-1102
                         unit_norm = self.unit_followers[unit_before.label_]
-                 except (KeyError, AttributeError, IndexError):
-                     pass
Was already missing at line 1145
             ):
-                 ent = doc[unit_text.start: number.end]
             else:
Was already missing at lines 1152-1154
                 dims = self.unit_registry.parse_unit(unit_norm)[0]
-             except KeyError:
-                 continue
Was already missing at lines 1260-1262
                     last._.set(last.label_, new_value)
-                 except (AttributeError, TypeError):
-                     merged.append(ent)
             else:

44020095.45%
edsnlp/pipes/misc/dates/models.py

Was already missing at line 165

                     else:
-                         d["month"] = note_datetime.month
                 if self.day is None:
Was already missing at lines 169-175
             else:
-                 if self.year is None:
  ...
-                     d["day"] = default_day
Was already missing at lines 183-185
                 return dt
-             except ValueError:
-                 return None
Was already missing at line 201
         else:
-             return None
Was already missing at line 217
         if self.second:
-             norm += f"{self.second:02}s"

20611094.66%
edsnlp/pipes/misc/dates/dates.py

Was already missing at line 249

         if isinstance(absolute, str):
-             absolute = [absolute]
         if isinstance(relative, str):
Was already missing at line 251
         if isinstance(relative, str):
-             relative = [relative]
         if isinstance(duration, str):
Was already missing at line 253
         if isinstance(duration, str):
-             relative = [duration]
         if isinstance(false_positive, str):
Was already missing at lines 357-366
             if self.merge_mode == "align":
-                 alignments = align_spans(matches, spans, sort_by_overlap=True)
  ...
-                         matches.append(span)
Was already missing at lines 462-464
                 if v1.mode == Mode.DURATION:
-                     m1 = Bound.FROM if v2.bound == Bound.UNTIL else Bound.UNTIL
-                     m2 = v2.mode or Bound.FROM
                 elif v2.mode == Mode.DURATION:

15314090.85%
edsnlp/pipes/misc/consultation_dates/consultation_dates.py

Was already missing at line 131

         else:
-             self.date_matcher = None
Was already missing at line 134
         if not consultation_mention:
-             consultation_mention = []
         elif consultation_mention is True:

482095.83%
edsnlp/pipes/core/normalizer/__init__.py

Was already missing at line 7

 def excluded_or_space_getter(t):
-     return t.is_space or t.tag_ == "EXCLUDED"

51080.00%
edsnlp/pipes/core/endlines/endlines.py

Was already missing at lines 160-164

         if end_lines_model is None:
-             path = build_path(__file__, "base_model.pkl")
- 
-             with open(path, "rb") as inp:
-                 self.model = pickle.load(inp)
         elif isinstance(end_lines_model, str):
Was already missing at lines 167-169
                 self.model = pickle.load(inp)
-         elif isinstance(end_lines_model, EndLinesModel):
-             self.model = end_lines_model
         else:
Was already missing at line 200
         ):
-             return "ENUMERATION"
Was already missing at line 287
         if np.isnan(sigma):
-             sigma = 1

897092.13%
edsnlp/pipes/core/contextual_matcher/contextual_matcher.py

Was already missing at lines 241-243

             ):
-                 to_keep = False
-                 break

1302098.46%
edsnlp/patch_spacy.py

Was already missing at lines 67-69

             # if module is reloaded.
-             existing_func = registry.factories.get(internal_name)
-             if not util.is_same_func(factory_func, existing_func):
                 raise ValueError(

312093.55%
edsnlp/package.py

Was already missing at lines 474-476

             version = version or pyproject["project"]["version"]
-         except (KeyError, TypeError):
-             version = "0.1.0"
         name = name or pyproject["project"]["name"]
Was already missing at line 480
         else:
-             main_package = None
         model_package = snake_case(name.lower())

2143098.60%
edsnlp/metrics/span_attribute.py

Was already missing at lines 67-69

         )
-         assert attributes is None
-         attributes = kwargs.pop("qualifiers")
     if attributes is None:

732097.26%
edsnlp/matchers/simstring.py

Was already missing at line 280

     if custom:
-         attr = attr[1:].lower()
Was already missing at line 295
             if custom:
-                 token_text = getattr(token._, attr)
             else:

1462098.63%
edsnlp/language.py

Was already missing at line 103

             if last != begin:
-                 logger.warning(
                     "Missed some characters during"

511098.04%
edsnlp/data/standoff.py

Was already missing at line 38

     def __init__(self, ann_file, line):
-         super().__init__(f"File {ann_file}, unrecognized Brat line {line}")
Was already missing at line 192
                         )
-                 except Exception:
                     raise Exception(

1862098.92%
edsnlp/data/polars.py

Was already missing at line 36

         if hasattr(data, "collect"):
-             data = data.collect()
         assert isinstance(data, pl.DataFrame)

551098.18%
edsnlp/data/json.py

Was already missing at line 81

                 return records
-         except Exception as e:
             raise Exception(f"Cannot read {file}: {e}")

1121099.11%
edsnlp/data/conll.py

Was already missing at lines 81-83

             )
-         except StopIteration:
-             cols = DEFAULT_COLUMNS
             warnings.warn(
Was already missing at lines 92-96
         if not line:
-             if doc["words"]:
-                 yield doc
-                 doc = {"words": []}
-             continue
         if line.startswith("#"):

766092.11%
edsnlp/core/torch_component.py

Was already missing at line 407

             if hasattr(self, "compiled"):
-                 res = self.compiled(batch)
             else:
Was already missing at line 453
         """
-         return self.preprocess(doc)

1892098.94%
edsnlp/core/registries.py

Was already missing at line 129

         if isinstance(obj, DraftPipe):
-             return obj
         elif isinstance(obj, dict):
Was already missing at line 134
                 if result is not None:
-                     return result
         elif isinstance(obj, (tuple, list, set)):
Was already missing at line 139
                 if result is not None:
-                     return result
         return None

1853098.38%
edsnlp/core/pipeline.py

Was already missing at line 607

             if name in exclude:
-                 continue
             if name not in components:
Was already missing at lines 718-721
         """
-         res = Stream.ensure_stream(docs)
-         res = res.map(functools.partial(self.preprocess, supervision=supervision))
-         return res

4484099.11%
edsnlp/connectors/omop.py

Was already missing at line 69

         if not isinstance(row.ents, list):
-             continue
Was already missing at line 87
             else:
-                 doc.spans[span.label_].append(span)
Was already missing at line 127
     if df.note_id.isna().any():
-         df["note_id"] = range(len(df))
Was already missing at line 171
         if i > 0:
-             df.term_modifiers += ";"
         df.term_modifiers += ext + "=" + df[ext].astype(str)

844095.24%

281 files skipped due to complete coverage.

Coverage success: total of 97.94% is above 97.90% 🎉

@percevalw percevalw force-pushed the llm-extraction branch 6 times, most recently from fd5056b to dd197bc Compare October 1, 2025 19:55
@percevalw percevalw force-pushed the llm-extraction branch 3 times, most recently from 1783c8e to fcd534a Compare October 3, 2025 00:17
@marconaguib
Copy link

Really nice work here! A few thoughts and questions:

  • I really like the design, especially the fuzzy_alignment.align() part and the fact that prompt can be a callable. Thanks also for the citation!
  • I still struggle with async logic, so unfortunately I can’t be of much help on that side.
  • The example prompt says "Tags can be nested, but they must not overlap". However, if I’m not mistaken, MarkupToDocConverter()._parse() does handle overlapping tags. Are there other parts of the pipeline that don’t? If not, this wording might be misleading for users.
  • It might be worth specifying in the documentation how to use LlmMarkupExtractor
    with preset="md".
  • In DocToMarkup and MarkupToDoc, the default preset is "md", while in LlmMarkupExtractor it is "xml". This could be a bit unintuitive for users.

Some broader remarks (that might be better addressed in a future PR):

  • It seems likely that users will want to do prompt engineering. Would it make sense to design another class (something closer to a trainable_component) that can optimize the prompt based on annotated data?
  • I’d be happy to run some benchmarks to compare this to the results I reported in my article.
  • A more modular retriever (e.g. TF-IDF, “most entity-containing”, etc.) could be very useful. Having a dedicated Retriever class might be a good abstraction in general.

@percevalw
Copy link
Member Author

percevalw commented Oct 3, 2025

@marconaguib thank you for your review ! All good points, I'll fix that :)

It seems likely that users will want to do prompt engineering. Would it make sense to design another class (something closer to a trainable_component) that can optimize the prompt based on annotated data?

Are you referring to finetuning (where it would make sense to have a trainable component, although our lib might not be as optimized as other work out there like unsloth), or prompt search (more akin to tuning) ? For the latter, yes it would be great ! I wonder if there is some way we could offer a unified interface using edsnlp.tune which uses optuna to optimize float/int/categorical variables like now, but also textual hyper parameter as in this component...

I’d be happy to run some benchmarks to compare this to the results I reported in my article.

That would be great !

@marconaguib
Copy link

Yes, I was thinking prompt search indeed, and the edsnlp.tune track sounds very promising 👍

@percevalw percevalw force-pushed the llm-extraction branch 5 times, most recently from 100fb19 to aab40a5 Compare October 4, 2025 00:20
@sonarqubecloud
Copy link

sonarqubecloud bot commented Oct 4, 2025

@percevalw percevalw merged commit 7f2c576 into master Oct 4, 2025
3 of 16 checks passed
@percevalw percevalw deleted the llm-extraction branch October 4, 2025 00:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants