In our study, we evaluated the performance of LynxCare’s clinical NLP pipeline against two open-source alternatives. Our goal was to assess out-of-the-box performance on multilingual biomedical text, particularly electronic health record (EHR) data.
Continue reading below and learn more about our methodology and findings.
• QuickUMLS: A widely used concept-matching tool based on approximate string matching.
• Mistral 7B Instruct (4-bit): We use this LLM for entity recognition using few-shot learning.
o Exhibited output inconsistency, requiring multiple prompt repetitions for optimal results.
o Its biomedical counterpart, BioMistral-7B, showed a higher tendency to hallucinate.
• SapBERT: A multilingual language model, pre-trained on the UMLS 2020AB knowledge base, used to embed textual concept mentions and to perform entity linking (EL) based on similarity with embedded representations of codes.
• LynxCare’s custom, discontinuous NER (C-DNER) combined with SapBERT
• The Mistral LLM with LynxCare’s custom Entity Linking (C-EL)
We selected four datasets spanning three languages (English, Dutch, and French):
• EHR dataset (Dutch) – 35 patient records (oncology & cardiology).
• Mantra GSC corpus (Dutch) – EMA subset (362 concepts).
• Quaero medical corpus (French) – EMA subset (1,970 concepts).
• E3C corpus (English) – Clinical records (2,389 concepts).
• Named entity recognition (NER) was evaluated solely on overlap and concept identification rather than label assignment, due to the varying annotation schemes across the datasets.
• We measured the precision, recall, and F1-score of code-assignments, comparing codes extracted by models to the ground-truth annotations, for all overlapping extractions from the NER.
View video for a visual representation of our findings
• Our pipeline outperforms all others on clinical narratives, highlighting the value of domain-specific fine-tuning.
• The Mistral + SapBERT pipeline slightly outperforms our model on precision, due to Mistral's preference for shorter (often single words), flat concepts, which are easier to map but do not capture all information.
• Our custom DNER model identifies nested and discontinuous concepts, which is more expressive but leads to lower precision in some cases.
• The highest-scoring pipeline combines a custom DNER model + SapBERT, outperforming Mistral + SapBERT by nearly 8 percentage points in F1-score.
• This underscores the importance of custom entity recognition models even when combined with a generic entity linker.
• Interestingly QuickUMLS performs best, benefiting from its close alignment with UMLS content.
• Based on automated metrics, the DNER + C-EL pipeline underperforms compared to QuickUMLS. This is likely because often longer, more specific concepts are extracted while 74% of Quaero’s annotations are unigrams.
• This highlights that differences in annotation and ranking standards across datasets significantly impact results.
• Our Dutch fine-tuned model outperforms others, demonstrating strong cross-lingual generalization in clinical narratives.
• The best-performing approach also involved combining custom and open-source components, further emphasizing the importance of domain-specific adaptation.
• Out-of-the-box LLM performance varies significantly depending on dataset language, annotation guidelines, and domain.
• Clinical NLP models require constant adaptation due to evolving medical knowledge, making static benchmarks insufficient.
• Expert validation remains crucial—final scores change based on medical experts' feedback and ontology adjustments.
• This study highlights the necessity of domain-specific customization in clinical NLP to ensure high-quality results in real-world healthcare applications.
Download our full technical research paper by completing the form.
• QuickUMLS: A widely used concept-matching tool based on approximate string matching.
• Mistral 7B Instruct (4-bit): We use this LLM for entity recognition using few-shot learning.
o Exhibited output inconsistency, requiring multiple prompt repetitions for optimal results.
o Its biomedical counterpart, BioMistral-7B, showed a higher tendency to hallucinate.
• SapBERT: A multilingual language model, pre-trained on the UMLS 2020AB knowledge base, used to embed textual concept mentions and to perform entity linking (EL) based on similarity with embedded representations of codes.
• LynxCare’s custom, discontinuous NER (C-DNER) combined with SapBERT
• The Mistral LLM with LynxCare’s custom Entity Linking (C-EL)
We selected four datasets spanning three languages (English, Dutch, and French):
• EHR dataset (Dutch) – 35 patient records (oncology & cardiology).
• Mantra GSC corpus (Dutch) – EMA subset (362 concepts).
• Quaero medical corpus (French) – EMA subset (1,970 concepts).
• E3C corpus (English) – Clinical records (2,389 concepts).
• Named entity recognition (NER) was evaluated solely on overlap and concept identification rather than label assignment, due to the varying annotation schemes across the datasets.
• We measured the precision, recall, and F1-score of code-assignments, comparing codes extracted by models to the ground-truth annotations, for all overlapping extractions from the NER.
View video for a visual representation of our findings
• Our pipeline outperforms all others on clinical narratives, highlighting the value of domain-specific fine-tuning.
• The Mistral + SapBERT pipeline slightly outperforms our model on precision, due to Mistral's preference for shorter (often single words), flat concepts, which are easier to map but do not capture all information.
• Our custom DNER model identifies nested and discontinuous concepts, which is more expressive but leads to lower precision in some cases.
• The highest-scoring pipeline combines a custom DNER model + SapBERT, outperforming Mistral + SapBERT by nearly 8 percentage points in F1-score.
• This underscores the importance of custom entity recognition models even when combined with a generic entity linker.
• Interestingly QuickUMLS performs best, benefiting from its close alignment with UMLS content.
• Based on automated metrics, the DNER + C-EL pipeline underperforms compared to QuickUMLS. This is likely because often longer, more specific concepts are extracted while 74% of Quaero’s annotations are unigrams.
• This highlights that differences in annotation and ranking standards across datasets significantly impact results.
• Our Dutch fine-tuned model outperforms others, demonstrating strong cross-lingual generalization in clinical narratives.
• The best-performing approach also involved combining custom and open-source components, further emphasizing the importance of domain-specific adaptation.
• Out-of-the-box LLM performance varies significantly depending on dataset language, annotation guidelines, and domain.
• Clinical NLP models require constant adaptation due to evolving medical knowledge, making static benchmarks insufficient.
• Expert validation remains crucial—final scores change based on medical experts' feedback and ontology adjustments.
• This study highlights the necessity of domain-specific customization in clinical NLP to ensure high-quality results in real-world healthcare applications.
Download our full technical research paper by completing the form.