Liking cljdoc? Tell your friends :D

bionlp

A Clojure library of tools for biomedical NLP tasks - like Named Entity Recognition (NER) for disease, chemical, genes and procedures. Bionlp uses transformers models (HuggingFace) to perform NER and UMLS lexical tools for named entity resolution to UMLS CUIs (Concept Unique Identifiers).

Usage

Add dependency in Leiningen:

[md.datum/bionlp "0.1.0"]

Required Python Packages

  • transformers >= 3.1.0
$> pip install --user transformers==3.1.0
  • onnxruntime >= 1.8.1
$> pip install --user onnxruntime
  • onnx_transformers
$> pip install --user git+https://github.com/patil-suraj/onnx_transformers

Additional Dependencies

  • bionlp depends on libpython-clj - Clojure interface for python. We will need to add it as a dependency.
[clj-python/libpython-clj "2.00-beta-15"]
  • bionlp uses clojure-opennlp - We don't need to add it as a dependency but we will need to download a pos model to resources/models directory. We will need to download en-pos-maxent.bin.
  • bionlp uses UMLS RRF files particularly MRCONSO.RRF and MRSTY.RRF files. These need to be copied into the resources folder.
  • bionlp uses UMLS Lexical Tools to perform term inflection and variant generation. After downloading, it can be installed to local maven repository as follows: Note: Do this on the root folder i.e. lvg2021
$> mvn install:install-file -Dfile=lib/lvg2021api.jar -DpomFile=pom.xml
$> mvn install:install-file -Dfile=lib/lvg2021dist.jar -DpomFile=pom.xml

Note: We will also need to copy lvg.properties file from config to your resources folder and rename it to 'data.config.lvg':

 $> cp data/config/lvg.properties ~/projects/bionlp-proj/resources/data.config.lvg

Basic Usage (From REPL)

In order to run biobert NER, you will first need to instantiate a transformers ner pipeline and pass it to the batched-ner function along with the text you want to classify.

user> (require '[bionlp.biobert :as biobert])
nil

user> (def condition-nlp (biobert/nlp-pipeline "resources/models/output/NCBI-disease"))
#'user/condition-nlp

user> (def results (biobert/batched-ner condition-nlp "The objective of this study was to provide more accurate frequency estimates of breast cancer susceptibility gene 1 (BRCA1) germline alterations in the ovarian cancer population"))
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#'user/results

user> results
({:token "breast cancer", :index 15} {:token "ovarian cancer", :index 38})

Once you've identified tokens, you can further lookup the umls cui for each matching concept as follows:

user> (require '[bionlp.umls :as umls])
nil

user> ;; First create a concept trie based on TUI's of semantic groups
(def concept-trie (umls/create-concept-lookup-trie :disease))
#'clintrials-clj.core/concept-trie

user> ;; Do a lookup for each token
(map #(umls/lookup-concept concept-trie (:token %)) results)
("C0678222" "C1140680")

Changelog

Release 0.1.1

  • Added option to exclude UMLS sources from concept lookup

License

Copyright © 2021 datum.md

This program and the accompanying materials are made available under the terms of the Eclipse Public License 2.0 which is available at http://www.eclipse.org/legal/epl-2.0.

This Source Code may also be made available under the following Secondary Licenses when the conditions for such availability set forth in the Eclipse Public License, v. 2.0 are satisfied: GNU General Public License as published by the Free Software Foundation, either version 2 of the License, or (at your option) any later version, with the GNU Classpath Exception which is available at https://www.gnu.org/software/classpath/license.html.

Can you improve this documentation?Edit on GitHub

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close