A Clojure library of tools for biomedical NLP tasks - like Named Entity Recognition (NER) for disease, chemical, genes and procedures. Bionlp uses transformers models (HuggingFace) to perform NER and UMLS lexical tools for named entity resolution to UMLS CUIs (Concept Unique Identifiers).
Add dependency in Leiningen:
[md.datum/bionlp "0.1.0"]
$> pip install --user transformers==3.1.0
$> pip install --user onnxruntime
$> pip install --user git+https://github.com/patil-suraj/onnx_transformers
[clj-python/libpython-clj "2.00-beta-15"]
$> mvn install:install-file -Dfile=lib/lvg2021api.jar -DpomFile=pom.xml
$> mvn install:install-file -Dfile=lib/lvg2021dist.jar -DpomFile=pom.xml
Note: We will also need to copy lvg.properties file from config to your resources folder and rename it to 'data.config.lvg':
$> cp data/config/lvg.properties ~/projects/bionlp-proj/resources/data.config.lvg
In order to run biobert NER, you will first need to instantiate a transformers ner pipeline and pass it to the batched-ner function along with the text you want to classify.
user> (require '[bionlp.biobert :as biobert])
nil
user> (def condition-nlp (biobert/nlp-pipeline "resources/models/output/NCBI-disease"))
#'user/condition-nlp
user> (def results (biobert/batched-ner condition-nlp "The objective of this study was to provide more accurate frequency estimates of breast cancer susceptibility gene 1 (BRCA1) germline alterations in the ovarian cancer population"))
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
#'user/results
user> results
({:token "breast cancer", :index 15} {:token "ovarian cancer", :index 38})
Once you've identified tokens, you can further lookup the umls cui for each matching concept as follows:
user> (require '[bionlp.umls :as umls])
nil
user> ;; First create a concept trie based on TUI's of semantic groups
(def concept-trie (umls/create-concept-lookup-trie :disease))
#'clintrials-clj.core/concept-trie
user> ;; Do a lookup for each token
(map #(umls/lookup-concept concept-trie (:token %)) results)
("C0678222" "C1140680")
Copyright © 2021 datum.md
This program and the accompanying materials are made available under the terms of the Eclipse Public License 2.0 which is available at http://www.eclipse.org/legal/epl-2.0.
This Source Code may also be made available under the following Secondary Licenses when the conditions for such availability set forth in the Eclipse Public License, v. 2.0 are satisfied: GNU General Public License as published by the Free Software Foundation, either version 2 of the License, or (at your option) any later version, with the GNU Classpath Exception which is available at https://www.gnu.org/software/classpath/license.html.
Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close