The detector of interesting things in the text. The intended use is in the stream search applications. Let us say you need to monitor a stream of text documents: web crawl results, chat messages, corporate documents for mentions of various keywords. Beagle will help you to quickly set up such system and start monitoring your documents.
Implementation is based on Lucene monitor library which is based on Luwak.
(require '[beagle.phrases :as phrases])
(let [dictionary [{:text "to be annotated" :id "1"}]
annotator (phrases/annotator dictionary :type-name "LABEL")]
(annotator "before annotated to be annotated after annotated"))
=> ({:text "to be annotated", :type "LABEL", :dict-entry-id "1", :meta {}, :begin-offset 17, :end-offset 32})
(let [dictionary [{:text "TO BE ANNOTATED" :id "1" :case-sensitive? false}]
annotator (phrases/annotator dictionary :type-name "LABEL")]
(annotator "before annotated to be annotated after annotated"))
=> ({:text "to be annotated", :type "LABEL", :dict-entry-id "1", :meta {}, :begin-offset 17, :end-offset 32})
(let [dictionary [{:text "TÖ BE ÄNNÖTÄTED" :id "1" :case-sensitive? false :ascii-fold? true}]
annotator (phrases/annotator dictionary :type-name "LABEL")]
(annotator "before annotated to be annotated after annotated"))
=> ({:text "to be annotated", :type "LABEL", :dict-entry-id "1", :meta {}, :begin-offset 17, :end-offset 32})
;; Stemming support for multiple languages
(let [dictionary [{:text "Kaunas" :id "1" :stem? true :stemmer :lithuanian}]
annotator-fn (phrases/annotator dictionary)]
(annotator-fn "Kauno miestas"))
=> ({:text "Kauno", :type "PHRASE", :dict-entry-id "1", :meta {}, :begin-offset 0, :end-offset 5})
;; Phrases also support slop (i.e. terms edit distance)
(let [txt "before start and end after"
dictionary [{:text "start end" :id "1" :slop 1}]
annotator-fn (phrases/annotator dictionary)]
(annotator-fn txt))
=> ({:text "start and end", :type "PHRASE", :dict-entry-id "1", :meta {}, :begin-offset 7, :end-offset 20})
Example:
DictionaryEntry dictionaryEntry = new DictionaryEntry("test phrase");
dictionaryEntry.setSlop(1);
HashMap<String, Object> annotatorOptions = new HashMap<>();
annotatorOptions.put("type-name", "LABEL");
annotatorOptions.put("validate-dictionary?", true);
Annotator annotator = new Annotator(Arrays.asList(dictionaryEntry), annotatorOptions);
HashMap<String, Object> annotationOptions = new HashMap<>();
annotationOptions.put("merge-annotations?", true);
Collection<Annotation> annotations = annotator.annotate("This is my test phrase", annotationOptions);
annotations.forEach(s -> System.out.println("Annotated: \'" + s.text() + "\' at offset: " + s.beginOffset() + ":" + s.endOffset()));
// => Annotated: 'test phrase' at offset: 11:22
All the options that are present in the Clojure interface are also available for use in Java. The translation is that both annotator and annotation options map should have converted Clojure keywords converted to strings, e.g.
:case-sensitive? => "case-sensitive?"
Add Clojars repository to your pom.xml
:
<repositories>
<repository>
<id>clojars.org</id>
<url>https://repo.clojars.org</url>
</repository>
</repositories>
and then the dependency on beagle
:
<dependency>
<groupId>lt.tokenmill</groupId>
<artifactId>beagle</artifactId>
<version>0.1.6-SNAPSHOT</version>
</dependency>
Three file formats are supported: csv, edn, json.
Separator: "," Escape: """
The first line MUST be a header.
Supported header keys: ["text" "type" "id" "synonyms" "case-sensitive?" ":ascii-fold?" "meta"]
Order is not important.
Under synonyms
, there should be a list of string separated by ";"
Under meta
, there should be a list of strings separated by ";". Even number of strings is expected. In case of odd number, last one is ignored.
Accepts any number of dictionaries to validate as long as they are provided in pairs as '"/path/to/dictionary/file" "file-type"'
To use validator directly execute command: clj -m beagle.validator "/path/to/dictionary/file" "file-type" "/path/to/dictionary/file2" "file-type" & ...
clj -m beagle.validator "your-dict.csv" "csv" "your-other-dict.json" "json"
Example in Gitlab CI:
validate-dictionaries:
stage: dictionary-validation
when: always
image: registry.gitlab.com/tokenmill/clj-luwak/dictionary-validator:2
script:
- >
dictionary-validator
/path/to/dict.csv csv
/path/to/dict.json json
/path/to/dict.edn edn
Supported optimizations:
There are cases when dictionary entries can't be merged:
Examples:
(require '[beagle.dictionary-optimizer :as optimizer])
; Remove duplicates
(let [dictionary [{:text "TO BE ANNOTATED" :id "1"}
{:text "TO BE ANNOTATED"}]]
(optimizer/optimize dictionary))
=> ({:text "TO BE ANNOTATED", :id "1"})
; Merge synonyms
(let [dictionary [{:text "TO BE ANNOTATED" :synonyms ["ONE"]}
{:text "TO BE ANNOTATED" :synonyms ["TWO"]}]]
(optimizer/optimize dictionary))
=> ({:text "TO BE ANNOTATED", :synonyms ("TWO" "ONE")})
; Synonyms and text equality check
(let [dictionary [{:text "TO BE ANNOTATED" :synonyms ["TO BE ANNOTATED"]}]]
(optimizer/optimize dictionary))
=> ({:text "TO BE ANNOTATED", :synonyms ["TO BE ANNOTATED"]})
; Can't be merged because of differences in text analysis
(let [dictionary [{:text "TO BE ANNOTATED" :case-sensitive? true}
{:text "TO BE ANNOTATED" :case-sensitive? false}]]
(optimizer/optimize dictionary))
=> ({:text "TO BE ANNOTATED", :case-sensitive? true} {:text "TO BE ANNOTATED", :case-sensitive? false})
Only annotations of the same type are merged.
Handled cases:
Examples:
(require '[beagle.annotation-merger :as merger])
(let [dictionary [{:text "TEST"}
{:text "This TEST is"}]
annotator (phrases/annotator dictionary)
annotations (annotator "This TEST is")]
(println "Annotations: " annotations)
(merger/merge-same-type-annotations annotations))
Annotations: ({:text TEST, :type PHRASE, :dict-entry-id 0, :meta {}, :begin-offset 5, :end-offset 9} {:text This TEST is, :type PHRASE, :dict-entry-id 1, :meta {}, :begin-offset 0, :end-offset 12})
=> ({:text "This TEST is", :type "PHRASE", :dict-entry-id "1", :meta {}, :begin-offset 0, :end-offset 12})
Copyright © 2019 TokenMill UAB.
Distributed under the The Apache License, Version 2.0.
Can you improve this documentation? These fine people already did:
Dainius Jocas, Dainius & Žygimantas MedelisEdit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close