A simple wrapper around the Boilerpipe text extraction library.
boilerpipe-clj
is released on Clojars.
With Leiningen, add it to the dependencies in project.clj
:
[run.avelino/boilerpipe-clj "0.3.1"]
The main namespace for Boilerpipe operations in the Clojure wrapper is
boilerpipe-clj.core
.
user=> (use 'boilerpipe-clj.core)
The main function for extracting human-readable text from an HTML document is
get-text
.
user=> (def article (slurp "https://help.github.com/articles/open-source-licensing"))
#'user/article
user=> (get-text article)
"all\nPublic repositories on GitHub are often used to share open source software. Open source software is software that is licensed so that others are free to use, change, [...]"
It expects HTML as a String for its first argument, but you can also opt to use different strategies for extracting text by passing it an extractor instance as the second arg.
user=> (get-text article boilerpipe-clj.extractors/default-extractor)
"Open source licensing\nWhich license is right for me?!\nDon't fret! Choosing an open source license can be confusing. That's why we created choosealicense.com , a website that helps you make decisions about how to license your code. [...]"
The most frequently used extraction strategies are definied in
boilerpipe-clj.extractors
. These are
ArticleExtractor
- DefaultDefaultExtractor
ArticleSentenceExtractor
Defining your own strategies is not currently possible from Clojure. Please refer to the Boilerpipe documentation for more info on implementing them in Java.
boilerpipe-clj is provided under the ASL 2.0 license.
The full license is available in LICENSE.md
de.l3s.boilerpipe/boilerpipe
to com.robbypond/boilerpipe
: Google Code deprecatedrun.avelino/boilerpipe-clj
on ClojarCan you improve this documentation? These fine people already did:
Curtis Gagliardi, Avelino & Nick BarnwellEdit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close