A simple wrapper around the Boilerpipe text extraction library.
boilerpipe-clj is released on Clojars.
With Leiningen, add it to the dependencies in project.clj:
[run.avelino/boilerpipe-clj "0.3.1"]
The main namespace for Boilerpipe operations in the Clojure wrapper is
boilerpipe-clj.core.
user=> (use 'boilerpipe-clj.core)
The main function for extracting human-readable text from an HTML document is
get-text.
user=> (def article (slurp "https://help.github.com/articles/open-source-licensing"))
#'user/article
user=> (get-text article)
"all\nPublic repositories on GitHub are often used to share open source software. Open source software is software that is licensed so that others are free to use, change, [...]"
It expects HTML as a String for its first argument, but you can also opt to use different strategies for extracting text by passing it an extractor instance as the second arg.
user=> (get-text article boilerpipe-clj.extractors/default-extractor)
"Open source licensing\nWhich license is right for me?!\nDon't fret! Choosing an open source license can be confusing. That's why we created choosealicense.com , a website that helps you make decisions about how to license your code. [...]"
The most frequently used extraction strategies are definied in
boilerpipe-clj.extractors. These are
ArticleExtractor - DefaultDefaultExtractorArticleSentenceExtractorDefining your own strategies is not currently possible from Clojure. Please refer to the Boilerpipe documentation for more info on implementing them in Java.
boilerpipe-clj is provided under the ASL 2.0 license.
The full license is available in LICENSE.md
de.l3s.boilerpipe/boilerpipe to com.robbypond/boilerpipe: Google Code deprecatedrun.avelino/boilerpipe-clj on ClojarCan you improve this documentation? These fine people already did:
Curtis Gagliardi, Avelino & Nick BarnwellEdit on GitHub
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |