![]() |
jTokeniser
Overview
Tokenising strings into its constituent words/tokens can prove tricky for non-trivial examples. In particular, when dealing with natural language, you must take into consideration punctuation too in order to isolate the words. The jTokeniser package was designed to combine a set of tokenisers that range from basic whitespace tokenisers to more complex ones that deal intuitively with natural language.
Each of the tokenisers adopt a similar structure to java.util.StringTokenizer in terms of how to instantiate the classes and extract the tokens. This means they are simple to use.
jTokeniser is free and open source. It is released under the LGPL licence.
News
Features
jTokeniser comprises of four tokenisers that all extend from an abtract Tokeniser class:
- WhiteSpaceTokeniser - this splits a string on all occurances of whitespace, which include spaces, newlines, tabs and linefeeds.
- StringTokeniser - this is basically the same as java.util.StringTokenizer with some extra methods (and extends from Tokeniser). Its default behaviour is to act as a WhiteSpaceTokeniser, however, you can specify a set of characters that are to be used to indicate word delimiters.
- RegexTokeniser - this tokeniser is much more flexible as you can use regular expressions to define a what a token is. So, "\\w+" means whenever it matches one or more letters, it will consider that a word. By default, it uses a regular expression equivalent to a whitespace tokeniser.
- RegexSeparatorTokeniser - this can be thought of as an advanced StringTokeniser. Whereas StringTokeniser is limited to defining delimiters as a set of individual characters, RegexSeparatorTokeniser can utilise regular expressions for a richer and more flexible approach.
- BreakIteratorTokeniser - one of the most sophisticated tokenisers in the library, although should only be used on natural language strings to isolate words. It also comes with built-in rules about how to find words, knowing how to disregard punctuation, etc.
- SentenceTokeniser - this also uses a BreakIterater like the above, but tuned towards finding sentence boundaries. The "tokens" in this tokeniser are in fact individual sentences.
Screenshots
Examples
JTokeniserExample.java is a Java program illustrating how to utilise the various tokenisers.
Download
jTokeniser is free and open source. It is released under the LGPL licence.
Please select the version you wish to use:
More details can be seen in the changelog.
NOTE: Before you can utilise jTokeniser on your system, you must ensure that the Java Runtime Envinronment (version 1.5/5.0 or greater) is also installed on your computer. This can be easily downloaded from http://java.sun.com and installed.
Documentation
Javadoc API for jTokeniser is here
The README also contains details about how to make your classes access the jTokeniser package.
Documentation for older versions:
| 2.0 | jTokeniser-2.0_docs.zip | 1.2 | jTokeniser_docs-1.2.zip | 1.0 | jTokeniser_docs-1.0.zip |
< Back to software
