Tokenizing strings into its constituent words/tokens can prove tricky for non-trivial examples. In particular, when dealing with natural language, you must take into consideration punctuation too in order to isolate the words. The jTokenizer package was designed to combine a set of tokenizers that range from basic whitespace tokenizers to more complex ones that deal intuitively with natural language.
Each of the tokenizers adopt a similar structure to java.util.StringTokenizer in terms of how to instantiate the classes and extract the tokens. This means they are simple to use, especially to those already familiar with Java's StringTokenizer.
jTokenizer comprises of six tokenizers that all extend from an abstract Tokenizer class:
- WhiteSpaceTokenizer - this splits a string on all occurances of whitespace, which include spaces, newlines, tabs and linefeeds.
- StringTokenizer - this is basically the same as java.util.StringTokenizer with some extra methods (and extends from Tokenizer). Its default behaviour is to act as a WhiteSpaceTokenizer, however, you can specify a set of characters that are to be used to indicate word delimiters.
- RegexTokenizer - this tokenizer is much more flexible as you can use regular expressions to define a what a token is. So, "\\w+" means whenever it matches one or more letters, it will consider that a word. By default, it uses a regular expression equivalent to a whitespace tokenizer.
- RegexSeparatorTokenizer - this can be thought of as an advanced StringTokenizer. Whereas StringTokenizer is limited to defining delimiters as a set of individual characters, RegexSeparatorTokenizer can utilise regular expressions for a richer and more flexible approach.
- BreakIteratorTokenizer - one of the most sophisticated tokenizers in the library, although should only be used on natural language strings to isolate words. It also comes with built-in rules about how to find words, knowing how to disregard punctuation, etc.
- SentenceTokenizer - this also uses a BreakIterater like the above, but tuned towards finding sentence boundaries. The "tokens" in this tokenizer are in fact individual sentences.
jTokenizer also bundles a GUI. You can type in, copy and paste, or even load a text file into the application. You must select your tokenizer of choice (and any options of interest) and then hit the Tokenize button. Your results will be displayed as soon as they are processed and you have the option to save the results to file, if you choose.
The GUI is particularly useful for experimenting with tokenization methods in a teaching environment (such as an NLP course). It will also be of interest to those wishing to use the jTokenizer library but don't have the Java programming experience to utilise the code directly.