Skip to main content
   
Andrew Roberts @ School of Computing

jTokeniser

Overview

Tokenising strings into its constituent words/tokens can prove tricky for non-trivial examples. In particular, when dealing with natural language, you must take into consideration punctuation too in order to isolate the words. The jTokeniser package was designed to combine a set of tokenisers that range from basic whitespace tokenisers to more complex ones that deal intuitively with natural language.

Each of the tokenisers adopt a similar structure to java.util.StringTokenizer in terms of how to instantiate the classes and extract the tokens. This means they are simple to use.

jTokeniser is free and open source. It is released under the LGPL licence.

News

16th July 2006 - jTokeniser 2.0 released
After a long time resting, jTokeniser finally has a new release. The library itself remains as flexible and powerful as it always has, but I can now unveil a GUI front-end to the package.
You can type in, copy and paste, or even load a text file into the application. You must select your tokeniser of choice (and any options of interest) and then hit the Tokenise button. Your results will be displayed as soon as they are processed and you have the option to save the results to file, if you choose.
The GUI is particularly useful for experimenting with tokenisation methods in a teaching environment (such as an NLP course). It will also be of interest to those wishing to use the jTokeniser library but don't have the Java programming experience to utilise the code directly.
See the screenshots below.
9th Aug 2005 - jTokeniser 1.2.1 released
Someone suggested that for the regular expression based classes, RegexTokeniser and RegexSeparatoTokeniser, they would like the option of keeping the delimters. In the former case, this means the strings inbetween the matched tokens. In the latter instance, this means keeping the strings matched. The JavaDocs should make this a little clearer thanks to some examples.
4th Aug 2005 - jTokeniser 1.2 released
After almost 6 months since the initial release, I added some more functionality to provide a more flexible library. A RegexSeparatorTokeniser class allows you to define complex delimiters that separate tokens. Also, the SentenceTokeniser should be very useful for those wishing to segment a piece of text into a set of sentence, where each "token" is infact an individual sentence.
10th March 2005 - jTokeniser 1.0 released
Initial release of the jTokeniser library which contains a set of useful tokenisers that all adhere to a simple and uniform API.

Features

jTokeniser comprises of four tokenisers that all extend from an abtract Tokeniser class:

  • WhiteSpaceTokeniser - this splits a string on all occurances of whitespace, which include spaces, newlines, tabs and linefeeds.
  • StringTokeniser - this is basically the same as java.util.StringTokenizer with some extra methods (and extends from Tokeniser). Its default behaviour is to act as a WhiteSpaceTokeniser, however, you can specify a set of characters that are to be used to indicate word delimiters.
  • RegexTokeniser - this tokeniser is much more flexible as you can use regular expressions to define a what a token is. So, "\\w+" means whenever it matches one or more letters, it will consider that a word. By default, it uses a regular expression equivalent to a whitespace tokeniser.
  • RegexSeparatorTokeniser - this can be thought of as an advanced StringTokeniser. Whereas StringTokeniser is limited to defining delimiters as a set of individual characters, RegexSeparatorTokeniser can utilise regular expressions for a richer and more flexible approach.
  • BreakIteratorTokeniser - one of the most sophisticated tokenisers in the library, although should only be used on natural language strings to isolate words. It also comes with built-in rules about how to find words, knowing how to disregard punctuation, etc.
  • SentenceTokeniser - this also uses a BreakIterater like the above, but tuned towards finding sentence boundaries. The "tokens" in this tokeniser are in fact individual sentences.

Screenshots

Examples

JTokeniserExample.java is a Java program illustrating how to utilise the various tokenisers.

Download

jTokeniser is free and open source. It is released under the LGPL licence.

Please select the version you wish to use:

2.0 jTokeniser-2.0.zip README jTokeniser-2.0_src.zip [source code]
1.2.1 jTokeniser-1.2.1.jar README jTokeniser-1.2.1_src.zip [source code]
1.2 jTokeniser-1.2.jar README jTokeniser-1.2_src.zip [source code]
1.0 jTokeniser-1.0.jar README jTokeniser-1.0_src.zip [source code]

More details can be seen in the changelog.

NOTE: Before you can utilise jTokeniser on your system, you must ensure that the Java Runtime Envinronment (version 1.5/5.0 or greater) is also installed on your computer. This can be easily downloaded from http://java.sun.com and installed.

Documentation

Javadoc API for jTokeniser is here

The README also contains details about how to make your classes access the jTokeniser package.

Documentation for older versions:

2.0 jTokeniser-2.0_docs.zip 1.2 jTokeniser_docs-1.2.zip 1.0 jTokeniser_docs-1.0.zip

Nedstat Basic - Free web site statistics

Personal homepage website counter
< Back to software