jTokeniser - v1.0 - README Andrew Roberts (11-Mar-2005) dev [at] andy-roberts [dot] net Overview ======== Tokenising strings into its constituent words/tokens can prove tricky for non-trivial examples. In particular, when dealing with natural language, you must take into consideration punctuation too in order to isolate the words. The jTokeniser package was designed to combine a set of tokenisers that range from basic whitespace tokenisers to more complex ones that deal intuitively with natural language. Each of the tokenisers adopt a similar structure to java.util.StringTokenizer in terms of how to instantiate the classes and extract the tokens. This means they are simple to use. Features ======== jTokeniser comprises of four tokenisers that all extend from an abtract Tokeniser class: * WhiteSpaceTokeniser - this splits a string on all occurances of whitespace, which include spaces, newlines, tabs and linefeeds. * StringTokeniser - this is basically the same as java.util.StringTokenizer with some extra methods (and extends from Tokeniser). Its default behaviour is to act as a WhiteSpaceTokeniser, however, you can specify a set of characters that are to be used to indicate word delimiters. * RegexTokeniser - this tokeniser is much more flexible as you can use regular expressions to define a what a token is. So, "\\w+" means whenever it matches one or more letters, it will consider that a word. By default, it uses a regular expression equivalent to a whitespace tokeniser. * BreakIteratorTokeniser - the most sophisticated of the four, although should only be used on natural language strings to isolate words. It also comes with built-in rules about how to find words, knowing how to disregard punctuation, etc. Installation ============ The jTokeniser package doesn't need installing as such. You simply have to download it to your computer, and then make sure that the Java compiler and virtual machine can "find" it. The package is bundled together a JAR file, with is a Java archive containing all the classes. JAR is actually compressed using the well known zip algorithms. The advantage of using JARs are that you can keep lots of related classes together in a single file, rather than having to uncompress them. All Java needs to know is where the JAR file is, and there are a couple of ways of achieving this. Imagine you have a class that uses a tokeniser from this package called ClassThatTokenises.java. To compile and run: 1. Specifying at the command-line javac -classpath /path/to/jTokeniser-1.0.jar ClassThatTokenises.java java -classpath /path/to/jTokeniser-1.0.jar ClassThatTokenises NB in Windows, the path would be more like c:\path\to\jTokeniser-1.0.jar 2. Setting the CLASSPATH environment variable. In Linux: export CLASSPATH=$CLASSPATH:/path/to/jTokeniser-1.0.jar (for bash) setenv CLASSPATH $CLASSPATH:/path/to/jTokeniser-1.0.jar (for csh) javac ClassThatTokenises.java java ClassThatTokenises In Windows: set CLASSPATH=%CLASSPATH%;c:\path\to\jTokeniser-1.0.jar NB you can set the CLASSPATH via Control Panel/System/Advanced/Environment Variables Support ======= The package API (as generated by JavaDoc) are to be found at: http://www.comp.leeds.ac.uk/andyr/software/jTokeniser/releases/1.0/doc/index.html An example Java file which illustrates common usage of each of the tokenisers exists at: http://www.comp.leeds.ac.uk/andyr/software/jTokeniser/releases/1.0/JTokeniserExample.java Contact ======= If you wish to contact the developer about jTokeniser to suggest future features, bugs or anything that you want, please email me at: dev [at] andy-roberts [dot] net * Anti-spam format. Please remove all spaces, and replace '[at]' with the '@' symbol (no quotes), etc.