andyr.jtokeniser
Class RegexTokeniser

java.lang.Object
  extended by andyr.jtokeniser.Tokeniser
      extended by andyr.jtokeniser.RegexTokeniser

public class RegexTokeniser
extends Tokeniser

The RegexTokeniser class uses regular expressions to define a word, and tokenises according to that expression. All matching is performed via Java's Pattern and Matcher classes.

The following is one example of the use of the tokeniser. The code:

 RegexTokeniser ret = new RegexTokeniser("the cat sat on the mat", "\\w+");
 while (ret.hasMoreTokens()) {
     System.out.println(ret.nextToken());
 }
 

prints the following output:

      the
      sat
      on
      the
      mat
 
It is also possible to keep the strings inbetween tokens should it be necessary. By default these are discarded. Note, it won't keep anything before the first match or anything after the last match. For example, take the string "123abc456def789" and the regular expression "\\D+" (one or more non-digits):
 RegexTokeniser ret = new RegexTokeniser("123abc456def789", "\\D+");
 while (ret.hasMoreTokens()) {
     System.out.println(ret.nextToken());
 }
 

prints the following output:

      abc
      456
      def
 

Version:
1.2.1 (09-Aug-2005)
Author:
Andrew Roberts

Field Summary
 
Fields inherited from class andyr.jtokeniser.Tokeniser
currentTokenPosition, tokens
 
Constructor Summary
RegexTokeniser(java.lang.String input)
          Creates a RegexTokeniser that tokenises the input.
RegexTokeniser(java.lang.String input, java.lang.String regex)
          Creates a RegexTokeniser that tokenises the input according a regular expression that defines a "word" or token.
RegexTokeniser(java.lang.String input, java.lang.String regex, boolean keepDelim)
          Creates a RegexTokeniser that tokenises the input according a regular expression that defines a "word" or token.
 
Method Summary
 
Methods inherited from class andyr.jtokeniser.Tokeniser
countTokens, getTokens, hasMoreTokens, nextToken, numberOfTokens
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

RegexTokeniser

public RegexTokeniser(java.lang.String input,
                      java.lang.String regex,
                      boolean keepDelim)
Creates a RegexTokeniser that tokenises the input according a regular expression that defines a "word" or token. If keepDelit is true then all the strings in between the tokens are kept as tokens too.

Parameters:
input - a string from which the tokens will be extracted.
regex - the regular expression.
keepDelim - flag indicating whether to return the delimiters as tokens.
See Also:
Pattern

RegexTokeniser

public RegexTokeniser(java.lang.String input,
                      java.lang.String regex)
Creates a RegexTokeniser that tokenises the input according a regular expression that defines a "word" or token.

Parameters:
input - a string from which the tokens will be extracted.
regex - the regular expression.
See Also:
Pattern

RegexTokeniser

public RegexTokeniser(java.lang.String input)
Creates a RegexTokeniser that tokenises the input.

Parameters:
input - a string from which the tokens will be extracted.