andyr.jtokeniser
Class RegexSeparatorTokeniser

java.lang.Object
  extended by andyr.jtokeniser.Tokeniser
      extended by andyr.jtokeniser.RegexSeparatorTokeniser

public class RegexSeparatorTokeniser
extends Tokeniser

The RegexSeparatorTokeniser class uses regular expressions to define the separation between tokens. Whereas RegexTokeniser uses regular expressions to define a "word" or token, RegexSeparator uses regular expressions to define what delimits tokens. All matching is performed via Java's Pattern and Matcher classes.

The following is one example of the use of the tokeniser, which refines the token delimeter as one or more whitespace characters. The code:

 RegexSeparatorTokeniser rest = new RegexSeparatorTokeniser(
         "the cat sat on the mat", "\\s+");
 while (rest.hasMoreTokens()) {
     System.out.println(rest.nextToken());
 }
 

prints the following output:

      the
      sat
      on
      the
      mat
 
* It is also possible to keep the strings inbetween tokens should it be necessary. By default these are discarded. For example, take the string "abc123def456ghi" and the separator regular expression "\\d+" (one or more digits):
 RegexTokeniser ret = new RegexTokeniser("abc123def456ghi", "\\d+");
 while (ret.hasMoreTokens()) {
     System.out.println(ret.nextToken());
 }
 

prints the following output:

      abc
      123
      def
      456
      ghi
 

Version:
1.2.1 (09-Aug-2005)
Author:
Andrew Roberts

Field Summary
 
Fields inherited from class andyr.jtokeniser.Tokeniser
currentTokenPosition, tokens
 
Constructor Summary
RegexSeparatorTokeniser(java.lang.String input)
          Creates a RegexSeparatorTokeniser that tokenises the input.
RegexSeparatorTokeniser(java.lang.String input, java.lang.String regex)
          Creates a RegexSeparatorTokeniser that tokenises the input according a regular expression that defines what separates "words" or tokens.
RegexSeparatorTokeniser(java.lang.String input, java.lang.String regex, boolean keepDelim)
          Creates a RegexSeparatorTokeniser that tokenises the input according a regular expression that defines what separates "words" or tokens.
 
Method Summary
 
Methods inherited from class andyr.jtokeniser.Tokeniser
countTokens, getTokens, hasMoreTokens, nextToken, numberOfTokens
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

RegexSeparatorTokeniser

public RegexSeparatorTokeniser(java.lang.String input,
                               java.lang.String regex,
                               boolean keepDelim)
Creates a RegexSeparatorTokeniser that tokenises the input according a regular expression that defines what separates "words" or tokens.

Parameters:
input - a string from which the tokens will be extracted.
regex - the regular expression.
keepDelim - flag indicating whether to return the delimiters as tokens.
See Also:
Pattern

RegexSeparatorTokeniser

public RegexSeparatorTokeniser(java.lang.String input,
                               java.lang.String regex)
Creates a RegexSeparatorTokeniser that tokenises the input according a regular expression that defines what separates "words" or tokens.

Parameters:
input - a string from which the tokens will be extracted.
regex - the regular expression.
See Also:
Pattern

RegexSeparatorTokeniser

public RegexSeparatorTokeniser(java.lang.String input)
Creates a RegexSeparatorTokeniser that tokenises the input. Default separation regular expression is "\\s+" which defines one or more whitespace characters as the token delimiter.

Parameters:
input - a string from which the tokens will be extracted.