edu.mayo.informatics.indexer.lucene.tokenizers
Class CustomWhiteSpaceTokenizer

java.lang.Object
  extended by org.apache.lucene.analysis.TokenStream
      extended by org.apache.lucene.analysis.Tokenizer
          extended by org.apache.lucene.analysis.CharTokenizer
              extended by edu.mayo.informatics.indexer.lucene.tokenizers.CustomWhiteSpaceTokenizer

public class CustomWhiteSpaceTokenizer
extends org.apache.lucene.analysis.CharTokenizer

A WhiteSpace Tokenizer that allows additional whitespace characters.

Author:
Dan Armbrust

Field Summary
 
Fields inherited from class org.apache.lucene.analysis.Tokenizer
input
 
Constructor Summary
CustomWhiteSpaceTokenizer(java.io.Reader in, java.util.Set whiteSpaceChars)
          Construct a new WhitespaceTokenizer.
 
Method Summary
protected  boolean isTokenChar(char c)
          Collects only characters which do not satisfy Character.isWhitespace(char), and are not in the whiteSpaceCharsToRemove set.
static java.util.Set makeCharWhiteSpaceSet(char[] charsToTreatAsWhiteSpace)
          Builds a Set from an array of chars to treat as whitespace, appropriate for passing into the CustomWhiteSpaceTokenizer constructor.
 
Methods inherited from class org.apache.lucene.analysis.CharTokenizer
next, normalize, reset
 
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close
 
Methods inherited from class org.apache.lucene.analysis.TokenStream
next, reset
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CustomWhiteSpaceTokenizer

public CustomWhiteSpaceTokenizer(java.io.Reader in,
                                 java.util.Set whiteSpaceChars)
Construct a new WhitespaceTokenizer.

Method Detail

makeCharWhiteSpaceSet

public static final java.util.Set makeCharWhiteSpaceSet(char[] charsToTreatAsWhiteSpace)
Builds a Set from an array of chars to treat as whitespace, appropriate for passing into the CustomWhiteSpaceTokenizer constructor.


isTokenChar

protected boolean isTokenChar(char c)
Collects only characters which do not satisfy Character.isWhitespace(char), and are not in the whiteSpaceCharsToRemove set.

Specified by:
isTokenChar in class org.apache.lucene.analysis.CharTokenizer

Copyright: (c) 2004-2006 Mayo Foundation for Medical Education and Research (MFMER). All rights reserved. MAYO, MAYO CLINIC, and the triple-shield Mayo logo are trademarks and service marks of MFMER.