edu.mayo.informatics.indexer.lucene.analyzers
Class WhiteSpaceLowerCaseAnalyzer

java.lang.Object
  extended by org.apache.lucene.analysis.Analyzer
      extended by edu.mayo.informatics.indexer.lucene.analyzers.WhiteSpaceLowerCaseAnalyzer

public class WhiteSpaceLowerCaseAnalyzer
extends org.apache.lucene.analysis.Analyzer

This analyzer uses the WhiteSpaceTokenizer, LowerCaseFilter, and StopFilter.

Author:
Dan Armbrust

Constructor Summary
WhiteSpaceLowerCaseAnalyzer()
          Construct the WhiteSpaceLowerCase analyzer, using the stop words from the Standard Analyzer.
WhiteSpaceLowerCaseAnalyzer(java.util.Set stopWords, java.util.Set charsToRemove, java.util.Set charsToTreatAsWhitespace)
          Construct the WhiteSpaceLowerCase analyzer, using the provided stop words.
WhiteSpaceLowerCaseAnalyzer(java.lang.String[] stopWords, char[] charsToRemove, char[] charsToTreatAsWhitespace)
          Construct the WhiteSpaceLowerCase analyzer, using the provided stop words.
 
Method Summary
 java.util.Set getCurrentCharRemovalTable()
           
 java.util.Set getCurrentStopWordTable()
           
 java.util.Set getCurrentWhiteSpaceEquivalentTable()
           
static char[] getDefaultCharRemovalSet()
          Default characters to remove from indexed content. , . / \ ` ' " + * = @ # $ % ^ & ?
static char[] getDefaultWhiteSpaceSet()
          Default characters to treat as whitespace (in addition to standard whitespace characters). - : ; ( ) { } [ ] < > | Note that this does not include the underscore - '_'
 org.apache.lucene.analysis.TokenStream tokenStream(java.lang.String fieldname, java.io.Reader reader)
           
 
Methods inherited from class org.apache.lucene.analysis.Analyzer
getPositionIncrementGap, getPreviousTokenStream, reusableTokenStream, setPreviousTokenStream
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WhiteSpaceLowerCaseAnalyzer

public WhiteSpaceLowerCaseAnalyzer()
Construct the WhiteSpaceLowerCase analyzer, using the stop words from the Standard Analyzer. Uses default character removal rules/whitespace

See Also:
getDefaultCharRemovalSet(), getDefaultWhiteSpaceSet()

WhiteSpaceLowerCaseAnalyzer

public WhiteSpaceLowerCaseAnalyzer(java.lang.String[] stopWords,
                                   char[] charsToRemove,
                                   char[] charsToTreatAsWhitespace)
Construct the WhiteSpaceLowerCase analyzer, using the provided stop words.

Parameters:
stopWords - - Stop words to use. Null or empty causes it to not use stop words.
charsToRemove - - Characters to strip from input. null or empty causes it to not remove any characters. @see getDefaultCharRemovalSet for a recommended set of characters to to remove from input.
charsToTreatAsWhitespace - - Characters to treat as whitespace (or split points in the tokenization) null or empty causes it to just split on whitespace.

WhiteSpaceLowerCaseAnalyzer

public WhiteSpaceLowerCaseAnalyzer(java.util.Set stopWords,
                                   java.util.Set charsToRemove,
                                   java.util.Set charsToTreatAsWhitespace)
Construct the WhiteSpaceLowerCase analyzer, using the provided stop words.

Parameters:
stopWords - - Stop words to use. Null or empty causes it to not use stop words.
charsToRemove - - Characters to strip from input. null or empty causes it to not remove any characters. @see getDefaultCharRemovalSet for a recommended set of characters to to remove from input.
charsToTreatAsWhitespace - - Characters to treat as whitespace (or split points in the tokenization) null or empty causes it to just split on whitespace.
Method Detail

getDefaultCharRemovalSet

public static char[] getDefaultCharRemovalSet()
Default characters to remove from indexed content. , . / \ ` ' " + * = @ # $ % ^ & ? ! Note that this does not include the underscore - '_'


getDefaultWhiteSpaceSet

public static char[] getDefaultWhiteSpaceSet()
Default characters to treat as whitespace (in addition to standard whitespace characters). - : ; ( ) { } [ ] < > | Note that this does not include the underscore - '_'


tokenStream

public final org.apache.lucene.analysis.TokenStream tokenStream(java.lang.String fieldname,
                                                                java.io.Reader reader)
Specified by:
tokenStream in class org.apache.lucene.analysis.Analyzer

getCurrentCharRemovalTable

public java.util.Set getCurrentCharRemovalTable()

getCurrentWhiteSpaceEquivalentTable

public java.util.Set getCurrentWhiteSpaceEquivalentTable()

getCurrentStopWordTable

public java.util.Set getCurrentStopWordTable()

Copyright: (c) 2004-2006 Mayo Foundation for Medical Education and Research (MFMER). All rights reserved. MAYO, MAYO CLINIC, and the triple-shield Mayo logo are trademarks and service marks of MFMER.