Jacson

de.spieleck.app.lang
Class StemmerDE

java.lang.Object
  extended byde.spieleck.app.lang.StemmerDE
All Implemented Interfaces:
Stemmer

public class StemmerDE
extends java.lang.Object
implements Stemmer

A stemmer for German words. The algorithm is based on the report "A Fast and Simple Stemming Algorithm for German Words" by J”rg Caumanns (joerg.caumanns@isst.fhg.de). This implementation is based on code of Gerhard Schwarz for the Apache Lucene project.

Author:
fsn

Field Summary
static char CH_TOKEN
           
static char EI_TOKEN
           
static char IE_TOKEN
           
static char IG_TOKEN
           
static char REP_TOKEN
           
static char SCH_TOKEN
           
static char ST_TOKEN
           
 
Constructor Summary
StemmerDE()
           
 
Method Summary
protected  void deleteParticleDenotion(java.lang.StringBuffer term)
          Removes a particle denotion ("ge") from a term.
protected  boolean isStemmable(java.lang.String term)
          Checks if a term could be stemmed.
protected  void optimizations(java.lang.StringBuffer term)
          Does some optimizations on the term.
protected  void resubstituteSpecialChars(java.lang.StringBuffer term)
          Undoes the changes made by substituteSpecialChars().
 java.lang.String stem(java.lang.String term)
          Stemms the given term to an unique discriminator.
protected  void stripSuffixes(java.lang.StringBuffer term, boolean lowerCase)
          suffix stripping (stemming) on the current term.
protected  void substituteSpecialChars(java.lang.StringBuffer term)
          Do some substitutions for the term to reduce overstemming: - Substitute Umlauts with their corresponding vowel: äöü -> aou, "ß" is substituted by "ss" - Substitute a second char of a pair of equal characters with an asterisk: ??
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

REP_TOKEN

public static final char REP_TOKEN
See Also:
Constant Field Values

SCH_TOKEN

public static final char SCH_TOKEN
See Also:
Constant Field Values

CH_TOKEN

public static final char CH_TOKEN
See Also:
Constant Field Values

EI_TOKEN

public static final char EI_TOKEN
See Also:
Constant Field Values

IE_TOKEN

public static final char IE_TOKEN
See Also:
Constant Field Values

IG_TOKEN

public static final char IG_TOKEN
See Also:
Constant Field Values

ST_TOKEN

public static final char ST_TOKEN
See Also:
Constant Field Values
Constructor Detail

StemmerDE

public StemmerDE()
Method Detail

stem

public java.lang.String stem(java.lang.String term)
Stemms the given term to an unique discriminator.

Specified by:
stem in interface Stemmer
Parameters:
term - The term that should be stemmed.
Returns:
Discriminator for term

isStemmable

protected boolean isStemmable(java.lang.String term)
Checks if a term could be stemmed.

Returns:
true if, and only if, the given term consists in letters.

stripSuffixes

protected void stripSuffixes(java.lang.StringBuffer term,
                             boolean lowerCase)
suffix stripping (stemming) on the current term. The stripping is reduced to the seven "base" suffixes "e", "s", "n", "t", "em", "er" and "nd", from which all regular suffixes are build of. The simplification causes some overstemming, and way more irregular stems, but still provides unique. discriminators in the most of those cases. The algorithm is context free, except of the length restrictions.


optimizations

protected void optimizations(java.lang.StringBuffer term)
Does some optimizations on the term. This optimisations are contextual.

Returns:
The term with the optimizations applied.

deleteParticleDenotion

protected void deleteParticleDenotion(java.lang.StringBuffer term)
Removes a particle denotion ("ge") from a term.


substituteSpecialChars

protected void substituteSpecialChars(java.lang.StringBuffer term)
Do some substitutions for the term to reduce overstemming: - Substitute Umlauts with their corresponding vowel: äöü -> aou, "ß" is substituted by "ss" - Substitute a second char of a pair of equal characters with an asterisk: ?? -> ?* - Substitute some common character combinations with a token: sch/ch/ei/ie/ig/st -> $/§/%/&/#/!


resubstituteSpecialChars

protected void resubstituteSpecialChars(java.lang.StringBuffer term)
Undoes the changes made by substituteSpecialChars(). That are character pairs and character combinations. Umlauts will remain as their corresponding vowel, as "ß" remains as "ss".


Spieleck

Copyleft 2002 spieleck.de.