marf.nlp.Parsing
Class ProbabilisticParser

java.lang.Object
  extended bymarf.Storage.StorageManager
      extended bymarf.nlp.Parsing.ProbabilisticParser
All Implemented Interfaces:
java.lang.Cloneable, IStorageManager, java.io.Serializable

public class ProbabilisticParser
extends StorageManager

Probabilistic parser is set of parsing a natural language (e.g. English) given probabilistic grammar. Since natural language sentences are ambiguous and a single sentence may have more than one parse each grammar rule is assigned a probability and a parse is chosen for the rule with according to the probability. This class implements the well-known CYK probabilistic parsing algorithm.

The CYK algorithm is cited below. The main reference is here.

 function CYK(words,grammar) returns The most probable parse
                             and its probability
 
     create and clear pi[num_words, num_words, num_nonterminals]
 
     # base case
     for i <-- 1 to num_words
         for A <-- 1 to num_nonterminals
             if (A --> wi) is in grammar then
                 pi [i, i, A] = P(A --> wi)
 
     # recursive case
     for span <-- 2 to num_words
         for begin <-- 1 to num_words - span + 1
             end <-- begin + span - 1
             for m = begin to end - 1
 
                 for A = 1 to num nonterminals
                     for B = 1 to num nonterminals
                         for C = 1 to num nonterminals
 
                             prob = pi [begin, m, B] * pi [m + 1, end, C] * P(A --> BC)
 
                             if (prob > pi[begin, end, A]) then
                                 pi [begin, end, A] = prob
                                 back[begin, end, A] = {m, B, C}
 
     return build_tree(back[1, num_words, 1]), [1, num_words, 1])
 
$Id: ProbabilisticParser.java,v 1.30 2006/01/30 03:43:17 mokhov Exp $

Since:
0.3.0.2
Version:
$Revision: 1.30 $
Author:
Serguei Mokhov
See Also:
Serialized Form

Field Summary
 
Fields inherited from class marf.Storage.StorageManager
bDumpOnNotFound, iCurrentDumpMode, oObjectToSerialize, strFilename
 
Fields inherited from interface marf.Storage.IStorageManager
DUMP_BINARY, DUMP_CSV_TEXT, DUMP_GZIP_BINARY, DUMP_HTML, DUMP_SQL, DUMP_XML, MARF_INTERFACE_CODE_REVISION, STORAGE_FILE_EXTENSIONS
 
Constructor Summary
ProbabilisticParser()
          Initializes default probabilistic parser with empty grammar.
ProbabilisticParser(java.io.StreamTokenizer poStreamTokenizer)
          Initializes probabilistic parser with the specified tokenizer.
ProbabilisticParser(java.lang.String pstrGrammarFilename)
          Initializes probabilistic parser with the grammar filename.
 
Method Summary
 void backSynchronizeObject()
          Implements StorageManager interface.
 void dumpBackPointersContents()
          Dumps back-pointers to the STDOUT.
 void dumpParseMatrix()
          Dumpts parse matrix to the STDOUT.
 void dumpParseTree()
          Dumps parse tree to the STDOUT.
 void dumpParseTree(int piLevel, int i, int j, int piA)
          Dumps a parse sub-tree to to the STDOUT Initial level of S non-terminal should be 0.
static java.lang.String getMARFSourceCodeRevision()
          Retrieves class' revision.
protected  java.lang.String getSentencePart(int i, int j)
          Gets a sentence span given indices.
protected  void indent(int piTabSize)
          Indents by the specified number of tabs.
 boolean parse()
          Performs parse of a natural language sentence using the CYK algorithm.
 void setStreamTokenizer(java.io.StreamTokenizer poStreamTokenizer)
          Allows setting desired stream tokenzer.
 boolean train()
          Performs training of the parser by compiling the source probabilistic grammar and then duming it onto disk as a precompiled binary file for future re-load.
 
Methods inherited from class marf.Storage.StorageManager
clone, dump, dumpBinary, dumpCSV, dumpGzipBinary, dumpHTML, dumpSQL, dumpXML, enableDumpOnNotFound, equals, getDefaultExtension, getDefaultExtension, getDumpMode, getFilename, hashCode, restore, restoreBinary, restoreCSV, restoreGzipBinary, restoreHTML, restoreSQL, restoreXML, setDumpMode, setFilename, toString
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

ProbabilisticParser

public ProbabilisticParser(java.lang.String pstrGrammarFilename)
Initializes probabilistic parser with the grammar filename.

Parameters:
pstrGrammarFilename - the filename of the probabilistic grammar

ProbabilisticParser

public ProbabilisticParser(java.io.StreamTokenizer poStreamTokenizer)
Initializes probabilistic parser with the specified tokenizer. By default sets the tokenizer not to fold anything into the lower case.

Parameters:
poStreamTokenizer - the stream tokenizer to read the tokens off

ProbabilisticParser

public ProbabilisticParser()
Initializes default probabilistic parser with empty grammar.

Method Detail

parse

public boolean parse()
              throws SyntaxError
Performs parse of a natural language sentence using the CYK algorithm.

Returns:
true if the parse was successful
Throws:
SyntaxError - in case of some unusual syntax brekage

dumpBackPointersContents

public void dumpBackPointersContents()
Dumps back-pointers to the STDOUT.


dumpParseMatrix

public void dumpParseMatrix()
Dumpts parse matrix to the STDOUT.


train

public boolean train()
              throws StorageException
Performs training of the parser by compiling the source probabilistic grammar and then duming it onto disk as a precompiled binary file for future re-load.

Returns:
true if the training went successful
Throws:
StorageException - in case of any GrammarCompiler error

dumpParseTree

public void dumpParseTree()
Dumps parse tree to the STDOUT.


dumpParseTree

public void dumpParseTree(int piLevel,
                          int i,
                          int j,
                          int piA)
Dumps a parse sub-tree to to the STDOUT Initial level of S non-terminal should be 0.

Parameters:
piLevel - starting level (depth) of the tree; also acts as indentation marker
i - left index of the span
j - right index of the span
piA - the non-terminal index

indent

protected void indent(int piTabSize)
Indents by the specified number of tabs.

Parameters:
piTabSize - the number of tab characters to indent by

getSentencePart

protected java.lang.String getSentencePart(int i,
                                           int j)
Gets a sentence span given indices.

Parameters:
i - leftmost word index
j - rightmost word index
Returns:
the setence span string

setStreamTokenizer

public void setStreamTokenizer(java.io.StreamTokenizer poStreamTokenizer)
Allows setting desired stream tokenzer.

Parameters:
poStreamTokenizer - the NLP stream tokenizer to read off tokens from

backSynchronizeObject

public void backSynchronizeObject()
Implements StorageManager interface.

Overrides:
backSynchronizeObject in class StorageManager
See Also:
StorageManager.backSynchronizeObject()

getMARFSourceCodeRevision

public static java.lang.String getMARFSourceCodeRevision()
Retrieves class' revision.

Returns:
revision string


SourceForge Logo