ProbabilisticParser

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

marf.nlp.Parsing
Class ProbabilisticParser

java.lang.Object
  marf.Storage.StorageManager
      marf.nlp.Parsing.ProbabilisticParser

All Implemented Interfaces:: java.io.Serializable, java.lang.Cloneable, IStorageManager

public class ProbabilisticParser
extends StorageManager
extends StorageManager

Probabilistic parser is set of parsing a natural language (e.g. English) given probabilistic grammar. Since natural language sentences are ambiguous and a single sentence may have more than one parse each grammar rule is assigned a probability and a parse is chosen for the rule with according to the probability. This class implements the well-known CYK probabilistic parsing algorithm.

The CYK algorithm is cited below. The main reference is here.

 function CYK(words,grammar) returns The most probable parse
                             and its probability
 
     create and clear pi[num_words, num_words, num_nonterminals]
 
     # base case
     for i <-- 1 to num_words
         for A <-- 1 to num_nonterminals
             if (A --> wi) is in grammar then
                 pi [i, i, A] = P(A --> wi)
 
     # recursive case
     for span <-- 2 to num_words
         for begin <-- 1 to num_words - span + 1
             end <-- begin + span - 1
             for m = begin to end - 1
 
                 for A = 1 to num nonterminals
                     for B = 1 to num nonterminals
                         for C = 1 to num nonterminals
 
                             prob = pi [begin, m, B] * pi [m + 1, end, C] * P(A --> BC)
 
                             if (prob > pi[begin, end, A]) then
                                 pi [begin, end, A] = prob
                                 back[begin, end, A] = {m, B, C}
 
     return build_tree(back[1, num_words, 1]), [1, num_words, 1])

$Id: ProbabilisticParser.java,v 1.31 2007/12/18 21:37:54 mokhov Exp $

Since:: 0.3.0.2
Version:: $Revision: 1.31 $
Author:: Serguei Mokhov
See Also:: Serialized Form

Field Summary

Fields inherited from class marf.Storage.StorageManager
`bDumpOnNotFound, iCurrentDumpMode, oObjectToSerialize, strFilename`

Fields inherited from interface marf.Storage.IStorageManager
`DUMP_BINARY, DUMP_CSV_TEXT, DUMP_GZIP_BINARY, DUMP_HTML, DUMP_SQL, DUMP_XML, MARF_INTERFACE_CODE_REVISION, STORAGE_FILE_EXTENSIONS`

Constructor Summary
`ProbabilisticParser()` Initializes default probabilistic parser with empty grammar.
`ProbabilisticParser(java.io.StreamTokenizer poStreamTokenizer)` Initializes probabilistic parser with the specified tokenizer.
`ProbabilisticParser(java.lang.String pstrGrammarFilename)` Initializes probabilistic parser with the grammar filename.

Method Summary
`void`	`backSynchronizeObject()` Implements StorageManager interface.
`void`	`dumpBackPointersContents()` Dumps back-pointers to the STDOUT.
`void`	`dumpParseMatrix()` Dumps parse matrix to the STDOUT.
`void`	`dumpParseTree()` Dumps parse tree to the STDOUT.
`void`	`dumpParseTree(int piLevel, int i, int j, int piA)` Dumps a parse sub-tree to to the STDOUT Initial level of S non-terminal should be 0.
`static java.lang.String`	`getMARFSourceCodeRevision()` Retrieves class' revision.
`protected java.lang.String`	`getSentencePart(int i, int j)` Gets a sentence span given indices.
`protected void`	`indent(int piTabSize)` Indents by the specified number of tabs.
`boolean`	`parse()` Performs parse of a natural language sentence using the CYK algorithm.
`void`	`setStreamTokenizer(java.io.StreamTokenizer poStreamTokenizer)` Allows setting desired stream tokenzer.
`boolean`	`train()` Performs training of the parser by compiling the source probabilistic grammar and then dumping it onto disk as a precompiled binary file for future re-load.

Methods inherited from class marf.Storage.StorageManager
`clone, dump, dumpBinary, dumpCSV, dumpGzipBinary, dumpHTML, dumpSQL, dumpXML, enableDumpOnNotFound, equals, getDefaultExtension, getDefaultExtension, getDumpMode, getFilename, getObjectToSerialize, hashCode, restore, restoreBinary, restoreCSV, restoreGzipBinary, restoreHTML, restoreSQL, restoreXML, setDumpMode, setFilename, toString`

Methods inherited from class java.lang.Object
`finalize, getClass, notify, notifyAll, wait, wait, wait`

Constructor Detail

ProbabilisticParser

public ProbabilisticParser(java.lang.String pstrGrammarFilename)

Initializes probabilistic parser with the grammar filename.

Parameters:: pstrGrammarFilename - the filename of the probabilistic grammar

ProbabilisticParser

public ProbabilisticParser(java.io.StreamTokenizer poStreamTokenizer)

Initializes probabilistic parser with the specified tokenizer. By default sets the tokenizer not to fold anything into the lower case.

Parameters:: poStreamTokenizer - the stream tokenizer to read the tokens off

ProbabilisticParser

public ProbabilisticParser()

Initializes default probabilistic parser with empty grammar.

Method Detail

parse

public boolean parse()
              throws SyntaxError

Performs parse of a natural language sentence using the CYK algorithm.

Returns:: true if the parse was successful
Throws:: SyntaxError - in case of some unusual syntax brekage

dumpBackPointersContents

public void dumpBackPointersContents()

Dumps back-pointers to the STDOUT.

dumpParseMatrix

public void dumpParseMatrix()

Dumps parse matrix to the STDOUT.

train

public boolean train()
              throws StorageException

Performs training of the parser by compiling the source probabilistic grammar and then dumping it onto disk as a precompiled binary file for future re-load.

Returns:: true if the training went successful
Throws:: StorageException - in case of any GrammarCompiler error

dumpParseTree

public void dumpParseTree()

Dumps parse tree to the STDOUT.

dumpParseTree

public void dumpParseTree(int piLevel,
                          int i,
                          int j,
                          int piA)

Dumps a parse sub-tree to to the STDOUT Initial level of S non-terminal should be 0.

Parameters:: piLevel - starting level (depth) of the tree; also acts as indentation marker; i - left index of the span; j - right index of the span; piA - the non-terminal index

indent

protected void indent(int piTabSize)

Indents by the specified number of tabs.

Parameters:: piTabSize - the number of tab characters to indent by

getSentencePart

protected java.lang.String getSentencePart(int i,
                                           int j)

Gets a sentence span given indices.

Parameters:: i - leftmost word index; j - rightmost word index
Returns:: the sentence span string

setStreamTokenizer

public void setStreamTokenizer(java.io.StreamTokenizer poStreamTokenizer)

Allows setting desired stream tokenzer.

Parameters:: poStreamTokenizer - the NLP stream tokenizer to read off tokens from

backSynchronizeObject

public void backSynchronizeObject()

Implements StorageManager interface.

Overrides:: backSynchronizeObject in class StorageManager

See Also:: StorageManager.backSynchronizeObject()

getMARFSourceCodeRevision

public static java.lang.String getMARFSourceCodeRevision()

Retrieves class' revision.

Returns:: revision string

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

marf.nlp.Parsing Class ProbabilisticParser

ProbabilisticParser

ProbabilisticParser

ProbabilisticParser

parse

dumpBackPointersContents

dumpParseMatrix

train

dumpParseTree

dumpParseTree

indent

getSentencePart

setStreamTokenizer

backSynchronizeObject

getMARFSourceCodeRevision

marf.nlp.Parsing
Class ProbabilisticParser