|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectmarf.Storage.StorageManager
marf.nlp.Parsing.ProbabilisticParser
public class ProbabilisticParser
Probabilistic parser is set of parsing a natural language (e.g. English) given probabilistic grammar. Since natural language sentences are ambiguous and a single sentence may have more than one parse each grammar rule is assigned a probability and a parse is chosen for the rule with according to the probability. This class implements the well-known CYK probabilistic parsing algorithm.
The CYK algorithm is cited below. The main reference is here.
function CYK(words,grammar) returns The most probable parse
and its probability
create and clear pi[num_words, num_words, num_nonterminals]
# base case
for i <-- 1 to num_words
for A <-- 1 to num_nonterminals
if (A --> wi) is in grammar then
pi [i, i, A] = P(A --> wi)
# recursive case
for span <-- 2 to num_words
for begin <-- 1 to num_words - span + 1
end <-- begin + span - 1
for m = begin to end - 1
for A = 1 to num nonterminals
for B = 1 to num nonterminals
for C = 1 to num nonterminals
prob = pi [begin, m, B] * pi [m + 1, end, C] * P(A --> BC)
if (prob > pi[begin, end, A]) then
pi [begin, end, A] = prob
back[begin, end, A] = {m, B, C}
return build_tree(back[1, num_words, 1]), [1, num_words, 1])
$Id: ProbabilisticParser.java,v 1.31 2007/12/18 21:37:54 mokhov Exp $
| Field Summary |
|---|
| Fields inherited from class marf.Storage.StorageManager |
|---|
bDumpOnNotFound, iCurrentDumpMode, oObjectToSerialize, strFilename |
| Fields inherited from interface marf.Storage.IStorageManager |
|---|
DUMP_BINARY, DUMP_CSV_TEXT, DUMP_GZIP_BINARY, DUMP_HTML, DUMP_SQL, DUMP_XML, MARF_INTERFACE_CODE_REVISION, STORAGE_FILE_EXTENSIONS |
| Constructor Summary | |
|---|---|
ProbabilisticParser()
Initializes default probabilistic parser with empty grammar. |
|
ProbabilisticParser(java.io.StreamTokenizer poStreamTokenizer)
Initializes probabilistic parser with the specified tokenizer. |
|
ProbabilisticParser(java.lang.String pstrGrammarFilename)
Initializes probabilistic parser with the grammar filename. |
|
| Method Summary | |
|---|---|
void |
backSynchronizeObject()
Implements StorageManager interface. |
void |
dumpBackPointersContents()
Dumps back-pointers to the STDOUT. |
void |
dumpParseMatrix()
Dumps parse matrix to the STDOUT. |
void |
dumpParseTree()
Dumps parse tree to the STDOUT. |
void |
dumpParseTree(int piLevel,
int i,
int j,
int piA)
Dumps a parse sub-tree to to the STDOUT Initial level of S non-terminal should be 0. |
static java.lang.String |
getMARFSourceCodeRevision()
Retrieves class' revision. |
protected java.lang.String |
getSentencePart(int i,
int j)
Gets a sentence span given indices. |
protected void |
indent(int piTabSize)
Indents by the specified number of tabs. |
boolean |
parse()
Performs parse of a natural language sentence using the CYK algorithm. |
void |
setStreamTokenizer(java.io.StreamTokenizer poStreamTokenizer)
Allows setting desired stream tokenzer. |
boolean |
train()
Performs training of the parser by compiling the source probabilistic grammar and then dumping it onto disk as a precompiled binary file for future re-load. |
| Methods inherited from class marf.Storage.StorageManager |
|---|
clone, dump, dumpBinary, dumpCSV, dumpGzipBinary, dumpHTML, dumpSQL, dumpXML, enableDumpOnNotFound, equals, getDefaultExtension, getDefaultExtension, getDumpMode, getFilename, getObjectToSerialize, hashCode, restore, restoreBinary, restoreCSV, restoreGzipBinary, restoreHTML, restoreSQL, restoreXML, setDumpMode, setFilename, toString |
| Methods inherited from class java.lang.Object |
|---|
finalize, getClass, notify, notifyAll, wait, wait, wait |
| Constructor Detail |
|---|
public ProbabilisticParser(java.lang.String pstrGrammarFilename)
pstrGrammarFilename - the filename of the probabilistic grammarpublic ProbabilisticParser(java.io.StreamTokenizer poStreamTokenizer)
poStreamTokenizer - the stream tokenizer to read the tokens offpublic ProbabilisticParser()
| Method Detail |
|---|
public boolean parse()
throws SyntaxError
true if the parse was successful
SyntaxError - in case of some unusual syntax brekagepublic void dumpBackPointersContents()
public void dumpParseMatrix()
public boolean train()
throws StorageException
true if the training went successful
StorageException - in case of any GrammarCompiler errorpublic void dumpParseTree()
public void dumpParseTree(int piLevel,
int i,
int j,
int piA)
piLevel - starting level (depth) of the tree; also acts as indentation markeri - left index of the spanj - right index of the spanpiA - the non-terminal indexprotected void indent(int piTabSize)
piTabSize - the number of tab characters to indent by
protected java.lang.String getSentencePart(int i,
int j)
i - leftmost word indexj - rightmost word index
public void setStreamTokenizer(java.io.StreamTokenizer poStreamTokenizer)
poStreamTokenizer - the NLP stream tokenizer to read off tokens frompublic void backSynchronizeObject()
backSynchronizeObject in class StorageManagerStorageManager.backSynchronizeObject()public static java.lang.String getMARFSourceCodeRevision()
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||