HYP - Honour year project

* Git synchronization local and remote (WING) repos
* Unsupervised Morpheme Segmentation and
Morphology Induction from Text Corpora Using
Morfessor 1.0:
** 2 level morphology method (Koskenniemi, 1983)
** Exisisting morphological analysers based on the 2-level morphology method are language dependent, expensive (expert knowledge and labor), and are prone to changes as languages change
** Describe the baseline and frequency-based versions of Morfessor using MDL (publications 2000, 2002). Recent modifications use MAP (2004, 2005) are not included in the released version

* extract function words using frequency --> how to filter incorrect function words?
* read chap12 NLP books to search for "function words" description. Happen to read: CFG, TreeBank, Tgrep2 

* To be done: talk to some one about function words extraction

* Read Knight tutorial to write about IBM model 3. Stop at point 19
* Latex HYP style problem: not generating et. al for many authors, lowercase title (such as SMT, MT)

* To be done: make it clear in the thesis when talking about source and target sentence either by literal meaning or through the framework of the source-channel approach

* To be done: check out "Morphology-aware paper ..." on the linguistic analysis of Finish, Spanish, Danish

* Check if there's a translation for "aamiais" (or very low and there's a very similar word); otherwise find the closest translation

1       aaltovoima      (aalto voima)   (STM STM) aalto/STM + voima/STM

5       aamiaista       (aamiaista)     (STM) aamiaista/STM
1       aamiaisten      (aamiais ten)   (STM SUF) aamiais/STM + ten/SUF

Finnish:  	
English: 	aamiaista = aamiainen
Finnish:  	aamiainen
# English: 	breakfast

1       aamiaisruokatehdas      (aamiais ruoka tehdas)  (STM PRE STM) aamiais/STM + ruok
a/PRE + tehdas/STM
5       aamiaista       (aamiaista)     (STM) aamiaista/STM

Finnish:  	tehdas
# English: 	factory
# mill
# plant
# works

Finnish:  	ruoka
# English: 	board
# dish
# food

* Generate words list in English (total freq = 7885761), Spanish (total freq = 8814072)
* To be check in MOSES website: 
** factored training http://www.statmt.org/moses/?n=FactoredTraining.FactoredTraining
** GIZA++: http://www.statmt.org/moses/?n=FactoredTraining.RunGIZA
** align word: http://www.statmt.org/moses/?n=FactoredTraining.AlignWords
** tunning: http://www.statmt.org/moses/?n=FactoredTraining.Tuning

* MOSES: what are the 7 distortion weights, 1 language model weight, 5 translation model weights, and 1 word penalty in tunning/ ?
http://www.statmt.org/moses

* In corpus/:
** en.vcb (es.vcb): English vocabulary. Format:
(index word freq)

** en-es-int-train.snt (en-es-int-train.snt): sentence-aligned corpus (in index form)
(
freq
es_1 ..... es_m
en_1 ..... en_n
)
freq: frequency of the sentence
where is es_i refers to the index in es.vcb, while en_j referes to the index in en.vcb	


** en.vcb.classes (es.vcb.classes): word classes
(word class_index)
*** what algorithm used to classify the word?

** europarl.clean.en (europarl.clean.es): training result after performing cleaning process by clean-corpus-n.perl, which:
*** removes empty lines
*** removes redundant spaces
*** drop lines that are empty, too short, too long, or violate 9-1 sentence ratio limit by GIZA++

* In model/:
** phrase-table.0-0.gz
Currently, five different phrase translation scores are computed:
    * phrase translation probability phi(e|f)
    * lexical weighting lex(e|f)
    * phrase translation probability phi(f|e)
    * lexical weighting lex(f|e)
    * phrase penalty (always exp(1) = 2.718)

only politiikka is translated to politics. However, there are many different surface forms consisting the STM "politiiko". How to resolve that?
* check if there's translation of poliiiko -> politics? and how the translation prob different if the source word is politiikka vs. politiiko

how to learn the relationship among words with the same segmentation component?

* now understand whether lex.0-0.f2n and lex.0-0.n2f

* noon is translated as 12pm

* To be done:
** perform clean-corpus script for all dataset (did for dataset0)
** write scritps to train all dataset

* uncomment in bin/moses-scripts/scripts-20080213-1421/training/clean-corpus-n.perl
  # $e =~ s/\|//g;  # kinda hurts in factored input
  $e =~ s/\s+/ /g;
  $e =~ s/^ //;
  $e =~ s/ $//;
  # $f =~ s/\|//g;  # kinda hurts in factored input
* corpus after cleaning
Input sentences: 54645  Output sentences:  30995
Input sentences: 58024  Output sentences:  35365
Input sentences: 69858  Output sentences:  38075
Input sentences: 69770  Output sentences:  42870
Input sentences: 70877  Output sentences:  40772
Input sentences: 80739  Output sentences:  45251
Input sentences: 81425  Output sentences:  47572
Input sentences: 89708  Output sentences:  54764
Input sentences: 119285  Output sentences:  73434
Input sentences: 171170  Output sentences:  94647

* Modify TranslationOptionCollection to add translation option for OOV words by using their stems
** In CreateTranslationOptionsForRange, DecodeStepTranslation::ProcessInitial
In DecodeStepTranslation::ProcessInitialTranslation, print 
	[kotimaani perinteet ; 0-1]
	in my , pC=-1.041, c=-4.914
	in my own , pC=-2.398, c=-8.957
	in my own country , pC=-3.616, c=-12.862
	in my own country , , pC=-4.336, c=-14.176
(os < < static_cast<const Phrase&>(tp) < < ", pC=" < < tp.m_transScore < < ", c=" < < tp.m_fullScore;  ''TargetPhrase'')

** TranslationOption printout

in my own c=-8.957 [ [0..1] ]< <0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, -7.374, 0.000, -4.820, -1.386, -6.782, 1.000> >

			out < < possibleTranslation.GetTargetPhrase() 
			< < "c=" < < possibleTranslation.GetFutureScore()
			< < " [" < < possibleTranslation.GetSourceWordsRange() < < "]"
			< < possibleTranslation.GetScoreBreakdown();

	/ /! in TranslationOption, m_scoreBreakdown is not complete.  It cannot,
	/ /! for example, know the full n-gram score since the length of the
	/ /! TargetPhrase may be shorter than the n-gram order.  But, if it is
	/ /! possible to estimate, it is included here.
	ScoreComponentCollection	m_scoreBreakdown; (in print out the last 5 figures are from phrase-base table, the rest are 0)

scoreBreakdown in a translation option is from the target phrase, which is set during phrase table loading
PhraseDictionaryMemory::Load

** Calculate future score void TranslationOptionCollection::CalcFutureScore()

* Components needed for a simple direct MT system: 
** what programming language, Perl? What are the source and target languages to work on, Spanish to English?
** good test sentences in the source language: half development set, half unseen set
** bilingual dictionary
** stemming or simple morphological analysis (use Porter stemming)
-> errors -> general rules for correcting translation mistakes
** part-of-speech tagger to be run on English output


* (Och, and Ney 2003)
** first-order dependence between work positions and fertility model improve performance
** smoothing and symmetrization: significant effect on the alignment quality
** expect to improve with: adoption of cognates (Simard, Foster, and Isabelle 1992), statistical alignment model based on work group rather than single word (Och, Tillman, and Ney, 1999), explicitly dealing with hierarchical structures of natural language (Wu 1996, Yamada and Knight 2001)

what is model 6? Different in  the decomposition of Pr(f^J, a^J | e^J), is that all? What is the linkage with first-order dependence between word positions and fertility model?

* (Och, and Ney 2004)
** many-to-many relations
** log-linear model approach

* Morfessor understanding:
** handle compositionality (composition and perturbation) (no notion of composition in Morfessor Baseline)	
** handle highly-inflecting and compound languages
** stems could alternate with prefixes and prefixes, but no suffix at beginning, and prefix at the end, not possible to move directly from a prefix to a suffix	
** Mor MAP utilizes info about word freq while Mor ML does not
** Existence of non-morphemes -> alleviate of oversegmenation
** Morfessor MAP files:
*** alphabetprobs: probability for each alphabet
*** baseline* -> joined.* -> resplit.*

* Read "INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM" stop at MATHEMATICAL FORMULATION OF THE MODEL & SEARCH ALGORITHM.

* To be done: 
** write script to read morphological analysis from Morfessor
** check Perl parsing XML docs

12/01/08
* Statistical pharse-based post-editing

* To be done
** check out the Arabic-English corpus "Overview of the IWSLT 2005 evaluation campaign"
** fix bib entry

* http://www.qamus.org/morphology.htm: Buckwalter morphology
* http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L02: Buckwalter morphological analysis tool 2.0

* Europarl: A Parallel Corpus for Statistical Machine Translation
One the one side, you can find the Romance
languages Spanish, French, Portuguese and Italian,
on the other side the Germanic languages Danish,
Swedish, English, Dutch and German. The close
languages Danish and Swedish, as well as Dutch
and German are group together first. The graph is
not perfect: One would suspect Spanish and Portuguese
to be joined first, but Spanish is first joined
with French.

* finalize all scripts for handling corpus data: process, sentence, tokenize, lowercase
* generate word list for Finland : Total number of words = 517766, total freq = 21282796
for English: Total number of words = 65015, total freq = 31438971
* To be done: 
* consult Min on handling Unicode for Finish
* Check Finish morphological analysis
* sort Finish alphabetically to see if there's any duplicate entry to confirm the overwhelm of Finish num of words 517766 vs. that of English.

Task for this week
* Cut off for function words: consider top 300 words, based on character length to cutoff
** English: < 6 characters, see [[English function word cutoff]]
** Finnish: < 8 characters, see [[Finnish function word cutoff]]
-> func_extractor.pl
-> ''func_extractor.pl -in inFile -out outFile -n topNwords -l charLen''
* write scripts to remove function words given a corpus & word list -> modified word corpus
-> remove_funcword.pl script
-> ''remove_funcword.pl -in inFile -word wordFile -out outFile''
* write scripts to decompose words into phoneme sequences given a corpus and morpheme dictionary -> modified morpheme corpus
-> word2morph.pl
-> ''word2morpheme.pl -in inFile -morph morphFile -out outFile''
* train morph language model
** Remove function words, and morphemize europarl.lowercased in /home/lmthang/HYP/acl07/lm -> europarl.morph
** /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount -text working-dir/lm/europarl.lowercased -lm working-dir/lm/europarl.lm
-> this is at morpheme level, ''might need to consider higher gram''
* train modified word corpora as well as modified morpheme corpora
../../src/scripts/remove_funcword.pl -in 100.train.en -word ../../src/scripts/my_data/funcword.en -out 100.modified.en
../../src/scripts/remove_funcword.pl -in 100.train.fi -word ../../src/scripts/my_data/funcword.fi -out 100.modified.fi

../../src/scripts/word2morpheme.pl -in 100.modified.en -morph ../../src/scripts/my_data/morph.proc.en -out 100.morph.en
../../src/scripts/word2morpheme.pl -in 100.modified.fi -morph ../../src/scripts/my_data/morph.proc.fi -out 100.morph.fi

../../bin/moses-scripts/scripts-20080213-1421/training/train-factored-phrase-model.perl -scripts-root-dir ../../bin/moses-scripts/scripts-20080213-1421 -root-dir result_10 -first-step 1 -last-step 4 -corpus 100.morph -f fi -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:5:../../acl07/lm/europarl.lm:0 1>stdout 2>stderr

-> obtain word-word (morpheme-morpheme) prob from modified bi-corpora (run only step 1->4 of train-factored-phrase-model.perl)
-> preliminary analysis for translation of "council", "citizen", "europe" "commission" ''favor the morph-morph prob'' over word-word (remove function word) probs, and normal-normal probs.
* how to inject these probabilities to MOSES
** See how phrases are picked out from input sentencese
** See how phrases are looked up in phrase tables

*acl05_filtered_10 m-m
Hmm: Iteration 5
Reading more sentence pairs into memory ...
ERROR2: nan nan nanN:

*acl05_filtered_10 w-m
WARNING: The following sentence pair has source/target sentence length ration more than
the maximum allowed limit for a source word fertility
 source length = 1 target length = 11 ratio 11 ferility limit : 9
Shortening sentence
Sent No: 253882 , No. Occurrences: 1

WARNING: already 61 iterations in hillclimb: 3.08741 2 52 23

PROBLEM: alignment is 0.
WARNING: Hill Climbing yielded a zero score viterbi alignment for the following pair:

WARNING: Model2 viterbi alignment has zero score.
Here are the different elements that made this alignment probability zero
Source length 25 target length 80
Source length 39 target length 75

* Just manage to get Perl work correctly in utf8 mode
so the statistics for Findland is now: Total number of words = 516102, total freq = 21180015
for English Total number of words = 105144, total freq = 30648429

* alphabet prob doesn't account for accent mark, check if it matters?
* in fi_words_utf, after those words (sorted) starts with z, there're a lot of words with strang symbol, does it matter?

* Generating function words list separately for each year and perform overlapping
Statistics
Year 01,        count = 62
./func_words/01
Year 05,        count = 54
./func_words/05
Year 97,        count = 69
./func_words/97
Year 04,        count = 50
./func_words/04
Year 02,        count = 61
./func_words/02
Year 99,        count = 63
./func_words/99
Year 98,        count = 73
./func_words/98
Year 03,        count = 59
./func_words/03
Year 00,        count = 70
./func_words/00
Year 06,        count = 46
./func_words/06

* Function words may be prepositions, pronouns, auxiliary verbs, conjunctions, grammatical articles or particles, all of which belong to the group of closed class words. Interjections are sometimes considered function words but they belong to the group of open class words. Function words may or may not be inflected or may have affixes.

* ideas on 3-way morpheme probabilistic model, list of prefix & suffix derived from morpheme analysis, search for best sequence of prefixes and suffixes using affix recovery model (don't care abt order) as well as suffix and prefixx model (for ordering)

* Consider the case translating 1 book, 3 book (non-inflected language) -> English (1 book, 3 book), how to recover to "3 books"?

* what is case marking?

* Improving Statistical MT through Morphological Analysis
* Morphological Analysis for Statistical Machine Translation

Morphology papers in ACL07 shared task paper
* Analysis of statistical and morphological classes to
generate weighted reordering hypotheses on a statistical
machine translation system
* Morphology and reranking for the statistical parsing of Spanish
* The ‘noisier channel’: translation from morphologically complex languages

* To be done: 
** read chapter 8 "Distinguish language contact phenomena: evidence from Finish-English bilingualism"
** sort morpheme freq, prefix freq, suffix freq, STM alone?
** try to predict the result above before carrying on the experiment

* Stop at remove_functionword, need to run for 100.modified
* generate phrase-phrase translation, extract only word-word
* incorporate to translationoption

-xml-input to add additional translation option

* straight/PRE forward/STEM ness/SUF: during parameter estimation, how to incorporate, PRE, STM, SUF tag knowledge?
* how to reduce the noise prefixes and suffixes generated by Morfessor? (what if we only keep the stem go train?)
* after training, how would we be certain that the translation probabilities are improved? -> pickup several candidate translation, and observe the probability difference

+ Make some summarized graphs for SMT
+ Check Brown 1993 on Model 2, deficiency analysis (to be done)
+ Read Vogel et.al 1996: mixture model -> HMM model captures strong localization effect
+ Continue read Och & Ney, 2003
22/12/07
* Inducing multilingual text analysis tools via robust projection across aligned corpora
Use morphological knowledge on Englist => morphological knowledge on new languages
MProj => pairs of (inflected, root) => Mtrie => inducing morphological rules on new language as well as clustering

* To-be-considered paper
Improving Word Alignment Quality using Morpho-Syntactic Information
Log-Linear Models for Word Alignment
Morphological Tagging: Data vs. Dictionaries
Morphological Annotation of Text: Automatic Disambiguation
Morphology: A study of the relation between meaning and form
Induction of First-Order Decision Lists: Results on Learning the Past Tense of English Verbs (	
Induction of a Stem Lexicon for Two-level Morphological Analysis 

* Not clear yet: 
** deficiency problem in model 3, 4
** efficient version of EM for IBM model 1, 2 and HMM
** decoding component: 
beam search using phrase-based table: for each partial alignment f, e, find the best (Viterbi) alignment, which also gives the probability of that pair f, e
** different scoring: WER, PER, 100-BLEU score, NIST
** meaning of backoff

* how to inject these probabilities to MOSES
** See how phrases are picked out from input sentencese
** See how phrases are looked up in phrase tables
-> Successfully build Moses in Window, http://www.statmt.org/moses/?n=Moses.LibrariesUsed
./moses-cmd.exe -config fi-en.10/filtered/moses.ini -input-file fi-en.10/100.test.fi > 100.output 1>moses_out 2>moses_err

 (''stop at compiling MOSES using SRILM 24/04/08'')
* crawl fincd.com website to obtain English-Finnish Finnish-English dictionaries
-> 'dict_crawl.pl'', crawling 50000 most frequent words

Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information

Not clear: 
* how to pick up 2 words that are related by an inflectional transformation?
* 

http://www.ling.umd.edu/~redpony/software/telugu.html: Telugu Morphological Analyzer
http://www.ling.umd.edu/~redpony/jbuck.jar: A Java implementation of the Buckwalter Morphological Analyzer v1
* Not clear yet:
** word lattice
** should use Czech instead of Spanish as Czech is highly inflected language?

* To-be-considered paper
Word lattice parsing for SMT

* Sent email to Koehn checking on BLEU score
* To be done: ask Thai Phuong Nguyen on Vietnamese corpora
* Understand MOSES more:
** moses.ini
# distortion (reordering) weight
[weight-d]
0.3
0.3
0.3
0.3
0.3
0.3
0.3

# language model weights
[weight-l]
0.5000


# translation model weights
[weight-t]
0.2
0.2
0.2
0.2
0.2

** Why there are multiple values in weight-d, and weight-t?

* Try MOSES again: http://www.statmt.org/moses/?n=Moses.Tutorial
* How to divide corpus into many parts?
** need to pick up important keywords
** find corresponding sentences in both corpus containing these keywords
** statistics or subcorpus (num of words, num of tokens) to make sure that on the coverage

Question: if use Porter algorithm to stem all words in the corpus. Knowing that Porter algorithms fails in several cases, how to test if the effect is significant?
	
* View morphology as inference process
Inflectional stemmer: rule-based: plurals, past tenses, -ing

figuring out Morfessor to deal with morphological analysis http://www.cis.hut.fi/projects/morpho/index.shtml

To be done: 
* read papers about Morfessor
* think about the feasibility of lemmatization approach

Todo:
* Modify train-factored-phrase-model.perl to improve lexical score [[train-factored-phrase-model.perl]],  [[phrase-extract scoring]]
** modify scripts\training\phrase-extract project to read in the the lexical translation, 
* Modify TranslationOptionCollection to add translation option for OOV words by using their stems [[13 May 2008]]

+ install varigram model trained with the VariKN Language Modelling toolkit. This is to handle language model where token unit is morpheme, which results in longer n-gram models. http://varikn.forge.pascal-network.org/
# counts2kn: Performs Kneser-Ney smoothed n-gram estimation and pruning for full counts.
# varigram_kn: Grows an Kneser-Ney smoothed n-gram models incrementally
# perplexity: Evaluate the model on a test data set

An allomorph is a linguistics term for a variant form of a morpheme. The concept occurs when a unit of meaning can vary in sound (phonologically) without changing meaning (wikipedia)

To be done:
* write program to generate wordlist from a corpus. This wordlist is used to input Morfessor

Moses:
* what are weights in tunning folder?

+ finish reading Morphology-Aware Statistical Machine Translation Based on Morphs Induced in an Unsupervised Manner.
+ update thesis

europarl.filtered : remove 620 pairs of sentences from acl05 corpora where the test length ratio > 4

similarly for test & dev sets: remove 3 pairs of sentences each

Here are the different elements that made this alignment probability zero
Source length 23 target length 84

Hill Climbing yielded a zero score viterbi alignment for the following pair

• In Parameter::Parameter(), 
	AddParam("weight-m", "m", "weight for morpheme penalty"); // Thang add

* In StaticData.h, declare
class MorphemePenaltyProducer; // Thang add

float m_weightMorphemePenalty; //Thang add
// Thang add
	float GetWeightMorphemePenalty() const
	{
		return m_weightMorphemePenalty;
	}

MorphemePenaltyProducer *m_mpProducer; // Thang add
const MorphemePenaltyProducer *GetMorphemePenaltyProducer() const { return m_mpProducer; } // Thang add

* In StaticData::LoadData(), 
if(staticData.MorphemePenalty()){ 
        m_weightMorphemePenalty				= Scan<float>( m_parameter->GetParam("weight-m")[0] ); // Thang add
}
// Thang add
if(staticData.MorphemePenalty()){ 
	m_mpProducer = new MorphemePenaltyProducer(m_scoreIndexManager);
	m_allWeights.push_back(m_weightMorphemePenalty);
}

* In StaticData::~StaticData(),  
if(staticData.MorphemePenalty()){ 
       delete m_mpProducer; // Thang add
}

* In StaticData::StaticData()
if(staticData.MorphemePenalty()){ 
  m_mpProducer = 0; // Thang add
}

* In DummyScoreProducers.cpp
// Added by Thang
/** Doesn't do anything but provide a key into the global
 * score array to store the Morpheme penalty in.
 */
class MorphemePenaltyProducer : public StatelessFeatureFunction {
public:
	MorphemePenaltyProducer(ScoreIndexManager &scoreIndexManager);

	size_t GetNumScoreComponents() const;
	std::string GetScoreProducerDescription() const;
	size_t GetNumInputScores() const;

	virtual void Evaluate(
		const TargetPhrase& phrase,
		ScoreComponentCollection* out) const;

};

* In DummyScoreProducers.cpp
// Thang add
MorphemePenaltyProducer::MorphemePenaltyProducer(ScoreIndexManager &scoreIndexManager)
{
	scoreIndexManager.AddScoreProducer(this);
}

size_t MorphemePenaltyProducer::GetNumScoreComponents() const
{
	return 1;
}

std::string MorphemePenaltyProducer::GetScoreProducerDescription() const
{
	return "MorphemePenalty";
}

size_t MorphemePenaltyProducer::GetNumInputScores() const { return 0;}

void MorphemePenaltyProducer::Evaluate(const TargetPhrase& tp, ScoreComponentCollection* out) const
{
  out->PlusEquals(this, -static_cast<float>(tp.GetSize()));
}

Note that the tp.GetSize() is in morpheme count. With respect to that, the WordPenaltyProducer is modified as followed
void WordPenaltyProducer::Evaluate(const TargetPhrase& tp, ScoreComponentCollection* out) const
{
	if(StaticData::Instance().UseMorphemeLM()){
		out->PlusEquals(this, -static_cast<float>(tp.GetNumWords()));// Thang add
	} else {
		out->PlusEquals(this, -static_cast<float>(tp.GetSize()));
	}
  
}

* In Hypothesis::CalcRemainingScore()
// WORD PENALTY
	if(staticData.UseMorphemeLM()){// Thang add
		m_scoreBreakdown.PlusEquals(staticData.GetWordPenaltyProducer(), - (float) m_targetPhrase.GetNumWords());
	} else {
		m_scoreBreakdown.PlusEquals(staticData.GetWordPenaltyProducer(), - (float) m_currTargetWordsRange.GetNumWordsCovered()); 
	}

	// MORPHEME PENALTY
	if(staticData.MorphemePenalty()){ // Thang add: each morpheme as a unit
		m_scoreBreakdown.PlusEquals(staticData.GetMorphemePenaltyProducer(), - (float) m_currTargetWordsRange.GetNumWordsCovered()); 
	}

* In IOWrapper::OutputNBestList()
		// morpheme penalty, Thang add
		if (labeledOutput)
			*m_nBestStream << "m: ";
		*m_nBestStream << path.GetScoreBreakdown().GetScoreForProducer(StaticData::Instance().GetMorphemePenaltyProducer()) << " ";

* In TargetPhrase::SetScore(const ScoreProducer* translationScoreProducer,
														const Scores &scoreVector,
														const vector<float> &weightT,
														float weightWP, const LMList &languageModels)
if(StaticData::Instance().UseMorphemeLM()){
		m_fullScore = m_transScore + totalFutureScore + totalFullScore
			- (this->GetNumWords() * weightWP);	 // word penalty
  } else {
	m_fullScore = m_transScore + totalFutureScore + totalFullScore
		- (this->GetSize() * weightWP);	 // word penalty
  }

  if(StaticData::Instance().MorphemePenalty()){
	  m_fullScore -= - (this->GetSize() * StaticData::Instance().GetWeightMorphemePenalty());
  }

* In TargetPhrase::SetScore()
	if(StaticData::Instance().MorphemePenalty()){
		m_fullScore = - StaticData::Instance().GetWeightMorphemePenalty();
	}

* In TranslationOption::CalcScore()
size_t phraseSize = GetTargetPhrase().GetSize();
// future score
	if(StaticData::Instance().UseMorphemeLM()){ // Thang add
		m_futureScore = retFullScore - ngramScore 
			+ m_scoreBreakdown.InnerProduct(StaticData::Instance().GetAllWeights()) 
			- GetTargetPhrase().GetNumWords() * StaticData::Instance().GetWeightWordPenalty();
	} else {		
		m_futureScore = retFullScore - ngramScore 
			+ m_scoreBreakdown.InnerProduct(StaticData::Instance().GetAllWeights()) 
			- phraseSize * StaticData::Instance().GetWeightWordPenalty();
	}

	// Thang add
	if(StaticData::Instance().MorphemePenalty()){
		m_futureScore -= phraseSize * StaticData::Instance().GetWeightMorphemePenalty();
	}

* in  mtrain-factored-phrase-model.perl
** declare option
my $_MORPHEME_PENALTY = undef; #added by Thang
'morpheme-penalty' => \$_MORPHEME_PENALTY, # added by Thang

** in sub create_ini:
if(defined $_MORPHEME_PENALTY){
  print INI "\n# morpheme penalty\n[weight-m]\n-1";
}

* in mert-moses.pl
add to additional_triples
    "m" => [ [ 0.0, -1.0, 1.0 ] ],  # Thang add: morpheme penalty

my $ABBR_FULL_MAP = "d=weight-d lm=weight-l tm=weight-t w=weight-w g=weight-generation m=weight-m";

Compare
* w system: more but less accurate phrase translations
* m system: show correct alignment but many incorrect boundary phrases
* m-realign system: less but accurate phrase translations

Problem:
* reordering problem

* lack of coverage in the m-realign system: compare the phrase tables & add additional entries from the word table

* Fix m-realign, combining scheme
* Filtering out incorrect phrases of m-augment

use "main.pl"

* modify /home/lmthang/HYP/bin_02_June_08/moses-scripts/scripts-20080602-2352/training/mert-moses.pl, at line 522 remove the keyword "my"

processing ptree for
..................................................[phrase:500000]
..................................................[phrase:1000000]
......................distinct source phrases: 1223895 distinct first words of source phrases: 14281 number of phrase pairs (line count): 3202063
WARNING: there are src voc entries with no phrase translation: count 2639
There exists phrase translations for 11642 entries

 ### Enriching Morphologically Poor Languages ###
* English-Greek (main testing language)
use data in http://www.statmt.org/wmt07
** Language model: 5 gram SRILM
** Greek model: trained on 440,082 aligned sentences Europarl v.3
** Tuned with MERT: dev2006 set (Europarl - 2,000 sentences)
** Tested on two sets (Europarl & News Commentary 2,000 sentences each): devtest2006, test2007 

* English-Czech (testing extension)
** Czech model: trained on 57,464 aligned sentences
** Tuned over 1057 sentences of the
News Commentary
** Tested on two sets of 964 sentences and 2000 sentences respectively

* Note:
** Training sentences were trimmed to a length of 60 words for reducing perplexity and a standard lexicalised reordering, with distortion limit set to 6
** Collins’ parser, Brill’s tagger for English
**  BLEU and NIST metrics

 ### Initial Explorations in English to Turkish Statistical Machine Translation ###
 ### Exploring Different Representational Units in English-to-Turkish Statistical ###
* English-Turkish
** Parallel data consists mainly of documents in international relations and legal documents from sources such as the Turkish Ministry of Foreign Affairs, EU, etc

* TreeTagger for English

 ### Generating Complex Morphology for Machine Translation ###
 ### Applying Morphology Generation Models to Machine Translation ###

* English-Russian, English-Arabic
** 1 million aligned sentence pairs for English-Russian, and 0.5 million
pairs for English-Arabic. we used data from a technical (computer) domain
** 1,000 sentence pairs each for development and testing for both language pairs
* Buckwalter morphological analyser for Arabic

/***
|''Name:''|CryptoFunctionsPlugin|
|''Description:''|Support for cryptographic functions|
***/
//{{{
if(!version.extensions.CryptoFunctionsPlugin) {
version.extensions.CryptoFunctionsPlugin = {installed:true};

//--
//-- Crypto functions and associated conversion routines
//--

// Crypto "namespace"
function Crypto() {}

// Convert a string to an array of big-endian 32-bit words
Crypto.strToBe32s = function(str)
{
	var be = Array();
	var len = Math.floor(str.length/4);
	var i, j;
	for(i=0, j=0; i<len; i++, j+=4) {
		be[i] = ((str.charCodeAt(j)&0xff) << 24)|((str.charCodeAt(j+1)&0xff) << 16)|((str.charCodeAt(j+2)&0xff) << 8)|(str.charCodeAt(j+3)&0xff);
	}
	while (j<str.length) {
		be[j>>2] |= (str.charCodeAt(j)&0xff)<<(24-(j*8)%32);
		j++;
	}
	return be;
};

// Convert an array of big-endian 32-bit words to a string
Crypto.be32sToStr = function(be)
{
	var str = "";
	for(var i=0;i<be.length*32;i+=8)
		str += String.fromCharCode((be[i>>5]>>>(24-i%32)) & 0xff);
	return str;
};

// Convert an array of big-endian 32-bit words to a hex string
Crypto.be32sToHex = function(be)
{
	var hex = "0123456789ABCDEF";
	var str = "";
	for(var i=0;i<be.length*4;i++)
		str += hex.charAt((be[i>>2]>>((3-i%4)*8+4))&0xF) + hex.charAt((be[i>>2]>>((3-i%4)*8))&0xF);
	return str;
};

// Return, in hex, the SHA-1 hash of a string
Crypto.hexSha1Str = function(str)
{
	return Crypto.be32sToHex(Crypto.sha1Str(str));
};

// Return the SHA-1 hash of a string
Crypto.sha1Str = function(str)
{
	return Crypto.sha1(Crypto.strToBe32s(str),str.length);
};

// Calculate the SHA-1 hash of an array of blen bytes of big-endian 32-bit words
Crypto.sha1 = function(x,blen)
{
	// Add 32-bit integers, wrapping at 32 bits
	add32 = function(a,b)
	{
		var lsw = (a&0xFFFF)+(b&0xFFFF);
		var msw = (a>>16)+(b>>16)+(lsw>>16);
		return (msw<<16)|(lsw&0xFFFF);
	};
	// Add five 32-bit integers, wrapping at 32 bits
	add32x5 = function(a,b,c,d,e)
	{
		var lsw = (a&0xFFFF)+(b&0xFFFF)+(c&0xFFFF)+(d&0xFFFF)+(e&0xFFFF);
		var msw = (a>>16)+(b>>16)+(c>>16)+(d>>16)+(e>>16)+(lsw>>16);
		return (msw<<16)|(lsw&0xFFFF);
	};
	// Bitwise rotate left a 32-bit integer by 1 bit
	rol32 = function(n)
	{
		return (n>>>31)|(n<<1);
	};

	var len = blen*8;
	// Append padding so length in bits is 448 mod 512
	x[len>>5] |= 0x80 << (24-len%32);
	// Append length
	x[((len+64>>9)<<4)+15] = len;
	var w = Array(80);

	var k1 = 0x5A827999;
	var k2 = 0x6ED9EBA1;
	var k3 = 0x8F1BBCDC;
	var k4 = 0xCA62C1D6;

	var h0 = 0x67452301;
	var h1 = 0xEFCDAB89;
	var h2 = 0x98BADCFE;
	var h3 = 0x10325476;
	var h4 = 0xC3D2E1F0;

	for(var i=0;i<x.length;i+=16) {
		var j,t;
		var a = h0;
		var b = h1;
		var c = h2;
		var d = h3;
		var e = h4;
		for(j = 0;j<16;j++) {
			w[j] = x[i+j];
			t = add32x5(e,(a>>>27)|(a<<5),d^(b&(c^d)),w[j],k1);
			e=d; d=c; c=(b>>>2)|(b<<30); b=a; a = t;
		}
		for(j=16;j<20;j++) {
			w[j] = rol32(w[j-3]^w[j-8]^w[j-14]^w[j-16]);
			t = add32x5(e,(a>>>27)|(a<<5),d^(b&(c^d)),w[j],k1);
			e=d; d=c; c=(b>>>2)|(b<<30); b=a; a = t;
		}
		for(j=20;j<40;j++) {
			w[j] = rol32(w[j-3]^w[j-8]^w[j-14]^w[j-16]);
			t = add32x5(e,(a>>>27)|(a<<5),b^c^d,w[j],k2);
			e=d; d=c; c=(b>>>2)|(b<<30); b=a; a = t;
		}
		for(j=40;j<60;j++) {
			w[j] = rol32(w[j-3]^w[j-8]^w[j-14]^w[j-16]);
			t = add32x5(e,(a>>>27)|(a<<5),(b&c)|(d&(b|c)),w[j],k3);
			e=d; d=c; c=(b>>>2)|(b<<30); b=a; a = t;
		}
		for(j=60;j<80;j++) {
			w[j] = rol32(w[j-3]^w[j-8]^w[j-14]^w[j-16]);
			t = add32x5(e,(a>>>27)|(a<<5),b^c^d,w[j],k4);
			e=d; d=c; c=(b>>>2)|(b<<30); b=a; a = t;
		}

		h0 = add32(h0,a);
		h1 = add32(h1,b);
		h2 = add32(h2,c);
		h3 = add32(h3,d);
		h4 = add32(h4,e);
	}
	return Array(h0,h1,h2,h3,h4);
};


}
//}}}

/***
|''Name:''|DeprecatedFunctionsPlugin|
|''Description:''|Support for deprecated functions removed from core|
***/
//{{{
if(!version.extensions.DeprecatedFunctionsPlugin) {
version.extensions.DeprecatedFunctionsPlugin = {installed:true};

//--
//-- Deprecated code
//--

// @Deprecated: Use createElementAndWikify and this.termRegExp instead
config.formatterHelpers.charFormatHelper = function(w)
{
	w.subWikify(createTiddlyElement(w.output,this.element),this.terminator);
};

// @Deprecated: Use enclosedTextHelper and this.lookaheadRegExp instead
config.formatterHelpers.monospacedByLineHelper = function(w)
{
	var lookaheadRegExp = new RegExp(this.lookahead,"mg");
	lookaheadRegExp.lastIndex = w.matchStart;
	var lookaheadMatch = lookaheadRegExp.exec(w.source);
	if(lookaheadMatch && lookaheadMatch.index == w.matchStart) {
		var text = lookaheadMatch[1];
		if(config.browser.isIE)
			text = text.replace(/\n/g,"\r");
		createTiddlyElement(w.output,"pre",null,null,text);
		w.nextMatch = lookaheadRegExp.lastIndex;
	}
};

// @Deprecated: Use <br> or <br /> instead of <<br>>
config.macros.br = {};
config.macros.br.handler = function(place)
{
	createTiddlyElement(place,"br");
};

// Find an entry in an array. Returns the array index or null
// @Deprecated: Use indexOf instead
Array.prototype.find = function(item)
{
	var i = this.indexOf(item);
	return i == -1 ? null : i;
};

// Load a tiddler from an HTML DIV. The caller should make sure to later call Tiddler.changed()
// @Deprecated: Use store.getLoader().internalizeTiddler instead
Tiddler.prototype.loadFromDiv = function(divRef,title)
{
	return store.getLoader().internalizeTiddler(store,this,title,divRef);
};

// Format the text for storage in an HTML DIV
// @Deprecated Use store.getSaver().externalizeTiddler instead.
Tiddler.prototype.saveToDiv = function()
{
	return store.getSaver().externalizeTiddler(store,this);
};

// @Deprecated: Use store.allTiddlersAsHtml() instead
function allTiddlersAsHtml()
{
	return store.allTiddlersAsHtml();
}

// @Deprecated: Use refreshPageTemplate instead
function applyPageTemplate(title)
{
	refreshPageTemplate(title);
}

// @Deprecated: Use story.displayTiddlers instead
function displayTiddlers(srcElement,titles,template,unused1,unused2,animate,unused3)
{
	story.displayTiddlers(srcElement,titles,template,animate);
}

// @Deprecated: Use story.displayTiddler instead
function displayTiddler(srcElement,title,template,unused1,unused2,animate,unused3)
{
	story.displayTiddler(srcElement,title,template,animate);
}

// @Deprecated: Use functions on right hand side directly instead
var createTiddlerPopup = Popup.create;
var scrollToTiddlerPopup = Popup.show;
var hideTiddlerPopup = Popup.remove;

// @Deprecated: Use right hand side directly instead
var regexpBackSlashEn = new RegExp("\\\\n","mg");
var regexpBackSlash = new RegExp("\\\\","mg");
var regexpBackSlashEss = new RegExp("\\\\s","mg");
var regexpNewLine = new RegExp("\n","mg");
var regexpCarriageReturn = new RegExp("\r","mg");

}
//}}}

* English function words http://www.marlodge.supanet.com/museum/funcword.html
A ABOUT ABOVE AFTER AGAIN AGO ALL ALMOST ALONG ALREADY ALSO ALTHOUGH ALWAYS AM AMONG AN AND ANOTHER ANY ANYBODY ANYTHING ANYWHERE ARE AREN'T AROUND AS AT BACK ELSE BE BEEN BEFORE BEING BELOW BENEATH BESIDE BETWEEN BEYOND BILLION BILLIONTH BOTH EACH BUT BY CAN CAN'T COULD COULDN'T DID DIDN'T DO DOES DOESN'T DOING DONE DON'T DOWN DURING EIGHT EIGHTEEN EIGHTEENTH EIGHTH EIGHTIETH EIGHTY EITHER ELEVEN ELEVENTH ENOUGH EVEN EVER EVERY EVERYBODY EVERYONE EVERYTHING EVERYWHERE EXCEPT FAR FEW FEWER FIFTEEN FIFTEENTH FIFTH FIFTIETH FIFTY FIRST FIVE FOR FORTIETH FORTY FOUR FOURTEEN FOURTEENTH FOURTH HUNDRED FROM GET GETS GETTING GOT HAD HADN'T HAS HASN'T HAVE HAVEN'T HAVING HE HE'D HE'LL HENCE HER HERE HERS HERSELF HE'S HIM HIMSELF HIS HITHER HOW HOWEVER NEAR HUNDREDTH I I'D IF I'LL I'M IN INTO IS I'VE ISN'T IT ITS IT'S ITSELF LAST LESS MANY ME MAY MIGHT MILLION MILLIONTH MINE MORE MOST MUCH MUST MUSTN'T MY MYSELF NEAR NEARBY NEARLY NEITHER NEVER NEXT NINE NINETEEN NINETEENTH NINETIETH NINETY NINTH NO NOBODY NONE NOONE NOTHING NOR NOT NOW NOWHERE OF OFF OFTEN ON OR ONCE ONE ONLY OTHER OTHERS OUGHT OUGHTN'T OUR OURS OURSELVES OUT OVER QUITE RATHER ROUND SECOND SEVEN SEVENTEEN SEVENTEENTH SEVENTH SEVENTIETH SEVENTY SHALL SHAN'T SHE'D SHE SHE'LL SHE'S SHOULD SHOULDN'T SINCE SIX SIXTEEN SIXTEENTH SIXTH SIXTIETH SIXTY SO SOME SOMEBODY SOMEONE SOMETHING SOMETIMES SOMEWHERE SOON STILL SUCH TEN TENTH THAN THAT THAT THAT'S THE THEIR THEIRS THEM THEMSELVES THESE THEN THENCE THERE THEREFORE THEY THEY'D THEY'LL THEY'RE THIRD THIRTEEN THIRTEENTH THIRTIETH THIRTY THIS THITHER THOSE THOUGH THOUSAND THOUSANDTH THREE THRICE THROUGH THUS TILL TO TOWARDS TODAY TOMORROW TOO TWELFTH TWELVE TWENTIETH TWENTY TWICE TWO UNDER UNDERNEATH UNLESS UNTIL UP US VERY WHEN WAS WASN'T WE WE'D WE'LL WERE WE'RE WEREN'T WE'VE WHAT WHENCE WHERE WHEREAS WHICH WHILE WHITHER WHO WHOM WHOSE WHY WILL WITH WITHIN WITHOUT WON'T WOULD WOULDN'T YES YESTERDAY YET YOU YOUR YOU'D YOU'LL YOU'RE YOURS YOURSELF YOURSELVES YOU'VE

* http://www2.lingsoft.fi/doc/fintwol/intro/tags.html: Finish morphological tags

* http://www.cis.hut.fi/morphochallenge2007/datasets.shtml: dataset for unsupervised morpheme analysis

similarly as in Finish ([[Finnish function word cutoff]]), we have  
6	''a''	1
28	''s''	1
10	''i''	1
------
66	''my''	2
1	''of''	2
64	''no''	2
4	''in''	2
152	''me''	2
22	''by''	2
85	''am''	2
17	''as''	2
12	''on''	2
78	''up''	2
166	''he''	2
57	''if''	2
7	''is''	2
13	''it''	2
67	''us''	2
25	''mr''	2
54	''do''	2
8	''we''	2
2	''to''	2
81	''eu''	2
30	''an''	2
47	''or''	2
61	''so''	2
14	''be''	2
27	''at''	2
-----------------
41	''you''	3
201	''put''	3
193	''see''	3
251	''aid''	3
149	''mrs''	3
31	''all''	3
44	''can''	3
105	''way''	3
165	''may''	3
182	''why''	3
24	''has''	3
82	''out''	3
79	''now''	3
292	''few''	3
75	''new''	3
139	''how''	3
191	''use''	3
109	''say''	3
45	''was''	3
125	''two''	3
221	''far''	3
190	''own''	3
32	''but''	3
3	''and''	3
153	''too''	3
253	''law''	3
275	''ask''	3
180	''his''	3
18	''not''	3
36	''our''	3
73	''who''	3
51	''its''	3
60	''one''	3
254	''set''	3
95	''any''	3
0	''the''	3
265	''let''	3
9	''for''	3
233	''end''	3
15	''are''	3
130	''had''	3
--------------
170	''both''	4
111	''many''	4
222	''give''	4
55	''what''	4
103	''them''	4
74	''time''	4
23	''will''	4
158	''much''	4
86	''when''	4
146	''said''	4
141	''part''	4
114	''fact''	4
71	''only''	4
62	''very''	4
175	''know''	4
38	''must''	4
93	''into''	4
192	''case''	4
173	''good''	4
195	''able''	4
97	''work''	4
174	''then''	4
122	''even''	4
40	''been''	4
143	''want''	4
269	''come''	4
178	''view''	4
34	''also''	4
273	''wish''	4
183	''your''	4
276	''role''	4
177	''over''	4
216	''area''	4
186	''hope''	4
16	''have''	4
194	''last''	4
104	''were''	4
259	''next''	4
58	''like''	4
53	''they''	4
145	''same''	4
90	''make''	4
132	''well''	4
5	''that''	4
89	''take''	4
92	''some''	4
20	''with''	4
128	''here''	4
37	''from''	4
48	''more''	4
156	''year''	4
101	''made''	4
262	''done''	4
77	''need''	4
123	''does''	4
11	''this''	4
83	''such''	4
157	''most''	4
240	''long''	4
171	''vote''	4
107	''than''	4
120	''just''	4
-----------------------------
52	''these''	5
127	''issue''	5
184	''taken''	5
246	''again''	5
168	''still''	5
99	''being''	5
294	''going''	5
160	''where''	5
124	''human''	5
29	''would''	5
147	''order''	5
291	''quite''	5
207	''thank''	5
247	''women''	5
263	''terms''	5
250	''after''	5
204	''means''	5
39	''there''	5
286	''shall''	5
281	''agree''	5
218	''since''	5
155	''state''	5
289	''whole''	5
69	''about''	5
88	''those''	5
59	''their''	5
68	''other''	5
288	''three''	5
229	''rules''	5
119	''point''	5
19	''which''	5
129	''think''	5
135	''right''	5
136	''today''	5
287	''madam''	5
243	''areas''	5
179	''under''	5
167	''clear''	5
133	''years''	5
212	''legal''	5
209	''level''	5
43	''union''	5
126	''group''	5
187	''house''	5
210	''given''	5
138	''could''	5
164	''world''	5
198	''place''	5
252	''basis''	5
211	''great''	5
102	''first''	5
260	''trade''	5
--------------
56	''member''	6
280	''energy''	6
199	''system''	6
245	''treaty''	6
169	''common''	6
241	''health''	6
284	''behalf''	6
142	''future''	6
72	''policy''	6
227	''ladies''	6
134	''public''	6
131	''market''	6
116	''debate''	6
230	''matter''	6
185	''course''	6
238	''regard''	6
296	''second''	6
70	''people''	6
208	''number''	6
228	''budget''	6
285	''making''	6
113	''social''	6
223	''united''	6
46	''states''	6
33	''should''	6
290	''better''	6
154	''within''	6
63	''europe''	6
91	''rights''	6
137	''cannot''	6
278	''sector''	6
219	''ensure''	6
205	''before''	6
49	''report''	6
237	''issues''	6
200	''action''	6
-------------------------------
274	''subject''	7
112	''believe''	7
256	''general''	7
94	''support''	7
50	''council''	7
214	''problem''	7
268	''adopted''	7
255	''further''	7
257	''certain''	7
236	''whether''	7
159	''already''	7
161	''members''	7
277	''opinion''	7
231	''through''	7
76	''because''	7
96	''however''	7
144	''country''	7
249	''respect''	7
121	''against''	7
196	''process''	7
215	''example''	7
98	''between''	7
298	''working''	7
202	''without''	7
261	''present''	7
110	''economic''	8
162	''national''	8
108	''proposal''	8
21	''european''	8
220	''problems''	8
297	''continue''	8
117	''question''	8
224	''security''	8
148	''possible''	8
282	''progress''	8
172	''citizens''	8
163	''measures''	8
225	''position''	8
266	''services''	8
232	''decision''	8
267	''concerned''	9
80	''important''	9
217	''programme''	9
100	''political''	9
299	''procedure''	9
264	''framework''	9
270	''different''	9
242	''proposals''	9
150	''situation''	9
283	''something''	9
87	''therefore''	9
295	''principle''	9
226	''gentlemen''	9
140	''agreement''	9
239	''necessary''	9
65	''countries''	9
35	''president''	9
84	''committee''	9
197	''financial''	9
234	''amendment''	9
115	''community''	9
151	''directive''	9
26	''commission''	10
272	''employment''	10
181	''amendments''	10
248	''protection''	10
271	''regulation''	10
213	''government''	10
258	''resolution''	10
42	''parliament''	10
244	''presidency''	10
203	''rapporteur''	10
188	''particular''	10
118	''development''	11
279	''legislation''	11
189	''cooperation''	11
206	''information''	11
106	''commissioner''	12
235	''particularly''	12
293	''institutions''	12
176	''international''	13

* Figuring out training errors is again due to corpus uncorrespondence
* Need to remove file ep-00-04-12
*  uncomment in bin/moses-scripts/scripts-20080213-1421/training/clean-corpus-n.perl
 # $e =~ s/\|g; # kinda hurts in factored input
 $e =~ s/\s+/ /g;
 $e =~ s/^ ;
 $e =~ s/ $;
 # $f =~ s/\|g; # kinda hurts in factored input

* The correct corpus after tokenizing, cleaning, and lowercase has 1021180 sentences, and are stored in /home/lmthang/HYP/src/scripts/europarl.tok.clean.low.*
* Steps to divide data and measure training time:
** Prepare the whole corpus by running "prepare_combine_data.pl" script
** Create many partial corpora by running "corpus_extractor.pl" script
** Run "data_split.pl" to split each partial corpus into training, dev, and test corpora in the proportion of 80%, 20%, 20%
** Run "creat_train_scripts.pl" to create for each corpus the script that does training as well as measuring time

* The divided corpora are here
"european" en corpus    101110  2730232 1021180 22148907
"european" fi corpus    101110  1933257 1021180 16061093
"commission" en corpus  78274   2043349 1021180 22148907
"commission" fi corpus  78274   1438861 1021180 16061093
"president" en corpus   62572   1580517 1021180 22148907
"president" fi corpus   62572   1185946 1021180 16061093
"parliament" en corpus  51159   1318580 1021180 22148907
"parliament" fi corpus  51159   945735  1021180 16061093
"union" en corpus       48116   1304960 1021180 22148907
"union" fi corpus       48116   923293  1021180 16061093
"states" en corpus      42818   1171039 1021180 22148907
"states" fi corpus      42818   818545  1021180 16061093
"report" en corpus      47072   1151682 1021180 22148907
"report" fi corpus      47072   821529  1021180 16061093
"council" en corpus     40757   1088765 1021180 22148907
"council" fi corpus     40757   758511  1021180 16061093
"europe" en corpus      37492   962090  1021180 22148907
"europe" fi corpus      37492   696875  1021180 16061093
"countries" en corpus   32691   887397  1021180 22148907
"countries" fi corpus   32691   646216  1021180 16061093

Steps to obtain parallel corpus:
* ''IMPORTANT'': run sentence aligner obtained from here http://www.statmt.org/europarl/v3/tools.tgz 
* run ''prepare_combine_data.pl'' to remove xml tag, tokenize, lowercase:
./scripts_own/prepare_combine_data.pl -out europarl -l1 fi -l2 en

** After sentence aligning there's still mismatch [[Fi-en corpora mismatch]]
** Total parallel corpus size before tag removing 1750290
After removing tag at europarl, count1 = 1228807, count2 = 1228807
Cleanning ...
clean-corpus.perl: processing europarl.fi & .en to europarl.clean, cutoff 1-40
..........(100000)..........(200000)..........(300000)..........(400000)..........(500000)..........(600000)..........(700000)..........(800000)..........(900000)..........(1000000)..........(1100000)..........(1200000)..
Input sentences: 1228807  Output sentences:  1006279
After tokenizing, cleaning at europarl, count1 = 1228807, count2 = 1228807

* Extract large corpora into small corpora
./corpus_extractor.pl -corpus ../corpus/europarl.clean -word data_own/word_list -count 20

entire en corpus: 1006279 line  21757289 tokens
entire fi corpus: 1006279 line  15775216 tokens
** 0
"european" en corpus    102215  2754855
"european" fi corpus    102215  1953341
** 1
"commission" en corpus  100931  2591940
"commission" fi corpus  100931  1875048
** 2
"president" en corpus   68076   1720838
"president" fi corpus   68076   1285807
** 3
"parliament" en corpus  55949   1444512
"parliament" fi corpus  55949   1037343
** 4
"union" en corpus       47976   1300255
"union" fi corpus       47976   920094 
** 5
"states" en corpus      42206   1153597
"states" fi corpus      42206   806208 
** 6
"report" en corpus      52396   1285914
"report" fi corpus      52396   918920 
** 7
"council" en corpus     40647   1084681
"council" fi corpus     40647   756091 
** 8
"countries" en corpus   32160   872117 
"countries" fi corpus   32160   635117 
** 9
"policy" en corpus      29689   777086 
"policy" fi corpus      29689   542920 
** 10
"important" en corpus   30955   766474 
"important" fi corpus   30955   561101 
** 11
"committee" en corpus   22589   623272 
"committee" fi corpus   22589   419830 
** 12
"rights" en corpus      20949   567727 
"rights" fi corpus      20949   397068 
** 13
"support" en corpus     31455   786387 
"support" fi corpus     31455   566067 
** 14
"political" en corpus   21914   572321 
"political" fi corpus   21914   423311 
** 15
"proposal" en corpus    30843   787450 
"proposal" fi corpus    30843   555795 
** 16
"economic" en corpus    18961   511382 
"economic" fi corpus    18961   372590 
** 17
"social" en corpus      20251   546884 
"social" fi corpus      20251   394118 
** 18
"community" en corpus   16192   443651 
"community" fi corpus   16192   319037 
** 19
"debate" en corpus      22856   501206 
"debate" fi corpus      22856   359220 

* Break each small corpus into train, dev, test data (80%, 10%, 10%)
./data_split.pl -dir ../corpus -n 20

**
Count /home/lmthang/HYP/corpus/aligned/fi-en/fi/ep-01-05-16.txt 4578
Mismatch 2337:
<P>
(Murmurs of dissent)

Mismatch 2338:
Hyv<E4>t j<E4>senet, olemme noudattaneet kaikkia ty<F6>j<E4>rjestyksen asiaan liittyvi<E4> artikloja, joiden mukaisesti j<E4>sen Ribeiro
e Castron esitys on hyv<E4>ksytty.
<P>

Mismatch 2339:
<P>
Ladies and gentlemen, we have acted pursuant to all the relevant Rules, and Mr Ribeiro e Castro' s motion is approved.

Mismatch 2340:
(Vastalauseita)
<P>

**
Count /home/lmthang/HYP/corpus/aligned/fi-en/fi/ep-01-12-17.txt 1792
Mismatch 1007:
<P>
(Applause)

Mismatch 1012:
Lopulta asia hyv<E4>ksyttiin, ja ansio siit<E4> kuuluu alueiden komitean puheenjohtajalle Chabertille, jota olin pyyt<E4>nyt l<E4>hett
<E4>m<E4><E4>n minulle kirjeen. H<E4>nen minulle l<E4>hett<E4>m<E4>ss<E4><E4>n kirjeess<E4> sanottiin, ett<E4> alueiden komitean on oltav
a edustettuna ja ett<E4> alueiden komitean tavoin edustusoikeutta on vaadittava eri kokonaisuuksille: alueille ja kaupungeille, mutta my
<F6>s omaa lains<E4><E4>d<E4>nt<F6>valtaa k<E4>ytt<E4>ville alueille.
<P>

**
Count /home/lmthang/HYP/corpus/aligned/fi-en/fi/ep-04-01-13.txt 3561
Mismatch 3499:
<P>
Mr Khanbhai asked what the woman who had these illnesses would think. What would she have thought, turning on her television last week, t
o see President Bush announcing a multi-billion-dollar space programme of truly astronomical amounts?

Mismatch 3500:
He olisivat nykyisill<E4> l<E4><E4>kkeill<E4> ja poisheitett<E4>v<E4>ll<E4> ruualla pelastettavissa. Edullisilla avustusohjelmilla voitai
siin vuosittain pelastaa kuuden miljoonan alle viisivuotiaan lapsen henki.
<P>

Mismatch 3501:
<P>
It is about time that we resolved the problems in this world before flying off in search of others.

Mismatch 3510:
Kapitalismi kehittyy ahneuden voimalla ja niin k<F6>yhyyskin. Ratkaiskaamme, hyv<E4>t kollegat, osattomuuden ongelma kuten sodat Euroopas
sa: yhteisty<F6>ll<E4>.
<P>

**
ep-06-05-17.txt 5075    5075
Mismatch 3609:
<P>
Is the EU in a position to implement an effectively coordinated energy policy, to be converted into an EU common energy policy in the nea
r future? Is the EU in a position to counter the Russian pipeline monopoly for transportation of oil and gas from Central Asia to Europe?

Mismatch 3613:
Ven<E4>j<E4>lt<E4> tuotavan <F6>ljyn ja kaasun aiheuttamasta energiariippuvuudesta on tullut kuuma puheenaihe EU:ssa erityisesti t<E4>n
<E4> vuonna, ja EU on selv<E4>sti ymm<E4>rt<E4>nyt, ett<E4> tulevan vuosikymmenen aikana energiantoimituksilla on paljon t<E4>rke<E4>mpi
poliittinen rooli kuin aikaisemmin. Ven<E4>j<E4><E4> koskevan EU:n tehokkaan energiapolitiikan puuttumisen ansiosta Ven<E4>j<E4>n hallitu
s pystyy peluuttamaan Euroopan yrityksi<E4> ja hallituksia toisiaan vastaan, kun ne haluavat saada kaasua ja p<E4><E4>st<E4> osallisiksi
investoinneista.
<P>

Mismatch 3615:
<P>
Is the Council prepared to raise issues relating to reciprocity and the transparency of Russian energy companies during the nearest G8 Su
mmit?

Mismatch 3616:
Onko EU:n mahdollista panna t<E4>yt<E4>nt<F6><F6>n tehokasta koordinoitua energiapolitiikkaa, joka voitaisiin muuttaa yhteiseksi energiap
olitiikaksi l<E4>hitulevaisuudessa? Onko EU:n mahdollista vastustaa Ven<E4>j<E4>n <F6>ljyputkimonopolia kuljetettaessa <F6>ljy<E4> ja kaa
sua Keski-Aasiasta Eurooppaan?
<P>

http://www.fincd.com/

! 23/03/08
arvoisa = honored, esteemed (top frequent word in europarl corpus)

!Last time
* onnea vuodelle 2000 =  a happy year 2000
* istuntokauden keskeyttÃ¤minen = adjournment of the session
* environment = ymp�¤rist�¶
* toivomukseni maanviljelij�¤n�¤ on = as a farmer
** toivomuksen =  toivomus = desire, wish
* portugalilainen = portuguese
* hallitustenv�¤lisess�¤ = (in/at) hallitustenv�¤linen
* English: judgement
# Finnish: 	arvioiminen
# arviointi
# arvostelu
# tuomio
* Finnish:  	tammikuussa
# English: 	in January
* English:  	january
# Finnish: 	tammikuu

*English:  	amendments
# Finnish: 	korjaus
# lainlis�¤ys
# lainmuutos
# lis�¤ys
# muutos
# muutosehdotus
# oikaisu
# parannus
# uudistus

* English:  	The Netherlands
# Finnish: 	Alankomaat

* English:  	entrepreneur
# Finnish: 	urakoitsija
# yksityisyritt�¤j�¤
# yritt�¤j�¤

* English:  	entrepreneurship
# Finnish: 	yritt�¤j�¤toiminta

* innovaatiota ja yritt�¤jyytt�¤ = innovation and entrepreneurship

* Toinen luku: kaksi kolmasosaa. = Second number: two thirds.
*

Top 300 frequent word
* It's very clear that >7 char contain almost those open-class words. So, let check those 7 character words Finnish

153	''a''	1
26	''n''	1
-----------
42	''ne''	2
130	''of''	2
5	''se''	2
197	''in''	2
61	''me''	2
45	''jo''	2
1	''on''	2
95	''en''	2
108	''he''	2
285	''is''	2
3	''ei''	2
139	''to''	2
38	''eu''	2
0	''ja''	2
-----------------------
290	''te''	2
24	''tai''	3
54	''oli''	3
10	''ole''	3
133	''eri''	3
186	''osa''	3
34	''nyt''	3
29	''kun''	3
210	''yh�¤''	3
33	''voi''	3
176	''and''	3
189	''eli''	3
9	''sen''	3
31	''jos''	3
87	''h�¤n''	3
51	''the''	3
124	''saa''	3
------------------------
229	''jopa''	4
118	''asia''	4
43	''niin''	4
94	''n�¤m�¤''	4
22	''kuin''	4
11	''joka''	4
174	''aina''	4
2	''ett�¤''	4
37	''olen''	4
121	''itse''	4
7	''ovat''	4
107	''siis''	4
64	''olla''	4
109	''koko''	4
203	''yht�¤''	4
71	''emme''	4
235	''onko''	4
85	''jota''	4
78	''vaan''	4
65	''n�¤in''	4
173	''hyv�¤''	4
221	''enää''	4
52	''t�¤t�¤''	4
255	''ajan''	4
137	''yksi''	4
17	''sit�¤''	4
264	''that''	4
104	''mik�¤''	4
13	''t�¤m�¤''	4
6	''my�¶s''	4
46	''vain''	4
134	''kyse''	4
50	''mit�¤''	4
81	''eik�¤''	4
27	''sek�¤''	4
---------------------------
265	''uuden''	5
73	''t�¤ll�¤''	5
96	''ettei''	5
131	''aivan''	5
41	''t�¤ss�¤''	5
59	''jotta''	5
93	''ollut''	5
80	''t�¤h�¤n''	5
292	''esiin''	5
168	''saada''	5
126	''tulee''	5
195	''kaksi''	5
250	''monet''	5
135	''antaa''	5
188	''herra''	5
40	''koska''	5
76	''joita''	5
202	''ollen''	5
162	''jolla''	5
119	''miten''	5
44	''hyvin''	5
16	''siit�¤''	5
170	''samaa''	5
36	''kuten''	5
57	''eiv�¤t''	5
110	''h�¤nen''	5
211	''tukea''	5
179	''ilman''	5
55	''viel�¤''	5
84	''ennen''	5
181	''ottaa''	5
220	''voida''	5
69	''t�¤st�¤''	5
252	''teit�¤''	5
226	''liian''	5
100	''tehd�¤''	5
200	''n�¤it�¤''	5
248	''voisi''	5
289	''uskon''	5
28	''j�¤sen''	5
194	''siten''	5
15	''jotka''	5
47	''jonka''	5
19	''mutta''	5
113	''hyv�¤t''	5
101	''siin�¤''	5
99	''niit�¤''	5
67	''jossa''	5
23	''olisi''	5
240	''josta''	5
263	''uusia''	5
48	''sill�¤''	5
261	''usein''	5
127	''viime''	5
161	''sanoa''	5
114	''pitää''	5
209	''asiaa''	5
256	''johon''	5
267	''suuri''	5
154	''minun''	5
90	''siksi''	5
112	''juuri''	5
231	''asian''	5
21	''t�¤m�¤n''	5
--------------------------
58	''siihen''	6 = ''there''
70	''vuoksi''	6 = ''because of, for the sake of, through''
260	''vuotta''	6 = vuosi = year
60	''niiden''	6 = neniiden =  niisiniiden =  se = ''it, that, the''
86	''vuoden''	6 = vuosi = year
106	''osalta''	6 = '(from) osa = destiny, luck, fortune
171	''alalla''	6 = (in/at) ala = area, domain
148	''avulla''	6 = ''by means of''
77	''mukaan''	6 = ''according to''
212	''t�¤ytyy''	6 = ''must''
138	''t�¤nään''	6 = today
117	''voimme''	6 = (We) voivoimme = (We) voida = ''be able to, can''
128	''n�¤iden''	6 =  n�¤m�¤n�¤iden =  nään�¤iden =  t�¤m�¤ = ''this, this one''
217	''muiden''	6=  muu = ''additional, rest of''
160	''toivon''	6 =  toivotoivon = (I) toivoa = desire, hope, with
172	''t�¤rke�¤''	6 = essential, important
83	''paljon''	6 = ''many, much''
218	''toinen''	6 = ''another, else, other''
155	''joilla''	6 = 
122	''vuonna''	6 = viime vuonna = last year
79	''mielt�¤''	6 =  mieli = mind
232	''teille''	6 = ''you''
225	''varten''	6 = ''for, in order to''
56	''haluan''	6 = (I) haluta = desire, want, wish
273	''tavoin''	6 = kaikin tavoin = ''in every way''
140	''heid�¤n''	6 = ''their''
149	''aikana''	6 = ''during, for, whereas''
277	''ryhm�¤n''	6 =  ryhm�¤ = group, party
167	''voivat''	6
237	''kerran''	6 = ''once''
105	''t�¤ysin''	6 = ''absolutely, completely, entirely, wholy''
152	''maiden''	6=  maa = country, earth
91	''vaikka''	6
163	''meille''	6
165	''oltava''	6
272	''oikeus''	6
284	''syyst�¤''	6
35	''kanssa''	6
183	''tääll�¤''	6
145	''unioni''	6 = ''union''
123	''joiden''	6
156	''koskee''	6
18	''meid�¤n''	6
159	''mitään''	6
192	''aikaan''	6
279	''joista''	6
206	''niist�¤''	6
53	''kaikki''	6
214	''joissa''	6
230	''olevan''	6
196	''kuinka''	6
111	''sitten''	6
151	''toimia''	6
291	''selvää''	6
116	''meill�¤''	6
39	''olemme''	6
------------------------
8)      arvoisa 7 = honored 
25)      unionin 7 = union
66)      voidaan 7 = to voida = ''be able, can could''
75)      lis�¤ksi 7 = ''furthermore, in addition to, moreover''
92)     pit�¤isi 7 = ''ought to, should''
97)     t�¤rkeää 7 = t�¤rke�¤ = important, essential
103)    j�¤lkeen 7 = ''after, behind''
115)    vastaan 7 = ''against, opposite''
''129)    asiassa 7 = in/at asia''
136)    enemm�¤n 7 = ''more''
141)    j�¤senet 7 = (singular You) j�¤sen
142)    lopuksi 7 = ''finally''
164)    kysymys 7 = question
''182)    asiasta 7 = (from) asia''
184)    teht�¤v�¤ 7 = affair, assignment, business, problem, task
187)    samalla 7 = ''at the same time''
190)    todella 7 = ''actually, indeed''
198)    kaikkea 7 = kaikki = ''altogether, everything''
199)    kiittää 7 = thank
201)    kaikkia 7 = kaikki = ''altogether, everything''
''207)    tavalla 7 = (in/at) tapa''
216)    v�¤lill�¤ 7 = ''between''
223)    k�¤yttää 7 = apply, employ
224)    esittää 7 = act, perform
227)    koskeva 7 : s�¤hköä koskeva = electrical
236)    yhdess�¤ 7 = ''together''
239)    otetaan 7
241)    tilanne 7 = circumstance, occassion
268)    silloin 7 = ''then''
276)    j�¤lleen 7 = ''again, all over again''
''278)    tasolla 7 = (in/at) taso''
280)    asioita 7 = asia
286)    naisten 7 = nainen = woman
293)    ehdotus 7 = advice, proposal
295)    kohtaan 7 = ''to, towards''
298)    etenkin 7 = especially, mainly
-------------------------------------
251	''selv�¤sti''	8
102	''neuvosto''	8
32	''komissio''	8 = ''commision''
213	''hetkell�¤''	8
193	''haluamme''	8
166	''koskevat''	8
283	''alueella''	8
294	''huomiota''	8
143	''kaikkien''	8
243	''k�¤yttöön''	8
288	''korostaa''	8
158	''kannalta''	8
144	''koskevan''	8
14	''puhemies''	8
132	''eurooppa''	8 = ''Europe''
222	''olemassa''	8
254	''yhteisen''	8
269	''toiseksi''	8
82	''yhteis�¶n''	8
89	''huomioon''	8
98	''puolesta''	8
259	''maailman''	8
296	''ongelmia''	8
185	''hyv�¤ksy�¤''	8
180	''mietint�¶''	8
204	''kollegat''	8
4	''euroopan''	8 = ''European''
62	''eritt�¤in''	8
147	''edelleen''	8
281	''erityisen''	9
68	''haluaisin''	9
287	''toisaalta''	9
299	''liittyv�¤t''	9
146	''mietinn�¶n''	9
247	''uudelleen''	9
12	''komission''	9 = ''komissio''
258	''tarvitaan''	9
238	''tietenkin''	9
49	''neuvoston''	9
271	''nimitt�¤in''	9
30	''kuitenkin''	9
253	''unionissa''	9 = ''(in/at) unioni''
191	''esittelij�¤''	10
257	''keskustelu''	10
120	''yhteydess�¤''	10
245	''todellakin''	10
228	''euroopassa''	10
63	''mielest�¤ni''	10
219	''mukaisesti''	10
297	''komissiota''	10 = '' komissio''
274	''ehdotuksen''	10
150	''ainoastaan''	10
169	''kuitenkaan''	10
249	''prosenttia''	10
205	''sopimuksen''	10
262	''huolimatta''	10
275	''ensinn�¤kin''	10
246	''voitaisiin''	10
72	''erityisesti''	11
270	''direktiivin''	11
233	''valiokunnan''	11
74	''parlamentti''	11 = ''parliament''
20	''parlamentin''	11 = '' parlamentti''
178	''mietinn�¶ss�¤''	11
234	''mahdollista''	11
266	''yhteisty�¶t�¤''	11
125	''esimerkiksi''	11
244	''perusteella''	11
177	''tapauksessa''	11
215	''j�¤senvaltiot''	12
175	''kansalaisten''	12
157	''puheenjohtaja''	13
208	''parlamentissa''	13
242	''mahdollisimman''	14
282	''valitettavasti''	14
88	''j�¤senvaltioiden''	15

As I mentioned, GIZA++ may have a bug on HMM training stage, it will add 
some random number to count table, and maybe it is the reason. You may 
check the archive of the mailing list for the description of the bug, 
also, you can simply comment out the lines marked with //*******// in 
Array2.h to fix it.

inline T*begin(){
#ifdef __STL_DEBUG //*******//
if( h1==0||h2==0)return 0;
#endif //*******//
return &(p[0]);
}
inline T*end(){
#ifdef __STL_DEBUG //*******//
if( h1==0||h2==0)return 0;
#endif //*******//
return &(p[0])+p.size();
}

You may also be interested in trying a new version of Multi-threaded 
GIZA++ with the bug fixed, and a much faster speed here

http://www.cs.cmu.edu/~qing/

The score for each hypothesis is computed as below:
        // DISTORTION COST
	CalcDistortionScore();
	
	// LANGUAGE MODEL COST
	CalcLMScore(staticData.GetAllLM());

	// WORD PENALTY
	m_scoreBreakdown.PlusEquals(staticData.GetWordPenaltyProducer(), - (float) m_currTargetWordsRange.GetNumWordsCovered()); 

	// FUTURE COST
	CalcFutureScore(futureScore);

	
	//LEXICAL REORDERING COST
	const std::vector<LexicalReordering*> &reorderModels = staticData.GetReorderModels();
	for(unsigned int i = 0; i < reorderModels.size(); i++)
	{
		m_scoreBreakdown.PlusEquals(reorderModels[i], reorderModels[i]->CalcScore(this));
	}

	// TOTAL
	m_totalScore = m_scoreBreakdown.InnerProduct(staticData.GetAllWeights()) + m_futureScore;

The vector of scores in m_scoreBreakdown will be multiplied with the vector of weights obtained from moses.ini. For example
weights: 0.016   0.068   1.000   -0.015  0.051   0.115   0.089   0.000   0.276   0.105   0.020   0.038   0.034   0.058   0.114
where
0.016: distortion weight
0.068: word penalty
1.000: unknown word penalty (default = 1)
-0.015  0.051   0.115   0.089   0.000   0.276: reordering weights
0.105: language model
0.020   0.038   0.034   0.058   0.114: weights for translation scores
corresponds with moses.ini

#########################
### MOSES CONFIG FILE ###
#########################

# input factors
[input-factors]
0

# mapping steps
[mapping]
0 T 0

# translation tables: source-factors, target-factors, number of scores, file
[ttable-file]
0 0 5 /tmp/Thang/test/w-w/filtered/phrase-table.0-0-0

# no generation models, no generation-file section

# language models: type(srilm/irstlm), factors, order, file
[lmodel-file]
0 0 5 /tmp/Thang/lm/binary/acl05.5-gram.fi


# limit on how many phrase translations e for each phrase f are loaded
# 0 = all elements loaded
[ttable-limit]
20
0
# distortion (reordering) files
[distortion-file]
0-0 msd-bidirectional-fe 6 /tmp/Thang/test/w-w/filtered/reordering-table.msd-bidirectional-fe.0.5.0-0

# distortion (reordering) weight
[weight-d]
0.016162
-0.014947
0.051103
0.115213
0.088628
0.000315
0.276050

# language model weights
[weight-l]
0.104619


# translation model weights
[weight-t]
0.020397
0.037620
0.033999
0.058153
0.114485

# no generation models, no weight-generation section

# word penalty
[weight-w]
0.068308

[distortion-limit]
6

/***
 * calculate the logarithm of our total translation score (sum up components)
 */
void Hypothesis::CalcScore(const SquareMatrix &futureScore) 
{
	const StaticData &staticData = StaticData::Instance();

	// DISTORTION COST
	CalcDistortionScore();
	
	// LANGUAGE MODEL COST
	CalcLMScore(staticData.GetAllLM());

	// WORD PENALTY
	m_scoreBreakdown.PlusEquals(staticData.GetWordPenaltyProducer(), - (float) m_currTargetWordsRange.GetNumWordsCovered()); 

	// FUTURE COST
	CalcFutureScore(futureScore);

	
	//LEXICAL REORDERING COST
	const std::vector<LexicalReordering*> &reorderModels = staticData.GetReorderModels();
	for(unsigned int i = 0; i < reorderModels.size(); i++)
	{
		m_scoreBreakdown.PlusEquals(reorderModels[i], reorderModels[i]->CalcScore(this));
	}

	// TOTAL
	m_totalScore = m_scoreBreakdown.InnerProduct(staticData.GetAllWeights()) + m_futureScore;
}

cd moses
./regenerate-makefiles.sh
./configure --with-srilm=/path-to-srilm
make -j 4

path-to-srilm set to .../srilm/bin/cygwin

+ Check GiZA++ 
An efficient Method for Determining Bilingual Word Classes (Och, 1999):

+ Install GIZA++ and mkcls with the file from Google code page require gcc >= 4.1

+ Install SRILM
* read INSTALL note first
* read platform-specific note in doc/
* modify platform-specific makefile instruction in common/
Notice: if no tcl is available, insert NO_TCL = X, and leave TCL_LIBRARY, TCL_INCLUDE empty (as instructed in INSTALL file)

* make World
* if want to make for other non-default platform such as cygwin:
make MACHINE_TYPE=cygwin
* 
export SRILM=
export MACHINE_TYPE= #i686, cygwin
export PATH=$PATH:$SRILM/bin:$SRILM/bin/$MACHINE_TYPE      # mentioned in INSTALL
export MANPATH=$MANPATH:$SRILM/man                  # mentioned in INSTALL
* go into test, make test. To check if output and predicted results are identical
* make all
* make cleanest
*

http://www.cs.tut.fi/~jkorpela/finnish-intro.html

* uses suffixes to express grammatical relations and also to derive new words:
** talossanikin = in my house, too. ''-ssa'': inessive case, roughly corresponding to the English preposition ''in''. ''-ni'' is a possessive one, corresponding to ''my'' in English. ''-kin'' is an enclitic particle corresponding to the English word ''too'' (and the Latin enclitic -que).
** verb flexion is kirjoitettuasi = after you had written

* tendencies from synthetic to analytic expression
** In speech, ''mun talossa'' (with mun corresponding to English my) than talossani. ''kirjoitettuasi'' usually only appear in written language

* Flexion. Originally, suffixes were “glued” to words by simple concatenation. Due to various phonetic changes, in Finnish suffixes very often cause changes in the word root, causing phenomena which resemble flexion (e.g. juon ‘I drink’, join ‘I drank’), and for several suffixes there are alternative forms
** final ''-i'' in nouns often (but not in new loanwords like grilli) changes to ''-e-'' in inflected forms, e.g. the genitive of kivi 'stone' is kiven (with -n as the genitive case suffix)
** final ''-nen'' (which is rather common in adjectives and occurs in nouns, too) in the singular nominative changes to ''-se-'' (or -s-) in other words, e.g. hevonen ‘horse’, hevoset ‘horses’
** consonant gradation: double consonants kk, pp, tt are often (basically, before closed syllables) replaced by single k, p, t, e.g. the genitive of lakki ‘cap’ is lakin
** similar phenomenon for single consonants: single k, p, and t are often replaced by absence of a consonant, v, and d, respectively, e.g. laki : lain, lupa : luvan, katu : kadun.

* Word derivation
Suffixes are also used for word derivation. Another word formation tool is composition: glueing two words together. The following list of derived and composite words should give some idea of the mechanisms:

* talous ‘economy’, from talo ‘house’
* taloudellinen ‘economical’
* taloudellisuus ‘economicality’
* kansantalous ‘national economy’, using kansa ‘people, nation’
* kansantaloustiede ‘(study of) economics’, using tiede ‘science’, which is derived from tietää ‘to know’.

* Word order
Word order is often said to be “free” in Finnish: change word order without changing the basic meaning of the sentence, but the emphasis or side meanings or style typically changes.
* Pete rakastaa Annaa. This is the normal word order, the same as in English. Case suffix ''-a'' in Annaa designates the grammatical object, no matter what the word order is. If we wanted to say that Anna loves Pete, we would say Anna rakastaa Pete�¤.
* Annaa Pete rakastaa. This emphasizes the word Annaa: the object of Pete’s love is Anna, not someone else.
* Rakastaa Pete Annaa. This emphasizes the word rakastaa, and such a sentence might used as a response to some doubt about Pete’s love; so one might say it corresponds to Pete does love Anna.
* Pete Annaa rakastaa. This word order might be used, in conjunction with special stress on Pete in pronunciation, to emphasize that it is Pete and not someone else who loves Anna.
* Annaa rakastaa Pete. This might be used in a context where we mention some people and tell about each of them who loves them. So this roughly corresponds to the English sentence Anna is loved by Pete.
* Rakastaa Annaa Pete. This does not sound like a normal sentence, but it is quite understandable.

* Peculiarities in grammar

ARPA format http://www.stanford.edu/class/cs224s/224s.07.lec11.6up.pdf
http://www.speech.sri.com/projects/srilm/manpages/ngram-format.html

\data\
ngram 1=n1
ngram 2=n2
...
ngram N=nN

\1-grams:
p	w		[bow]
...

\2-grams:
p	w1 w2		[bow]
...

\N-grams:
p(wN | w1..w(N-1)	w1 ... wN
...

\end\

bow: backoff weights

/***
|''Name:''|LegacyStrikeThroughPlugin|
|''Description:''|Support for legacy (pre 2.1) strike through formatting|
|''Version:''|1.0.2|
|''Date:''|Jul 21, 2006|
|''Source:''|http://www.tiddlywiki.com/#LegacyStrikeThroughPlugin|
|''Author:''|MartinBudden (mjbudden (at) gmail (dot) com)|
|''License:''|[[BSD open source license]]|
|''CoreVersion:''|2.1.0|
***/

//{{{
// Ensure that the LegacyStrikeThrough Plugin is only installed once.
if(!version.extensions.LegacyStrikeThroughPlugin) {
version.extensions.LegacyStrikeThroughPlugin = {installed:true};

config.formatters.push(
{
	name: "legacyStrikeByChar",
	match: "==",
	termRegExp: /(==)/mg,
	element: "strike",
	handler: config.formatterHelpers.createElementAndWikify
});

} //# end of "install only once"
//}}}

[lmthang@pie model]$ /home/lmthang/HYP/scripts/find_pattern.pl ' environment' < ../../../m-m_1/result/model/lex.0-0.f2n | sort -nrk 3 | head
# Copyright 2008 © by Luong Minh Thang
Pattern:  environment
ympäristö environment 0.8543724
ympäristö environmental 0.7312253
ttavat environmentalist 0.3333333
s environment-friendly 0.3333333
piirit environmentalist 0.3333333
ase environmentalist 0.3333333
ä environmentally-friendly 0.1666667
ystäv environmentally-friendly 0.1666667
ympäristö environmentally-friendly 0.1666667
tervee environment-friendly 0.1666667
[lmthang@pie model]$ /home/lmthang/HYP/scripts/find_pattern.pl ' environment' < ../../../w-w/result/model/lex.0-0.f2n | sort -nrk 3 | head
# Copyright 2008 © by Luong Minh Thang
Pattern:  environment
ympäristöpiirit environmentalists 1.0000000
ympäristöystävällinen environmentally-friendly 0.5000000
ympäristöjä environments 0.5000000
NULL environments 0.5000000
NULL environmental 0.3652968
NULL environment 0.3029466
mieluummin environmentally-friendly 0.2500000
ilmastoystävällinen environmentally-friendly 0.2500000
ympäristöystävällisistä environment-friendly 0.2000000
ympäristöystävällisestä environment-friendly 0.2000000
[lmthang@pie model]$
[lmthang@pie model]$ /home/lmthang/HYP/scripts/find_pattern.pl ' environment' < ../../../m-m_2/result/model/lex.0-0.f2n | sort -nrk 3 | head
# Copyright 2008 © by Luong Minh Thang
Pattern:  environment
ystäv/STM+ environmental/STM+ 0.6000000
ympäristö/PRE+ environmental/STM 0.5910931
ympäristö/PRE+ environment/STM 0.3072125
ympäristö/STM+ environment/STM 0.2955381
ä/SUF+ environmentally-friendly 0.2000000
ystäv/STM+ environmentally-friendly 0.2000000
ympäristö/PRE+ environmentally-friendly 0.2000000
llinen/STM environmentally-friendly 0.2000000
auton/STM environmentally-friendly 0.2000000
NULL environmental/STM+ 0.2000000
[lmthang@pie model]$ /home/lmthang/HYP/scripts/find_pattern.pl ' environment' < ../../../m-m_3/result/model/lex.0-0.f2n | sort -nrk 3 | head
# Copyright 2008 © by Luong Minh Thang
Pattern:  environment
ympäristö/PRE environmental/STM 0.6596639
ympäristö/STM environment/STM 0.4027875
ympäristö/PRE environment/STM 0.3611498
ttavat/STM environmentalist/STM 0.3333333
piirit/STM environmentalist/STM 0.3333333
ase/PRE environmentalist/STM 0.3333333
ympäristö/PRE environmentally-friendly 0.2000000
s/SUF environment-friendly 0.2000000
osta/STM environmentally-friendly 0.2000000
llinen/STM environmentally-friendly 0.2000000

!DecodeStep
const Dictionary *m_ptr; / /! pointer to translation/generation table 

2 inherited classes: DecodeStepGeneration, DecodeStepTranslation

/*** Implementation of a phrase table in a trie.  Looking up a phrase of
 * length n words requires n look-ups to find the TargetPhraseCollection.
 */
class PhraseDictionaryMemory : public PhraseDictionary

Each PhraseDictionaryNode will represent a word, and contain the string ending at that word

! TranslationOptionCollection
/** Contains all phrase translations applicable to current input type (a sentence or confusion network).
 * A key insight into efficient decoding is that various input
 * conditions (trelliss, factored input, normal text, xml markup)
 * all lead to the same decoding algorithm: hypotheses are expanded
 * by applying phrase translations, which can be precomputed.
 *
 * The precomputation of a collection of instances of such TranslationOption 
 * depends on the input condition, but they all are presented to
 * decoding algorithm in the same form, using this class.
 *
 * This class cannot, and should not be instantiated directly. Instantiate 1 of the inherited
 * classes instead, for a particular input type
 **/

2 inherited classes: TranslationOptionCollectionText (for Sentence-InputType) , TranslationOptionCollectionConfusionNet (for ConfusionNet - InputType)

/** create translation options that exactly cover a specific input span. 
 * Called by CreateTranslationOptions() and ProcessUnknownWord()
 * \param decodeStepList list of decoding steps
 * \param factorCollection input sentence with all factors
 * \param startPos first position in input sentence
 * \param lastPos last position in input sentence 
 * \param adhereTableLimit whether phrase & generation table limits are adhered to
 */

void TranslationOptionCollection::CreateTranslationOptionsForRange(
	const DecodeGraph &decodeStepList
	, size_t startPos
	, size_t endPos
	, bool adhereTableLimit)

/** Contains partial translation options, while these are constructed in the class TranslationOption.
 *  The factored translation model allows for multiple translation and 
 *  generation steps during a single Hypothesis expansion. For efficiency, 
 *  all these expansions are precomputed and stored as TranslationOption.
 *  The expansion process itself may be still explode, so efficient handling
 *  of partial translation options during expansion is required. 
 *  This class assists in this tasks by implementing pruning. 
 *  This implementation is similar to the one in HypothesisStack. */

class PartialTranslOptColl
{
 protected:
	std::vector<TranslationOption*> m_list;
	float m_bestScore; /**< score of the best translation option */
	float m_worstScore; /**< score of the worse translation option */
	size_t m_maxSize; /**< maximum number of translation options allowed */
	size_t m_totalPruned; /**< number of options pruned */

! Load pharse table
* StaticData::LoadPhraseTables()
PhraseDictionaryMemory *pd=new PhraseDictionaryMemory(numScoreComponent);
				if (!pd->Load(input
								 , output
								 , filePath
								 , weight
								 , maxTargetPhrase[index]
								 , GetAllLM()
								 , GetWeightWordPenalty()))

* PharseDictionaryMemory: where the translation table file is read
line = ( sv ||| a ||| () (0) ||| (1) ||| 0.0588235 0.000428816 1 1 2.718
** break into token at "|||"
** extract score vector: 0.0588235 ;  0.000428816  ; 1 ; 1 ; 2.718
** extract source phrase vector of the LHS: "(" ; "sv"
** extract target phrase vector of the RHS: "a"
** take the log of the score vector n assign lowest score of -100 for those fall below: -2.8332138,-7.7544827,0.00000000,0.00000000,0.99989629
** Compute the score for target phrase:
*** translation score: inner product of score vector with weight vector
*** nGram score: using language model
*** total score: m_fullScore = m_transScore + totalFutureScore + totalFullScore
							- (this->GetSize() * weightWP);	 / / word penalty

what's mapping steps in moses.ini?

!
* Start with Main.cpp class
* Read parameters from ''moses.ini'' file
Parameter *parameter = new Parameter();
	if (!parameter->LoadParam(argc, argv))

* Load data structure 
StaticData::LoadDataStatic(parameter)
** Loading lexical distortion models: read distortion (reordering) files ''reordering-table.msd-bidirectional-fe.0.5.0-0.gz'', StaticData::LoadLexicalReorderingModel()
) yhteismarkkinoilla ja ||| fair and right to require certain minimum ||| 0.20000 0.20000 0.60000 0.20000 0.20000 0.60000
) yhteismarkkinoilla ja ||| fair and right to require certain ||| 0.20000 0.20000 0.60000 0.20000 0.20000 0.60000
) yhteismarkkinoilla ja ||| fair and right to require ||| 0.20000 0.20000 0.60000 0.20000 0.20000 0.60000
) yhteismarkkinoilla ja ||| fair and right to ||| 0.20000 0.20000 0.60000 0.20000 0.20000 0.60000
) yhteismarkkinoilla ja ||| only fair and right to require certain ||| 0.20000 0.20000 0.60000 0.20000 0.20000 0.60000
) yhteismarkkinoilla ja ||| only fair and right to require ||| 0.20000 0.20000 0.60000 0.20000 0.20000 0.60000
) yhteismarkkinoilla ja ||| only fair and right to ||| 0.20000 0.20000 0.60000 0.20000 0.20000 0.60000

-> need to understand number meaning

** Creating lexical reordering: LexicalReordering::LexicalReordering

* Decoding: Manager.cpp. Start with a seed hypothesis, process one by one in the stack. For each loop, perform pruning, go through each hypothesis to expand it
/**
 * Main decoder loop that translates a sentence by expanding
 * hypotheses stack by stack, until the end of the sentence.
 */
void Manager::ProcessSentence()

std::vector < HypothesisStack > m_hypoStackColl; /**< stacks to store hypothesis (partial translations) */ 

/** Find all translation options to expand one hypothesis, trigger expansion
 * this is mostly a check for overlap with already covered words, and for
 * violation of reordering limits. 
 * \param hypothesis hypothesis to be expanded upon
 */
void Manager::ProcessOneHypothesis(const Hypothesis &hypothesis)

* The decoding process involves generating possible translations from ''phrase tables'':
/** Create all possible translations from the phrase tables
 * for a particular input sentence. This implies applying all
 * translation and generation steps. Also computes future cost matrix.
 * \param decodeStepList list of decoding steps
 * \param factorCollection input sentence with all factors
 */
void TranslationOptionCollection::CreateTranslationOptions(const vector <list <const DecodeStep* > * > &decodeStepVL)

* void Manager::ProcessSentence()
// collect translation options for this sentence
-> m_transOptColl->CreateTranslationOptions(decodeStepVL);

// search for best translation with the specified algorithm
m_search->ProcessSentence();
m_search could be SearchNormal, or SearchCubePruning

! TranslationOptionCollection::CreateTranslationOptions
/** Create all possible translations from the phrase tables
 * for a particular input sentence. This implies applying all
 * translation and generation steps. Also computes future cost matrix.
 * \param decodeStepList list of decoding steps
 * \param factorCollection input sentence with all factors
 */
* void TranslationOptionCollection::CreateTranslationOptions(const vector <DecodeGraph*> &decodeStepVL)
-> CreateTranslationOptionsForRange( decodeStepList, startPos, endPos, true)

/** create translation options that exactly cover a specific input span.
 * Called by CreateTranslationOptions() and ProcessUnknownWord()
 * \param decodeStepList list of decoding steps
 * \param factorCollection input sentence with all factors
 * \param startPos first position in input sentence
 * \param lastPos last position in input sentence
 * \param adhereTableLimit whether phrase & generation table limits are adhered to
 */
* void TranslationOptionCollection::CreateTranslationOptionsForRange(
for (iterColl = partTransOptList.begin() ; iterColl != partTransOptList.end() ; ++iterColl)
			{
				TranslationOption *transOpt = *iterColl;
				transOpt->CalcScore();
				Add(transOpt);
			}


* void TranslationOption::CalcScore()
''1st place where LM model is called''
const LMList &allLM = StaticData::Instance().GetAllLM();

allLM.CalcScore(GetTargetPhrase(), retFullScore, ngramScore, &m_scoreBreakdown);

// future score
m_futureScore = retFullScore - ngramScore
								+ m_scoreBreakdown.InnerProduct(StaticData::Instance().GetAllWeights()) - phraseSize * StaticData::Instance().GetWeightWordPenalty();

to compute the future score.
retFullScore = ngramScore + those gram < max_nGram, but ''why using retFullScore - ngramScore''


! SearchNormal::ProcessSentence()
* SearchNormal::ProcessSentence() :
for (iterHypo = sourceHypoColl.begin() ; iterHypo != sourceHypoColl.end() ; ++iterHypo)
		{
			Hypothesis &hypothesis = **iterHypo;
			ProcessOneHypothesis(hypothesis); // expand the hypothesis
		}

/** Find all translation options to expand one hypothesis, trigger expansion
 * this is mostly a check for overlap with already covered words, and for
 * violation of reordering limits.
 * \param hypothesis hypothesis to be expanded upon
 */
* void SearchNormal::ProcessOneHypothesis(const Hypothesis &hypothesis)

-> ExpandAllHypotheses(hypothesis, startPos, endPos); -> ExpandHypothesis(hypothesis, **iter, expectedScore);

/**
 * Expand one hypothesis with a translation option.
 * this involves initial creation, scoring and adding it to the proper stack
 * \param hypothesis hypothesis to be expanded upon
 * \param transOpt translation option (phrase translation)
 *        that is applied to create the new hypothesis
 * \param expectedScore base score for early discarding
 *        (base hypothesis score plus future score estimation)
 */
* void SearchNormal::ExpandHypothesis(const Hypothesis &hypothesis, const TranslationOption &transOpt, float expectedScore)
-> newHypo->CalcScore(m_transOptColl.GetFutureScore());

/***
 * calculate the logarithm of our total translation score (sum up components)
 */
** void Hypothesis::CalcScore(const SquareMatrix &futureScore) 
-> ffs[i]->Evaluate(
			*this,
			m_prevHypo ? m_prevHypo->m_ffStates[i] : NULL,
			&m_scoreBreakdown);

* ''2nd place where LM involve''
FFState* LanguageModel::Evaluate(
		const Hypothesis& hypo,
		const FFState* ps,
		ScoreComponentCollection* out) const {

''what is the use of LMState''
LMState* res = new LMState(prevlm);
			if (hypo.GetCurrTargetLength() == 0)
				return res;

• In Parameter::Parameter(), 
AddParam("uses-morpheme-lm", "using word-based LM to score morpheme input. Default is false");

* In StaticData.h, declare
bool m_UseMorphemeLM; // Thang: add parameter
bool UseMorphemeLM() const {	return m_UseMorphemeLM;} // Thang add
void UseMorphemeLM(bool a){ m_UseMorphemeLM=a; }; // Thang add

• In StaticData::LoadData(), 
	// Thang: morpheme LM score
	SetBooleanParameter( &m_UseMorphemeLM, "uses-morpheme-lm", false );

Similarly, add "-thang-print" option for my own debugging print out

* TargetPhrase.h: add the following paramters, to speed up the concatenating processes of morpheme sequence
string m_wordRep; / /Thang, use with option "-uses-morpheme-lm": word representation of morpheme input
int isBeginBoundary; / /Thang, use with option "-uses-morpheme-lm": to tell that the begin of this phrase should not be concatenated with any phrase
int isEndBoundary; / /Thang, use with option "-uses-morpheme-lm": to tell that the end of this phrase should not be concatenated with any phrase

	inline std::string GetWordRep() const
	{
		return m_wordRep;
	}

	inline int GetBeginFlag() const
	{
		return beginFlag;
	}

	inline int GetEndFlag() const
	{
		return endFlag;
	}
	inline void SetWordRep(std::string wordRepCopy) const
	{
		m_wordRep = wordRepCopy;
	}

	inline void SetBeginFlag(int beginFlagCopy) const
	{
		beginFlag = beginFlagCopy;
	}

	inline void SetEndFlag(int endFlagCopy) const
	{
		endFlag = endFlagCopy;
	}
* in PDTAimp.h, modify:
void CreateTargetPhrase(TargetPhrase& targetPhrase,
		StringTgtCand::first_type const& factorStrings,
		StringTgtCand::second_type const& scoreVector,
		StringWordAlignmentCand::second_type const& swaVector,
		StringWordAlignmentCand::second_type const& twaVector,
		Phrase const* srcPtr=0) const

* Test with different values of b
* Test word list with/without punctuation within words (e.g. compound words with "-" as a separator)

Make sure the word list files don't contain digits, punctuations, /, *, #, and is non-utf-8

* what is reordering-table.msd-bidirectional-fe.0.5.0-0? inside "model" folder after factored training
* what is the value of 2.718 in phrase-table.0-0.gz for all sentences?

* Moses reordering
This reordering model is suitable for local reorderings: they are discouraged, but may occur with sufficient support from the language model. But large-scale reorderings are often arbitrary and effect translation performance negatively.

By limiting reordering, we can not only speed up the decoder, often translation performance is increased. Reordering can be limited to a maximum number of words skipped (maximum d) with the switch -distortion-limit, or short -dl.

Setting this parameter to 0 means monotone translation (no reordering). If you want to allow unlimited reordering, use the value -1.

* realize that when using nohup, due to redirect all output to a file, all printings to STDERR are redirected to the file as well. As such, last time the BLEU  scores obtained is not correct

* in recase/recase.perl
open(MODEL,"$MOSES -v 0 -f $RECASE_MODEL -i $INFILE -dl 1|"); ??? What does it do? call MOSES to access recase model

* modify scripts/wrap-xml.perl: to remove the redundant line "Language: en" in the output

  Evaluation of any-to-en translation using:
    src set "devtest2006" (1 docs, 2000 segs)
    ref set "devtest2006" (1 refs)
    tst set "devtest2006" (1 systems)

NIST score = 7.4298  BLEU score = 0.3112 for system "my-system"

# ------------------------------------------------------------------------

Individual N-gram scoring
        1-gram   2-gram   3-gram   4-gram   5-gram   6-gram   7-gram   8-gram
        ------   ------   ------   ------   ------   ------   ------   ------
 NIST:  5.2709   1.6129   0.4237   0.0965   0.0258   0.0094   0.0036   0.0024

 BLEU:  0.6271   0.3663   0.2436   0.1677   0.1175   0.0842   0.0613   0.0457

# ------------------------------------------------------------------------
Cumulative N-gram scoring
        1-gram   2-gram   3-gram   4-gram   5-gram   6-gram   7-gram   8-gram
        ------   ------   ------   ------   ------   ------   ------   ------
 NIST:  5.2709   6.8839   7.3075   7.4040   7.4298   7.4392   7.4427   7.4452

 BLEU:  0.6271   0.4793   0.3825   0.3112   0.2561   0.2128   0.1781   0.1503
MT evaluation scorer ended on 2008 Jan 24 at 22:33:49

* NN0: common noun, VVD: past tense form of verb
neutral for number
NN1: singular common noun VVG: -ing form of verb
NN2: plural common noun VVI: infinitive form of verb
NP0: proper noun VVN: past participle form of verb

*

* compute probabilities and add new phrases to the phrase table
* OOV analysis
* PER
* modify decoding process: for each OOV word input, search for best approximation

(remember that the test data contains many pair with length ration > 9)

! Monthly plan
* Jan
** Data collection + Understand MOSES
** Morphological model
* Feb:
** Integrate morphological feature to MOSES
* March: 
** 1st system (mid of March)
* April
* May: 
** mainly reading (back home 7 May-25 May)
* June
** 2nd system (mid June)
* July: (if do well, relax 1-2 weeks)
* Aug
** 3rd system (mid August)
* Sep: Eval (2/3)
* Oct: Eval (1/3) + writing 
* Nov (3 Nov – 14 Nov): Lag time

!Weekly plan
* Week 2 (21 Jan – 25 Jan)
** Data collection (Chinese + probably Arabic)
** Processing data + 
** Check on MOSES again, to figure out the BLUE score of en-es last time SMT
** Ask Hendra abt a baseline English-Chinese baseline system to experiment some minor changes, let say remove all appearances of "le" in Chinese corpus
* Week 3: Mon 28 Jan – Fri 1 Feb 2008
** Understand MOSES more
** Morphological preprocessing model to add additional features
* Week 4: Mon 4 Feb – Fri 8 Feb 2008
** Back home for CNY
* Week 5: Mon 11 Feb – Fri 15 Feb 2008
** Morphological preprocessing model (cont.)
** Start figuring out how to add additional feature to MOSES
* Week 6: Mon 18 Feb – Fri 22 Feb 2008
** Read MOSE code
** Measuring training time -> debugging 
* Recess Week: Sat 23 Feb – Sun 2 Mar 2008
** Finish debugging -> finish measuring time
** Generate modified corpora (remove high freq words)
** Figuring out how to get word-word probs
* Week 7: Mon 3 Mar – Fri 7 Mar 2008
** Figuring out decoder process to incorporate word-word probs
* Week 8: Mon 10 Mar – Fri 14 Mar 2008
** Lag
* Week 9: Mon 17 Mar – Fri 21 Mar 2008
** Start training modified corpora
* Week 10: Mon 24 Mar – Fri 28 Mar 2008
** Post processing
** BLEU scores for the systems
-> 1st system
* Week 11: Mon 31 Mar – Fri 4 Apr 2008
* Week 12: Mon 7 Apr – Fri 11 Apr 2008
* Week 13: Mon 14 Apr – Fri 18 Apr 2008
* Reading Week: Sat 19 Apr – Fri 25 Apr 2008
* Examination (2 weeks): Sat 26 Apr – Sat 10 May 2008

* Vacation (12 weeks): Sun 11 May – Sun 3 Aug 2008

* Orientation Week: Mon 4 Aug - Sat 9 Aug 2008
* Week 1: Mon 11 Aug – Fri 15 Aug 2008
* Week 2: Mon 18 Aug – Fri 22 Aug 2008
* Week 3: Mon 25 Aug – Fri 29 Aug 2008
* Week 4: Mon 1 Sep – Fri 5 Sep 2008
* Week 5: Mon 8 Sep – Fri 12 Sep 2008
* Week 6: Mon 15 Sep – Fri 19 Sep 2008
* Recess Week: Sat 20 Sep – Sun 28 Sep 2008
* Week 7: Mon 29 Sep – Fri 3 Oct 2008 (b)
* Week 8: Mon 6 Oct – Fri 10 Oct 2008
* Week 9: Mon 13 Oct – Fri 17 Oct 2008
* Week 10: Mon 20 Oct – Fri 24 Oct 2008
* Week 11: Mon 27 Oct – Fri 31 Oct 2008 (c)
* Week 12: Mon 3 Nov – Fri 7 Nov 2008
* Week 13: Mon 10 Nov – Fri 14 Nov 2008
* Reading Week: Sat 15 Nov - Fri 21 Nov 2008
* Examination (2 weeks): Sat 22 Nov - Sat 6 Dec 2008

* morpheme concatenation using frequent n-gram as a baseline for surface recovery algorithm

** debug morpheme recovery algorithm
[lmthang@aye scripts]$ more word_seq1.fi
heille sanon , ettÃ¤ ennen kuin pitÃ¤Ã�¤ puheen , on tutkittava ja arvioitava senhetkinen tilanne , koska poliittisen puheen tÃ¤ytyy aina olla
 realistinen ja sidoksissa todelliseen tilanteeseen sekÃ¤ pÃ¤Ã�¤mÃ¤Ã�¤riin , joihin me kaikki yhdymme .
haluaisin , ettÃ¤ aloitettaisiin tÃ¤stÃ¤ vahvistetusta yhteistyÃ¶stÃ¤ ja annettaisiin sen avulla joitakin esimerkkejÃ¤ euroopan uusista mahdoll
isuuksista .
Ã¤Ã�¤nestys toimitetaan tÃ¤nÃ¤Ã�¤n klo 12.30 .
[lmthang@aye scripts]$ more out1.txt
heillekin sanonta , kÃ¤sitettÃ¤ ennenkin jotakuinkin pitÃ¤Ã�¤ puheenjohtaja , huomioon tutkittava kuluttajat arvioinnin va vÃ¤lisen tÃ¤mÃ¤nhetkin
en tilanne , koskaan poliittisen puhemies en kieltÃ¤ytyy ainakin ollaan realistinen tietoja sidoksissa todelliseen tilanteeseen sekÃ¤Ã�¤n pÃ¤Ã�¤
mÃ¤Ã�¤riin , joihinkin menneeseen kaikkien yhdymme .
haluaisin , kÃ¤sitettÃ¤ aloitetta isiin tÃ¤ s tÃ¤ vahvistetusta yhteis tyÃ¶ s pÃ¶ytÃ¤kirjan annettaisiin vÃ¤lisen avullaan joitakin esimerkkejÃ¤ e
uroopan uusista mahdollisuuksista .
Ã¤Ã�¤nestys toimitetaan tÃ¤nÃ¤Ã�¤n klooneja 12.30 .
[lmthang@pie scripts]$ more morph_seq1.fi
heille sanon , ettÃ¤ ennen kuin pit Ã¤ Ã¤ puhe en , on tutki ttava ja arvioi ta va sen hetkinen tilan ne , koska poliittis en puhe en tÃ¤ytyy aina olla realist inen ja sido ksi ssa todellis een tilan teeseen sekÃ¤ pÃ¤Ã�¤ mÃ¤Ã�¤ri in , joihin me kaikki yhdy mme .
halua isi n , ettÃ¤ aloite tta isiin tÃ¤ s tÃ¤ vahviste tusta yhteis tyÃ¶ s tÃ¤ ja an nettaisi in sen avulla joita kin esi merkke jÃ¤ euroopa n uusi sta mahdollis uuks
ista .
Ã¤Ã�¤nesty s toimite ta an tÃ¤ nÃ¤ Ã¤ n klo 12.30 .

* MERT (''done'')
** Baseline
*** Without MERT: ''BLEU = 19.6491, 58.0/28.7/17.8/11.6 (BP=0.811, ration=0.827)''
*** With MERT: ''BLEU = 21.4737, 53.0/25.9/15.7/10.3 (BP=0.989, ration=0.989)''
** Morph system
*** Without MERT: ''BLEU = 22.7044, 57.5/28.7/17.1/10.7 (BP=0.969, ration=0.970)''
*** With MERT: ''BLEU = 23.2175, 56.6/28.3/17.1/10.8 (BP=0.995, ration=0.995)''
 
* # OOV
* other analysis
* MOSES trace (-translation-details)
* MAX_PHRASE_LENGTH

* replicate the baseline acl05 http://www.statmt.org/wpt05/mt-shared-task/ (''done'')
** without mert: "BLEU = 20.6243, 60.8/30.2/17.4/10.4 (BP=0.859, ration=0.868)"
** with mert (''done''): ''BLEU = 22.4473, 57.4/28.5/16.2/9.6 (BP=1.000, ration=1.003)''

** train new Morfessor for acl05 (data) due to non-unicode encoding (write script to modifiy makefile, send email to ask abt parameter) (''done'')

* May: 
** wrap up for the 1st system (back home 7 May-28 May) -> Meet Fri 30 May
''Aim'': 
** having 3 translation results to compare (normal-normal, word-word, morph-morph)
* June: preparing for 2nd system
* July: aim for 2nd system (mid July)
* Aug: preparing for 3rd system
* Sep: 
** 1- mid Sep: aim for 3rd system
** mid Sep- end Sep: Experiment
* Oct: Experiment + writing 
* Nov (3 Nov – 14 Nov): Lag time

! Previously
* Jan
** Data collection + Understand MOSES
** Morphological model
* Feb:
** Integrate morphological feature to MOSES
* March: 
** 1st system (mid of March)
* April
* May: 
** mainly reading (back home 7 May-25 May)
* June
** 2nd system (mid June)
* July: (if do well, relax 1-2 weeks)
* Aug
** 3rd system (mid August)
* Sep: Eval (2/3)
* Oct: Eval (1/3) + writing 
* Nov (3 Nov – 14 Nov): Lag time

* Rerun mert on word-morpheme translation
* debug giza++ for WARNING: DIFFERENT SUMS: (1) (nan)
 /home/lmthang/HYP/bin_02_June_08/GIZA++  -CoocurrenceFile ./result/giza.fi-en/fi-en.cooc -c ./result/corpus/fi-en-int-train.sn
t -m1 5 -m2 0 -m3 3 -m4 3 -model1dumpfrequency 1 -model4smoothfactor 0.4 -nodumps 1 -nsmooth 4 -o ./result/giza.fi-en/fi-en -onlyaldumps
1 -p0 0.999 -s ./result/corpus/en.vcb -t ./result/corpus/fi.vcb

* Understand MOSES weights

* remove lines with high ratio (especially on Finnish side) (''done'')
* extract small corpus with specified size (together with the alignment) (''done'')
* figure score formula, add additional entries to phrase table
* surface recovery scripts (''done'')
* test Finnish - English translation, where the morphemes in Finnish 
* check Moses analysis tool dealing with word alignment
* train recased model for morpheme

Feb:
''Tasks'': run on multiple corpora (40K, 80, 160K, 320K & full (714K)). Target
* improve word alignment: 
** break into morpheme, realign STM, translate at word-level (''done, collecting result'')
** model consonant gradation: conflate variations of STMs into 1 representative
* improve morphological segmentation:
** fix tagging: using freq (''progressing'')
** fix segmentation boundary: use consonant gradation model

Mar: experiment with other language pairs
* consider paraphrase approach (check Preslav in Feb) - -> might need to consider reverse translation direction
* reconsider morphological generation (if experiments in Feb show good results)

* Long-term goal: toward achieve morphological generation & capture morphological translation rule
** performing morpheme segmentation
** understand affix role

* Check the real issue in the current baseline SMT & morph SMT
* morphological prediction using maxent
* soft-pattern idea: using place holders to extract translation rules English function words -> Finnish morphemes

* comprehensive experiments:
** need to test on various corpora to test language independent approach
** what is BLEU score with the 95% confidence intervals? (Hendra paper)

* things to consider
** edit moses decoding to add sentence number, to use in parallel decoding for identification purpose & merging result (moses_err, n-best-file)
** randLM

* Yang and Kirchhoff, 2006: Word-based SMT + morpheme-based SMT ?
* Corston-Oliver and Gamon, 2004: normalize inflectional morphology (German-English)
* Popovic and Ney, 2004: inflected languages using stems, suffixes part-of-speech tags

* Phrase-Based Backoff Models for Machine Translation of Highly Inflected Languages
* Analysis of statistical and morphological classes to generate weighted
reordering hypotheses on a Statistical Machine Translation system (wmt07)
* Inducing the morphological lexicon of a natural unannotated text

Machine translation in CHIME

Ordering Phrases with Function Words
http://wing.comp.nus.edu.sg/chime/070425/CHIME25April2007-v3.htm

Word Sense Disambiguation Improves Statistical Machine Translation

A constituency-based approach to phrase-based statistical machine translation
http://wing.comp.nus.edu.sg/chime/060329/chime.htm

Hiero: Finding Structure in Statistical Machine Translation
http://wing.comp.nus.edu.sg/chime/060411/NUSslides.pdf

Phrase-based Statistical Machine Translation: A Level of Detail Approach

! To be read
* Modeling lexical redundancy for machine translation
* Generating complex morphology for machine translation

! Read
* Exploring different representational units in English-to-Turkish statistical machine translation
** MOSES decoder: distortion limit set to unlimited (-1) and distortion weight (-weight-d) set to a very low value of 0.1 (instead of 1 as default). The idea is to allow the decoder to consider ''longer range distortion''. Constituent orders of Turkish and English are very different, root words may have to be scrambled to rather long distances along with the translation of functions words and tags on the English side, to morpheme on the Turkish side.
** 5-gram morpheme LM -> word-based LM for rescoring
** Problem: Most verbal nominalization on the English side were just aligned to the verb roots on the Turkish side. Additional markers on the Turkish side indicating the nominalization and agreement markers etc., were mostly unaligned 
Solution: selectively segmented (morphological segmentation) (? manual)
** Problem: get the root word translated correctly but morphemes are either not applicable, or in morphologically wrong positions.
Solution: utilize a morpheme level "spelling corrector" to correct words which are 1-2 morpheme edit distance from the correct form (haven't done yet)

* Generating complex morphology for machine translation: stop at 4.

* to run in VS2005, go to project properties:
** General/Debugging, set command argument and current directory
** C/C++/Preprocessing: set preprocessor definition for either LM_SRI, or LM_IRST, remove definition for LM_INTERNAL
** build SRILM in visual studio : http://www.inference.phy.cam.ac.uk/kv227/srilm/

! Things to get srilm run with Moses:
* In srilm, project sri_oolm add file Bleu.cc, recompile
http://hi.baidu.com/lucene/blog/item/a5c1e8c497798dad8326ac7d.html
* Moses project, 
** Properties/ C/C++ / Preprocessor, change definition LM_INTERNAL to LM_SRI since in LanguageModelFactory.cpp
 #ifdef LM_SRI
	lm = new LanguageModelSRI(true, scoreIndexManager);
 #elif LM_INTERNAL
 	lm = new LanguageModelInternal(true, scoreIndexManager);
LM_INTERNAL only allow language model of order <3
** Properties/ C/C++/General, set additional include directories to point to include directories of srilm, and include dir of vld.h, zlib.h
** Make sure file LanguageModelSRI.cpp is not excluded in compilation
* Mose-cmd project
** Properties/ C/C++ / Preprocessor, change definition LM_INTERNAL to LM_SRI since in LanguageModelFactory.cpp
** Properties/ C/C++/General, set additional include directories to vld.h, zlib.h
** Properties Linker, set additional dependencies to "zlib.lib", and srilm lib
* processLexicalTable project
** Properties Linker, set additional dependencies to srilm libs

Notice: In Linker/General/ put the necessary lib directory under "Additional library directories"

Create an SMT system from fi-en

structure of $SMT_HOME
+ lm: language model in general
+ lm/binary: binarized version of language model

* make sure $SMT_HOME/lm have the necessary language model: create by running $SMT_HOME/runLM.sh (help message will tell the required parameters)
nohup ./runLM.sh 4 en "" 2>lm.4.en &
nohup ./runLM.sh 5 en "100" 2>lm.5.en.100 &
* make sure $SMT_HOME/recaser have the necessary recaser model: create by running $SMT_HOME/runRecaser.sh (help message will tell the required parameters)
nohup ./runRecaser.sh "" en 2>recaser.en.log &
modify the moses.ini to specify the correct path

This was previously discussed in [[train-factored-phrase-model.perl]] and [[phrase-extract scoring]]. However, it's not quite clear, so I rewrite the scoring mechanism here.

Translate from english-finnish (f = English, e = Finnish)
f2n: english foreign translation_prob_foreign_english = P(english | foreign)
n2f: foreign english P(foreign | english)
* in ''lex.0-0.f2n'', search for "ympäristöongelmat"
ympäristöongelmat cities 0.0769231
ympäristöongelmat environmental 0.0230415 -> P(ympäristöongelmat | environmental)
ympäristöongelmat NULL 0.0001484
ympäristöongelmat problems 0.0462963 -> P(ympäristöongelmat | problems)
* in ''lex.0-0.n2f'', search for "ympäristöongelmat"
cities ympäristöongelmat 0.0769231
environmental ympäristöongelmat 0.3846154 -> P(environmental | ympäristöongelmat)
NULL ympäristöongelmat 0.1538462
problems ympäristöongelmat 0.3846154 -> P(problems | ympäristöongelmat)
* ''extract.0-0.gz'' search for "environment"
environmental problems ||| ympäristöongelmat ||| 0-0 1-0
environmental problems ||| , joilla ympäristöongelmat ||| 0-2 1-2
environmental problems ||| joilla ympäristöongelmat ||| 0-1 1-1
environmental problems ||| ympäristöongelmat ||| 0-0 1-0
environmental problems ||| ympäristöongelmien ||| 0-0
environmental problems ||| ympäristöongelmat ||| 0-0 1-0
environmental problems ||| ympäristöalan ongelmiin ||| 0-0 1-1
environmental problems ||| samoin kuin ympäristöongelmat ||| 0-2 1-2
environmental problems ||| kuin ympäristöongelmat ||| 0-1 1-1
environmental problems ||| ympäristöongelmat ||| 0-0 1-0
environmental problems ||| ympäristöä koskeviin maailmanlaajuisiin ongelmiin ||| 0-2 1-3
environmental problems ||| koskeviin maailmanlaajuisiin ongelmiin ||| 0-1 1-2
environmental problems ||| maailmanlaajuisiin ongelmiin ||| 0-0 1-1

* ''extract.0-0.inv.gz'', search for "ympäristöongelmat"
ympäristöongelmat ||| environmental problems ||| 0-0 0-1
ympäristöongelmat ||| environmental problems ||| 0-0 0-1
ympäristöongelmat ||| environmental problems ||| 0-0 0-1
ympäristöongelmat ||| environmental problems ||| 0-0 0-1

* ''phrase-table.0-0''
environmental problems ||| , joilla ympäristöongelmat ||| (2) (2) ||| () () (0,1) ||| 1 0.147929 0.0769231 1.71819e-05 2.718
environmental problems ||| joilla ympäristöongelmat ||| (1) (1) ||| () (0,1) ||| 1 0.147929 0.0769231 8.23178e-05 2.718
environmental problems ||| koskeviin maailmanlaajuisiin ongelmiin ||| (1) (2) ||| () (0) (1) ||| 1 0.15 0.0769231 6.64876e-08 2.718
environmental problems ||| kuin ympäristöongelmat ||| (1) (1) ||| () (0,1) ||| 1 0.147929 0.0769231 0.000108042 2.718
environmental problems ||| maailmanlaajuisiin ongelmiin ||| (0) (1) ||| (0) (1) ||| 1 0.15 0.0769231 0.000128008 2.718
environmental problems ||| samoin kuin ympäristöongelmat ||| (2) (2) ||| () () (0,1) ||| 1 0.147929 0.0769231 5.61171e-08 2.718
environmental problems ||| ympäristöalan ongelmiin ||| (0) (1) ||| (0) (1) ||| 1 0.00333333 0.0769231 0.000128008 2.718
environmental problems ||| ympäristöongelmat ||| (0) (0) ||| (0,1) ||| 1 0.147929 0.307692 0.0346689 2.718
environmental problems ||| ympäristöongelmien ||| (0) () ||| (0) ||| 0.037037 0.00014798 0.0769231 0.0046083 2.718
environmental problems ||| ympäristöä koskeviin maailmanlaajuisiin ongelmiin ||| (2) (3) ||| () () (0) (1) ||| 1 0.15 0.0769231 1.97335e-11 2.718

! Let's investigate the score of environmental problems ||| ympäristöongelmat ||| (0) (0) ||| (0,1) ||| 1 0.147929 0.307692 0.0346689 2.718
Currently, five different phrase translation scores are computed:
    * phrase translation probability φ(f|e) 
    * lexical weighting lex(f|e)
    * phrase translation probability φ(e|f)
    * lexical weighting lex(e|f)
    * phrase penalty (always exp(1) = 2.718) 

* phrase translation probability φ(f|e) = φ(environmental problems | ympäristöongelmat)
In "extract.0-0.inv.gz" there are 4 translations of "ympäristöongelmat", and all are to "environmental problems". Thus, the ''translation score'' = 4/4 = ''1''
* phrase translation probability φ(e|f) = φ(ympäristöongelmat | environmental problems )
In "extract.0-0.gz" there are 13 translations of "environmental problems", out of which 4 are to "ympäristöongelmat". Thus the ''translation score'' = 4/13 = ''0.307692''.

Let's look at the lexical scores with the scoring mechanism:
For an english phrase e = ew_1, .., ew_n where each word ew_i is translated from a phrase f_i The ''lexical_score'' = P(ew_1|f_1) . P(ew_2|f_2) ... P(ew_n|f_n)
where P(ew_i|f_i) is computed as follow:
+ if e_i is aligned to nothing P(e_i | null) = lex (e_i | null)
+ if e_i is aligned to f_i = fw_1, .. fw_m, P(e_i | f_i) = ( lex(ew_i | fw_1) + ... + lex(ew_i | fw_m) / m
* lexical weighting lex(f|e) = lex (environmental problems | ympäristöongelmat)
the alignment (0, 1), say that ympäristöongelmat - environment problems (english phrase is environmental problems). Thus, the lexical score = g(environmental) * g(problems) = ''P(environmental | ympäristöongelmat) * P(problems | ympäristöongelmat)'' = 0.3846154 * 0.3846154 = ''0.147929''
* lexical weighting lex(e|f) = lex (ympäristöongelmat | environmental problems )
the alignment (0) (0), say that environment - ympäristöongelmat, problems - ympäristöongelmat (english phrase is ympäristöongelmat). Thus, the lexical score = ''P(ympäristöongelmat | environmental problems )'' =  (lex(ympäristöongelmat | environmental) + lex(ympäristöongelmat | problems))/2 = (0.0230415 + 0.0462963)/2 = 0.0230415 = ''0.0346689''

Honour year project

HYP

/***
|''Name:''|SparklinePlugin|
|''Description:''|Sparklines macro|
***/
//{{{
if(!version.extensions.SparklinePlugin) {
version.extensions.SparklinePlugin = {installed:true};

//--
//-- Sparklines
//--

config.macros.sparkline = {};
config.macros.sparkline.handler = function(place,macroName,params)
{
	var data = [];
	var min = 0;
	var max = 0;
	var v;
	for(var t=0; t<params.length; t++) {
		v = parseInt(params[t]);
		if(v < min)
			min = v;
		if(v > max)
			max = v;
		data.push(v);
	}
	if(data.length < 1)
		return;
	var box = createTiddlyElement(place,"span",null,"sparkline",String.fromCharCode(160));
	box.title = data.join(",");
	var w = box.offsetWidth;
	var h = box.offsetHeight;
	box.style.paddingRight = (data.length * 2 - w) + "px";
	box.style.position = "relative";
	for(var d=0; d<data.length; d++) {
		var tick = document.createElement("img");
		tick.border = 0;
		tick.className = "sparktick";
		tick.style.position = "absolute";
		tick.src = "data:image/gif,GIF89a%01%00%01%00%91%FF%00%FF%FF%FF%00%00%00%C0%C0%C0%00%00%00!%F9%04%01%00%00%02%00%2C%00%00%00%00%01%00%01%00%40%02%02T%01%00%3B";
		tick.style.left = d*2 + "px";
		tick.style.width = "2px";
		v = Math.floor(((data[d] - min)/(max-min)) * h);
		tick.style.top = (h-v) + "px";
		tick.style.height = v + "px";
		box.appendChild(tick);
	}
};


}
//}}}

bool StaticData::LoadPhraseTables() {

// get all the translation weights
		vector<float> weightAll									= Scan<float>(m_parameter->GetParam("weight-t"));

// ttable-limit
vector<size_t>	maxTargetPhrase					= Scan<size_t>(m_parameter->GetParam("ttable-limit"));

			string filePath= token[3]; // the phrase table path
			size_t numScoreComponent = Scan<size_t>(token[2]); // num of translation scores

// Load all the scores to weights
for (size_t currScore = 0 ; currScore < numScoreComponent; currScore++)
				weight.push_back(weightAll[weightAllOffset + currScore]);

// load normal phrase table
PhraseDictionaryMemory *pd=new PhraseDictionaryMemory(numScoreComponent);
				if (!pd->Load(input
								 , output
								 , filePath
								 , weight
								 , maxTargetPhrase[index]
								 , GetAllLM()
								 , GetWeightWordPenalty()))
				{
					delete pd;
					return false;
				}

// load binary phrase table
PhraseDictionaryTreeAdaptor *pd=new PhraseDictionaryTreeAdaptor(numScoreComponent,(currDict==0 ? m_numInputScores : 0));
				if (!pd->Load(input,output,filePath,weight,
									 maxTargetPhrase[index],
									 GetAllLM(),
									 GetWeightWordPenalty()))
				{
					delete pd;
					return false;
				}

* For binary phrase loading
In PhraseDictionaryTreeAdaptor::Load, call
	imp->Create(input,output,filePath,
							weight,languageModels,weightWP);
Create() is from PDTAimp.h

/* horizontal main menu */

#displayArea { margin: 1em 15.5em 0em 1em; } /* use the full horizontal width */

#topMenu { background: [[ColorPalette::PrimaryMid]]; color: [[ColorPalette::PrimaryPale]]; padding: 0.2em 0.2em 0.2em 0.5em; border-bottom: 2px solid #000000; }

#topMenu br { display: none; }

#topMenu .button, #topMenu .tiddlyLink, #topMenu a { margin-left: 0.25em; margin-right: 0.25em; padding-left: 0.5em; padding-right: 0.5em; color: [[ColorPalette::PrimaryPale]]; font-size: 1.15em; }

#topMenu .button:hover, #topMenu .tiddlyLink:hover { background: [[ColorPalette::PrimaryDark]]; }

 .firstletter{ float:left; width:0.75em; font-size:400%; font-family:times,arial; line-height:60%; }

.viewer .FOO table tr.oddRow { background-color: #bbbbbb; }
.viewer .FOO table tr.evenRow { background-color: #fff; } 


/*Invisible table*/

.viewer .invisibletable table { 
border-color: white;
 }

.viewer .invisibletable table td { 
font-size: 1em;
font-family: Verdana;
border-color: white;
padding: 10px 20px 10px 0px;
text-align: left;
vertical-align: top;
} 

.viewer .invisibletable table th { 
color: #005566;
background-color: white;
border-color: white;
font-family: Verdana;
font-size: 1.2em;
font-weight: bold;
padding: 10px 20px 10px 0px;
text-align: left;
vertical-align: top;
} 

/* GIFFMEX TWEAKS TO STYLESHEETPRINT (so that nothing but tiddler title and text are printed) */


@media print {#mainMenu {display: none ! important;}}
@media print {#topMenu {display: none ! important;}}
@media print {#sidebar {display: none ! important;}}
@media print {#messageArea {display: none ! important;}} 
@media print {#toolbar {display: none ! important;}}
@media print {.header {display: none ! important;}}
@media print {.tiddler .subtitle {display: none ! important;}}
@media print {.tiddler .toolbar {display; none ! important; }}
@media print {.tiddler .tagging {display; none ! important; }}
@media print {.tiddler .tagged {display; none ! important; }}
@media print {#displayArea {margin: 1em 1em 0em 1em;}}
@media print {.pageBreak {page-break-before: always;}}

a.button{
 border: 0;

} 

/*Color changes*/


#sidebarOptions input {
	border: 1px solid [[ColorPalette::TertiaryPale]];
}

#sidebarOptions .sliderPanel {
	background: [[ColorPalette::TertiaryPale]];
}

#sidebarOptions .sliderPanel a {
	border: none;
	color: [[ColorPalette::PrimaryMid]];
}

#sidebarOptions .sliderPanel a:hover {
	color: [[ColorPalette::Background]];
	background: [[ColorPalette::TertiaryPale]];
}

#sidebarOptions .sliderPanel a:active {
	color: [[ColorPalette::PrimaryMid]];
	background: [[ColorPalette::TertiaryPale]];
}

/*Makes sliders bold*/

.tuduSlider .button{font-weight: bold;
}

/* (2) Adjusts the color for all headlines so they are both readable and match my color schemes. */

h1,h2,h3,h4,h5 {
 color: #000;
 background: [[ColorPalette::TertiaryPale]];
}

.title {
color: [[ColorPalette::PrimaryMid]];
}

/* (2) Makes text verdana. */

body {
/* font-family: verdana;*/
font-size: 9pt;
}

/* (4) Allows for Greek - one way */

   .greek {
      font-family: Palatino Linotype;
      font-style: normal;
      font-size: 150%;
   }

/* (5) Shortens the height of the Header */

.headerShadow {
 padding: 1.5em 0em 1em 1em;
}

.headerForeground {
 padding: 2em 0em 1em 1em;
}

/* (8) Makes ordered and unordered lists double-spaced between items but single-spaced within items. */

/*.viewer li {
   padding-top: 0.5em;
   padding-bottom: 0.5em;

} */

/*Makes block quotes line-less*/

.viewer blockquote {
border-left: 0px;
margin-top:0em;
margin-bottom:0em; 
}

/* Cosmetic fixes that probably should be included in a future TW... */

.viewer .listTitle { list-style-type:none; margin-left:-2em; }
.editorFooter .button { padding-top: 0px; padding-bottom:0px; }

Important stuff. See TagglyTaggingStyles and HorizontalMainMenuStyles

[[Styles TagglyTagging]]
[[Styles HorizontalMainMenu]]

Just colours, fonts, tweaks etc. See MessageTopRight and SideBarWhiteAndGrey

body { 
  background: #eee; }
.headerForeground a { 
  color: #6fc;}
.headerShadow { 
  left: 2px; 
  top: 2px; }
.siteSubtitle { 
  padding-left: 1.5em; }

.shadow .title {
  color: #999; }

.viewer pre { 
  background-color: #f8f8ff; 
  border-color: #ddf }

.tiddler {
  border-top:    1px solid #ccc; 
  border-left:   1px solid #ccc; 
  border-bottom: 3px solid #ccc; 
  border-right:  3px solid #ccc; 
  margin: 0.5em; 
  background:#fff; 
  padding: 0.5em; 
  -moz-border-radius: 1em; }

#messageArea { 
  background-color: #eee; 
  border-color: #8ab; 
  border-width: 4px; 
  border-style: dotted; 
  font-size: 90%; 
  padding: 0.5em; 
  -moz-border-radius: 1em; }

#messageArea .button { text-decoration:none; font-weight:bold; background:transparent; border:0px; }

#messageArea .button:hover {background: #acd; }

.editorFooter .button { 
  padding-top: 0px; 
  padding-bottom:0px; 
  background: #fff;
  color: #000; 
  border-top:    1px solid #ccc; 
  border-left:   1px solid #ccc; 
  border-bottom: 2px solid #ccc; 
  border-right:  2px solid #ccc; 
  margin-left: 3px;
  padding-top: 1px;
  padding-bottom: 1px;
  padding-left: 5px;
  padding-right: 5px; }
  
.editorFooter .button:hover { 
  border-top:    2px solid #ccc; 
  border-left:   2px solid #ccc; 
  border-bottom: 1px solid #ccc; 
  border-right:  1px solid #ccc; 
  margin-left: 3px;
  padding-top: 1px;
  padding-bottom: 1px;
  padding-left: 5px;
  padding-right: 5px; }

.tagged {
  padding: 0.5em;
  background-color: #eee;
  border-top:    1px solid #ccc; 
  border-left:   1px solid #ccc; 
  border-bottom: 3px solid #ccc; 
  border-right:  3px solid #ccc; 
  -moz-border-radius: 1em; }

.selected .tagged {
  padding: 0.5em;
  background-color: #eee;
  border-top:    1px solid #ccc; 
  border-left:   1px solid #ccc; 
  border-bottom: 3px solid #ccc; 
  border-right:  3px solid #ccc; 
  -moz-border-radius: 1em; }

Clint's fix for weird IE behaviour
body {position:static;}
.tagClear{margin-top:1em;clear:both;}

! 26 Aug 2008
Possible improvements:
* if 

!Steps:
* List of morphemes & total frequencies of each morpheme being a PRE, STEM, SUF
* 

!
* List of words & its dominant categories (PRE, STEM, SUF)
* Construct Morfessor morphological segmentation dictionary for Finnish
Each Morfessor entry M_i is of the form e_i -> e_i1 .. e_{ih_i}: h_i is the number of morphemes
* Train English vs. Finnish (Finnish is morphologically segmented)
* Translate English -> Finnish
* Recover Finnish full form:Sentence 
translated sentence having m stems e'_1, .... e'_m
** for each stem e_i, consider window of size (2*k + 1):  [e'_(i-k), ..., e'_i, ..., e'_(i+k)]
let T(e'_i) is the set of Morfessor entries. {M_j} satisfying e'_i \in {e_j1, .., e_{jh_j}}
each entry M_j in T(e_i) has a frequency -> probability for each entry P(M_j)
Find the best entry M_j using dynamic programming based on the probability of each entry P(M_j) and its coverage {e_j1,...e_{jh_j}

* Include works of morphology-aware SMT

Finnish has fewer words
for the same text compared to Swedish or Danish, and thus
one word includes more information on average. One mistake
in one suffix of a word is enough to mark the word
as an error. This does not usually prevent understanding
the translation, but will drop the scores as much as more
“serious” mistakes

Number of sentences not fully translated out of
1 000 with word-based and morph-based phrases. The
numberswere the same with all of the tested languagemodels
and maximum phrase length combinations.

An examination
of the untranslated words reveals that a higher
number of compound words and inflected word forms are
left untranslated by the word-based systems

We noticed that especially when translating into Finnish,
both the word and morph models experience difficulties in
getting the grammatical endings right. In order to achieve
better results it seems that more elaborate models of syntax
are needed, or the amount of training data must be increased.
However, since the morph model is capable of translating
previously unseen compound words by decomposing
them into parts, one may wonder whether the morph
model might outperform the word model if the grammatical
word endings are disregarded in the evaluation. That is,
how do the approaches compare if every word in the proposed
translations as well as the reference are restored to
their baseforms before the BLEU scores are calculated?

Compound words are common in the three languages studied.
Additionally, inflectional and derivational suffixes exist,
to a very high degree in Finnish, and to some extent in
Swedish and Danish

* analysis tables on token type of word/morph

* discussion on perflexity value of Morfessor

* appendix on Moses code understanding and trans score, lex score, hypothesis score

Available: recase model, morph language model
* train morph translation model
* test if tunning improve the translation quality
* prepare data without lower case

luongmin@access7:/research/trial/luongmin/result$ grep ' take ' acl05_full-w/result/model/lex.0-0.f2n | sort -nrk 3 | he     ad
NULL take 0.1956303
toimitetaan take 0.1000899
ottaa take 0.0593557
otettava take 0.0324816
ryhtyÃ¤ take 0.0236470
kÃ¤yttÃ¤Ã¤ take 0.0203671
tehdÃ¤ take 0.0188330
huomioon take 0.0184098
ottamaan take 0.0181982
osallistua take 0.0102100
luongmin@access7:/research/trial/luongmin/result$ grep ' took ' acl05_full-w/result/model/lex.0-0.f2n | sort -nrk 3 | he     ad
NULL took 0.2149758
otti took 0.0381643
kesti took 0.0342995
pidettiin took 0.0270531
oli took 0.0265700
teki took 0.0251208
tapahtui took 0.0183575
ottanut took 0.0140097
kÃ¤ytiin took 0.0120773
tehnyt took 0.0111111
luongmin@access7:/research/trial/luongmin/result$ grep ' taken ' acl05_full-w/result/model/lex.0-0.f2n | sort -nrk 3 | h     ead
NULL taken 0.2294248
huomioon taken 0.0707769
ottanut taken 0.0250648
ottaa taken 0.0248728
toimitetaan taken 0.0199750
otettu taken 0.0179583
tehty taken 0.0174782
tehdÃ¤Ã¤n taken 0.0148852
tehnyt taken 0.0138289
tehdÃ¤ taken 0.0115241

luongmin@access7:/research/trial/luongmin/result$ grep ' discuss ' acl05_full-w/result/model/lex.0-0.f2n | sort -nrk 3 | head
keskustella discuss 0.2177686
NULL discuss 0.0706612
siitÃ¤ discuss 0.0681818
keskustelemaan discuss 0.0661157
keskusteltava discuss 0.0603306
keskustelemme discuss 0.0458678
puhua discuss 0.0347107
kÃ¤sitellÃ¤ discuss 0.0305785
asiasta discuss 0.0301653
keskustellaan discuss 0.0148760
luongmin@access7:/research/trial/luongmin/result$ grep ' discussed ' acl05_full-w/result/model/lex.0-0.f2n | sort -nrk 3 | head
NULL discussed 0.0777696
keskusteltu discussed 0.0674982
keskustellaan discussed 0.0649303
keskustelleet discussed 0.0616288
keskustelimme discussed 0.0579604
keskusteltiin discussed 0.0520910
keskustella discussed 0.0476889
siitÃ¤ discussed 0.0319149
keskustellut discussed 0.0249450
keskusteltava discussed 0.0245781
luongmin@access7:/research/trial/luongmin/result$ grep ' discuss/' acl05_full-m/result/model/lex.0-0.f2n | sort -nrk 3 | head
keskus/STM+ discuss/STM 0.2697410
keskus/STM+ discuss/STM+ 0.2604285
t/SUF+ discuss/STM 0.1396083
ella/SUF discuss/STM 0.1132870
t/SUF+ discuss/STM+ 0.1099840
tele/STM+ discuss/STM 0.0745420
tele/STM+ discuss/STM+ 0.0608616
mme/SUF discuss/STM+ 0.0368133
maan/STM discuss/STM 0.0357970
puhu/STM+ discuss/STM 0.0349547

* acl05
** original train size: 716960 sentences
we remove 2607 sentences where  english/finnish > 3 or finnish/english > 2
** original dev size: 2000 sentences
we remove 5 sentences where  english/finnish > 3 or finnish/english > 2
Remove line 274: I would like to congratulate the President of the Commission on his speech and I welcome his authoritative mention , in this Chamber yesterday , of the role , collegial nature and managerial capacity of the Commission .
Remove line 275: Lastly , we must provide the countries and peoples of South-East Europe with the hope that they will be able to penetrate the institutional bounds of the Union .
Remove line 966: The Commission has presented a paper on this to the Committee on Budgets .
Remove line 1438: Report ( A5-0344 / 2000 ) by Mr Hatzidakis , on behalf of the Committee on Regional Policy , Transport and Tourism , on the proposal for a European Parliament and Council regulation on the accelerated phasing-in of double-hull or equivalent design standards for single-hull oil tankers [ COM ( 2000 ) 142 - C5  0173 / 2000 - 2000 / 0067 ( COD ) ]
Remove line 1440: ( Parliament adopted the resolution )
Remove line 274: Näiden tavoitteiden toteuttaminen kokonaisuudessaan ja varhaisessa vaiheessa on yhtenäisyytemme koetinkivenä .
Remove line 275: Eurooppa kuitenkin ulottuu Välimeren alueellekin .
Remove line 966: Kun viime vuoden lopussa osoittautui , että humanitaarisen avun alkuperäisiä tavoitteita ei voitu saavuttaa kohtuullisten valvonta- ja tarkastustoimien avulla , komissio päätti keskeyttää toiminnan , vaikka käytettävissä oli vielä määrärahoja .
Remove line 1438: ( Parlamentti hyväksyi lainsäädäntöpäätöslauselman . )
Remove line 1440: Guy-Quintin laatima budjettivaliokunnan mietintö ( A5-0327 / 2000 ) valkoisesta kirjasta komission uudistamisesta ( budjettivaliokuntaa koskevat näkökohdat ) ( KOM ( 2000 ) 200 - C5  0447 / 2000 - 2000 / 2217 ( COS ) )

** original test size: 2000 sentences
we remove 5 sentences where  english/finnish > 3 or finnish/english > 2
Remove line 7: There are many of us who want a federation of nation states , which means that each state must find the position that best suits it .
Remove line 142: Report ( A5-0248 / 2000 ) by Mr Queiró , on behalf of the Committee on Foreign Affairs , Human Rights , Common Security and Defence Policy , on the European Parliament resolution on Hungary 's membership application to the European Union and the state of negotiations ( COM ( 1999 ) 505 - C5-0028 / 2000 - 1997 / 2175 ( COS ) )
Remove line 797: On behalf of the Committee on Fisheries , I should like to call attention to a fairly small number of points .
Remove line 1438: ( Parliament adopted the legislative resolution )
Remove line 1440: Report ( A5-0329 / 2000 ) by Mr Pomés Ruiz , on behalf of the Committee on Budgetary Control , on the White Paper on reforming the Commission ( aspects concerning the Committee on Budgetary Control ) [ COM ( 2000 ) 0200 - C5-0445 / 2000 - 2000 / 2215 ( COS ) ]
Remove line 7: Monet meistä haluavat kansallisten jäsenvaltioiden muodostamaa liittovaltiota .
Remove line 142: ( Parlamentti hyväksyi päätöslauselman . )
Remove line 797: Haluaisin kalatalousvaliokunnan puolesta korostaa muutamia kohtia .
Remove line 1438: Me odotamme , syntyykö joulukuussa liikenneasiain neuvostossa yhteistä kantaa , ja siitä eteenpäin sitoudumme siihen , että tärkeysjärjestyksessä meidän aivan ensimmäinen tehtävämme on saattaa päätökseen kaksi muuta mietintöä mahdollisimman nopeasti , kevääseen mennessä , jos mahdollista .
Remove line 1440: ( Parlamentti hyväksyi päätöslauselman . )

* Lowercase the development test set:
lowercase.perl < test2000.fr > test2000.fr.lowercase

* Run the decoder with the given phrase table and language model:
pharaoh -f pharaoh.fr.ini < test2000.fr.lowercase > test2000.fr.out

* The output file should start with
mr president , what we will have to respond to biarritz , is looking a little further .
the elect us as have just as much duty-bound to encourage it to make progress , albeit with adversity , that of passing on the messages we receive from the public opinion in all our countries .
with regard to the events of recent times , the issue of the price of fuel i also think that particularly well .

* Get the reference translations, lowercase them, and evaluate the output with BLEU:
lowercase.perl < test2000.en > test2000.en.lowercase
multi-bleu.perl test2000.en.lowercase < test2000.fr.out

# You should get a score of 26.76% BLEU. Note that you may get a better score with a different parameter setting. See the Pharaoh manual for more detai

* Finnish corpus: 11318863 tokens, 392311 types
* Finnish corpus (lowercased, remove punctuation): 11318863 tokens, 334893 types

1344	contract
2088	contribute
2564	encourage
2670	consult
3141	education
3600	adminis
3914	business
4083	farm
4532	demand
4653	promis
5207	success
9596	interest
10717	democra
10770	employ
10797	discuss
10830	environment
13461	govern
14762	produc
16493	politic
16818	econom
19312	develop
21222	support
21442	amend
22007	proposal
24095	agree
32050	nation
37979	report
53367	use
64637	european
84806	europe

 # Given a moses.ini file and an input text prepare minimized translation
 # tables and a new moses.ini, so that loading of tables is much faster.

 # consider phrases in input up to this length 
 # in other words, all phrase-tables will be truncated at least to 10 words per
 # phrase

tudies on machine translation between close
languages are generally concentrated around
certain Slavic languages (e.g., Czech→Slovak,
Czech→Polish, Czech→Lithuanian (Hajic et al.,
2003)) and languages spoken in the Iberian Penin-
sula (e.g., Spanish↔Catalan (Canals et al., 2000),
Spanish↔Galician (Corbi-Bellot et al., 2003) and
Spanish↔Portugese (Garrido-Alenda et al., 2003).

* change to binary phrase table, reordering phrase table, modify ini script before tuning

top - 11:15:47 up 46 days, 21:55,  2 users,  load average: 11.31, 11.48, 10.61
Tasks: 197 total,   6 running, 191 sleeping,   0 stopped,   0 zombie
Cpu(s): 85.9%us,  0.5%sy,  0.0%ni,  1.8%id, 11.7%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   8178368k total,  8130592k used,    47776k free,     1504k buffers
Swap: 66412688k total, 10292444k used, 56120244k free,   196780k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
19774 lmthang   25   0 83780  80m  820 R  100  1.0   2:02.75 GIZA++
19686 lmthang   25   0 90792  87m  820 R   98  1.1   6:00.58 GIZA++
19483 lmthang   19   0  162m  12m  800 R   92  0.2  24:17.36 score-nbest.py
19521 lmthang   18   0 2011m 524m  816 R   67  6.6  10:16.51 moses
18924 lmthang   18   0  589m 307m  816 R   64  3.9  72:39.32 moses
19642 lmthang   18   0 3766m 2.5g 1080 D    1 31.8   6:50.91 moses
19482 lmthang   18   0 61624 2516  312 S    0  0.0   0:51.54 sort
19562 lmthang   18   0 3510m 1.0g  732 D    0 13.2   6:04.31 moses
19762 lmthang   15   0 30780 2192 1520 R    0  0.0   0:00.69 top



sh: line 1: 12100 Segmentation fault      /home/lmthang/HYP/src/moses/moses-cmd/src/moses -config filtered/moses.ini -inputtype 0 -w -0.687588 -lm 0.009388 -d 0.039359 0.083356 0.000572 -0.007485 0.003778 -0.023522 -0.046637 -tm -0.005330 0.024834 0.064540 0.000418 0.003193 -n-best-list run2.best100.out 100 -i /mnt/homes/lmthang/HYP/scripts/acl05_full_10/w-m_100/acl05_full.dev.processed.en > run2.out
Exit code: 139
The decoder died. at /home/lmthang/HYP/bin_02_June_08/moses-scripts/scripts-20080602-2352/training/mert-moses.pl line 772.



*** glibc detected *** /home/lmthang/HYP/src/moses/moses-cmd/src/moses: free(): invalid pointer: 0x3599f890 ***
/======= Backtrace: =========
/lib/libc.so.6[0xab3b16]
/lib/libc.so.6(cfree+0x90)[0xab7070]
/usr/lib/libstdc++.so.6(_ZdlPv+0x21)[0xf7f52151]
/home/lmthang/HYP/src/moses/moses-cmd/src/moses[0x806e977]
/home/lmthang/HYP/src/moses/moses-cmd/src/moses[0x806e965]
/home/lmthang/HYP/src/moses/moses-cmd/src/moses[0x806e965]
/home/lmthang/HYP/src/moses/moses-cmd/src/moses[0x806e965]
/home/lmthang/HYP/src/moses/moses-cmd/src/moses[0x806e965]
/home/lmthang/HYP/src/moses/moses-cmd/src/moses[0x806e965]
/home/lmthang/HYP/src/moses/moses-cmd/src/moses[0x806e965]
/home/lmthang/HYP/src/moses/moses-cmd/src/moses[0x806e965]
/home/lmthang/HYP/src/moses/moses-cmd/src/moses[0x806e965]
/home/lmthang/HYP/src/moses/moses-cmd/src/moses[0x806e965]
/home/lmthang/HYP/src/moses/moses-cmd/src/moses[0x806e965]
/home/lmthang/HYP/src/moses/moses-cmd/src/moses[0x8064c43]
/home/lmthang/HYP/src/moses/moses-cmd/src/moses[0x8061850]
/home/lmthang/HYP/src/moses/moses-cmd/src/moses[0x80621b8]
/home/lmthang/HYP/src/moses/moses-cmd/src/moses[0x809eb92]
/home/lmthang/HYP/src/moses/moses-cmd/src/moses[0x8095b7a]
/lib/libc.so.6(exit+0xe9)[0xa769d9]
/lib/libc.so.6(__libc_start_main+0xe4)[0xa60df4]
/home/lmthang/HYP/src/moses/moses-cmd/src/moses(__gxx_personality_v0+0xbd)[0x804bc81]
/======= Memory map: ========
00a2d000-00a47000 r-xp 00000000 09:02 15171585                           /lib/ld-2.5.so
00a47000-00a48000 r--p 00019000 09:02 15171585                           /lib/ld-2.5.so
00a48000-00a49000 rw-p 0001a000 09:02 15171585                           /lib/ld-2.5.so
00a4b000-00b88000 r-xp 00000000 09:02 8716308                            /lib/libc-2.5.so
00b88000-00b8a000 r--p 0013c000 09:02 8716308                            /lib/libc-2.5.so
00b8a000-00b8b000 rw-p 0013e000 09:02 8716308                            /lib/libc-2.5.so
00b8b000-00b8e000 rw-p 00b8b000 00:00 0
00ba9000-00bce000 r-xp 00000000 09:02 8716346                            /lib/libm-2.5.so
00bce000-00bcf000 r--p 00024000 09:02 8716346                            /lib/libm-2.5.so
00bcf000-00bd0000 rw-p 00025000 09:02 8716346                            /lib/libm-2.5.so
00cd2000-00cdd000 r-xp 00000000 09:02 8717060                            /lib/libgcc_s-4.1.2-20080102.so.1
00cdd000-00cde000 rw-p 0000a000 09:02 8717060                            /lib/libgcc_s-4.1.2-20080102.so.1
00d12000-00d24000 r-xp 00000000 09:02 10961094                           /usr/lib/libz.so.1.2.3
00d24000-00d25000 rw-p 00011000 09:02 10961094                           /usr/lib/libz.so.1.2.3
08048000-08114000 r-xp 00000000 00:15 105283593                          /mnt/homes/lmthang/HYP/src/moses/moses-cmd/src/moses
08114000-08115000 rw-p 000cb000 00:15 105283593                          /mnt/homes/lmthang/HYP/src/moses/moses-cmd/src/moses
08115000-085ac000 rw-p 08115000 00:00 0
095db000-e57c7000 rw-p 095db000 00:00 0
f7d00000-f7d21000 rw-p f7d00000 00:00 0
f7d21000-f7e00000 ---p f7d21000 00:00 0
f7e9f000-f7ea1000 rw-p f7e9f000 00:00 0
f7ea1000-f7f7f000 r-xp 00000000 09:02 10965054                           /usr/lib/libstdc++.so.6.0.8
f7f7f000-f7f82000 r--p 000dd000 09:02 10965054                           /usr/lib/libstdc++.so.6.0.8
f7f82000-f7f84000 rw-p 000e0000 09:02 10965054                           /usr/lib/libstdc++.so.6.0.8
f7f84000-f7f8a000 rw-p f7f84000 00:00 0
f7f9f000-f7fa1000 rw-p f7f9f000 00:00 0
ffe9d000-ffea1000 rw-p ffe9d000 00:00 0                                  [stack]
ffffe000-fffff000 r-xp ffffe000 00:00 0
sh: line 1: 11579 Aborted                 /home/lmthang/HYP/src/moses/moses-cmd/src/moses -config filtered/moses.ini -inputtype 0 -w -0.427595 -lm 0.010922 -d 0.099354 0.032944 -0.001238 -0.170074 0.041642 0.081615 -0.033839 -tm 0.003272 0.025515 0.030671 0.029248 -0.012071 -n-best-list run2.best100.out 100 -i /mnt/homes/lmthang/HYP/scripts/acl05_full_10/w-m_10/acl05_full.dev.processed.en > run2.out
Exit code: 134
The decoder died. at /home/lmthang/HYP/bin_02_June_08/moses-scripts/scripts-20080602-2352/training/mert-moses.pl line 772.

* to score phrase-phrase translation given the 
** phrase-phrase alignment
tapahtuisi ||| this to be ||| 0-0
tapahtuisi ||| this to be possible ||| 0-0
n<C3><A4>in tapahtuisi ||| this to be possible , ||| 1-0 0-4
jotta n<C3><A4>in tapahtuisi ||| this to be possible , ||| 2-0 1-4

** lexical translation with probability
seekers eurooppaan 1.0000000
their turkki 0.5000000
turkey turkki 0.5000000
international asiaankuuluvia 1.0000000

* There are two scores, output to ''phrase-table.0-0.half.n2f..''

 " ||| ' ||| (0) ||| (0) ||| 0.5 1
 " ||| cyprus ' ||| (1) ||| () (0) ||| 0.166667 0.0011338
 " ||| from northern cyprus ' ||| (3) ||| () () () (0) ||| 0.166667 2.91501e-009
 " ||| northern cyprus ' ||| (2) ||| () () (0) ||| 0.166667 1.2855e-006

** Tranlsation score (e.g. 0.5 + 0.166667 + 0.166667 + 0.166667 = 1)
" ||| ' ||| 0-0
" ||| ' ||| 0-0
" ||| ' ||| 0-0
" ||| cyprus ' ||| 0-1
" ||| from northern cyprus ' ||| 0-3
" ||| northern cyprus ' ||| 0-2

There are 6 translation options from " , and  " ||| ' repeats 3 times. Thu, its translation prob is 3/6 =0.5

** Lexical score (e.g. 1, 0.0011338, 2.91501e-009, 1.2855e-006)
" kehottaa ||| ' calls on the
" kehottaa ||| ' calls on
lex0('|")=1  => 1
lex0(calls|kehottaa)=0.5  => 0.5
lex0(on|kehottaa)=0.5  => 0.25
 => 0.0325964
lex0('|")=1  => 1
lex0(calls|kehottaa)=0.5  => 0.5
lex0(on|kehottaa)=0.5  => 0.25

For an english phrase e_1, .., e_n, lex_score = g(e_1) . g(e_2) ... g(e_n)
where g(e_i) is computed as follow:
+ if e_i is aligned to nothing g(e_i) = P(e_i | null)
+ if e_i is aligned to f_1, .. f_m, g(e_i) = ( P(e_i | f_1) + ... + P(e_i | f_m) / m

Steps: (- -first-step to - -last-step)
!(1) prepare corpus: generate vocabulary files, class files
* results in "corpus" folder
en-fi-int-train.snt  en.vcb  en.vcb.classes  en.vcb.classes.cats  fi-en-int-train.snt  fi.vcb  fi.vcb.classes  fi.vcb.classes.cats
* perform: reduce factor, make classess, get vocabulary (*.vcb files)

(2) run GIZA: training
(3) align words

!(4) learn lexical translation
''Require'' 3 files: aligned.0.en, aligned.0.fi, aligned.grow-diag-final-and
http://www.statmt.org/moses/?n=FactoredTraining.GetLexicalTranslationTable

f2n: english foreign translation_prob_foreign_english = P(english | foreign)
n2f: foreign english P(foreign | english)
 foreach my $f (keys %{$WORD_TRANSLATION}) {
	foreach my $e (keys %{$$WORD_TRANSLATION{$f}}) {
	    printf F2E "%s %s %.7f\n",$e,$f,$$WORD_TRANSLATION{$f}{$e}/$$TOTAL_FOREIGN{$f};
	    printf E2F "%s %s %.7f\n",$f,$e,$$WORD_TRANSLATION{$f}{$e}/$$TOTAL_ENGLISH{$e};
	}
    }

!(5) extract phrases
* phrase-extract solution
** extract.cpp: SentenceAlignment obj with alignedToE[e]: vector contain all word index of F that is aligned to E, while alignedCountF is the number of time word f is aligned to e
write result to model/extract.* files 
E.g.:
nÃ¤in tapahtuisi ||| this to be possible , ||| 1-0 0-4
jotta nÃ¤in tapahtuisi ||| this to be possible , ||| 2-0 1-4
nÃ¤in tapahtuisi ||| this to be possible , what ||| 1-0 0-4
jotta nÃ¤in tapahtuisi ||| this to be possible , what ||| 2-0 1-4
nÃ¤in tapahtuisi ||| this to be possible , what we ||| 1-0 0-4
jotta nÃ¤in tapahtuisi ||| this to be possible , what we ||| 2-0 1-4

!(6) score phrases
* Read from ''extract.0-0.sorted''
 ! ||| strategy for other countries ! ||| 0-4
(0: position of ! in the foreign phrase, 4: that of the english one)

 ! taustana ||| , the background ||| 0-0 1-2

* Read from ''extract.0-0.inv.sorted''
 strategy for other countries ! ||| ! ||| 4-0

 , the background ||| ! taustana ||| 0-0 2-1

* Use ''training/phrase-extract/score'' to score phrase (see [[phrase-extract scoring]])
../../bin/moses-scripts/scripts-20080213-1421/training/phrase-extract/score result_9/model/extract.0-0.sorted.part0000 result_9/model/lex
.0-0.f2n result_9/model/phrase-table.0-0.half.f2n.part0000
../../bin/moses-scripts/scripts-20080213-1421/training/phrase-extract/score result_9/model/extract.0-0.inv.sorted.part0000 result_9/model
/lex.0-0.n2f result_9/model/phrase-table.0-0.half.n2f.part0000  inverse

* Read from file ''phrase-table.0-0.half.n2f.sorted'' (see [[phrase-extract scoring]] for score meaning)
my ($english, $foreign , $alignEnglish,  $alignForeign,  $p) = split(/ \|\|\| /,$n2f);
 ! ||| strategy for other countries ! ||| (4) ||| () () () () (0) ||| 1 0.764706
 ! taustana ||| , the background ||| (0) (2) ||| (0) () (1) ||| 1 7.98622e-06

* Read from file ''phrase-table.0-0.half.f2n''  
my ($english2,$foreign2, $alignEnglish2, $alignForeign2, $p2) = split(/ \|\|\| /,$f2n);
 ! ||| strategy for other countries ! ||| (4) ||| () () () () (0) ||| 0.00877193 7.06945e-11

 ! taustana ||| , the background ||| (0) (2) ||| (0) () (1) ||| 0.5 0.00956885
* Print to file phrase-table.0-0.gz
print TABLE "$english ||| $foreign ||| $alignEnglish ||| $alignForeign ||| $p $p2 2.718\n"; 
 ! ||| strategy for other countries ! ||| (4) ||| () () () () (0) ||| 1 0.764706 0.00877193 7.06945e-11 2.718

 ! taustana ||| , the background ||| (0) (2) ||| (0) () (1) ||| 1 7.98622e-06 0.5 0.00956885 2.718


(7) learn reordering model
(8) learn generation model
(9) create decoder config file
-lm 0:5:../../acl07/lm/europarl.lm:0
($f, $order, $filename, $type) = split /:/, $lm, 4;  // factor, order, filename, type (srilm/irstlm)

* generate

* Apophony http://en.wikipedia.org/wiki/Apophony
Apophony is exemplified in English as the internal vowel alternations that produce such related words as

    * sing, sang, sung, song
    * rise, raise
    * bind, bound
    * goose, geese

The difference in these vowels marks variously a difference in tense or aspect (e.g. sing/sang/sung), transitivity (rise/raise), part of speech (sing/song, bind/bound), or grammatical number (goose/geese).

Similarly, there are consonant alternations which are also used grammatically:

    * belief, believe
    * house (noun), house (verb)   (phonetically: [haʊs] (noun), [haʊz] (verb))

That these sound alternations function grammatically can be seen as they are often equivalent to grammatical suffixes (an external modification). Compare the following:

    Present Tense 	Past Tense
    jump 	jumped
    sing 	sang
    Singular 	Plural
    book 	books


* vowel harmonization http://en.wikipedia.org/wiki/Vowel_harmony
** In languages with vowel harmony, there are constraints on what vowels may be found near each other.
** Notion of group of vowels (e.g. front, neutral, back) -> front/back system (might have neutral vowels)

The vowel that causes the vowel assimilation is frequently termed the trigger while the vowels that assimilate (or harmonize) are termed targets. In most languages, the vowel triggers lie within the root of a word while the affixes added to the roots contain the targets. This may be seen in the Hungarian dative suffix:

    Root 	Dative 	Gloss
    város 	város-nak 	"city"
    öröm 	öröm-nek 	"joy"

The dative suffix has two different forms -nak/-nek. The -nak form appears after the root with back vowels (a and o are both back vowels). The -nek form appears after the root with front vowels (ö and e are front vowels).

Another example: Turkish araba (car) pluralises to arabalar but tren (train) pluralises to trenler.