CS224n Final Project
                           ====================

Alex Khomenko      (homa@stanford.edu)
Nickolay Stanev    (nbstanev@stanford.edu)

1. Problem statement.
=====================

We decided to investigate verious methods of extracting collocations from
a corpus and finding relations between them. Collocations are important
for applications such as natural language generation (making sure that the
system's output sounds natural) and computational lexicography
(identifying important collocations to be listed in a dictionary entry).

Finding relations between collocations (finding collocations that are
related to a given collocation or clustering a group of collocations)
could be used for automatic generation of cross-references in a
collocation-based dictionary or for automatic acquisition of
domain-specific vocabularies.

Our project accomplishes both tasks: we evaluate different methods of
finding collocations and we allow searching for collocations that occur in
contexts similar to the context of a user-specified collocation. Our
results suggest that the methods we using are feasible and that
corpus-based automatic discovery of collocation clusters is possible.

2. Implementation.
==================

Data processing:
----------------

We are using the data in the "ICAME-Browntag" directory. An instance of
the TaggedBrownParser class can be initialized either without any
arguments, in which case the data files are read and the data is processed
and the data structures saved in files in a format more convenient for our
purposes, or with the name of the directory that already contains the
written out preferred data structures. The purpose of doing this is mostly
to save time, since parsing the whole corpus takes more than half an hour,
while loading the preprocessed corpus is much faster. We have already
generated preprocessed data and the submited program can read it from our
directory. We have two data sets - one that contains the whole Brown
corpus, and one that contains only the first five documents from each
Brown data file. See below for instructions on how these are used.

The TaggedBrownParser module reads in the data and breaks it down in
contexts (each context is the subset of the data file that comes from a
particular original document). Then, another program can obtain the
contexts one by one via a simple call (getNextContext). The context itself
is just an array of word structures, containing the word and its tag. We
are only interested in collocations that contain words of the following
types:

NN /* Noun. */
NP /* Proper noun. */
VB /* Verb. */
JJ /* Adjective. */
RB /* Adverb. */
IN /* Preposition. */

Everything else is marked as OTHER, with the exception of the dot ("."),
which is *always* an end-of-sentence marker in this particular corpus. We
tag the dots with EOS and use them in the collocation finding process,
since we are only interested in words co-occurring within a sentence's
boundaries.

Note that we do not have plural nouns. We are trying to convert all nouns
from plural to singular using the following simple convertions:

- if an NNS ends with "IES", replace ending with "Y";
- if an NNS ends
with "CHES", "OES", "SHES", "SSES", or "XES", strip the "ES";
- if an NNS
ends with "S" and is not covered by one of the above, strip the "S";

Thus, we are not taking care of irregular plurals (but we don't damage
them either) and the only harm done is in the rare case when some word
ends in "CHE", for example, in its singular form. Then the "E" is
incorrectly stripped. This is quite an ad hoc approach that seemed to work
well. However, it would have been useful to have data where the base forms
of the nouns and especially the verbs are known (an example of such a
corpus is SUSANNE, which is a parsed subset of the Brown corpus). This
would have enabled us to find more VB_NN collocations.

Apostorophied nouns are tagged as NN without change.

The auxiliary verbs "BE", "DO" and "HAVE" are tagged as OTHER in all their
forms.

Finding collocations:
---------------------

We are finding collocations using 4 different methods: simple frequency
counting, the mean and variance method, the t test, and the chi-square
test (all taken from the textbook). The mode is specified as an argument
to the Collocations class constructor, along with a corpus parser object
that is used for obtaining contexts.

a) Simple frequency counting.

We use this method to search for continuous collocations of length 2 or
3. While our program can be trivially extended to search for longer
collocations too, we decided that the longer collocations are not that
many and perhaps not that interesting. Thus, we process the text from the
corpus in bigrams and trigrams as we go through it context by context. We
first test the candidate collocation using a simple tag filter. Regardless
of the method used, we accept only collocations that match one of the
following patterns:

NN_NN
JJ_NN
VB_NN

NN_NN_NN
JJ_NN_NN
NN_JJ_NN
JJ_JJ_NN
NN_IN_NN

We initially had NP_NP and NP_NP_NP but most collocations found were not
really interesting (Mr. Smith, etc.). We also tried the pattern VB_RB, as
in "move quickly", but we were getting only uninteresting collocations of
the type "go now" or phrasal verbs like "set up", "come by". While the
latter could possibly be used for generating entries in a dictionary, they
did not present good collocations for the purposes of our project.

If the n-gram matches one of the patterns, we record it in a table (or add
to its count if it is already there). For each collocation we keep track
of its count and the location within the corpus of each occurrence (we do
this in all modes). The latter is used later when we measure similarity.

So, finally, we sort the collocations by frequency count and take the top
1000 of each length to use for the second part of the program.

We also always record all the words (including the ones tagged as OTHER)
with their frequency counts for further use (in some of the other
collocation finding modes).

b) Mean and variance method.

In this mode we are looking for discontinuous (or possibly continuous)
collocations of length 2. As we go through the corpus, we consider around
every word a window of size 9, where that word combined with any of the
surrounding 8 words is a potential collocation (we consider only
neighboring words within the same sentence). Again, the candidate has to
pass through the tag filter for consideration. The ordering of the words
is taken to be as in the first instance that we come upon. For each
collocation, we build a histogram of the number of times the second word
occurred at a certain distance from the first word.

After we have gone through the corpus, we iterate over the found
collocations and do a "flat peak" filtering. We look for the largest
distance count in the collocation's histogram. If the sum of the
surrounding 2 counts is larger than half of the maximum peak size, we
remove the collocation from our list.

Then we proceed to compute the mean and variance for the remaining
collocations. They are sorted by increasing variance, where the
collocations of the same variance are sorted by decreasing frequency
count. The top 1000 are kept for similarity measuring.

c) t test.

In this mode we are looking for continuous collocations of length 2 or
3. We collect the collocations in the exact same manner as in the simple
frequency count mode. After that we compute the t value for each
collocation. They are sorted by decreasing t value and, again, the top
1000 of each length are kept.

d) Chi-square test.

Here we look for continuous collocations of length 2. We proceed in the
same fashion as in the simple frequency count and t test modes, except at
the end we compute the chi-square values of the collocations and order
the collocations by decreasing chi-square value.

Looking for similar collocations:
---------------------------------

In this part of the program we take a list of the top N collocations found
by one of the above methods and then try to measure the correlation
between one of the top 200 collocations (chosen interactively by the user)
and the rest of them. For earch collocation we take 200 words of the
context around each occurence and obtain counts of the words occuring in
those context windows. We implement two variations of this method: in one
we include all the words in the context window, in the other we only
consider the words tagged as NN, VB, RB, or JJ. The words counts for a
given collocation are normalized by the total number of words occurring in
the context windows of that collocation. This gives us a distribution over
the different words that occur in all context windows of a given
collocation.  We then measure similarities between these distributions
using the following 4 metrics:

- L1   (Manhattan distance)
- L2   (Euclidean distance)
- COS  (Cosine)
- IRad (Information radius)

All of them are as described in the textbook.

We report the top 40 "closest" collocations for each metric.

Running the program:
--------------------

To run the program, start the "clc" script. It takes one argument -
"small" or "large" - specifying whether to use the whole corpus or the
smaller data (5 documents from each data file). The program first reads in
the preprocessed data from the files in our directory. Then the user has
two options:

a) Find collocations using one of the methods described above. The user
has to specify a file name in which the top N collocations are saved. The
program appends to the specified name "_<numCollocations>_<method>".

b) Compare collocations. The user has to enter the name of the
collocations file to be loaded. The top 200 collocations are printed out
and the user is asked for the target collocation and whether all words
should be considered or only select ones (omitting prepositions and words
tagged as OTHER). Then the program computes the similarity of the target
collocation to all the other collocations using the metrics listed above
and reports the top 50 most similar ones, according to each of the 4
metrics. Typing -1 for the collocation number exits this mode.

The above two actions are performed in a loop, so multiple operations can
be performed during one run of the program.

NOTE: The program requires about 800M to run on the whole Brown
corpus. We'd advise running the program on the small data to save
time. The results, of course, will not be as good as for the whole corpus.

3. Testing.
===========

Collocation finding.
--------------------

a) Simple frequency counting.

Even this simplest of all methods seemed to give good results. Below are
shown the five most frequent collocations of length 2 and 3 occurring in
the "small" corpus. Along with good collocations such as "SMALL BUSINESS
ADMINISTRATION" and "UNITED STATES" we were also getting idiomatic
expressions like "MATTER OF FACT" or "POINT OF VIEW" and word pairs that
are commonly encountered together but do not necessarily stand in strong
relation to each other. Examples of the latter are "OLD MAN" and "LONG
TIME". Since this method does not take into consideration the frequencies
of the collocation's composing words, such collocations tend to have high
scores.


Length 3 (<WORD>, <count>):

SMALL BUSINESS CONCERN, 13
SMALL BUSINESS ADMINISTRATION, 9
RATE OF SHEAR, 8
SECRETARY OF STATE, 7
MATTER OF FACT, 5

Length 2 (<WORD>, <count>):

UNITED STATES, 47
SMALL BUSINESS, 37
OLD MAN, 21
ANODE HOLDER, 19
PERSONAL PROPERTY, 17

b) Mean and variance method.

This seemed to be the best of all methods. The continuous collocations are
among the best scoring, but good discontinuous collocations like "DESTROY
ENEMY'S" and "SCIENCE TECHNOLOGY" were found too. This method copes
better, although not completely, with the "common words" problem mentioned
above. While "OLD MAN" is still considered a very good collocation, the
score of something like "LONG TIME" dropped. The reason is that other
instances of the coocurrence of these common words ("...long after the
time...", "...this time he took a long break...") are considered and they
lower the possibility that the two words are related strongly.

Length 2 (<WORD>, <mean>, <variance>, <count>):

UNITED STATES, UNITED, STATES, 1.0, 0.0, 47
OLD MAN, OLD, MAN, 1.0, 0.0, 21
RADIO EMISSION, RADIO, EMISSION, 1.0, 0.0, 16
NUCLEAR WEAPON, NUCLEAR, WEAPON, 1.0, 0.0, 13
WAVE LENGTH, WAVE, LENGTH, 1.0, 0.0, 13

A note on the parameters used for "flat peak" filtering. We initially
decided to compare the size of the largest peak to the sum of its 4
closest neighbors, the cutt-off condition being that that sum is more than
1/4 of the largest peak's size. But we noticed that good collocations
were being removed. Here are some examples. The numbers shown are the
distance counts in the collocation's histogram.

HELD MEETING
0 0 0 0 0 1 4 1 1
ATHLETIC PROGRAM
0 0 0 0 0 3 0 1 0 
FIRED GUN
0 0 0 0 0 1 3 0 0 
WHITE TEETH
0 0 0 0 0 4 1 2 0 
FURNISHED ROOM
0 0 0 0 0 3 1 0 0 

The method proved useful in eliminating bad collocations too, though:

SAID DAY
0 0 0 0 0 0 2 8 3
SUCH DEVELOPMENT
0 0 0 0 0 0 3 3 1
GAVE YEAR
0 0 0 0 0 0 1 3 1
SAID CAR
0 0 0 0 0 0 1 4 3 

After observing examples like the ones presented above, we found out that
the cutt-off is too low for the amount of data that we have (if we had
encountered "ATHLETIC PROGRAM", for example, 20 times, then it would have
probably had a histogram that passes the test). Judging from the results
above, we decided on comparing the highest peak only to its immediate
neighbor's sum and setting the cut-off sum to 0.5. Some good collocations
like the ones below are still being cut-off, but you couldn't think of a
good way to deal with that.

Removing DRIVE CAR
0 0 0 0 0 1 4 3 1 
Removing TOUCHED HAND
0 0 0 0 0 1 2 1 0 
Removing DRAW LINE
0 0 0 0 0 0 3 2 0 

c) t test.

This method was also slightly better than the simple frequency counting
method in that it deals better with idiomatic expressions and common word
collocations. Since it uses the composing words' frequency counts,
high-scoring collocations consist of words that are very likely to occur
together, not just very likely in general. In contrast to the top 5
collocations found by the frequency counting method, "MATTER OF FACT"
moved down the list, while the more interesting "HOME RULE CHARTER"
climbed to the top.

Length 3 (<WORD>, <t_value>, <count>):

SMALL BUSINESS CONCERN, 3.6056123197382504, 13
SMALL BUSINESS ADMINISTRATION, 2.9998899172660995, 9
RATE OF SHEAR, 2.8265620474375597, 8
SECRETARY OF STATE, 2.632356060656279, 7
HOME RULE CHARTER, 2.236093660779998, 5

Length 2 (<WORD>, <t_value>, <count>):

UNITED STATES, 6.8303675830354, 47
SMALL BUSINESS, 5.987962457983314, 37
ANODE HOLDER, 4.339252842456542, 19
OLD MAN, 4.283453291332583, 21
PERSONAL PROPERTY, 4.082507525167258, 17

d) Chi-square test.

This method gave interesting results. It is similar to the t test but puts
even more weight on the coocurrence of the words composing the
collocation. Thus, most of the top collocations were of low count and
consisted of uncommon words that appeared very rarely if at all in the
rest of the corpus. An example of such a collocation is "POST-ATTACK
RECONNAISSANCE" below. We decided to remove the collocations with a count
of 3 or below. There were a lot of them, but they were really uncommon,
and we wanted to deal with more natural collocations. We felt that this
method, more than the other three, needs more data. Given that, it would
be able to pick out the good collocations among the uncommon word pairs
and would have the potential to perform better than the mean and variance
method.


Length 2 (<WORD>, <chi-square_value>, <count>):

UNITED STATES, 6823.207158142358, 47
POST-ATTACK RECONNAISSANCE, 5789.036633730465, 4
PLASMA GENERATOR, 5789.036633730465, 4
MISPLACED MODIFIER, 5402.567754959485, 4
SURFACE-ACTIVE AGENT, 5206.106140377071, 6

Measuring similarity between collocations.
------------------------------------------

We evaluated four different metrics of measuring semantic similarity
between our collocations' contexts: L1 distance, L2 distance, Cosine
(COS), and Information Radius (IRad). We found that IRad was consistently
better than the other three metrics, although they all worked well at
identifying obvious matches.

Of course our judgement was subjective, but we will provide some
characteristic results below. Here are 15 best collocations that occur in
contexts similar to that of "NUCLEAR WEAPON" according to each of the four
metrics:

a) L1.					b) L2.

DESTROY ENEMY'S (4): 0.83		MILITARY FORCE (8): 0.0
MILITARY FORCE (8): 0.84		HUMAN BEING (37): 0.0
BALLISTIC MISSILE (16): 0.85		FAR EAST (6): 0.0
STRATEGIC FORCE (6): 0.89		SOVIET UNION (19): 0.0
BALLISTIC LONG-RANGE (4): 0.91		NATIONAL SECURITY (10): 0.0
SOVIET UNION (19): 0.92			BALLISTIC MISSILE (16): 0.0
FOREIGN POLICY (31): 0.92		GREAT DEAL (43): 0.0
NATIONAL SECURITY (10): 0.92		UNITED PRESIDENT (9): 0.0
INITIAL ATTACK (4): 0.93		PRESIDENT STATES (9): 0.0
HUMAN BEING (37): 0.93			REAL ESTATE (30): 0.0
DIFFERENT WAY (8): 0.94			SUCH THING (20): 0.0
GREAT DEAL (43): 0.96			DIFFERENT WAY (8): 0.0
FAR EAST (6): 0.98			DESTROY ENEMY'S (4): 0.0
SUCH THING (20): 0.98			GOOD WILL (12): 0.0
ATOMIC ENERGY (10): 0.99		FOREIGN POLICY (31): 0.0

c) COS.					d) IRad.

NUCLEAR WEAPON (23): 1.0		DESTROY ENEMY'S (4): 0.39
MILITARY FORCE (8): 0.93		BALLISTIC MISSILE (16): 0.42
SOVIET UNION (19): 0.92			MILITARY FORCE (8): 0.43
BALLISTIC MISSILE (16): 0.92		STRATEGIC FORCE (6): 0.43
HUMAN BEING (37): 0.92			BALLISTIC LONG-RANGE (4): 0.47
NATIONAL SECURITY (10): 0.92		NATIONAL SECURITY (10): 0.48
FAR EAST (6): 0.92			SOVIET UNION (19): 0.49
GOOD WILL (12): 0.91			INITIAL ATTACK (4): 0.49
COMMON EXPERIENCE (7): 0.91		FOREIGN POLICY (31): 0.49
FREE SOCIETY (7): 0.91			HUMAN BEING (37): 0.5
GREAT DEAL (43): 0.91			DIFFERENT WAY (8): 0.51
FOREIGN POLICY (31): 0.91		GREAT DEAL (43): 0.52
PRESIDENT STATES (9): 0.91		ATOMIC ENERGY (10): 0.53
UNITED PRESIDENT (9): 0.91		GO CODE (5): 0.53
REAL ESTATE (30): 0.91			MANNED BOMBER (5): 0.53

While L1, L2, and COS place collocations such as "FAR EAST", 
"SUCH THING", "GOOD WILL", "REAL ESTATE", "COMMON EXPERIENCE", and
"FREE SOCIETY" close to the top of the list, IRad does not.


4. Results.
===========

There are 6 sets of results here. We take three of our collocation-finding
methods (Mean and variance, Chi-square test, and t-test), then choose two
collocations that each method found, and report for each one top 15
similarity matches that were found using the IRad metric. We report
results for context models that include all of the words (ALL WORDS) and
also for the models that only use words tagged as NN, VB, JJ and RB
(SELECT WORDS). The two models are comparable, but sometimes using
selected words helps (for example that model that uses all words
inexplicably deems "INDIAN PYTHON" to be close to "MICROMETEORITE FLUX",
while the select words model does not).

The matches are reported in the following format:
<collocation> (<count>): <IRad value>


			MEAN AND VARIANCE COLLOCATIONS

ALL WORDS				SELECT WORDS

NUCLEAR WEAPON (23): 0.0		NUCLEAR WEAPON (23): 0.0
DESTROY ENEMY'S (4): 0.39		DESTROY ENEMY'S (4): 0.7
BALLISTIC MISSILE (16): 0.42		STRATEGIC FORCE (6): 0.77
MILITARY FORCE (8): 0.43		MILITARY FORCE (8): 0.78
STRATEGIC FORCE (6): 0.43		BALLISTIC MISSILE (16): 0.79
BALLISTIC LONG-RANGE (4): 0.47		BALLISTIC LONG-RANGE (4): 0.84
NATIONAL SECURITY (10): 0.48		NATIONAL SECURITY (10): 0.9
SOVIET UNION (19): 0.49			INITIAL ATTACK (4): 0.91
INITIAL ATTACK (4): 0.49		SOVIET UNION (19): 0.94
FOREIGN POLICY (31): 0.49		FOREIGN POLICY (31): 0.95
HUMAN BEING (37): 0.50			SOVIET LEADER (9): 0.96
DIFFERENT WAY (8): 0.51			POST-ATTACK RECONNAISSANCE (4): 0.98
GREAT DEAL (43): 0.52			NEW EDITOR (6): 0.99
ATOMIC ENERGY (10): 0.53		FAR EAST (6): 0.99
GO CODE (5): 0.53			ACCIDENTAL WAR (5): 0.99

ALL WORDS				SELECT WORDS
					
WAVE LENGTH (14): 0.0			WAVE LENGTH (14): 0.0
RADIO EMISSION (16): 0.13		RADIO EMISSION (16): 0.24
EMISSION MOON (6): 0.25			EMISSION MOON (6): 0.46
BRIGHTNESS TEMPERATURE (7): 0.36	BRIGHTNESS TEMPERATURE (7): 0.63
LOW INTENSITY (5): 0.38			LOW INTENSITY (5): 0.67
ANTENNA BEAM (6): 0.42			ANTENNA BEAM (6): 0.78
USING TECHNIQUE (4): 0.51		USING TECHNIQUE (4): 0.9
ORDER MAGNITUDE (6): 0.59		SHOWN FIGURE (18): 1.09
STEADY STATE (6): 0.6			ORDER MAGNITUDE (6): 1.09
SHOWN FIGURE (18): 0.61			STEADY STATE (6): 1.1
TAKES PLACE (13): 0.62			ECONOMIC INTEGRATION (15): 1.11
MOLECULAR WEIGHT (10): 0.62		FLOW RATE (6): 1.12
ECONOMIC INTEGRATION (15): 0.62		CARBON TETRACHLORIDE (18): 1.12
BALLISTIC MISSILE (16): 0.63		INCIDENT LIGHT (7): 1.12
FILM THICKNESS (7): 0.63		ARC VOLTAGE (5): 1.13
STRAIGHT LINE (12): 0.63		LIGHT INTENSITY (9): 1.13


			CHI-SQUARED TEST COLLOCATIONS

ALL WORDS				SELECT WORDS

SKILLED MANPOWER (4): 0.0		SKILLED MANPOWER (4): 0.0
DEVELOPING NATION (5): 0.36		DEVELOPING NATION (5): 0.6
PEACE CORPS (55): 0.42			PEACE CORPS (55): 0.7
VOCATIONAL TRAINING (9): 0.46		VOCATIONAL TRAINING (9): 0.75
MUTUAL SECURITY (4): 0.48		MUTUAL SECURITY (4): 0.78
VOCATIONAL EDUCATION (19): 0.53		VOCATIONAL EDUCATION (19): 0.83
EMPLOYMENT OPPORTUNITY (5): 0.55	TRAINING PROGRAM (12): 0.88
FEDERAL GOVERNMENT (22): 0.55		ECONOMIC DEVELOPMENT (17): 0.88
INDUSTRIAL DEVELOPMENT (10): 0.56	FOREIGN COUNTRY (11): 0.9
TRAINING PROGRAM (12): 0.56		FEDERAL FUND (9): 0.91
ECONOMIC DEVELOPMENT (17): 0.56		INDUSTRIAL DEVELOPMENT (10): 0.91
FEDERAL FUND (9): 0.56			FEDERAL GOVERNMENT (22): 0.93
HOST COUNTRY (6): 0.56			DEPRESSED AREA (5): 0.93
FOREIGN COUNTRY (11): 0.57		HOST COUNTRY (6): 0.93
DEPRESSED AREA (5): 0.57		EMPLOYMENT OPPORTUNITY (5): 0.94
INDUSTRIAL PLANT (7): 0.57		SCHOOL DISTRICT (21): 0.95

ALL WORDS				SELECT WORDS

MICROMETEORITE FLUX (4): 0.0		MICROMETEORITE FLUX (4): 0.0
PARTICLE SIZE (8): 0.46			PARTICLE SIZE (8): 0.88
KINETIC ENERGY (5): 0.54		KINETIC ENERGY (5): 0.99
WAVE LENGTH (14): 0.6			PLANETARY RADIATION (4): 1.08
RADIO EMISSION (16): 0.6		WAVE LENGTH (14): 1.09
PLANETARY RADIATION (4): 0.61		SOLAR RADIATION (4): 1.09
LOW INTENSITY (5): 0.62			RADIO EMISSION (16): 1.1
POTENTIAL ENERGY (5): 0.63		POTENTIAL ENERGY (5): 1.11
INDIAN PYTHON (5): 0.63			LOW INTENSITY (5): 1.13
SOLAR RADIATION (4): 0.63		SOLAR SYSTEM (6): 1.13
SOLAR SYSTEM (6): 0.63			LIGHT INTENSITY (9): 1.15
OPTIMAL POLICY (15): 0.63		INFRARED EMISSION (4): 1.16
AMETHYSTINE PYTHON (4): 0.63		INCIDENT LIGHT (7): 1.16
OPERATING VARIABLE (8): 0.63		SAMPLING CENSUS (4): 1.16
THERMAL RADIATION (4): 0.64		NORMAL PRESSURE (8): 1.17
MOLECULAR WEIGHT (10): 0.64		SURFACE TEMPERATURE (8): 1.17

			T-TEST COLLOCATIONS

ALL WORDS				SELECT WORDS

DEPARTMENT OF JUSTICE (13): 0.0		DEPARTMENT OF JUSTICE (13): 0.0
HEARING OFFICER (9): 0.2		HEARING OFFICER (9): 0.36
FEDERAL BUREAU (8): 0.21		BUREAU OF INVESTIGATION (8): 0.38
BUREAU OF INVESTIGATION (8): 0.21	FEDERAL BUREAU (8): 0.38
HEARING OFFICER'S (10): 0.23		HEARING OFFICER'S (10): 0.4
OFFICER'S REPORT (8): 0.25		OFFICER'S REPORT (8): 0.42
HEARING OFFICER'S REPORT (8): 0.25	HEARING OFFICER'S REPORT (8): 0.42
DUE PROCESS (8): 0.28			DUE PROCESS (8): 0.52
LOCAL BOARD (14): 0.3			LOCAL BOARD (14): 0.54
APPEAL BOARD (16): 0.31			APPEAL BOARD (16): 0.55
HEARING OFFICER'S NOTE (2): 0.35	HEARING OFFICER'S NOTE (2): 0.58
UNIVERSAL MILITARY TRAINING (2): 0.36	UNIVERSAL MILITARY TRAINING (2): 0.61
SUCH CLAIM (11): 0.46			SUCH CLAIM (11): 0.8
UNITED STATES (392): 0.47		STATES SUPREME COURT (2): 0.86
SELECTIVE SERVICE (5): 0.48		COURT OF APPEALS (4): 0.87
SUPREME COURT (28): 0.5			SELECTIVE SERVICE (5): 0.88


ALL WORDS				SELECT WORDS
					
ROMAN CATHOLIC CHURCH (4): 0.0		ROMAN CATHOLIC CHURCH (4): 0.0
CATHOLIC CHURCH (9): 0.24		CATHOLIC CHURCH (9): 0.39
FAMILY PLANNING (6): 0.31		FAMILY PLANNING (6): 0.5
BIRTH CONTROL (8): 0.44			BIRTH CONTROL (8): 0.78
NECESSITY OF LIFE (3): 0.49		NECESSITY OF LIFE (3): 0.84
COUNCIL OF CHURCHES (5): 0.49		COUNCIL OF CHURCHES (5): 0.89
NATURAL LAW (8): 0.56			SUCH FACTOR (6): 0.97
SUCH FACTOR (6): 0.57			NATURAL LAW (8): 1.01
CHRISTIAN FAITH (12): 0.58		OLD TESTAMENT (7): 1.03
WORLD COMMUNITY (6): 0.59		MARRIED LIFE (6): 1.03
OLD TESTAMENT (7): 0.59			WORLD COMMUNITY (6): 1.04
RELIGIOUS BELIEF (9): 0.59		CHRISTIAN FAITH (12): 1.04
LOCAL CHURCH (11): 0.59			RELIGIOUS BELIEF (9): 1.08
SOCIAL STRUCTURE (6): 0.6		LAW OF NATURE (5): 1.1
MARRIED LIFE (6): 0.6			SOCIAL STRUCTURE (6): 1.1
LAW OF NATURE (5): 0.61			AMERICAN POLICY (7): 1.1

As can be seen from the above results, collocations that were selected by
our program are in general related well to the target collocation. Of
course we would expect that collocations that mostly occur together
(physically close to one another) in the corpus will be considered
strongly related by any metric. But, on the other hand, since collocation
counts in all of the above examples vary greatly, we believe that we are
also able to relate collocations that occur in contexts that are similar,
but not the same.

5. Conclusion.
==============

We experimented with four different existing collocation methods and
obtained satisfactory results with all of them. Most collocations occur in
a fixed word order and without separation. Thus, even though three of the
methods looked at continuous bigrams or trigrams only, they were able to
find good selections of collocations. The mean and variance approach can
be used for discovery of collocations that are more loosely structured.

Each of the explored methods displayed some weaknesses which are hard to
work around, because they are inherent to the approach. All these methods
use only some sort of frequency count-based evaluation formula and
part-of-speech tags for simple filtering. For significantly better
results, we would need to use other information as well. An example of
such an approach is [Lin 1998], where syntactic dependencies between the
words are used to find collocations.

What we failed to do very well was to determine which collocations are
"good" and which are not. We basically just take the N highest scoring
collocations. We could not think of a good automated evaluation. [Lin
1998] suggests a way, but it is specific to the approach used.

As far as the problem of finding clusters of related collocations is
concerned, we could not find any literature on that. A related problem is
finding clusters of related word and works has been done in that area by
[Lin 1998] and others. Collocation clusters could be very useful in tasks
that are based on word clusters. For example, an algorithm that compares
the similarity of two documents using a "bag-of-words" type of approach
could be augmented to include a "bag-of-collocations" similarity measure.


6. References.
==============

a) Ch. Manning & H. Schultze - Foundations of Statistical Natural Language
Processing.

b) Dekang Lin - Extracting Collocations from Text Corpora (1998).