CS224n Final Project ==================== Alex Khomenko (homa@stanford.edu) Nickolay Stanev (nbstanev@stanford.edu) 1. Problem statement. ===================== We decided to investigate verious methods of extracting collocations from a corpus and finding relations between them. Collocations are important for applications such as natural language generation (making sure that the system's output sounds natural) and computational lexicography (identifying important collocations to be listed in a dictionary entry). Finding relations between collocations (finding collocations that are related to a given collocation or clustering a group of collocations) could be used for automatic generation of cross-references in a collocation-based dictionary or for automatic acquisition of domain-specific vocabularies. Our project accomplishes both tasks: we evaluate different methods of finding collocations and we allow searching for collocations that occur in contexts similar to the context of a user-specified collocation. Our results suggest that the methods we using are feasible and that corpus-based automatic discovery of collocation clusters is possible. 2. Implementation. ================== Data processing: ---------------- We are using the data in the "ICAME-Browntag" directory. An instance of the TaggedBrownParser class can be initialized either without any arguments, in which case the data files are read and the data is processed and the data structures saved in files in a format more convenient for our purposes, or with the name of the directory that already contains the written out preferred data structures. The purpose of doing this is mostly to save time, since parsing the whole corpus takes more than half an hour, while loading the preprocessed corpus is much faster. We have already generated preprocessed data and the submited program can read it from our directory. We have two data sets - one that contains the whole Brown corpus, and one that contains only the first five documents from each Brown data file. See below for instructions on how these are used. The TaggedBrownParser module reads in the data and breaks it down in contexts (each context is the subset of the data file that comes from a particular original document). Then, another program can obtain the contexts one by one via a simple call (getNextContext). The context itself is just an array of word structures, containing the word and its tag. We are only interested in collocations that contain words of the following types: NN /* Noun. */ NP /* Proper noun. */ VB /* Verb. */ JJ /* Adjective. */ RB /* Adverb. */ IN /* Preposition. */ Everything else is marked as OTHER, with the exception of the dot ("."), which is *always* an end-of-sentence marker in this particular corpus. We tag the dots with EOS and use them in the collocation finding process, since we are only interested in words co-occurring within a sentence's boundaries. Note that we do not have plural nouns. We are trying to convert all nouns from plural to singular using the following simple convertions: - if an NNS ends with "IES", replace ending with "Y"; - if an NNS ends with "CHES", "OES", "SHES", "SSES", or "XES", strip the "ES"; - if an NNS ends with "S" and is not covered by one of the above, strip the "S"; Thus, we are not taking care of irregular plurals (but we don't damage them either) and the only harm done is in the rare case when some word ends in "CHE", for example, in its singular form. Then the "E" is incorrectly stripped. This is quite an ad hoc approach that seemed to work well. However, it would have been useful to have data where the base forms of the nouns and especially the verbs are known (an example of such a corpus is SUSANNE, which is a parsed subset of the Brown corpus). This would have enabled us to find more VB_NN collocations. Apostorophied nouns are tagged as NN without change. The auxiliary verbs "BE", "DO" and "HAVE" are tagged as OTHER in all their forms. Finding collocations: --------------------- We are finding collocations using 4 different methods: simple frequency counting, the mean and variance method, the t test, and the chi-square test (all taken from the textbook). The mode is specified as an argument to the Collocations class constructor, along with a corpus parser object that is used for obtaining contexts. a) Simple frequency counting. We use this method to search for continuous collocations of length 2 or 3. While our program can be trivially extended to search for longer collocations too, we decided that the longer collocations are not that many and perhaps not that interesting. Thus, we process the text from the corpus in bigrams and trigrams as we go through it context by context. We first test the candidate collocation using a simple tag filter. Regardless of the method used, we accept only collocations that match one of the following patterns: NN_NN JJ_NN VB_NN NN_NN_NN JJ_NN_NN NN_JJ_NN JJ_JJ_NN NN_IN_NN We initially had NP_NP and NP_NP_NP but most collocations found were not really interesting (Mr. Smith, etc.). We also tried the pattern VB_RB, as in "move quickly", but we were getting only uninteresting collocations of the type "go now" or phrasal verbs like "set up", "come by". While the latter could possibly be used for generating entries in a dictionary, they did not present good collocations for the purposes of our project. If the n-gram matches one of the patterns, we record it in a table (or add to its count if it is already there). For each collocation we keep track of its count and the location within the corpus of each occurrence (we do this in all modes). The latter is used later when we measure similarity. So, finally, we sort the collocations by frequency count and take the top 1000 of each length to use for the second part of the program. We also always record all the words (including the ones tagged as OTHER) with their frequency counts for further use (in some of the other collocation finding modes). b) Mean and variance method. In this mode we are looking for discontinuous (or possibly continuous) collocations of length 2. As we go through the corpus, we consider around every word a window of size 9, where that word combined with any of the surrounding 8 words is a potential collocation (we consider only neighboring words within the same sentence). Again, the candidate has to pass through the tag filter for consideration. The ordering of the words is taken to be as in the first instance that we come upon. For each collocation, we build a histogram of the number of times the second word occurred at a certain distance from the first word. After we have gone through the corpus, we iterate over the found collocations and do a "flat peak" filtering. We look for the largest distance count in the collocation's histogram. If the sum of the surrounding 2 counts is larger than half of the maximum peak size, we remove the collocation from our list. Then we proceed to compute the mean and variance for the remaining collocations. They are sorted by increasing variance, where the collocations of the same variance are sorted by decreasing frequency count. The top 1000 are kept for similarity measuring. c) t test. In this mode we are looking for continuous collocations of length 2 or 3. We collect the collocations in the exact same manner as in the simple frequency count mode. After that we compute the t value for each collocation. They are sorted by decreasing t value and, again, the top 1000 of each length are kept. d) Chi-square test. Here we look for continuous collocations of length 2. We proceed in the same fashion as in the simple frequency count and t test modes, except at the end we compute the chi-square values of the collocations and order the collocations by decreasing chi-square value. Looking for similar collocations: --------------------------------- In this part of the program we take a list of the top N collocations found by one of the above methods and then try to measure the correlation between one of the top 200 collocations (chosen interactively by the user) and the rest of them. For earch collocation we take 200 words of the context around each occurence and obtain counts of the words occuring in those context windows. We implement two variations of this method: in one we include all the words in the context window, in the other we only consider the words tagged as NN, VB, RB, or JJ. The words counts for a given collocation are normalized by the total number of words occurring in the context windows of that collocation. This gives us a distribution over the different words that occur in all context windows of a given collocation. We then measure similarities between these distributions using the following 4 metrics: - L1 (Manhattan distance) - L2 (Euclidean distance) - COS (Cosine) - IRad (Information radius) All of them are as described in the textbook. We report the top 40 "closest" collocations for each metric. Running the program: -------------------- To run the program, start the "clc" script. It takes one argument - "small" or "large" - specifying whether to use the whole corpus or the smaller data (5 documents from each data file). The program first reads in the preprocessed data from the files in our directory. Then the user has two options: a) Find collocations using one of the methods described above. The user has to specify a file name in which the top N collocations are saved. The program appends to the specified name "__". b) Compare collocations. The user has to enter the name of the collocations file to be loaded. The top 200 collocations are printed out and the user is asked for the target collocation and whether all words should be considered or only select ones (omitting prepositions and words tagged as OTHER). Then the program computes the similarity of the target collocation to all the other collocations using the metrics listed above and reports the top 50 most similar ones, according to each of the 4 metrics. Typing -1 for the collocation number exits this mode. The above two actions are performed in a loop, so multiple operations can be performed during one run of the program. NOTE: The program requires about 800M to run on the whole Brown corpus. We'd advise running the program on the small data to save time. The results, of course, will not be as good as for the whole corpus. 3. Testing. =========== Collocation finding. -------------------- a) Simple frequency counting. Even this simplest of all methods seemed to give good results. Below are shown the five most frequent collocations of length 2 and 3 occurring in the "small" corpus. Along with good collocations such as "SMALL BUSINESS ADMINISTRATION" and "UNITED STATES" we were also getting idiomatic expressions like "MATTER OF FACT" or "POINT OF VIEW" and word pairs that are commonly encountered together but do not necessarily stand in strong relation to each other. Examples of the latter are "OLD MAN" and "LONG TIME". Since this method does not take into consideration the frequencies of the collocation's composing words, such collocations tend to have high scores. Length 3 (, ): SMALL BUSINESS CONCERN, 13 SMALL BUSINESS ADMINISTRATION, 9 RATE OF SHEAR, 8 SECRETARY OF STATE, 7 MATTER OF FACT, 5 Length 2 (, ): UNITED STATES, 47 SMALL BUSINESS, 37 OLD MAN, 21 ANODE HOLDER, 19 PERSONAL PROPERTY, 17 b) Mean and variance method. This seemed to be the best of all methods. The continuous collocations are among the best scoring, but good discontinuous collocations like "DESTROY ENEMY'S" and "SCIENCE TECHNOLOGY" were found too. This method copes better, although not completely, with the "common words" problem mentioned above. While "OLD MAN" is still considered a very good collocation, the score of something like "LONG TIME" dropped. The reason is that other instances of the coocurrence of these common words ("...long after the time...", "...this time he took a long break...") are considered and they lower the possibility that the two words are related strongly. Length 2 (, , , ): UNITED STATES, UNITED, STATES, 1.0, 0.0, 47 OLD MAN, OLD, MAN, 1.0, 0.0, 21 RADIO EMISSION, RADIO, EMISSION, 1.0, 0.0, 16 NUCLEAR WEAPON, NUCLEAR, WEAPON, 1.0, 0.0, 13 WAVE LENGTH, WAVE, LENGTH, 1.0, 0.0, 13 A note on the parameters used for "flat peak" filtering. We initially decided to compare the size of the largest peak to the sum of its 4 closest neighbors, the cutt-off condition being that that sum is more than 1/4 of the largest peak's size. But we noticed that good collocations were being removed. Here are some examples. The numbers shown are the distance counts in the collocation's histogram. HELD MEETING 0 0 0 0 0 1 4 1 1 ATHLETIC PROGRAM 0 0 0 0 0 3 0 1 0 FIRED GUN 0 0 0 0 0 1 3 0 0 WHITE TEETH 0 0 0 0 0 4 1 2 0 FURNISHED ROOM 0 0 0 0 0 3 1 0 0 The method proved useful in eliminating bad collocations too, though: SAID DAY 0 0 0 0 0 0 2 8 3 SUCH DEVELOPMENT 0 0 0 0 0 0 3 3 1 GAVE YEAR 0 0 0 0 0 0 1 3 1 SAID CAR 0 0 0 0 0 0 1 4 3 After observing examples like the ones presented above, we found out that the cutt-off is too low for the amount of data that we have (if we had encountered "ATHLETIC PROGRAM", for example, 20 times, then it would have probably had a histogram that passes the test). Judging from the results above, we decided on comparing the highest peak only to its immediate neighbor's sum and setting the cut-off sum to 0.5. Some good collocations like the ones below are still being cut-off, but you couldn't think of a good way to deal with that. Removing DRIVE CAR 0 0 0 0 0 1 4 3 1 Removing TOUCHED HAND 0 0 0 0 0 1 2 1 0 Removing DRAW LINE 0 0 0 0 0 0 3 2 0 c) t test. This method was also slightly better than the simple frequency counting method in that it deals better with idiomatic expressions and common word collocations. Since it uses the composing words' frequency counts, high-scoring collocations consist of words that are very likely to occur together, not just very likely in general. In contrast to the top 5 collocations found by the frequency counting method, "MATTER OF FACT" moved down the list, while the more interesting "HOME RULE CHARTER" climbed to the top. Length 3 (, , ): SMALL BUSINESS CONCERN, 3.6056123197382504, 13 SMALL BUSINESS ADMINISTRATION, 2.9998899172660995, 9 RATE OF SHEAR, 2.8265620474375597, 8 SECRETARY OF STATE, 2.632356060656279, 7 HOME RULE CHARTER, 2.236093660779998, 5 Length 2 (, , ): UNITED STATES, 6.8303675830354, 47 SMALL BUSINESS, 5.987962457983314, 37 ANODE HOLDER, 4.339252842456542, 19 OLD MAN, 4.283453291332583, 21 PERSONAL PROPERTY, 4.082507525167258, 17 d) Chi-square test. This method gave interesting results. It is similar to the t test but puts even more weight on the coocurrence of the words composing the collocation. Thus, most of the top collocations were of low count and consisted of uncommon words that appeared very rarely if at all in the rest of the corpus. An example of such a collocation is "POST-ATTACK RECONNAISSANCE" below. We decided to remove the collocations with a count of 3 or below. There were a lot of them, but they were really uncommon, and we wanted to deal with more natural collocations. We felt that this method, more than the other three, needs more data. Given that, it would be able to pick out the good collocations among the uncommon word pairs and would have the potential to perform better than the mean and variance method. Length 2 (, , ): UNITED STATES, 6823.207158142358, 47 POST-ATTACK RECONNAISSANCE, 5789.036633730465, 4 PLASMA GENERATOR, 5789.036633730465, 4 MISPLACED MODIFIER, 5402.567754959485, 4 SURFACE-ACTIVE AGENT, 5206.106140377071, 6 Measuring similarity between collocations. ------------------------------------------ We evaluated four different metrics of measuring semantic similarity between our collocations' contexts: L1 distance, L2 distance, Cosine (COS), and Information Radius (IRad). We found that IRad was consistently better than the other three metrics, although they all worked well at identifying obvious matches. Of course our judgement was subjective, but we will provide some characteristic results below. Here are 15 best collocations that occur in contexts similar to that of "NUCLEAR WEAPON" according to each of the four metrics: a) L1. b) L2. DESTROY ENEMY'S (4): 0.83 MILITARY FORCE (8): 0.0 MILITARY FORCE (8): 0.84 HUMAN BEING (37): 0.0 BALLISTIC MISSILE (16): 0.85 FAR EAST (6): 0.0 STRATEGIC FORCE (6): 0.89 SOVIET UNION (19): 0.0 BALLISTIC LONG-RANGE (4): 0.91 NATIONAL SECURITY (10): 0.0 SOVIET UNION (19): 0.92 BALLISTIC MISSILE (16): 0.0 FOREIGN POLICY (31): 0.92 GREAT DEAL (43): 0.0 NATIONAL SECURITY (10): 0.92 UNITED PRESIDENT (9): 0.0 INITIAL ATTACK (4): 0.93 PRESIDENT STATES (9): 0.0 HUMAN BEING (37): 0.93 REAL ESTATE (30): 0.0 DIFFERENT WAY (8): 0.94 SUCH THING (20): 0.0 GREAT DEAL (43): 0.96 DIFFERENT WAY (8): 0.0 FAR EAST (6): 0.98 DESTROY ENEMY'S (4): 0.0 SUCH THING (20): 0.98 GOOD WILL (12): 0.0 ATOMIC ENERGY (10): 0.99 FOREIGN POLICY (31): 0.0 c) COS. d) IRad. NUCLEAR WEAPON (23): 1.0 DESTROY ENEMY'S (4): 0.39 MILITARY FORCE (8): 0.93 BALLISTIC MISSILE (16): 0.42 SOVIET UNION (19): 0.92 MILITARY FORCE (8): 0.43 BALLISTIC MISSILE (16): 0.92 STRATEGIC FORCE (6): 0.43 HUMAN BEING (37): 0.92 BALLISTIC LONG-RANGE (4): 0.47 NATIONAL SECURITY (10): 0.92 NATIONAL SECURITY (10): 0.48 FAR EAST (6): 0.92 SOVIET UNION (19): 0.49 GOOD WILL (12): 0.91 INITIAL ATTACK (4): 0.49 COMMON EXPERIENCE (7): 0.91 FOREIGN POLICY (31): 0.49 FREE SOCIETY (7): 0.91 HUMAN BEING (37): 0.5 GREAT DEAL (43): 0.91 DIFFERENT WAY (8): 0.51 FOREIGN POLICY (31): 0.91 GREAT DEAL (43): 0.52 PRESIDENT STATES (9): 0.91 ATOMIC ENERGY (10): 0.53 UNITED PRESIDENT (9): 0.91 GO CODE (5): 0.53 REAL ESTATE (30): 0.91 MANNED BOMBER (5): 0.53 While L1, L2, and COS place collocations such as "FAR EAST", "SUCH THING", "GOOD WILL", "REAL ESTATE", "COMMON EXPERIENCE", and "FREE SOCIETY" close to the top of the list, IRad does not. 4. Results. =========== There are 6 sets of results here. We take three of our collocation-finding methods (Mean and variance, Chi-square test, and t-test), then choose two collocations that each method found, and report for each one top 15 similarity matches that were found using the IRad metric. We report results for context models that include all of the words (ALL WORDS) and also for the models that only use words tagged as NN, VB, JJ and RB (SELECT WORDS). The two models are comparable, but sometimes using selected words helps (for example that model that uses all words inexplicably deems "INDIAN PYTHON" to be close to "MICROMETEORITE FLUX", while the select words model does not). The matches are reported in the following format: (): MEAN AND VARIANCE COLLOCATIONS ALL WORDS SELECT WORDS NUCLEAR WEAPON (23): 0.0 NUCLEAR WEAPON (23): 0.0 DESTROY ENEMY'S (4): 0.39 DESTROY ENEMY'S (4): 0.7 BALLISTIC MISSILE (16): 0.42 STRATEGIC FORCE (6): 0.77 MILITARY FORCE (8): 0.43 MILITARY FORCE (8): 0.78 STRATEGIC FORCE (6): 0.43 BALLISTIC MISSILE (16): 0.79 BALLISTIC LONG-RANGE (4): 0.47 BALLISTIC LONG-RANGE (4): 0.84 NATIONAL SECURITY (10): 0.48 NATIONAL SECURITY (10): 0.9 SOVIET UNION (19): 0.49 INITIAL ATTACK (4): 0.91 INITIAL ATTACK (4): 0.49 SOVIET UNION (19): 0.94 FOREIGN POLICY (31): 0.49 FOREIGN POLICY (31): 0.95 HUMAN BEING (37): 0.50 SOVIET LEADER (9): 0.96 DIFFERENT WAY (8): 0.51 POST-ATTACK RECONNAISSANCE (4): 0.98 GREAT DEAL (43): 0.52 NEW EDITOR (6): 0.99 ATOMIC ENERGY (10): 0.53 FAR EAST (6): 0.99 GO CODE (5): 0.53 ACCIDENTAL WAR (5): 0.99 ALL WORDS SELECT WORDS WAVE LENGTH (14): 0.0 WAVE LENGTH (14): 0.0 RADIO EMISSION (16): 0.13 RADIO EMISSION (16): 0.24 EMISSION MOON (6): 0.25 EMISSION MOON (6): 0.46 BRIGHTNESS TEMPERATURE (7): 0.36 BRIGHTNESS TEMPERATURE (7): 0.63 LOW INTENSITY (5): 0.38 LOW INTENSITY (5): 0.67 ANTENNA BEAM (6): 0.42 ANTENNA BEAM (6): 0.78 USING TECHNIQUE (4): 0.51 USING TECHNIQUE (4): 0.9 ORDER MAGNITUDE (6): 0.59 SHOWN FIGURE (18): 1.09 STEADY STATE (6): 0.6 ORDER MAGNITUDE (6): 1.09 SHOWN FIGURE (18): 0.61 STEADY STATE (6): 1.1 TAKES PLACE (13): 0.62 ECONOMIC INTEGRATION (15): 1.11 MOLECULAR WEIGHT (10): 0.62 FLOW RATE (6): 1.12 ECONOMIC INTEGRATION (15): 0.62 CARBON TETRACHLORIDE (18): 1.12 BALLISTIC MISSILE (16): 0.63 INCIDENT LIGHT (7): 1.12 FILM THICKNESS (7): 0.63 ARC VOLTAGE (5): 1.13 STRAIGHT LINE (12): 0.63 LIGHT INTENSITY (9): 1.13 CHI-SQUARED TEST COLLOCATIONS ALL WORDS SELECT WORDS SKILLED MANPOWER (4): 0.0 SKILLED MANPOWER (4): 0.0 DEVELOPING NATION (5): 0.36 DEVELOPING NATION (5): 0.6 PEACE CORPS (55): 0.42 PEACE CORPS (55): 0.7 VOCATIONAL TRAINING (9): 0.46 VOCATIONAL TRAINING (9): 0.75 MUTUAL SECURITY (4): 0.48 MUTUAL SECURITY (4): 0.78 VOCATIONAL EDUCATION (19): 0.53 VOCATIONAL EDUCATION (19): 0.83 EMPLOYMENT OPPORTUNITY (5): 0.55 TRAINING PROGRAM (12): 0.88 FEDERAL GOVERNMENT (22): 0.55 ECONOMIC DEVELOPMENT (17): 0.88 INDUSTRIAL DEVELOPMENT (10): 0.56 FOREIGN COUNTRY (11): 0.9 TRAINING PROGRAM (12): 0.56 FEDERAL FUND (9): 0.91 ECONOMIC DEVELOPMENT (17): 0.56 INDUSTRIAL DEVELOPMENT (10): 0.91 FEDERAL FUND (9): 0.56 FEDERAL GOVERNMENT (22): 0.93 HOST COUNTRY (6): 0.56 DEPRESSED AREA (5): 0.93 FOREIGN COUNTRY (11): 0.57 HOST COUNTRY (6): 0.93 DEPRESSED AREA (5): 0.57 EMPLOYMENT OPPORTUNITY (5): 0.94 INDUSTRIAL PLANT (7): 0.57 SCHOOL DISTRICT (21): 0.95 ALL WORDS SELECT WORDS MICROMETEORITE FLUX (4): 0.0 MICROMETEORITE FLUX (4): 0.0 PARTICLE SIZE (8): 0.46 PARTICLE SIZE (8): 0.88 KINETIC ENERGY (5): 0.54 KINETIC ENERGY (5): 0.99 WAVE LENGTH (14): 0.6 PLANETARY RADIATION (4): 1.08 RADIO EMISSION (16): 0.6 WAVE LENGTH (14): 1.09 PLANETARY RADIATION (4): 0.61 SOLAR RADIATION (4): 1.09 LOW INTENSITY (5): 0.62 RADIO EMISSION (16): 1.1 POTENTIAL ENERGY (5): 0.63 POTENTIAL ENERGY (5): 1.11 INDIAN PYTHON (5): 0.63 LOW INTENSITY (5): 1.13 SOLAR RADIATION (4): 0.63 SOLAR SYSTEM (6): 1.13 SOLAR SYSTEM (6): 0.63 LIGHT INTENSITY (9): 1.15 OPTIMAL POLICY (15): 0.63 INFRARED EMISSION (4): 1.16 AMETHYSTINE PYTHON (4): 0.63 INCIDENT LIGHT (7): 1.16 OPERATING VARIABLE (8): 0.63 SAMPLING CENSUS (4): 1.16 THERMAL RADIATION (4): 0.64 NORMAL PRESSURE (8): 1.17 MOLECULAR WEIGHT (10): 0.64 SURFACE TEMPERATURE (8): 1.17 T-TEST COLLOCATIONS ALL WORDS SELECT WORDS DEPARTMENT OF JUSTICE (13): 0.0 DEPARTMENT OF JUSTICE (13): 0.0 HEARING OFFICER (9): 0.2 HEARING OFFICER (9): 0.36 FEDERAL BUREAU (8): 0.21 BUREAU OF INVESTIGATION (8): 0.38 BUREAU OF INVESTIGATION (8): 0.21 FEDERAL BUREAU (8): 0.38 HEARING OFFICER'S (10): 0.23 HEARING OFFICER'S (10): 0.4 OFFICER'S REPORT (8): 0.25 OFFICER'S REPORT (8): 0.42 HEARING OFFICER'S REPORT (8): 0.25 HEARING OFFICER'S REPORT (8): 0.42 DUE PROCESS (8): 0.28 DUE PROCESS (8): 0.52 LOCAL BOARD (14): 0.3 LOCAL BOARD (14): 0.54 APPEAL BOARD (16): 0.31 APPEAL BOARD (16): 0.55 HEARING OFFICER'S NOTE (2): 0.35 HEARING OFFICER'S NOTE (2): 0.58 UNIVERSAL MILITARY TRAINING (2): 0.36 UNIVERSAL MILITARY TRAINING (2): 0.61 SUCH CLAIM (11): 0.46 SUCH CLAIM (11): 0.8 UNITED STATES (392): 0.47 STATES SUPREME COURT (2): 0.86 SELECTIVE SERVICE (5): 0.48 COURT OF APPEALS (4): 0.87 SUPREME COURT (28): 0.5 SELECTIVE SERVICE (5): 0.88 ALL WORDS SELECT WORDS ROMAN CATHOLIC CHURCH (4): 0.0 ROMAN CATHOLIC CHURCH (4): 0.0 CATHOLIC CHURCH (9): 0.24 CATHOLIC CHURCH (9): 0.39 FAMILY PLANNING (6): 0.31 FAMILY PLANNING (6): 0.5 BIRTH CONTROL (8): 0.44 BIRTH CONTROL (8): 0.78 NECESSITY OF LIFE (3): 0.49 NECESSITY OF LIFE (3): 0.84 COUNCIL OF CHURCHES (5): 0.49 COUNCIL OF CHURCHES (5): 0.89 NATURAL LAW (8): 0.56 SUCH FACTOR (6): 0.97 SUCH FACTOR (6): 0.57 NATURAL LAW (8): 1.01 CHRISTIAN FAITH (12): 0.58 OLD TESTAMENT (7): 1.03 WORLD COMMUNITY (6): 0.59 MARRIED LIFE (6): 1.03 OLD TESTAMENT (7): 0.59 WORLD COMMUNITY (6): 1.04 RELIGIOUS BELIEF (9): 0.59 CHRISTIAN FAITH (12): 1.04 LOCAL CHURCH (11): 0.59 RELIGIOUS BELIEF (9): 1.08 SOCIAL STRUCTURE (6): 0.6 LAW OF NATURE (5): 1.1 MARRIED LIFE (6): 0.6 SOCIAL STRUCTURE (6): 1.1 LAW OF NATURE (5): 0.61 AMERICAN POLICY (7): 1.1 As can be seen from the above results, collocations that were selected by our program are in general related well to the target collocation. Of course we would expect that collocations that mostly occur together (physically close to one another) in the corpus will be considered strongly related by any metric. But, on the other hand, since collocation counts in all of the above examples vary greatly, we believe that we are also able to relate collocations that occur in contexts that are similar, but not the same. 5. Conclusion. ============== We experimented with four different existing collocation methods and obtained satisfactory results with all of them. Most collocations occur in a fixed word order and without separation. Thus, even though three of the methods looked at continuous bigrams or trigrams only, they were able to find good selections of collocations. The mean and variance approach can be used for discovery of collocations that are more loosely structured. Each of the explored methods displayed some weaknesses which are hard to work around, because they are inherent to the approach. All these methods use only some sort of frequency count-based evaluation formula and part-of-speech tags for simple filtering. For significantly better results, we would need to use other information as well. An example of such an approach is [Lin 1998], where syntactic dependencies between the words are used to find collocations. What we failed to do very well was to determine which collocations are "good" and which are not. We basically just take the N highest scoring collocations. We could not think of a good automated evaluation. [Lin 1998] suggests a way, but it is specific to the approach used. As far as the problem of finding clusters of related collocations is concerned, we could not find any literature on that. A related problem is finding clusters of related word and works has been done in that area by [Lin 1998] and others. Collocation clusters could be very useful in tasks that are based on word clusters. For example, an algorithm that compares the similarity of two documents using a "bag-of-words" type of approach could be augmented to include a "bag-of-collocations" similarity measure. 6. References. ============== a) Ch. Manning & H. Schultze - Foundations of Statistical Natural Language Processing. b) Dekang Lin - Extracting Collocations from Text Corpora (1998).