Cedegren data set: Logistic Regression a.k.a. Varbrul

In this assignment, we'll build logistic regression models using a version of Varbrul.

To do this assignment, you'll need a copy of Varbrul. Oct 2007 update: The most recent version of Varbrul for both Mac and Windows is now GoldVarb X (a.k.a. GoldVarb 3.0b). You can download it here. You can even run it under Linux using Wine. The instructions below were originally written for GoldVarb 2.1. You can still download Goldvarb 2.1 for the Mac from http://www.crm.umontreal.ca/~sankoff/GoldVarb_Eng.html. The version for Windows at that time, GoldVarb 2001, no longer seems to be available at http://www.york.ac.uk/depts/lang/webstuff/goldvarb/, or elsewhere. These sites have online copies of the documentation for both packages (and a copy also comes in the software you download - though the Mac version is from MacWrite which you may no longer be able to MacRead). The manual for Windows is no longer at http://www.york.ac.uk/depts/lang/webstuff/goldvarb/manual/manualOct2001.html, but the manual for the Mac is still at http://www.crm.umontreal.ca/~sankoff/GoldVarbManual.Dir/ or at http://individual.utoronto.ca/tagliamonte/Goldvarb/GoldVarb_MANUAL.htm . Although there have at times been versions for Unix and other larger computers (Sankoff discusses one such version in his article, and you can read some history at http://www.ling.upenn.edu/~ellen/varb.html), as far as I am aware, these are the only versions currently widely available.

You'll also need some data.  Here's some data from Henrietta Cedergren's 1973 study of /s/-deletion in Panamanian Spanish (via Greg Guy and Scott Kiesling). It's available in Windows format as http://www-nlp.stanford.edu/~manning/courses/ling236/handouts/panama-win.tkn and in Mac format as http://www-nlp.stanford.edu/~manning/courses/ling236/handouts/panama-mac.tok (these two formats differ in the end of line character used: the file I had was in Mac-format from GoldVarb 2.1, and my memory is that the Mac version only works if the line ends are right, so I thought the same might be true for the PC version, but I haven’t tested this all thoroughly...).  Varbrul token files represent the individual data tokens one per line in a fairly bizarre format apparently left over from the about the 1960s.  Each data observation or token is on a separate line and is preceded by a ( character.  Thereafter, each character position by default represents a factor group (the response variable or the explanatory variable). 

You’ve been given a token file which represents data from Cedergren's 1974 study of final s-deletion in Panama City, Panama. Cedergren had noticed that speakers in Panama City, like in many dialects of Spanish, variably deleted the s at the end of words. She undertook a study to find out if there was a change in progress: if final /s/ was systematically dropping out of Panamanian Spanish. The attached data are from interviews she performed across the city in several different social classes, to see how the variation was structured in the community. She also investigated the linguistic constraints on deletion, so she coded for a phonetic constraint — whether the following segment was consonant, vowel, or pause — and the grammatical category of word that the /s/ is part of a:

monomorpheme, where the s is part of the free morpheme (eg, menos)

verb, where the s is the second singular inflection (eg, tu tienes, el tienes)

determiner, where s is plural marked on a determiner (eg, los, las)

adjective, where s is a nominal plural agreeing with the noun (eg, buenos)

noun, where s marks a plural noun (eg, amigos)

The codes are as follows:

FG1: s-deletion



FG2: grammatical category


      v=verb (second singular inflection)

      d= nominal plural marker in a determiner

      a= nominal plural marker in an adjective

      n=nominal plural marker in a noun

FG3: following phonetic environment




FG4: social class


      2=second highest

      3=second lowest

      4= lowest

Your task is to perform an analysis of this data using Varbrul as a tool: Under what conditions is /s/ deleted most often?

In this file, after the parenthesis, the first column will be whether /s/ was deleted (1) or not (0).  The following columns are the other factor groups, as just laid out.  I might add that there is work within the categorical data analysis community on logistic regression models that treat ordinal variables specially (you can get extra power by knowing that the values are ordered), but AFAIK, GoldVarb does not implement this (such methods are described in Chapter 8 of Agresti, Categorical Data Analysis, Wiley, 1990, and implemented in some statistics programs, such as SPSSX or SAS). This would have been appropriate for social class. Download this file on to your Mac/PC (it’s probably safest to do any “Save as” as source/raw/all files rather than text).  “Statistics” can be defined as numerical summaries of the raw data, and so a statistics package would more commonly start with some kind of numerical summary. Varbul generates one of these as it goes, but you can also get something roughly along those lines (still slightly imperfectly sorted, sorry) by doing grep '^(' panama-win.tkn | sort | uniq -c | sort -k 2.4 giving this data: http://www-nlp.stanford.edu/~manning/courses/ling236/handouts/panama.cnts .

I started writing instructions for the PC version of GoldVarb below. However, if you’re using a Mac, the particulars of the interface are somewhat different, even though the essential functionality is the same. At any rate, it meant that I ended up starting to retype a lot of the manual, which also takes you through sample runs. So, I gave up. Therefore, as well as paying attention in class, you should read the manual for help in doing this assignment. In case it helps, here are copies of the sample Nepean files discussed in the manuals: http://nlp.stanford.edu/~manning/courses/ling236/handouts/Nepean-mac.tok or http://nlp.stanford.edu/~manning/courses/ling236/handouts/Nepean-win.tkn . And, finally, here are copies of the Nasal deletion data that I discussed in class: http://nlp.stanford.edu/~manning/courses/ling236/handouts/Nasal-mac.tok

Run GoldVarb. From the main screen, ask to View | Tokens. In the Tokens window that then appears, do File | Load, and load the Panama data. Since we’re lazy, lets just get the program to define the factor groups for us based on what is in the file. Do Action | Generate Factor Spec’s. A Groups window should appear. It should list the 4 factor groups with the factors as discussed in the preceding paragraph. (Again, in Varbrul-speak, a factor group is a variable, and a factor is a categorical value of such a variable. To make sure everything is hunky dory you might want to in the Tokens window do Action | Check Tokens. This checks for things in the tokens file that are not listed in the Factor Groups specification. However, since we generated the Factors from the tokens rather than specifying them by hand, it would be rather unsettling if this check indicated any errors.

GoldVarb has extensive facilities for mapping from a coding in a data file to a derived coding scheme (by merging factors, or by doing more complicated things using ANDs and ORs of underlying factors). You will want to explore simple recodings in the assignment to see if factors are distinct, but, initially, if we don’t want to do any of that, we simply choose from the Tokens window Actions | No Recode. This produces a null recoding. In any conditions file, including this null one, the first condition is treated as the dependent variable. Felicitously, this is just what we want. From the main Goldvarb 2001 window, choose View | Results, and then from the Results window choose Action | Load Cells to Memory. Click OK to anything that comes up. (The first condition/factor group is treated as the dependent variable. GoldVarb does only binary logistic regression (the Multinomial menu option is an unimplemented no-operation). If you give two values of the factor group – there are only 2 here – then it does a binomial logistic regression between those two factors. If you mention only one, it does a binomial logistic regression between that factor and all other factors in the factor group combined.) Then wait until the Results window fills with text (this takes a few seconds even on my machine, and you didn’t seem to get an hourglass). The result will show you marginals of the rate of application (here /s/ deletion) for each value of each explanatory variable. Looking at the marginals is useful from the point of view of exploratory data analysis, and might suggest a recoding of variables. Technically, the main thing to check for is that no factors are categorical (100% or 0% rule application). While categorical cases can be viewed as limit cases of a Varbul model, Varbul only analyzes data where there is variation.

1. Once you have generated your initial cell results, discuss the effect of factors just on the percentages (and report these percentages): Which class favors deletion most? Following segment? Grammatical category? Why do you think these factors pattern the way they do? Are there any patterns that surprise you?

2. Which factor group has the strongest effect? Which has the weakest?

3. Would you guess that all the factors in this analysis will be significantly different from one another?

From the Results window, choose Binomial, 1 level. This does a logistic regression analysis with the variables and values as defined, and produces both a Binomial Varbrul results window, and a Scattegram. The Scattergram shows the actual rate of ‘application’ of a variable rule (i.e., the proportion of one value of the dependent variable) vs. that predicted by the program. In a good model, this relation should be linear. If it is badly nonlinear, this suggests interactions or other necessary factors that are not being modeled. In the Mac version, you can click on points to identify them. That doesn’t seem to work on the Windows version. Pity. This does a simple logistic regression with all the variables. Then explore the Binomial, up and down, which does a stepwise logistic regression to see if which factor groups do or don’t improve the model. Important: doing this only explores adding and deleting whole factor groups. You will also want to explore collapsing factors within factor groups. To do this, you have to use the Recode options and manually collapse together factors that you think might not have a substantially different effect.  You then also have to compare log likelihoods by hand to see if the model gets significantly worse or not. See the manual, or remember what we did in class!

Once you have run Varbrul on the Panama data:

4. Report the results of the one level and stepwise analysis. Look for badly modeled cells or signs of interaction, etc.

5. Within each factor group, are there any factors that you think should be combined? If so, combine them and present the findings of your reanalysis. The final analysis you present should be the one you think is most efficient, with the minimum number of predictive parameters required to model the data adequately.

Addendum on model comparison: 2 Mar 2002

Here's a slightly more detailed discussion of comparing models to find out the best logistic regression model for the data. There are two parts to this. One is the general idea of likelihood ratio tests, and the particular instance of that for logistic regression models. This is a useful general technique, and a good one to understand. The other half is the particularities of how this is implemented and realized in Varbrul.

Likelihood ratio tests: The likelihood ratio test here is exactly the same one we saw in week 3. It's just being used in a new context. The first fundamental idea behind the likelihood ratio test is that we would like to choose a model that gives high likelihood to the observed data. We have two different models of differing complexity, for example, one may seek to model nasal deletion based just on the type of the nasal, and the other might model nasal deletion based on the type of the nasal and the following context. For each, we will normally have set the numeric parameters of the model to have found the maximum likelihood model within that model class. Note that if one model is a subset of the other one (as in my example above), then the more complex model must score at least as well in the likelihood it assigns the data, and usually one would expect it to do at least a fraction better, since a model with more parameters can capture some of the random variation in the observed data which isn't statistically significant. Beyond this point, we are working in the traditional hypothesis testing framework of (frequentist) statistics. Our null hypothesis H0 is that the simpler model is adequate. We seek to disconfirm the null hypothesis by establishing whether there is sufficient evidence that the better fit of the more complex model cannot reasonably be attributed to modeling chance occurrences in the observed data. The likelihood ratio is:

Lambda = maximum likelihood for model H0
maximum likelihood for more complex model

The test statistic for a likelihood ratio is G2 = -2 log(Lambda) [using a natural logarithm].. The likelihood-ratio chi-squared statistic G2 will take a minimum value of 0 when the likelihood of the two models is identical, and will take high values as the more complex model becomes much more likely (i.e., by orders of magnitude). This statistic (for reasonably large data sets) is approximately chi-square distributed, so we test for significance against a standard chi-square distribution. The number of degrees of freedom to use is the difference in the number of estimated parameters between the two models.

Varbrul: There are two basic cases to consider. One is dropping a complete factor group, and the other is dropping a distinction between two or more factors. (More complex things one can do include forming the crossproduct of two factor groups, to model factor group interaction, or totally reclassifying the data, perhaps according to new factors that are boolean combinations of old factors.) If you drop a factor group, the number of parameters you are no longer estimating is one less than the number of factors in the group (since the value of the final one is determined by the others). If you collapse k factors together, then you have removed k - 1 estimated parameters [actually, this one rule covers both cases!]. To test collapsing factors together, you need to jot down the log likelihood of the current model l1, collapse the factors together by doing a Recode setup from the Tokens window to make a Conditions file. Then run Varbul on the data again, and get a new log likelihood l2. Since we're already in log space, log(Lambda) is l2 - l1. We multiply that number by -2 [or simply do the subtraction the other way round, and double!], and then see if that number is big enough to be significant according to a chi-square distribution. [For the simplest case, of collapsing together 2 factors at one step, we require that number to be greater than 3.84 to have an argument against collapsing at a p = 0.05 confidence level, or 6.63 at a p = 0.01 confidence level. To consider the hypothesis that a whole factor group is irrelevant to the response variable, you could do the same thing manually (choosing to just Exclude a factor group) but in this case only you can get the program to do it for you by choosing Binomial, Up and Down, which will do a stepwise logistic regression to try to find the smallest set of factors that give a model whose likelihood is as good as any up to variations that aren't statistically significant.

Christopher Manning

Last modified: Tue Feb 26 23:17:12 PST 2002