In this assignment, we’ll build logistic regression models using a version of Varbrul. To do this assignment, you’ll need a copy of Varbrul. You can download Goldvarb 2.1 for the Mac from http://www.crm.umontreal.ca/~sankoff/GoldVarb_Eng.html and you can download GoldVarb 2001, a version for Windows, at http://www.york.ac.uk/depts/lang/webstuff/goldvarb/ . These sites have online copies of the documentation for both packages (and a copy also comes in the software you download – though the Mac version is from MacWrite which you may no longer be able to MacRead). The manual for Windows is at http://www.york.ac.uk/depts/lang/webstuff/goldvarb/manual/manualOct2001.html and for the Mac at http://www.crm.umontreal.ca/~sankoff/GoldVarbManual.Dir/ . Although there have at times been versions for Unix and other larger computers (Sankoff discusses one such version in his article, and you can read some history at http://www.ling.upenn.edu/~ellen/varb.html), as far as I am aware, these are the only versions currently widely available.

You’ll also need some data.
Here’s some data from Henrietta Cedergren's 1973 study of /s/-deletion
in Panamanian Spanish (via Greg Guy and Scott Kiesling). It’s available in
Windows format as http://www-nlp.stanford.edu/~manning/courses/ling236/handouts/panama-win.tkn
and in Mac format as http://www-nlp.stanford.edu/~manning/courses/ling236/handouts/panama-mac.tok
(these two formats differ in the end of line character used: the file I had was
in Mac-format from GoldVarb 2.1, and my memory is that the Mac version only
works if the line ends are right, so I thought the same might be true for the
PC version, but I haven’t tested this all thoroughly...). Varbrul token files represent the
individual data tokens one per line in a fairly bizarre format apparently left
over from the about the 1960s.
Each data observation or *token* is on a separate line and is
preceded by a ( character.
Thereafter, each character position by default represents a *factor
group *(the response variable or the explanatory variable).

You’ve been given a token file which
represents data from Cedergren's 1974 study of final *s*-deletion in Panama City, Panama. Cedergren had noticed that
speakers in Panama City, like in many dialects of Spanish, variably deleted the
*s* at the end of words. She undertook
a study to find out if there was a change in progress: if final /s/ was
systematically dropping out of Panamanian Spanish. The attached data are from
interviews she performed across the city in several different social classes,
to see how the variation was structured in the community. She also investigated
the linguistic constraints on deletion, so she coded for a phonetic constraint
— whether the following segment was consonant, vowel, or pause — and the
grammatical category of word that the /s/ is part of a:

monomorpheme,
where the s is part of the free morpheme (eg, *menos*)

verb,
where the *s* is the second singular
inflection (eg, *tu tienes*, *el tienes*)

determiner,
where *s* is plural marked on a
determiner (eg, *los*, *las*)

adjective,
where *s *is a nominal plural agreeing
with the noun (eg, *buenos*)

noun,
where *s *marks a plural noun (eg, *amigos*)

The codes are as follows:

FG1: s-deletion

1=deletion

0=non-deletion

FG2: grammatical category

m=monomorpheme

v=verb (second singular inflection)

d= nominal plural
marker in a *determiner*

a= nominal plural
marker in an *adjective*

n=nominal plural
marker in a *noun*

FG3: following phonetic environment

C=consonant

V=vowel

P=pause

FG4: social class

1=highest

2=second highest

3=second lowest

4= lowest

Your task is to perform an analysis of this data using Varbrul as a tool: Under what conditions is /s/ deleted most often?

In this file, after the parenthesis, the first column will be whether /s/
was deleted (1) or not (0). The
following columns are the other factor groups, as just laid out. I might add that there is work within
the categorical data analysis community on logistic regression models that
treat ordinal variables specially (you can get extra power by knowing that the
values are ordered), but AFAIK, GoldVarb does not implement this (such methods are
described in Chapter 8 of Agresti, *Categorical Data Analysis, *Wiley,
1990, and implemented in some statistics programs, such as SPSS^{X} or
SAS). This would have been appropriate for social class. Download this file on
to your Mac/PC (it’s probably safest to do any “Save as” as source/raw/all
files rather than text).
“Statistics” can be defined as numerical summaries of the raw data, and
so a statistics package would more commonly start with some kind of numerical
summary. Varbul generates one of these as it goes, but you can also get
something roughly along those lines (still slightly imperfectly sorted, sorry)
by doing grep '^(' panama-win.tkn | sort | uniq -c | sort -k 2.4 giving this
data: http://www-nlp.stanford.edu/~manning/courses/ling236/handouts/panama.cnts
.

I started writing instructions for the PC version of GoldVarb below. However, if you’re using a Mac, the particulars of the interface are somewhat different, even though the essential functionality is the same. At any rate, it meant that I ended up starting to retype a lot of the manual, which also takes you through sample runs. So, I gave up. Therefore, as well as paying attention in class, you should read the manual for help in doing this assignment. In case it helps, here are copies of the sample Nepean files discussed in the manuals: http://nlp.stanford.edu/~manning/courses/ling236/handouts/Nepean-mac.tok or http://nlp.stanford.edu/~manning/courses/ling236/handouts/Nepean-win.tkn . And, finally, here are copies of the Nasal deletion data that I discussed in class: http://nlp.stanford.edu/~manning/courses/ling236/handouts/Nasal-mac.tok

Run GoldVarb. From the main screen, ask to View | Tokens. In the Tokens window that then appears, do File | Load, and load the Panama data. Since we’re lazy, lets just get the program to define the factor groups for us based on what is in the file. Do Action | Generate Factor Spec’s. A Groups window should appear. It should list the 4 factor groups with the factors as discussed in the preceding paragraph. (Again, in Varbrul-speak, a factor group is a variable, and a factor is a categorical value of such a variable. To make sure everything is hunky dory you might want to in the Tokens window do Action | Check Tokens. This checks for things in the tokens file that are not listed in the Factor Groups specification. However, since we generated the Factors from the tokens rather than specifying them by hand, it would be rather unsettling if this check indicated any errors.

GoldVarb has extensive facilities for mapping from a coding in a data file
to a derived coding scheme (by merging factors, or by doing more complicated
things using ANDs and ORs of underlying factors). You will want to explore
simple recodings in the assignment to see if factors are distinct, but,
initially, if we don’t want to do any of that, we simply choose from the Tokens
window Actions | No Recode. This produces a null recoding. In any conditions
file, including this null one, the first condition is treated as the dependent
variable. Felicitously, this is just what we want. From the main Goldvarb 2001
window, choose View | Results, and then from the Results window choose Action |
Load Cells to Memory. Click OK to anything that comes up. (The first
condition/factor group is treated as the dependent variable. GoldVarb does only
binary logistic regression (the Multinomial menu option is an unimplemented
no-operation). If you give two values of the factor group – there are only 2
here – then it does a binomial logistic regression between those two factors.
If you mention only one, it does a binomial logistic regression between that
factor and all other factors in the factor group combined.) Then *wait*
until the Results window fills with text (this takes a few seconds even on my
machine, and you didn’t seem to get an hourglass). The result will show you
marginals of the rate of application (here /s/ deletion) for each value of each
explanatory variable. Looking at the marginals is useful from the point of view
of exploratory data analysis, and might suggest a recoding of variables.
Technically, the main thing to check for is that no factors are categorical
(100% or 0% rule application). While categorical cases can be viewed as limit
cases of a Varbul model, Varbul only analyzes data where there is variation.

**1.** Once you have generated your initial cell results, discuss the effect
of factors just on the percentages (and report these percentages): Which class
favors deletion most? Following segment? Grammatical category? Why do you think
these factors pattern the way they do? Are there any patterns that surprise
you?

**2.** Which factor group has the strongest effect? Which has the
weakest?

**3.** Would you guess that all the factors in this analysis will be
significantly different from one another?

From the Results window, choose Binomial, 1 level. This does a logistic
regression analysis with the variables and values as defined, and produces both
a Binomial Varbrul results window, and a Scattegram. The Scattergram shows the
actual rate of ‘application’ of a variable rule (i.e., the proportion of one
value of the dependent variable) vs. that predicted by the program. In a good
model, this relation should be linear. If it is badly nonlinear, this suggests
interactions or other necessary factors that are not being modeled. In the Mac
version, you can click on points to identify them. That doesn’t seem to work on
the Windows version. Pity. This does a simple logistic regression with all the
variables. Then explore the Binomial, up and down, which does a stepwise
logistic regression to see if which factor groups do or don’t improve the model.
*Important:* doing this only explores adding and deleting whole factor
groups. You will also want to explore collapsing factors within factor groups.
To do this, you have to use the Recode options and manually collapse together
factors that you think might not have a substantially different effect. You then also have to compare log
likelihoods *by hand* to see if the model gets significantly worse or not.
See the manual, or remember what we did in class!

Once you have run Varbrul on the Panama data:

**4.** Report the results of the one level and stepwise analysis. Look
for badly modeled cells or signs of interaction, etc.

**5.** Within each factor group, are there any factors that you think
should be combined? If so, combine them and present the findings of your reanalysis.
The final analysis you present should be the one you think is most efficient,
with the minimum number of predictive parameters required to model the data
adequately.

I got a bit rushed at the end of last time. To assuage my guilt, here's a slightly more detailed discussion of comparing models to find out the best logistic regression model for the data. There are two parts to this. One is the general idea of likelihood ratio tests, and the particular instance of that for logistic regression models. This is a useful general technique, and a good one to understand. The other half is the particularities of how this is implemented and realized in Varbrul.

**Likelihood ratio tests:** The likelihood ratio test here is
*exactly the same one* we saw in week 3. It's just being used in a
new context. The first fundamental idea behind the
likelihood ratio test is that we would like to choose a model that gives high
likelihood to the observed data. We have two different models of differing
complexity, for example, one may seek to model nasal deletion based just on the
type of the nasal, and the other might model nasal deletion based on the type
of the nasal and the following context. For each, we will normally have set the
numeric parameters of the model to have found the maximum likelihood model
within that model class. Note that if one model is a subset of the other one
(as in my example above), then the more complex model must score at least as
well in the likelihood it assigns the data, and usually one would expect it to
do at least a fraction better, since a model with more parameters can capture
some of the random variation in the observed data which isn't
statistically significant.
Beyond this point, we are working in the traditional hypothesis testing
framework of (frequentist) statistics. Our null hypothesis H_{0}
is that the simpler
model is adequate. We seek to disconfirm the null hypothesis by establishing
whether there is sufficient evidence that the better fit of the more
complex model cannot reasonably be attributed to modeling chance
occurrences in the observed data. The likelihood ratio is:

Lambda = | maximum likelihood for model
H_{0}--------------------------------------------- maximum likelihood for more complex model |

The test statistic for a likelihood ratio is G^{2} = -2
log(Lambda) [using a natural logarithm]..
The likelihood-ratio chi-squared statistic *G*^{2} will take
a minimum value of 0 when the likelihood of the two models is identical,
and will take high values as the more complex model becomes much more
likely (i.e., by orders of magnitude). This statistic (for reasonably
large data sets) is approximately chi-square distributed, so we test for
significance against a standard chi-square distribution. The number of
degrees of freedom to use is the difference in the number of estimated
parameters between the two models.

**Varbrul:** There are two basic cases to consider. One is dropping a
complete factor group, and the other is dropping a distinction between
two or more factors. (More complex things one can do include forming the
crossproduct of two factor groups, to model factor group interaction, or
totally reclassifying the data, perhaps according to new factors that
are boolean combinations of old factors.) If you drop a factor group,
the number of parameters you are no longer estimating is one less than
the number of factors in the group (since the value of the final one is
determined by the others). If you collapse *k* factors together, then
you have removed *k* - 1 estimated parameters [actually, this one
rule covers both cases!]. To test collapsing factors
together, you need to jot down the log likelihood of the current model *l*_{1},
collapse the factors together by doing a Recode setup from the Tokens
window to make a Conditions file. Then run Varbul on the data again,
and get a new
log likelihood *l*_{2}. Since we're already in log space,
log(Lambda) is *l*_{2} - *l*_{1}. We multiply
that number by -2 [or simply do the subtraction the other way round, and
double!], and then see if that number is big enough to be significant
according to a chi-square distribution. [For the simplest case, of
collapsing together 2 factors at one step, we require that number to be
greater than 3.84 to have an argument *against* collapsing at a *p* =
0.05 confidence level, or 6.63 at a *p* = 0.01 confidence level.
To consider the hypothesis that
a whole factor group is irrelevant to the response variable, you could
do the same thing manually (choosing to just Exclude a factor group) but
in this case *only* you can get the program to do it for you by
choosing Binomial, Up and Down, which will do a stepwise logistic
regression to try to find the smallest set of factors that give a model
whose likelihood is as good as any up to variations that aren't
statistically significant.

Another reference on getting started with Varbrul is available here from Naomi Nagy.

For the statistically more hard core, there is a (partial) R implementation of
Varbrul by John Paolillo:
http://ella.slis.indiana.edu/~paolillo/projects/varbrul/rvarb/.
Also, Paolillo's website on his book *Analyzing Linguistic
Variation* has a number of useful resources:
http://ella.slis.indiana.edu/~paolillo/projects/varbrul/.

Last modified: Tue Feb 26 23:17:12 PST 2002