Vsevolod Kapatsinski, Indiana University
Frequency and cohesion: Evidence from repair
Bybee (2002) proposes that units used together fuse together. The present corpus study tested this hypothesis in the domains of repetition and replacement repair. Repetition repair, illustrated in (1)-(2), occurs when a speaker repeats a word s/he has just produced or started producing. Replacement repair, illustrated in (3)-(4), occurs when the speaker replaces a word that has just been (partially) produced with another one. As examples (1)-(4) show, there is variability in whether word production is interrupted or goes to completion before the repair. The hypothesis that high frequency leads to increased cohesion predicts that high-frequency words should be less likely to be interrupted, as in examples (2) and (4), and more likely to be produced to the end, as in examples (1) and (3), before they are repeated or replaced.
First, 1018 single-word repetitions found on Switchboard were analyzed. For every word length (in segments), words that were interrupted had significantly lower token frequency than words produced completely. Thus, high-frequency words are less likely to be split than low-frequency words. Longer words are also more likely to be split than shorter words. When number of segments, number of syllables and log word frequency were entered into a multinomial logistic regression with whether the word was interrupted as the dependent variable, number of segments and word frequency were significant at the .001 level. Another logistic regression analysis showed that the effect was attributable to surface/wordform frequency and not base/root frequency, supporting the idea that words and not just morphemes are stored in the mental lexicon (Bybee 1985).
It is likely that some repetition repairs are actually interrupted replacement repairs, where the speaker thinks of replacing the repeated word but in the end decides not to. Interrupted replacement repairs could be argued to be more likely with low-frequency words and to be more likely to result in word interruption than genuine repetitions. To avoid this confound we also tested the 2839 replacement repairs found on Switchboard. Just like with repetition repairs, interrupted words were lower-frequency than those produced to the end for every word length (measured in either segments or syllables). Thus a high-frequency word is less likely to be interrupted than a low-frequency word even if it is to be replaced.
While these results are consistent with the cohesion hypothesis, they could also be explained by duration differences between high-frequency and low-frequency words. Since high-frequency words tend to be shorter, even when number of segments is controlled, they provide fewer opportunities for the replacing word to be accessed before the production of the replaced word is complete. Durations of the replaced words were measured both in conversation, where ten tokens per word were used, and in citation form. Frequency remained a significant predictor of interrupting the word (p<.001) when duration was statistically controlled by entering duration and log frequency as covariates into a logistic regression.
The possibility remained that the frequency of the replacing word was responsible for interrupting the replaced. Since high-frequency words are more accessible, the replacing would be more likely to become available during the production of the replaced if the replacing has high frequency. If high frequency of the replacing correlated with low frequency of the replaced, low frequency words would be more likely to be interrupted just because alternatives would come to mind faster. To test this hypothesis we correlated frequency of the replaced with the frequency of the replacing. The correlation was positive, rather than negative, so frequency of the replacing could not be responsible for the observed effect.
Thus, the study provides strong evidence for the hypothesis that words are stored in the lexicon and differ in cohesion, such that high-frequency words are more cohesive than low-frequency words. Cohesion could affect interruption in two ways: interruption could be delayed in high-frequency words, or interruption could be dispreferred in high-frequency words. The delay hypothesis predicts that words interrupted early during their production should have lower frequency than those interrupted late. We found no frequency differences among words interrupted close to their beginnings and words interrupted close to their ends. Thus we argue that people prefer not to interrupt highly cohesive units rather than having a neuromotor inertia that prevents them from interrupting production as soon as the decision is made.
Data (Switchboard, Godfrey et al. 1992):
(1) We’ve gotten, gotten pulled into these superfund deals
(2) He is living now in Maryland but he li-, lived in Grapevine for a long time
(3) We were surprised to find ‘Toyota’ written, I mean, imprinted on the engine
(4) It was pathe-, it was horrible.
Bybee, J. 1985. Morphology: A Study of the Relation between Meaning and Form. Amsterdam: John Benjamins.
Bybee, J. 2002. Sequentiality as the basis of constituent structure. In: T. Givon and B. F. Malle (eds.) The Evolution of Language out of Pre-Language, 109-134. Amsterdam, Philadelphia: John Benjamins.
Godfrey, J. J., E. C. Holliman, and J. McDaniel. 1992. SWITCHBOARD: Telephone Speech Corpus for Research and Development. IEEE ICASSP: I517-I5.