Lund University.
Dept.
of Linguistics and Phonetics Working Papers 53 .
, 61-79 Lexical diversity and lexical density in speech and writing: a developmental perspective Victoria Johansson Introduction Literature about early, pre-school lexical development often mentions vocabulary development.
As an example, the reader of Handbook of child language (Fletcher and MacWhinney 1.
is referred to 'vocabulary development' when looking up the term 'lexical development'.
The same term is used by e.
Dromi 1999 in her overview of early lexical development, and the index in David Crystal's The Cambridge encyclopedia of language .
refers to 'vocabulary' from the index entry 'lexicon'.
This article will compare two measures that often have been used to describe later lexical development: lexical diversity and lexical density.
Lexical diversity is a measure of how many different words that are used in a text, while lexical density provides a measure of the proportion of lexical items .
nouns, verbs, adjectives and some adverb.
in the text.
Both measures have the advantage of being easy to operationalise, and also practical to apply in computer analyses of large data corpora.
Further, both lexical diversity and lexical density have been shown to be significantly higher in writing than in speaking (Ure 1971.
Halliday 1.
One conclusion from this could be that the two measures are interchangeable, and that we will encounter a similar developmental pattern independent of the measure used for describing lexical development.
It is, however, theoretically possible that a text has high lexical diversity .
contains many different word type.
, but low lexical density .
contains many pronouns and auxiliaries rather than nouns and lexical verb.
, or, vice versa, that a text has low lexical diversity .
the same words or VICTORIA JOHANSSON phrases are repeated over and ove.
but high lexical density .
the words that are repeated are nouns, adjective or verb.
Lexical diversity is often used as an equivalent to lexical richness .
, by Daller, van Hout & Treffers-Daller 2.
However.
Malvern et al.
begin their book about lexical diversity with discussing the difference between lexical diversity and lexical richness, stating .
long the lines of Read 2.
that the lexical diversity measure is only one part of the multidimensional feature of lexical richness.
Other factors proposed by Read are lexical sophistication, number of errors, and lexical density (Read 2.
I side with Read and Malvern et al.
neither lexical diversity nor lexical density is the one and only measure.
However, both measures are easily accessible and easy to apply to corpora of different kinds.
No doubt they also provide important insights into the texts, and as long as the measures are not used as the only way to judge a text qualitatively, they are very useful.
The aim and outline of the study This study focuses on developmental patterns in terms of the measures lexical diversity and lexical density.
I will examine whether these measures are sensitive to genre .
arrative vs.
and modality .
riting vs.
Another goal is to investigate to what extent the two measures are The article starts with a theoretical background on the two measures, followed by a presentation of the data, then moves on to statistical analyses presented measure by measure, age group by age grup, and ends with a general discussion and a conclusion.
Lexical diversity The more varied a vocabulary a text possesses, the higher lexical diversity.
For a text to be highly lexically diverse, the speaker or writer has to use many different words, with littie repetition of the words already used.
The type-token ratio The traditional lexical diversity measure is the ratio of different words .
to the total number of words .
, the so-called type-token ratio, or T T R .
Lieven 1978.
Bates.
Bretiierton & Snyder 1.
A problem with the TTR measure is that text samples containing large numbers of tokens give lower values for TTR and vice versa.
The reason for this is that the number
of word tokens can increase infinitely, and although the same is true for word
LEXICAL DIVERSITY AND LEXICAL DENSITY
types, it is often necessary for the writer or speaker to re-use several function words in order to produce one new .
This implies that a longer text in general has a lower T T R value than a shorter text, which makes it especially complicated to use TTR in developmental comparisons, e.
between age-groups, where the number of word tokens often increase with Gayraud 2000 compares TTR and the number of word tokens and shows that although the number of word tokens increases substantially with speaker/writer's age, the TTR drops.
One consequence of this is that T T R is only possible to use when comparing texts of equal length.
In spite of this.
T T R is still used for comparing text production, for instance between children's texts, or between various groups with language impairment.
For instance.
T T R is part of the SALT {Systematic Analysis of Language Transcript.
programs, a set of computer programs developed by Miller and Chapman in order to quantify developmental aspects of speech for typically as well as atypically developing children (Miller & Klee 1.
A variant of the T T R measure is the so-called index of Guiraud.
This measure uses the square root of TTR.
Other proposed variants are Advanced TTR and Guiraud Advanced, for instance used by Daller et al.
Vermeer 2000 discusses TTR and various other measures, and their use in both first and second language acquisition.
She concludes her discussion with proposing that lexical richness can be more successfully measured by exploring the degree of difficulty for the words in a text, as measured by their frequency in everyday life.
Theoretical vocabulary Other ways around the TTR-problem have been proposed and used.
One is the so-called theoretical vocabulary .
ee e.
Broeder.
Extra & van Hout The principle behind this measure is to pick a number of words .
100 word.
from a text at random, and calculate the number of word types in the sample.
The theoretical vocabulary takes into account all possible ways of choosing 100 words from the text.
In this way, one can compare texts of different lenghts, with the only restriction that the shortest text limits the maximal number of random words to be picked.
Johansson 1999 uses theoretical vocabulary for comparing spoken and written expository texts between a group of Swedish university students and 12-year-olds.
In this case the program Vocab .
eveloped by Leif Gronqvist.
Department of Linguistics.
Goteborg Universit.
was used for calculating VICTORIA JOHANSSON theoretical vocabulary.
The result shows that the lexical diversity is higher in writing than in speech for both the adults and the 12-year-olds.
The adults have higher lexical diversity than the 12-year-olds.
Vocah was also used by Wengelin 2002 to compare written texts in various genres from three populations: a group of adult controls, a group of congenitally deaf adults, and a group of adults with reading and writing difficulties.
The adult controls had higher diversity than the other groups.
Some of the written texts had spoken equivalents, and Wengelin was able to show that the control group had a greater difference between their spoken and written texts than the group with reading and writing difficulties.
VocD In order to compare texts of different lengths, a measure independent of sample size is required.
One such measure is the D measure developed by Brian Richards and David Malvern (Richards & Malvern 1997.
Malvern et 2004.
MacWhinney 2.
The D measure is based on the predicted decline of the T T R , as the sample size increases.
This mathematical curve is compared with empirical data from a text sample.
For calculating D, information from the whole text sample is used .
he minimum length of the text is 50 words, howeve.
A higher value of D indicates higher lexical diversity, and thus a richer vocabulary.
The D measure is implemented in the most recent versions of CLAN (MacWhinney 2.
, under the name VocD.
The measure VocD is described at length in Malvern et al.
2004, with many examples and references to previous studies about lexical measures.
Although Malvern and Richards claim that VocD permits comparisons between texts of unequal length, not everybody is convinced that the text length factor is completely eliminated by using VocD.
The D measure has been criticised, for instance by Daller et al.
2003, who instead prefer the index of Guiraud.
Malvern and Richards' D measure is severely criticised by McCarthy and Jarvis 2007 for not being insensitive to text lengths.
McCarthy and Jarvis compare D to 13 alternative methods for measuring lexical diversity.
They conclude that D .
r VocD) performs better than most alternatives, but that there are better options.
However, another conclusion is that the length of texts one wants to compare should determine which measure one uses, since some measures are more effective within certain ranges.
Their analysis shows that D is the second best of all measures within the text length of 100-^00 word tokens, which is also what is claimed in Malvern et al.
McCarthy LEXICAL DIVERSITY AND LEXICAL DENSITY and Jarvis 2007:483 finish by questioning "whether a single index has the capacity to encompass the construct of lexical diversity".
Stromqvist et al.
2002 used VocD to compare spoken and written expository and narrative texts produced by adults from four countries.
The results show strong differences between speech and writing, where writing has a much higher lexical diversity.
However, a conclusion from this study is that one should be careful when using the measure to compare data from different languages.
The morphological structure of the language highly influences the outcome of the comparison.
The definition of lexical diversity in this article To conclude, there are several ways to compare lexical diversity between texts of different lengths.
In spite of some criticism.
VocD seems to be the most accurate instilment to use.
For the calculations of lexical diversity below.
I will consequentiy use the measure D.
Lexical density Lexical density is the term most often used for describing the proportion of content words .
ouns, verbs, adjectives, and often also adverb.
to the total number of words.
By investigating this, we receive a notion of information a text with a high proportion of content words contains more information than a text with a high proportion of function words .
repositions, interjections, pronouns, conjunctions and count word.
Various variants of lexical density have been proposed.
A popular 'minor variant' is to calculate the noun density, the number of nouns divided by the total number of tokens in the text.
Other options are for instance verb or adjective or adverb types per total lexical words.
Various options are described and discussed in Wolfe-Quintero.
Inagaki & K i m 1998.
Introducing the concept of lexical density.
Ure 1971 distinguishes between words with lexical properties, and those without.
According to Ure, items that do not have lexical properties can be described "purely in terms of grammar" .
, meaning that such words .
r item.
possess a more grammatical-syntactic function than the lexical items.
Lexical density is then defined as the total number of words with lexical properties divided by the total number of orthographic words.
The result is a percentage for each text in the corpus.
Ure concludes that a large majority of the spoken texts have a lexical density of under 40%, while a large majority of the written texts have a lexical density of 40% or higher.
One remark here is that these numbers LEXICAL DIVERSITY AND LEXICAL DENSITY VICTORIA JOHANSSON ought to be highly language dependent - a language with more bound morphology would probably show a higher proportion of lexical items.
In a later article.
Ure defines lexical density as "the proportion of words carrying lexical values .
embers of open-ended set.
to the words with grammatical values .
tems representing terms in closed set.
Since all words have grammatical values, this is a part : whole relation" (Ure & Ellis 1977:.
Ure and Ure & Ellis correctiy maintain that the matter of lexicality is important when discussing the concept of lexical density.
Traditionally, nouns, verbs and adjectives are the three word classes considered to have lexical properties .
lthough this is not stated clearly in Ure 1971 or Ure & Ellis 1.
Often these items are called content words or open class words .
ecause of the possibility to easily include new members of the class - while die more grammatical parts of speech are called closed classes, since new prepositions or pronouns seldom enter the languag.
The concept of lexical density is developed, and further refined by Halliday 1985.
He points out the importance of discriminating between lexical items and grammatical items.
A n item may consist of more than one word.
Thus.
Halliday counts turn up as one lexical item, while Ure 1971 counts it as one lexical item .
and one grammatical item .
A lexical item is by Halliday defined as an item that "function.
in lexical sets not grammatical systems: that is to say, they enter into open not closed contrasts" (Halliday 1985:.
The lexical item is part of an open set, that can be contrasted with a number of items in the world.
A grammatical item, on the other hand, enters into a closed system, according to Halliday.
Characteristic for the grammatical system is that the .
classes belonging to it have a fixed set of items, where it is impossible to add new members.
According to Halliday, child language gives evidence for tiie existence of two classes, one with lexical and one with grammatical items.
In the beginning of their linguistic development, children often construct sentences where all grammatical items are missing.
Halliday further emphasises that there is a continuum from lexis into grammar, and that there are - and always will be - intermediate cases.
For instance, he claims that English prepositions and certain classes of adverbs are on the borderline between lexical and grammatical items.
The adverbs that he gives as examples are the modal adverbs, such as always and perhaps.
When comparing e.
speech and writing, the important thing is to be consistent in drawing the line between 'lexical' adverbs and 'grammatical' adverbs, but it matters less where the line is drawn.
The definition of lexical density given by Halliday is thus "the number of lexical items, as a proportion of the number of running words" (Halliday 1985:.
The difference between Halliday's and Ure's definitions of lexical density is that Halliday counts some adverbs as lexical items.
The definition of lexical density in this article This article follows Halliday's definition of lexical density.
Thus, grammatical adverbs are included in the closed class items, while non-grammaticalised adverbs .
ncluding all adverbs derived from adjective.
are counted as lexical In our data, lexical density was calculated by dividing the number of lexical items by the total number of words in each text.
Data To compare lexical diversity and lexical density in a developmental perspective.
I have used material from the Swedish part of an intemational study on developing literacy, the so-called Spencer project^ .
or more details on data collection, see Berman and Verhoeven 2002, or Johansson 2.
The Spencer study aimed at investigating the development of literacy in both speech and writing in two different genres: narrative and expository.
The Swedish data consist of 316 texts distributed evenly on written and spoken narrative and expository texts.
Four age groups participated in the study: 10year-olds .
th-grader.
, 13-year-olds .
th-grader.
, 17-year-olds .
, and adults .
niversity students with at least 2 years of university education, during which they had produced at least one major pape.
A l l participants were monolingual Swedish speakers^, with no known reading or writing difficulties.
Each group consisted of 20 persons, except the adult group which had only 19 members.
The text length range was 50-650 words.
After watching a wordless elicitation movie showing scenes from a school-day .
, from cheating, fighting, bullying, stealin.
, the participants were asked to produce four texts each.
The experimental tasks were balanced 'The project was supported by the Spencer Foundation Major Grant for the Study of Developing Literacy to Ruth Berman.
Apart from Sweden, six other countries participated: Israel.
Netheriands.
France.
Spain.
Iceland and California.
USA.
2'MonoIinguaI speaker' here means that both parents had Swedish as their first language, and that Swedish was the main language used both at home and at school.
At the time of the recording, all subjects had at least started to learn English in school, however, and some of the participants in the adult group might have spent long time abroad.
LEXICAL DIVERSITY AND LEXICAL DENSITY
VICTORIA JOHANSSON
for order.
The text types and the topic for each taslc were as follows .
ith the elicitation question rephrase.
Spoken narrative (NS): Tell me about one time when you helped somebody in/was helped by somebody out of a predicament.
Written narrative (NW): Write about one time when you helped somebody in/was helped by somebody out of a predicament.
Spoken expository (ES) .
a speec.
: Give a speech, where you discuss the problems you just saw in the film.
Don't describe the film, but instead say something about the cause of the problems, and possible solutions.
Written expository (EW) .
an essa.
: Write an essay where you discuss the problems you just saw in the film.
Don't describe the film, but instead say something about the cause of the problems, and possible solutions.
Correlating lexical diversity and lexical density Before exploring each lexical measure individually, a correlation test will give a hint on whether or not the two measures are connected in the data.
Not surprisingly, given that both measures have been proposed to show lexical development, there proved to be a highly significant correlation between lexical diversity and lexical density .
= 0.
733, p < 0.
Overall patterns of age, modality and genre After stating that lexical diversity and lexical density are correlated in the data, multivariate ANOVA was used to explore overall patterns of age, modality and genre for each lexical measure.
To summarise the results below, the general effects were significant for almost all factors, including an interaction of genre, age and modality.
To investigate the main effects of genre and modality and the interactions between these factors, a within-subject factor test was used, while a betweensubjects test was used to look for main effects of age.
Table 1 shows an overview of the results of the post hoc tests.
Lexical diversity: Multivariate analyses Multivariate analyses of lexical diverstity show a significant main effect of genre (F.
= 4.
236,/? < 0.
^ = 0.
, of modality (F.
= 333.
p < 0.
01, rf = 0.
, and of age (F.
= 3302.
206, p < 0.
T?^ = 0.
Table 1.
Results of the post hoc comparisons between lexical diversity and lexical density.
Lexical Measure Lexical diversity Lexical density Subset 1 10-year-olds 13-year-olds 10-year-olds 13-year-olds Subset! 17-year-olds Subset 3 Adults 17-year-olds Adults A significant interaction of modality and age is also found (F.
= 11.
664,/7<0.
01,772= 0.
, as with genre and modality (F.
= 3.
jf7<0.
05, rf- = Q.
\2%).
However, there is no significant interaction of genre and age group.
Tukey's post hoc analyses show no significant difference between the two youngest age groups .
-year-olds and 13-year-old.
, but a significant difference between the two youngest age groups and the two oldest ones.
Further, there was a significant difference between the two oldest groups, in that the adults had higher lexical diversity than the 17-year-olds .
subsets from the post hoc tests in Table .
Lexical density: multivariate analyses Multivariate analyses of lexical density show a main effect of modality (F.
= 651.
744, p<0.
01, j?^^ 0.
, and of age (F.
= 20.
p<(}.
0\,rf- = 0.
, but no effects of genre.
Further, a significant interaction of genre and age is found (F.
= 181, p <0.
01, j?2 = 0.
, and of modality and age group (F.
= 3.
p <0.
Thus, there is a genre effect for the spoken texts, where the narrative spoken texts are more lexically dense than the expository ones.
17-year-olds The 17-year-olds show a significant effect of modality (F.
= 183.
p<0.
01, 77^=0.
, where - again - the written texts have higher lexical density than the spoken texts.
There are no effects of genre.
30% !
lO-year-olds The 10-year-olds show a significant difference of modality (F.
= 360, / 3 < 0.
A?2 = 0.
, in that the written texts have higher lexical density than the spoken ones.
However, there are no genre effects.
0% 'A
10-year-olds 13-ycar-olds 17-ycar-olds Adults AiAAiNS AiAAiNW Ai A - E S A -i- -EW 13-year-olds The 13-year-olds show a significant modality effect (F.
= 171.
p