Threshold Vocabulary - The magic key

Words are, in my-not-so-humble opinion, our most inexhaustible source of magic. - Dumbledore in Harry Potter and the Deathly Hallows

Words form the basic building blocks of communication in any language. And so, vocabulary building plays a vital role in learning a new language. The higher number of words that a learner “knows”, the easier it is for her to read and understand a text in that language. Wouldn’t it be good then, to get hold of the minimal set of words in a language, the magical set, learning which would allow one read any text comfortably well, in that language - the key that opens the whole world of a new language!

This has been a question of many researchers of language learning, especially in English, and this magical set of words is indeed called the “Threshold Vocabulary”. Now that we know what Threshold Vocabulary is, let us ponder over it a bit. Though there is a lot to discuss on this topic, I will stick to introducing key concepts and initial thoughts, in this post.

What does it mean to “know” a word? There are two parts to it.

  • One is to recognise the sound of the word just by seeing it, without requiring to decode each letter in it. For example, the moment we see the word “know”, we recognise that its sound is “no”. However, a word such as “schizophrenia”, will require some processing of its parts - “schi”, “zo”, “phre” and “nia”. This will be more apparent with people who have not encountered the word before. Of course, for a psychiatrist, it might be just as easy to recognise it as the word “know”!
  • The second is to understand the meaning of the word. For example, it is easy to recognise the sound of a word like “deedy”, but one might not know its meaning. This is more applicable to Indian languages. If one recognises all the “Aksharas” of a language, he/she will be able to decode (read) any text reasonably well, without actually knowing its meaning. For example, I can pretty easily read a Kannada text without understanding it! The secret is, my native language is Telugu and scripts of both the languages are quite similar.

Now, let me introduce some more vocabulary about vocabulary! A word that one “knows” (in the above sense), is called a “sight word” for that person, i.e., the word the person understands just by its sight, without much of cognitive processing.

Vocabulary of a person is the set of sight words acquired by that person over a period of time. As I mentioned above, this has a direct relation to reading comprehension. This relation has been studied extensively in context of English. The main question asked by researchers is - what is the minimal set of words that should become a part of any person’s vocabulary (i.e., the threshold vocabulary), to enable reading of a general English text “reasonably well” (defined in terms of fluency and accuracy of reading).

Ok, here goes another terminology! Percentage of sight words in a given text for a given person, is called the person’s “Lexical Coverage” of that text. Studies have shown that a lexical coverage of 95% is required for achieving a reasonable comprehension level of the text.

Remember, the context here is comprehension of a “general” English text, i.e., a text that is not specific to any particular subject, but related to general (day to day) topics. So, the words that occur in such text would typically be the most frequently occurring words in that language. In fact, studies have shown that (for English), the 3000 most frequently occuring words (some studies take this number to 4000) usually cover 95% of a general text. And so, the set of 3000 most frequently occurring words is the “Threshold Vocabulary” in English.

However, “most frequently occurring words” is not a static or even a well-defined set! The volumes of text in English, for that matter in many of the languages, is huge, and ever growing. So there are several Threshold Vocabulary lists from different researchers. Following are some of the well known lists:

  1. The General Service List - A collection of most frequent 2000 words (enhanced to 3000 in New General Service List) for the learners of English as a foreign language. This list has had a wide influence on teaching and learning vocabulary for many years and has served as the basis for second language graded readers.
  2. Dolch Word List - Dolch Word List consists of 220 basic English words. It is believed that these 220 words form 50% to 75% of running words used in school books, library books, newspapers, and magazines.
  3. Ogden English Word List - This a basic word list with simplified version of the English language consisting of 850 words. The purpose of this list was to establish an international language. This list was later extended to 2000 words that, according to Ogden, achieves standard English level.

Threshold Vocabulary is a valuable resources for language learning, with several uses such as: Evaluating complexity and suitability of a text for a given grade (based on % of Threshold words). Assessing reading fluency of learners.

EkStep Language Model uses New General Service List as Threshold Vocabulary for English. However, it requires some fine-tuning to suit Indian context. For example, words like “Church”, “Attic”, “Private” etc. occur in the first 1000 words of this list, which are not likely to be the same in Indian context.

For Indian Languages, the problem is bigger. There is very little to no research, and no standard vocabulary lists available for Indian languages. Building such lists requires a thorough thinking of various approaches, testing them and refining. It is an iterative and evolving process. At EkStep, we have taken one step towards this. We analyzed of Grade 1 to 3 textbooks of Hindi (Rajasthan Govt) and Kannada (Karnataka Govt) for this. The textbooks were parsed to obtain the list of words along with their frequencies across the three grades. Here are some interesting numbers from the analysis:

 

Word Occurrence

Kannada

Hindi

Grade 1 only

825

325

Grade 2 only

273

629

Grade 3 only

2001

863

Grade 1 and 2

44

85

Grade 1 and 3

272

102

Grade 2 and 3

58

260

Grade 1, 2, 3

82

243

Total Words

3555

2507

Words that occurred at least twice

1424

1536

Overall, Kannada textbooks have slightly higher number of words compared to Hindi. Kannada textbooks seem to have a lot of words introduced in Grade 1 and 3, as compared to Hindi, which has a more gradual increase. Words that occur at least twice in the textbooks are around 1500, for both the languages. This list would probably cover 50% of Threshold Vocabulary in each of that language.

This is just a starting point of the process. As more and more content across different grades comes into EkStep platform, the confidence of Threshold Vocabulary definition increases. This process is also extensible to other languages, including English. However, one assumption in this process is that textbooks (and other language learning content), reflect general vocabulary of the language reasonably well. A possible way to validate this is to parse a different text corpus (like news articles, wiki pages), perform similar frequency analysis and compare the results.

I will end this post with a novel thought. This is about extending the definition of Threshold Vocabulary. As we have seen above, the context of Threshold Vocabulary is comprehension of “general” text in a language. Can we not extend this concept to different contexts? For example, if we take the context of learning Mathematics, there is a basic set of Maths vocabulary that a learner should know in order to do primary level Maths (till grade 5). This can be categorised as Threshold Vocabulary for Maths learning. It can be extended to different such learning contexts. Language Model of EkStep platform supports this type of extension. So, it is not just about the keys that open worlds of new languages, but also about the keys that open worlds of new concepts!

Vocabularies are crossing circles and loops. We are defined by the lines we choose to cross or to be confined by. - A.S. Byatt