Words are like people; they rarely occur in isolation. Like tribes, each word belongs to a large semantic domain. For instance there are religious terms and health terms. And just like every tribe has its own makgotla, words also within a specific domain cluster around certain themes and semantics concepts. For instance, amongst health terms there are HIV terms, TB terms etc. Like individuals have their own buddies, words also have their associates, that is, their preferred terms of association.
This linguistic reality is behind Firth’s 1957 claim that you shall know a word by the company it keeps which echoes another maxim: you shall know a man by the company he keeps. It is not really possible to talk about the meaning of the word in isolation ÔÇô it only has a particular meaning when it is in a particular environment. For instance the word bank can mean a financial institution or the side of a river based on its context, that is, on the company that it keeps. MWEs therefore include idioms, phrasal verbs, proverbs, compound words, etc. English examples of MWE include by and large, kick the bucket, in step, take up, take off, shake up, telephone booth, pull strings, fresh air, fish and chips, salt and pepper, etc. Setswana examples are solegela molemo (benefit), kukega maikutlo (be upset), iphaga dikoro (involve oneself in other people’s business), tsholetsa maoto/dina├┤ (walk faster), opisa tlhogo (cause trouble), tsaya karolo (participate), tsaya tsia (pay attention), nna le seabe (take part), ja monate (enjoy), etc.
The teaching of new vocabulary must also be done in context so that learners can be made aware of the grammar and collocations of new words and phrases.
There are areas of linguistics which focus on the study of word clusters. Different strategies are used to study how words cluster together. The computational measure that is used in corpus linguistics is known as the Mutual Information (MI) measure.
A mutual information (MI) score relates one word to another. For example, if problem is often found with solve, they may have a high mutual information score. Usually, the will be found much more
often near problem than solve, so the procedure for calculating Mutual Information takes into account not just the most frequent words found near the word in question, but also whether each word is often found elsewhere, well away from the word in question. Since the is found very often indeed far away from problem, it will not tend to be related, that is, it will get a low MI score. This study of word cluster has been used extensively in dictionary making processes to identify clusters such as those of multi-word expressions such as mother-in-law.
Lately, we have been most fascinated by exploring the way Setswana words cluster together. Our obvious fascination relates to how the results of our study could be applied obviously to Setswana dictionaries, an area of our principal interest. In our study, we found out that the word pula is statistically associated with different words. First, there are high frequency words that are found in the vicinity of a word under investigation which are nevertheless not immediately critical to the meaning of a headword. Second, multi-word units (e.g. pula ya matlakadibe: a vicious rainy storm, pula ya sephai: the first rain of the season; pula ya kgogolamoko: the first rain after harvest etc) are unearthed. Third, a word’s valency is revealed. For instance the noun pula ‘rain’ can take certain Setswana terms such as verbs and adjectives that characterise the type, intensity, end or beginning of the rain. For instance words that express the sense of heavy rain are:
‘tsorotla’, ‘porotla’, ‘bokete’, ‘kgolo’, ‘tshologa’, ‘gosomana’ ‘maswe’ and ‘tsora’. ‘sarasara’, ‘komakoma’ and ‘rotha’ all express a ‘light showers’. ‘thiba’ expresses impending rain while ‘simolola’, ‘itelekela’, and ‘kgomoga’ all indicate the start of rain with ‘kgomoga’ implying the beginning of a heavy rain or an unexpected rain. ‘kgaotsa’, ‘didimala’, and ‘ema’ relate to the sense of ‘stop raining’. Information relating to category one above may be treated in large Setswana dictionaries. Category-two information is lexicalised and should be included either as independent headwords or as dictionary subentries. Category three collocations are what could be added as part of a dictionary’s usage notes to illustrate the natural collocates of a headword. This will aid users, particularly users of an active dictionary to produce ‘natural-sounding’ pieces of large units of language.
More examples of the word pula’s valency include: na, nele, tla, kgolo, namagadi, ntsi, tshologa, tshweu, tswa, boutsana, simolotse, tona, kgaotsa, bokete, phaila, porotla, rotha, tsorotla, utlwala, duma, goroga, ntlha, tsheola, dikgadima, selemo, tla, tshologa, simolola, mariga, matlotlo, morago, morwalela, ngwaga, ditladi. The verb tshwara on the other hand’s valency includes the following terms: bothata, sentle, sepe, diphuthego, legodu, letsogo, thata, terena, boroko, phuthego, ntlha, phage, dithuto, dipuisano, pitso, mafoko, botlhaswa, kgaisano. The word associations lead to the following multiword expressions: maru ga se pula, mosi ke molelo, nesa ke pula, mosele wa pula o et┼íwa go sa le gale, kgole ya pula e bo┼íwa e bofologa, pula ya medupe, pula ya tsheola, pula ya sephai, pula ya kgogolamoko, pula ya maebana, pula e namagadi, pula e tshweu. The verb tshwara’s word associations lead to the following multiword expressions: tshwara ditlhapi, tshwara pelo, tshwara logaba, tshwara bothata, tshwara phage ka mangana, tshwara thipa ka fa bogaleng, tshwara mala ka letsogo and tshwara poo.
There is still much to learn about word clusters. What is not in dispute is that words are like people; they rarely occur in isolation.