A Corpus-Based Analysis of Mixed Code in Hong Kong Speech
2012 Worldwide Convention on Asian Language Processing A Corpus-based Evaluation of Blended Code in Hong Kong Speech John Lee Halliday Centre for Clever Functions of Language Research Division of Chinese language, Translation and Linguistics Metropolis College of Hong Kong [email protected] edu. hk Summary—We current a corpus-based evaluation of the usage of combined code in Hong Kong speech. From transcriptions of Cantonese tv packages, we determine English phrases embedded inside Cantonese utterances, and examine the motivations for such code-switching. Among the many many motivations noticed in earlier analysis, we discovered that 4 alone account for greater than 95% of the usage of English phrases in our speech knowledge throughout genres, genders, and age teams. We carried out analyses over greater than 60 hours of transcribed speech, leading to one of many largest empirical research to-date on this linguistic phenomenon. Key phrases-code-mixing; English; corpus linguistics. code-switching; Cantonese; II. PREVIOUS RESEARCH I. INTRODUCTION Whereas Cantonese is the mom tongue for the overwhelming majority of the individuals in Hong Kong, English can also be spoken by 43% of the inhabitants [1], reflecting town’s heritage as a British colony. A well known function of the speech in Hong Kong is code-switching, i. e. , “the juxtaposition of passages of speech belonging to 2 completely different grammatical methods or sub-systems, inside the identical trade” [2]. Particularly, within the case of Hong Kong, the 2 grammatical methods are Cantonese and English. The previous serves because the ‘matrix language’, and the latter because the ‘embedded language’, leading to Cantonese sentences with English segments reminiscent of (instance taken from [3]): canteen heoi3 canteen jam2 caa4 ‘let’s go to the canteen for lunch’ Right here, the English section comprises just one phrase (‘canteen’), however on the whole, it may be an entire clause. We'll use the final time period ‘code-switching’ fairly than the extra particular time period ‘code-mixing’, which refers to switching beneath the clause stage, regardless that most English segments in our corpus certainly include just one or two phrases (see Desk three). There's already a big physique of literature dedicated to the research of Cantonese-English code-switching from the theoretical linguistic viewpoint [3,4,5]. This paper investigates the motivations behind the usage of combined code, on the premise of a big dataset of speech transcribed from tv packages. In Part II, we define earlier analysis on the motivations of code-switching, and focus on how our investigation enhances theirs. In Part III, we describe our methodology for corpus development, specifically the design of the taxonomy of code-switching motivations. In Part IV, we current an evaluation of those motivations based on style, gender and age. The primary main framework for classifying codeswitching motivations in Hong Kong consists of two classes: ‘expedient’ and ‘orientational’ [6]. Central to this framework is the excellence between phrases in ‘excessive Cantonese’ and ‘low Cantonese’. In on a regular basis conversations, a speaker generally can't discover any phrase from ‘low Cantonese’ to explain an object, establishment or thought (e. g. , ‘software kind’). Utilizing a phrase from ‘excessive Cantonese’ (e. g. , biu2 gaak3), nevertheless, would sound too formal and subsequently stylistically inappropriate. In expedient mixing, the speaker resorts to an English phrase; the blending is pragmatically motivated. In distinction, orientational mixing is socially motivated. The speaker chooses to make use of English (e. g. , ‘barbecue’) regardless of the provision of equal phrases from each ‘low Cantonese’ (e. g. , siu2 je5 sik6) and ‘excessive Cantonese’ (e. g. , siu1 haau1), since he perceives the subject material to be inherently extra ‘western’. This dichotomy has been criticized as overly simplistic, due to the paradox in defining lexical and stylistic equivalents amongst ‘low Cantonese’, ‘excessive Cantonese’, and English. As a substitute, a four-way taxonomy is proposed: euphemism, specificity, bilingual punning, and the precept of financial system [7]. This taxonomy is then additional prolonged, in a research of code-switching in textual content media [8], to incorporate quotations, doubling, identification marking, and interjection. These classes can be defined intimately in Part III. Whereas these classification methods are complete and properly grounded, they don't per se convey any sense of the relative significance or distribution of the assorted motivations. Our objective is, first, to empirically confirm the protection of those classification methods on a big dataset of transcribed speech; and, second, to present quantitative solutions to questions reminiscent of: Which sorts of motivations are essentially the most distinguished? Does the vary of motivations differ based on the speech style, or to the speaker’s gender or age? We now flip our consideration to the methodology for setting up and annotating a speech corpus for these analysis functions. III. DATA A. Supply Materials Our corpus is constructed from tv packages broadcast in Hong Kong inside the final 4 years by Tv Broadcasts Restricted (TVB). The packages belong to quite a lot of genres, together with two drama collection, three current-affairs reveals, a information program, and a chat present. The information program, TVB Information at Six-Thirty, carries essentially the most formal register, containing principally pre-planned 165 978-Zero-7695-4886-9/12 $26. 00 © 2012 IEEE DOI 10. 1109/IALP. 2012. 10 speech by the anchor. The present-affairs reveals, Tuesday Report, Sunday Report and Hong Kong Connection, are severe in tone however include spontaneous discussions. The discuss present, My Sweets, is about food and drinks. It additionally comprises spontaneous discussions, however the matters are usually lighter. Though pre-planned, the speech in each drama collection, Moonlight Resonance and Sure Sir, Sorry Sir, is arguably the least formal in register, designed to mirror pure speech in on a regular basis life. Particulars of those TV packages are introduced in Desk 1. Desk 1: Tv packages that function the supply materials of our corpus. Style Program Size Present Tuesday Report ( ), 135 episodes affairs ), X 20 minutes Sunday Report ( Hong Kong Connection ( ) Speak 24 episodes My Sweets ( ) present X 30 minutes Euphemism: When a Cantonese phrase explicitly mentions one thing that the speaker finds embarrassing, s/he may go for an English phrase that comprises no such point out. For instance, to keep away from the feminine physique half hung1 ‘breast’ within the phrase hung1 wai4 ‘bra’, the speaker may want to make use of the English ‘bra’ (all examples are taken from [7]): bra tau3 bra gaak3 gaak3 ‘A princess whose bra is seen’ Specificity: “Generally an English expression is most popular as a result of its that means is extra common or particular in contrast with its near-synonymous counterparts,” [7] in both low or excessive Cantonese. For instance, the verb ‘to e book’ means ‘to make a reservation for which no cash or deposit is required’, which is extra particular than its closest equal in Cantonese, deng6 ‘to make a reservation’. It's usually utilized in sentences reminiscent of: e book ngo5 soeng2 e book saam1 dim2 ‘I wish to e book three o’clock’ Precept of Financial system: “An English expression can also be most popular as a result of it's shorter and thus requires much less linguistic effort in contrast with its Chinese language/Cantonese equal. ” [7] Whereas the phrase ‘check-in’ has two syllables, its Cantonese equal baan6 lei5 dang1 gei1 sau2 zuk6 ‘check-in [on a plane]’ has six. The precept of financial system is thus probably the explanation behind combined code reminiscent of: check-in nei5 check-in zo2 mei6 aa3 ‘Have you ever checked in already? ’ The taxonomy in [8] builds on the one in [7], additional enriching it with categories2 beneath: Citation: When citing textual content or another person’s speech, one usually prefers to make use of the unique code to keep away from having to carry out translation. An instance is direct speech: “What do you suppose? ” jau5 go3 pang4 jau5 man6 ngo5 what do you suppose ‘A pal requested me, “What do you suppose? ’ Doubling: Initially named ‘Emphasis or avoidance of repetition’ [8], it will likely be known as ‘Doubling’ [9] right here to make it specific, as this class refers to English phrases which are embedded alongside Cantonese phrases which have the identical or practically the identical that means. The aim is to emphasise the concept or to keep away from repetitions. Within the following sentence, it serves as an emphasis: 2 Information Drama TVB Information at Six-Thirty ( ) Moonlight Resonance ( ), Sure Sir, Sorry Sir ( Sir Sir) 5 episodes X 20 minutes four episodes X 45 minutes B. Information Processing From the tv packages listed in Desk 1, all code-mixed utterances have been transcribed, preserving the unique languages, both Cantonese or English. Following normal apply, mortgage phrases will not be thought of to be combined code; in our context, all English phrases (e. g. , ‘taxi’) which have been tailored into Cantonese phonology (e. g. , dik1 si2) have been excluded. The TV captions corresponding to every of those utterances are additionally recorded as a part of the corpus. These captions are in normal Chinese language, fairly than Cantonese. Moreover, alignments between the Chinese language phrase(s) within the caption and the English phrase(s) within the utterance are annotated. This info can be used within the classification of motivations. Lastly, two sorts of metadata in regards to the speaker are recorded: gender (male or feminine) and age group (teenager or grownup). C. Taxonomy of Code-Switching Motivations Our objective is to quantitatively characterize the motivations behind code-switching; to this finish, every English section within the Cantonese sentences in our corpus is to be labeled with a motivation. On account of time constraint, this classification was carried out solely on the currentaffairs and discuss reveals. The ‘expedient’ vs. ‘orientational’ classification system is just too coarse for our goal. As a substitute, we adopted the taxonomy in [7,8] as our start line, then launched some new classes to accommodate our knowledge. The classes in [7] are1: 1 A fourth class, ‘bilingual punning’, is excluded from our taxonomy. As could also be anticipated, punning is rarer in speech, and is certainly not present in our corpus. Amongst these classes is ‘identification marking’, for combined code that marks “social traits reminiscent of social standing, schooling standing, occupation, in addition to regional affiliation. [8] We discovered it tough to objectively determine this motivation, and excluded it from our taxonomy. 166 Superb excellent m4 co3 aa1 ‘Superb, excellent! ’ Interjection: English interjections could also be inserted into the Cantonese sentence. For instance: Anyway anyway nei5 hou2 sai1 lei6 ak1 ‘Anyway, you might be superior! ’ A big quantity of combined code in our corpus, nevertheless, nonetheless doesn't match into any of the above classes. Most fall underneath one in all two causes, ‘Private Identify’ and ‘Register’. We subsequently added them to our taxonomy: Register: That is roughly equal to the ‘expedient’ class in [6], however can be known as ‘Register’ on this paper to make the motivation specific. Generally, the speaker can't discover any equal ‘low Cantonese’ phrase, however feels awkward to make use of a extra formal ‘excessive Cantonese’ phrase (e. g. , paai1 deoi3 ‘occasion’). Consequently, s/he resorts to an English equal as a substitute. For instance, occasion hoi1 ci2 laa1 ngo5 dei6 go3 occasion ‘Our occasion is beginning’ Private Identify: It is not uncommon apply amongst Hong Kong individuals to undertake an English title. Though this phenomenon could also be thought of ‘orientational’ codemixing when it comes to the ‘western’ notion [6], it's given its personal class, as a result of it is rather particular and accounts for a considerable quantity of our knowledge. A typical instance is: Teresa, Teresa ngo5 dei6 zing2 dak1 leng3 m4 leng3 ‘Teresa, did we make it properly? ’ D. Annotation Process We thus have a complete of eight classes in our taxonomy of code-switching motivations. 5 of those classes – specifically, ‘euphemism’, ‘citation’, ‘doubling’, ‘interjection’, and ‘private title’ – can normally be unambiguously discerned. The annotator, nevertheless, has usually discovered it tough to tell apart between ‘specificity’, ‘register’, and ‘precept of financial system’. To take care of consistency, we adopted the next process. When an English section doesn't match into any of the 5 “straightforward” classes, the annotator is to determine whether or not it has the identical that means because the Chinese language phrase within the caption to which it's aligned. Whether it is deemed to not have the identical that means, then it's assigned ‘specificity’. Whether it is equal in that means, and the annotator can't consider any equal in ‘low Cantonese’, then it's labeled ‘register’. Lastly, if there's a ‘low Cantonese’ equal, however its variety of syllables is bigger than that of the English section, then the motivation is ‘precept of financial system’. IV. ANALYSIS English segments in Cantonese speech (part A), then focus on the distribution of the classes of motivations, each total and with respect to genres, genders, and age teams (part B). A. Density and Size of English Segments It's well-known that English phrases are sprinkled fairly liberally within the Cantonese speech in Hong Kong. We measure how the frequency of English segments varies throughout completely different genres. As proven in Desk 2, the frequency correlates with the register of the style (see Part III. A). Within the drama collection, essentially the most colloquial style, one and a half English phrases are uttered per minute on common. The discuss present occupies second place, and the present affairs reveals have barely much less frequent English phrases. Within the information program, the place the speech is preplanned, the anchor didn't utter any English phrase. Desk 2: The whole variety of Cantonese sentences containing English segments, and the entire variety of English phrases transcribed. The final column reveals how usually an English phrase is uttered. Program style Drama Speak present Present affairs Information # despatched with English 219 487 1495 Zero # English phrases 259 625 1995 Zero Frequency (phrases/min) 1. four Zero. 87 Zero. 74 Zero Second, we measure the size of the English segments. Desk three reveals that the overwhelming majority of English segments include not more than two phrases. Throughout all genres, greater than 80% of the English segments include just one English phrase. This determine is corresponding to the 81. four% for textual content knowledge reported in [8]. Desk three: Proportion of English segments with just one (e. g. , “canteen”) or two phrases (e. g. , “thanks”). Program style Drama Present affairs Speak present One-word 85% 85% 81% Two-word 11% 11% 17% This part presents some preliminary analyses on this corpus. We first take into account the frequency and size of B. Motivations for the usage of combined code A plethora of motivations have been posited for the usage of combined code in Hong Kong (see Part II). Making use of our proposed classification system (see Part III. C) on our corpus of transcribed speech, we intention now to discern the relative prevalence of the assorted sorts of codeswitching motivations. Desk four reveals the distribution of those motivations within the current-affairs and the discuss reveals. 4 dominant motivations – mainly ‘register’, but in addition ‘private title’, ‘precept of financial system’, and ‘specificity’ – are attributed to greater than 95% of the English segments. This pattern is identical throughout genres (current-affairs and discuss reveals), genders (see Desk 6), and age teams (see Desk 5). All different classes, together with quotations, euphemism, doubling, and interjection, are comparatively rare. Genres. Among the many 4 dominant motivations, ‘register’ – the usage of appropriately casual phrases – is essentially the most frequent motivation in each the current-affairs and 167 discuss reveals. Its proportion, nevertheless, is considerably extra marked (47. four%) within the discuss present than in present affairs (36. four%), reflecting the extra casual nature of the previous. Desk four: Distribution of code-switching motivations, contrasted between genres. Motivation Present affairs Speak present Register 36. four% 47. four% Private Identify 26. eight% 24. 5% Precept of financial system 19. Zero% 17. 6% Specificity 13. 2% eight. 2% Citation 2. 1% 1. Zero% Doubling 1. four% Zero. four% Interjection Zero. 9% 1. Zero% Euphemism Zero. three% Zero% Age teams. Desk 5 contrasts the distributions of code-switching motivations between adults and youngsters within the current-affairs reveals three . As talked about above, the 4 main motivations stay fixed. Nevertheless, youngsters are more likely than adults to make use of English phrases to attain extra casual register (52. four% vs. 35. 1%). In addition they have a tendency extra to go for English to save lots of effort (23. eight% vs. 18. 6%). Considerably surprisingly at first look, youngsters deal with others in English names much less usually than adults (2. four% vs. 28. eight%); it seems that within the conversations in our corpus, youngsters usually want to deal with adults with the extra formal Chinese language names, probably out of respect. Desk 5: Distribution of code-switching motivations, contrasted between age teams. Motivation Adults Youngsters Register 35. 1% 52. four% Private Identify 28. eight% 2. four% Precept of financial system 18. 6% 23. eight% Specificity 13. 1% 14. three% Citation 1. 9% four. Zero% Doubling 1. three% 2. four% Interjection Zero. 9% Zero% Euphemism Zero. three% Zero. eight% use English names to deal with others (32. 9% vs. 18. 9%); males, however, extra incessantly use English phrases to cut back effort (22. 9% vs. 14. eight%). V. CONCLUSIONS We have now described the development of a corpus of Cantonese-English combined code, based mostly on speech transcribed from tv packages in Hong Kong. Drawn from greater than 60 hours of speech, this corpus is among the many largest of its sort. A novel function of the corpus is the annotation of the motivation behind every code-mixed utterance. Having proposed a classification system for these motivations, we utilized it on our corpus, and reported variations in the usage of combined code between genres, genders and age teams. A key discovering is that 4 foremost motivations – ‘register’, ‘private title’, ‘precept of financial system’, and ‘specificity’ --- account for greater than 95% of the embedded English segments. ACKNOWLEDGMENT This challenge was partially funded by a Small-Scale Analysis Grant from the Division of Chinese language, Translation and Linguistics at Metropolis College of Hong Kong. We thank Man Chong Mak and Hiu Yan Wong for compiling the corpus and performing annotation. REFERENCES [1] Okay. H. Y. Chen, “The Social Distinctiveness of Two Code-mixing Kinds in Hong Kong,” in Proceedings of the 4th Worldwide Symposium on Bilingualism, MA: Cascadilla Press, 2005, pp. 527541. J. Gumperz, “The sociolinguistic significance of conversational code-switching,” in RELC Journal eight(2), 1977, pp. 1—34. J. Gibbons, “Code-mixing and koineizing within the speech of scholars on the college of Hong Kong”, in Anthropological Linguistics 21(three), 1979, pp. 113—123. B. H. -S. Chan, “How does Cantonese-English code-mixing work? ”, in Language in Hong Kong at Century’s Finish, M. C. Pennington (ed. ), 1998, pp. 191—216, Hong Kong: Hong Kong College Press. D. C. S. Li, “Linguistic convergence: Impression of English on Hong Kong Cantonese,” in Asian Englishes 2(1), 1999, pp. 5—36. Okay. Okay. Luke, “Why two languages is perhaps higher than one: motivations of language mixing in Hong Kong”, in Language in Hong Kong at Century’s Finish, M. C. Pennington (ed. ), 1998, pp. 145—159, Hong Kong: Hong Kong College Press. D. C. S. Li, “Cantonese-English code-switching analysis in Hong Kong: a Y2K assessment,” in World Englishes 19(three), 2000, pp. 305— 322. H. Cao, “Improvement of a Cantonese-English code-mixing speech recognition system,” PhD dissertation, Chinese language College of Hong Kong, 2011. R. Appel and P. Muysken, Language contact and bilingualism. London: Arnold, 1987. [2] [3] [4] [5] [6] Desk 6: Distribution of code-switching motivations, contrasted between genders. Motivation Feminine Male Register 37. 5% 40. 7% Private Identify 32. 9% 18. 9% Precept of financial system 14. eight% 22. 9% Specificity 10. 9% 13. 2% Citation 1. 9% 1. 7% Doubling 1. 1% 1. three% Interjection Zero. 7% 1. 1% Euphemism Zero. three% Zero. 2% Genders. Lastly, we examine whether or not codeswitching motivations are biased based on gender. Aggregating statistics from each the current-affairs and discuss reveals, Desk 6 compares the motivations of males and people of females. Females are proven to be extra prone to three [7] [8] [9] The audio system within the discuss present are predominantly adults. 168