Prompting Towards Alleviating Code-Switched Data Scarcity in

Under-Resourced Languages with GPT as a Pivot

Michelle Terblanche, Kayode Olaleye, Vukosi Marivate

Data Science for Social Impact

Dept. of Computer Science

University of Pretoria

South Africa

[email protected],

kayode.olaley[email protected],

vukosi.mariv[email protected]

Abstract

Many multilingual communities, including numerous in Africa, frequently engage in code-switching during

conversations. This behaviour stresses the need for natural language processing technologies adept at processing

code-switched text. However, data scarcity, particularly in African languages, poses a signiﬁcant challenge, as many

are low-resourced and under-represented. In this study, we prompted GPT 3.5 to generate Afrikaans–English and

Yoruba–English code-switched sentences, enhancing diversity using topic-keyword pairs, linguistic guidelines, and

few-shot examples. Our ﬁndings indicate that the quality of generated sentences for languages using non-Latin

scripts, like Yoruba, is considerably lower when compared with the high

Afrikaans–English

success rate. There

is therefore a notable opportunity to reﬁne prompting guidelines to yield sentences suitable for the ﬁne-tuning of

language models. We propose a framework for augmenting the diversity of synthetically generated code-switched

data using GPT and propose leveraging this technology to mitigate data scarcity in low-resourced languages,

underscoring the essential role of native speakers in this process.

Keywords: code-switch, LLM, few-shot, prompting

1. Introduction

Multilingual communities, exempliﬁed well by var-

ious African countries, often engage in code-

switching, where two or more languages are used

within a single discourse (Poplack, 2001a). This

language practice highlights the need to develop

more advanced natural language processing (NLP)

technologies that can smoothly process and pro-

duce code-switched sentences. This will move

the needle towards equitable representation of the

world’s under-resourced languages, ensuring that

everyone has equal access to these technologies

(Solorio, 2021).

There are numerous challenges in code-

switching research. The main three are highlighted

by Doğruöz et al. (2021) as follows: i) data, which

is related to quantity, quality and availability; ii) eval-

uation, which refers to benchmarks and metrics;

and iii) challenges related to end-to-end applica-

tions, particularly the ability to process and produce

code-switched data.

The focus of this paper is on the ﬁrst challenge

regarding data. While code-switching frequently

occurs in written forms, due to the ubiquitous use of

social media platforms, leveraging this data in NLP

applications for code-switching presents many chal-

lenges. These platforms, with their extensive and

diverse linguistic expressions, can be invaluable

in gathering code-switched data. Yet, the practical

utility of such data is hindered by various factors,

including the informal, inconsistent nature of online

language (Çetinoğlu et al., 2016). It is common

to use acronyms, emojis and make spelling mis-

takes which aﬀect quality and usability of such data

(Srivastava et al., 2019)

. Furthermore the diversity

of such data is limited to a speciﬁc type of language

use (Winata et al., 2022).

To address the shortage of available data, eﬀorts

have been made to create synthetic code-switched

data using diﬀerent methods: from using parallel

corpora with linguistic constraints on where a switch

can occur (Pratapa et al., 2018; Rizvi et al., 2021)

to employing transformer-based models to gener-

ate diverse sentences that adhere to lexical and

syntactic rules (Riktika et al., 2022). A more re-

cent study evaluated prompting of large language

models (LLMs) to generate code-switched data for

South East Asian languages (Yong et al., 2023).

They explored a few prompting templates with a

limited number of topics in a zero-shot manner and

cautioned against the use of synthetically gener-

ated data without involving native speakers of the

language.

In this paper, we build on the work of (Yong et al.,

2023) to address the question about GPT’s ability

to generate code-switched data. Our work overlaps

in that we also use an LLM, OpenAI’s GPT, and var-

ious topics in the prompts. We increase the number

of topics and provide topic-related keywords in an

arXiv:2404.17216v1 [cs.CL] 26 Apr 2024

eﬀort to increase diversity and reduce the model’s

propensity to default to certain words. Our goal is

not to evaluate various prompting templates, how-

ever, we add linguistic guidelines in the prompts

to further increase diversity. We propose this as

an approach towards language agnostic prompting.

We also test the performance of GPT 3.5 with few-

shot in-context examples. We speciﬁcally consider

whether GPT can support the generation of larger

code-switched datasets and to what extent.

Our contributions are as follows: (i) we provide a

framework to increase the diversity of synthetically

generated code-switched data by prompting Ope-

nAI’s GPT; and (ii) we position GPT as a pivot to ad-

dress code-switched data scarcity in low-resource

languages while emphasising the need for native

speakers in the loop.

Increasing data availability is at the center of

developing language models that serve multilingual

communities. Our work is a step towards closing

the gap in low-resourced and under-represented

languages.

2. Related Work

2.1. Code-Switching Research

Various types of code-switching have been identi-

ﬁed but the type that attracts the most academic

research is intra-sentential code-switching which

can occur anywhere within a sentence boundary

(Poplack, 1980) and as a result, adds complex-

ity in evaluation (Poplack, 2001b). Another com-

plex type is intra-word code-switching where the

stem of one language is bound to another language

(Çetinoğlu et al., 2016; Van der Westhuizen and

Niesler, 2018).

Over and above the issue of data diversity

(Winata et al., 2022), one of the major challenges in

code-switching studies is related to data availability

(Doğruöz et al., 2021). A survey by (Winata et al.,

2022) showed that up until October 2022, a rela-

tively small amount of papers (ACL Anthology, 2023

and ISCA Proceedings, 2023) focused on code-

switching research in African languages with very

few publicly available datasets. Eleven publications

mention South African languages. The non-English

South African languages referenced are isiZulu,

isiXhosa, Setswana, Sesotho and Afrikaans. Only

one proceeding includes Afrikaans code-switching

(Niesler and De Wet, 2008) with no published

dataset. A paper by Van der Westhuizen and

Niesler (2018) introduced the ﬁrst corpus on isiZulu,

isiXhosa, Setswana, Sesotho curated from tran-

scribed soap opera speech data and eight of the

papers makes use of this dataset and is mainly

focused on automatic speech recognition (ASR)

systems.

Code-switching in Kiswahili–English is studied in

two papers but no datasets were made available

(Otundo and Grice, 2022; Piergallini et al., 2016).

In addition to a survey by Winata et al. (2022),

one other paper was found that addresses Sepedi–

English code-switching. Modipa et al. (2013) de-

velop a corpus from a set of radio broadcasts to

evaluate the implication of code-switching in ASR

systems. This dataset is publicly available. This

brief review of the state of code-switching research

in an African context motivates our work to develop

methods for addressing data scarcity.

A predominant approach to mitigating data avail-

ability issues involves augmenting existing datasets

through the generation of synthetic code-switched

data. Some of the methods to augment the ear-

lier mentioned South African speech corpus in-

clude the use of word embeddings to synthesise

code-switched bigrams to ﬁnd similar words in

the sparse training data (Westhuizen and Niesler,

2017). Biswas et al. (2018) evaluated adding

out-of-domain monolingual data and synthesised

code-switched data using an LSTM to augment the

dataset.

For non-African languages, Rizvi et al. (2021)

developed a toolkit that generates multiple code-

switched sentences using either the Equivalence

Constraint or the Matrix Language Frame. The lim-

itations are that it relies on a good sentence aligner

and parser and parallel translated sentences as

input. The notion is that this approach should

work on any language pair. Winata et al. (2019)

implemented a sequence-to-sequence model for

English-Mandarin code-switched data. Although

the model does not require external knowledge

regarding word alignments, it still relies on an exist-

ing English–Mandarin code-switched dataset and

parallel corpora. The work of (Liu et al., 2020) intro-

duced an attention-informed zero-shot adaptation

method that relies on a limited number of parallel

word pairs. The languages covered are German,

Italian, Spanish and Thai, the latter two for natu-

ral language understanding. The shortcoming of

the above-mentioned approaches is the diversity of

data. Most existing code-switched datasets were

collected from social media platforms such as Twit-

ter and therefore limits the type of code-switching

(Doğruöz et al., 2021).

To this issue, Riktika et al. (2022) developed an

encoder-decoder translation model for controlled

code-switched generation. It uses monolingual

Hindi and a publicly available Hindi–English code-

switched dataset as input to generate data that is

faithful to syntactic and lexical attributes.

Yong et al. (2023) proposed an approach that is

independent of existing code-switched datasets or

parallel corpora through prompting of LLMs. Their

objective was to test whether multilingual LLMs

can generate code-switched text through prompting.

They evaluated a variety of prompt templates and

found that those explicitly deﬁning code-switching

gave the highest success rate. However, they also

highlighted the sentences often contained word-

choice errors and semantic inaccuracies which was

more prevalent in the languages that don’t use the

English alphabet and Latin script. They limited the

scope to ﬁve topics and did not include diversity as a

measure. Their ﬁndings were that GPT’s capability

to generate code-switched data is superior to other

LLMs, however, using this method without humans-

in-the-loop is not advised.

Jha et al. (2023) elaborated on LLMs such as

GPT being prone to hallucinations where it provides

factually inaccurate or contextually inappropriate

responses. A solution to address this is to ensure

carefully curated prompts. Furthermore, to avoid

encoded biases, Bender et al. (2021) emphasises

the need to also evaluate appropriateness in rela-

tion to a particular social context.

With the rapid adoption of LLMs in everyday life,

these are a low-cost alternative to alleviate data

scarcity in low-resourced and under-represented

languages by synthetically generating text. In this

paper we expand on the work of Yong et al. (2023)

and position GPT as a pivot in generating code-

switched data rather than a self-suﬃcient solution.

Code-Switched Text Generation via

GPT-3.5 Prompting

Our prompt-based approach to code-switched (CS)

text generation is heavily inspired by the work of

Yong et al. (2023), who collected synthetic CS data

by prompting LLMs with requests along languages

and topics. Their focus was on code-switching

English with South-East Asian languages. In our

case, we focus on two under-explored and under-

resourced code-switching scenarios: Afrikaans–

English and Yoruba–English. Although Afrikaans

and English are typologically dissimilar (van Dulm,

2007), they are both West Germanic languages

and generating CS text should be easier. Yoruba

is a tonal language and even more dissimilar to

English which could provide challenges when cre-

ating synthetic CS data. We extend the limited

topics covered in Yong et al. (2023) and present

GPT-3.5 not as an autonomous solution to CS data

scarcity, but as a potential tool for supporting CS

data curation eﬀorts for under-resourced African

languages. We speciﬁcally use GPT-3.5, ﬁrstly as

a baseline to compare with the ﬁndings from Yong

et al. (2023) and secondly, due to the unavailability

of the GPT-4 API at the time of our experiments

The API for GPT 4 was made available after we

ﬁnished the majority of the experiments,

3.1.

Prompting for Afrikaans–English CS

Sentences

Building on the prompt template from Yong et al.

(2023), which uses topics as guidelines, our ap-

proach extends this by (i) incorporating speciﬁc

code-switching words related to each topic within

the prompt and (ii) evaluating the eﬀect of prompt

complexity from basic (Section 3.1.1) to more com-

prehensive prompts (Section 3.1.2).

We curate a non-exhaustive list of common

conversation topics and associate typical English

words from native speakers of Afrikaans and from

available online platforms. We cover 22 topics with

a total of 355 keywords. For this paper we generate

one sentence per keyword for the various prompts.

We also develop a general list of words used in

code-switching that is not directly linked to a spe-

ciﬁc topic consisting of 138 words.

∼

90% of the

keywords are nouns, verbs and adjectives which is

in line with the notion that switching is more likely

to occur on these open word classes as opposed

to close word classes (such as pronouns and con-

junctions) (Kodali et al., 2022).

3.1.1. Topic-Keyword Basic Prompting

In the six diﬀerent prompting templates of Yong

et al. (2023), one prompt speciﬁcally requests a

native speaker to give a mixed sentence. This is

an indirect way to impose a matrix language (ML).

We explicitly include the use of a matrix language

in our prompts (Jake et al., 2002). This is to ensure

that we adequately represent the low-resourced

language. However, we recognise that grammatical

constraints on CS is an open research question with

varying deﬁnitions of acceptability that evolves over

time (Bhat et al., 2016).

The following shows the basic prompt we used

(Prompt 1.1) and a few examples to highlight the

behaviour of GPT-3.5 (English translation in Italics).

Prompt 1.1: Generate an Afrikaans-English

code-switch sentence with Afrikaans as the

matrix language. Typical words used in code-

switching are: general

. The topic is [insert

topic] and must contain the word [insert key-

word].

Topic: education and training; keyword: skills

Example 1: Ek

moet

skills

verbeter

’n

beter

werksgeleentheid

kry

I must improve my skills to get a better job

opportunity.

Topic: general conversation; keyword: try

https://openai.com/blog/

gpt-4-api-general-availability

List of general words provided

Example 2: Ek

sal

probeer

ﬁnish

assignment

tyd

I will try to ﬁnish my assignment on time.

The matrix language is Afrikaans in Example 1

and English in Example 2. We see from these

examples that GPT 3.5 does not necessarily follow

the prompt with regards to the matrix language.

We do not evaluate word-level language identiﬁ-

cation therefore we do not explicitly measure adher-

ence to the matrix language prompt in this paper.

The results of the generated sentences there-

fore indicate that GPT 3.5 is capable of generating

some coherent sentences and can be corrected

where the grammatical structure follows English.

Section 4.3 gives a more detailed analysis of code-

switch acceptability.

A key observation from using this basic prompt

for generating Afrikaans–English sentences is that

sentences are one-dimensional with

∼

80% of sen-

tences starting with a singular personal pronoun:

‘Ek’ (English: ‘I’) (Section 4.2.1). This creates the

opportunity to explore ways of adding diversity to

the type of sentence through the use of basic lin-

guistic guidelines (such as specifying pronouns)

which is discussed in the following section.

3.1.2. Linguistic-Based Prompting

Since the word lists contain nouns, verbs and ad-

jectives related to speciﬁc topics, content diversity

in the sentences is addressed. These are also

words that are most typically code-switched (Kodali

et al., 2022). To add further diversity in the type of

sentence, we add basic linguistic guidelines in the

form of varying pronouns (personal, impersonal,

interrogative etc.), tenses (past, present and future

that alters the verb) and using negative particles.

The inclusion of negative particles is randomly ini-

tialised and not in each prompt. We also impose

a rule that conjunctions must be in the matrix lan-

guage since conjunctions are part of closed word

classes and should less likely be switched.

Prompt 2.1 is an example of a prompt using lin-

guistic guidelines following with an example of the

generated sentence (English translation in Italics).

In Example 3 the prompts are adhered to, however,

the conjunctions ‘but’ and ‘and’ are in English there-

fore note adhering to the guideline. Our preliminary

observation is that the prompting approach can

support the generation of CS sentences that are

diverse. The eﬀect of varying pronouns on sen-

tence diversity is further evaluated in Section 4.1.

Word order structure mimics that of natural speech

and can be corrected where needed. We give ad-

ditional examples and an evaluation of the quality

of the sentences in Section 4.3.

Prompt 2.1: Generate an Afrikaans-English

code-switch sentence with Afrikaans as the

matrix language. Typical words used in code-

switching are: general. The topic is [insert

topic] and must contain the word [keyword].

Start the sentence with [insert pronoun] us-

ing the [insert tense]. A conjunction must be

Afrikaans. [Use a negative particle].

Topic: physical health and ﬁtness; keyword:

race; Pronoun: impersonal; Tense: past; Use a

negative particle: No

Example 3: Dit

was

super

lekker

die

race

hardloop

, but

ignore

die

consequences

and

het

veel

geëet

afterwards

It was super nice to run the race, but I ignore

the consequences and ate too much after-

wards.

3.1.3. Few-Shot Prompting

In the work from (Yong et al., 2023) they did not

evaluate the eﬀect of few-shot examples. We there-

fore evaluate two additional prompts: Prompt 1.2

and Prompt 2.2 where we add ﬁve examples of

code-switched sentences to Prompts 1.1 and 2.1

respectively. These are general examples and not

in the context of the topic.

3.2. Prompting for Yoruba–English CS

Sentences

In this section we apply the same methodology

(Section 3.1) used to generate Afrikaans–English

CS sentences to generate Yoruba–English CS sen-

tences and provide brief observations. We de-

velop similar topic keyword lists for Yoruba with

most words overlapping with those developed for

Afrikaans–English. In future work we will focus on

developing common lists that cover a more diverse

set of languages. The following are a few examples

of the generated Yoruba–English sentences:

Topic: information technology; keyword:

spreadsheet; Pronoun: indeﬁnite; Tense: future;

Use negative particle: Yes

Example 1: Mo

relax

, infact

gba

surprise

spreadsheet

Yoruba

word

I said you should relax, infact I accept the sur-

prise that spreadsheet is a Yoruba word.

Topic: social media; keyword: cope; Pronoun:

indeﬁnite; Tense: present; Use negative particle:

Yes

Example 2: Kò

sí

èèyàn

tó

yàn

ònà

ní

wáhálà

, view

yìí

dá

tí

wò

làti

cope

There is no person that chooses problems as a

path, this view is what the creatures XXX did to

cope

Examples 1 and 2 both follow the prompt guide-

lines with respect to the matrix language and tense.

Example 1, however, uses a personal pronoun in-

stead of an indeﬁnite pronoun with Example 2 using

the correct pronoun. XXX in Example 2 indicates a

phrase that cannot be translated.

We observe that the prompting approach can

also support the generation of Yoruba–English sen-

tences that are diverse.

We provide observations on the coherence and

naturalness of synthetic sentences in Section 4.4.

4. Evaluation of Generated Data

In this section, we evaluate our work in three parts:

(i) we evaluate the diversity of the generated sen-

tences, (ii) we comment on GPT 3.5’s adherence

to the prompts provided, and (iii) we evaluate the

quality of the sentences generated through a combi-

nation of statistical analysis and human evaluation

of the sentences. We use the four prompt guide-

lines as discussed in Section 3. For this paper we

Romanised the Yoruba–English sentences for eas-

ier evaluation, however, we will include this in future

work.

4.1. Data Diversity

4.1.1. Content Diversity

In Figure 1a (from Prompt 1.1) we see a large

amount of general words being used compared

with the number of sentences. We also note that

the top three keywords (amazing, acknowledge,

anyway) is the same as the top three keywords in

the alphabetised list. In Prompt 2.1 we provide a

randomised general word list to GPT 3.5 and in

Figure 1b we observe a more even distribution of

general words as a result. This indicates GPT 3.5’s

sensitivity to prompts and the context provided.

4.1.2. Linguistic Diversity

Since Prompts 2.1 and 2.2 asked “start the sen-

tence with...”, all sentences were evaluated accord-

ingly. We used a list of common Afrikaans and

Yoruba pronouns to evaluate this prompt.

From Figure 3 we observe an increase in diver-

sity of the types of sentences with regards to the

distribution of pronouns (Prompts 2.1/2.2). For

Afrikaans–English, more than 90% of the sen-

tences start with one of the speciﬁed pronouns.

We also see an increase in the diversity of Yoruba–

English sentences, however, there are still

∼35%

of sentences starting with words other than the re-

quested pronouns. It is not well understood why

GPT 3.5 ignored these prompts. In the absence of

linguistic guidelines in the prompt, we note that

by adding few-shot examples, we lack diversity

(Prompts 1.2 and 2.2).

Similarly to pronouns, we use Afrikaans and

Yoruba keywords that indicate past and future

tense, negation (negative sentiment) and conjunc-

tions to evaluate the eﬀect of adding these guide-

lines to the prompts. In Table 1 we highlight the

impact of these factors on distribution in sentences

using Prompts 1.1 and 2.1 (prompts without exam-

ple sentences).

Afrikaans Yoruba

Prompt 1.1 2.1 1.1 2.1

Past Tense 42% 34% 17% 23%

Future Tense 55% 39% 10% 12%

Negation 26% 39% 15% 27%

Conjunction* 14 4 1 2

*The ratio of Afrikaans/Yoruba to English.

Table 1: Distribution of tenses and negation and

ratio of conjunctions.

We see in Table 1 that for Afrikaans–English,

both the distribution of tenses (equal distribution

between past and future) and the presence of nega-

tion improved. However, it is only negation that

improved for Yoruba–English. We further elaborate

on this observation in Section 4.2.1. The ratio of

Afrikaans:English conjunctions decreased showing

the guideline is not eﬃcient. For Yoruba:English

conjunctions we observe a slight improvement.

The above statistical evaluation of diversity

shows that adding various linguistic guidelines to

the prompts improves diversity. However, this does

not consider whether a prompt is adhered to. In

the next section, we evaluate GPT 3.5’s ability to

execute prompts.

4.2. Prompt Adherence

In Section 3.1 we already observed that GPT 3.5

does not always adhere to using the speciﬁed ma-

trix language and since we do not consider word-

level language identiﬁcation in this paper, we ex-

clude this when determining adherence.

We apply a simple approach to calculate prompt

adherence. We express the number of prompts

adhered to as a percentage of the total prompts

given. In Prompt 1.1, the only prompt given is the

topic keyword hence a total of one prompt (the

same for Prompt 1.2). In Prompt 2.1, there are

ﬁve prompts given: topic keyword, pronoun, tense,

(a) Distribution of general words (alphabetised). (b) Distribution of general words (randomised).

Figure 1: Distribution of top 10 general CS words across all topics.

Figure 2: Distribution of pronouns.

Figure 3: Distribution of the use of pronouns at the

beginning of a generated sentence.

negative particle and conjunction. The average

prompt adherence across the sentences is then

used to represent overall prompt adherence.

4.2.1. Statistical Evaluation of Prompt

Adherence

In this section we present the prompt adherence for

the four prompt guidelines. Keywords for pronouns,

tenses, negative particles and conjunctions as per

Section 4.2.1. Table 2 shows the overall prompt

adherence.

Prompt 1.1 1.2 2.1 2.2

Afrikaans 83% 90% 74% 78%

Yoruba 83% 92% 53% 58%

Table 2: Overall prompt adherence.

From Table 2 we see that the adherence to

prompts for Yoruba–English is much lower than

for Afrikaans–English in the linguistically guided

prompts (Prompts 2.1 and 2.2).

In Afrikaans there are a few speciﬁc keywords

such as ‘nie’, ‘nooit’, ‘nee’ (English: not, never, no)

that indicate negation. Similarly for tenses, words

like ‘was’, ‘gister’, ‘wil’, ‘more’ (English: was, yes-

terday, will, tomorrow) can be used for past and fu-

ture tense. However, the Yoruba language is more

complex and keywords like the above-mentioned

are not adequate to identify negation and tenses,

hence the lower prompt adherence.

In the next section (Section 4.3) we use manual

annotation of sentences for tenses and negation to

re-evaluate prompt adherence.

4.2.2. Manual Evaluation of Prompt

Adherence

For manual evaluation of generated sentences, we

sample 100 sentences each from the four prompt

methods.

We manually annotate the sentences of Prompts

2.1 and 2.2 with tense (past or future) and negation

(whether the sentence expresses some negative

sentiment). In future work, external annotators will

also be used.

In Table 3 we show the impact on the calculated

prompt adherence (using Prompt 2.1) for the statis-

tical (1) and manual (2) evaluation of the 100 sen-

tences. The prompt adherence for Yoruba–English

increased to 66% from 59% with a signiﬁcant

increase in the adherences to tenses.Afrikaans–

English prompt adherence remains constant. The

adherence to negation reduced slightly for both lan-

guages. This conﬁrms the earlier comment that

it is statistically more diﬃcult to calculate prompt

adherence for Yoruba–English without a human in

the loop.

Afrikaans Yoruba

Prompt (1) (2) (1) (2)

Tense 79% 84% 41% 72%

Negation 47% 41% 40% 36%

Total 72% 72% 59% 66%

Table 3: Comparing prompt adherence for both a

statistical and manual annotation perspective.

We conclude that there is potential in using GPT

3.5 as a supporting tool to generate diverse sen-

tences with linguistically guided prompts. In the

following sections we provide an overview of the

quality of generated sentences to further determine

the role that GPT 3.5 can play in addressing code-

switched data availability.

4.3. Code-Switch Acceptability

The ﬁnal part of our analysis looks at the quality of

generated sentences. As mentioned in Section 4.3,

we sampled 100 sentences from each of the four

prompt methods. For this part of the analysis, we

rated the acceptability of a code-switch sentence

according to: i) Yes, ii) Yes, with minimal changes

or iii) No. We adopt the constraint-free approach

of MacSwan (2000).

The results of the manual annotation are shown

in Figure 4. We observe that the acceptability of

Afrikaans–English sentences far outweighs that of

Yoruba–English. We also see that adding few-shot

examples increases acceptability (Prompts 1.2 and

2.2). Although we observe an increase in diver-

sity through linguistic guidelines, the quality of sen-

tences are sub-optimal. Subsequent work will fo-

cus on how correctable sentences can be used

for improved prompting and/or ﬁne tuning of lan-

guage models. However, with further analysis and

improvement, there is potential to use GPT 3.5 to

support synthetic data generation.

4.4. Language Speciﬁc Observations

4.4.1. Afrikaans–English

In order to quantify the acceptability observed from

internal evaluation, we randomly select 5 Afrikaans-

English sentences from the dataset used for man-

ual evaluation (Section 4.3). Table 4 gives the sen-

tences with translations and comments.

In our general overview we ﬁnd that the typical

mistakes made are as a result of following English

grammar structure. However, for many sentences

this does not aﬀect the meaning and can be cor-

rected.

The results from the various experiments there-

fore indicate that using GPT 3.5 (and it’s followers)

Figure 4: Evaluation of manual annotation of sen-

tences.

can be considered as a method to generate large-

scale data in Afrikaans-English code-switching.

4.4.2. Yoruba–English

Similarly to quantifying the Afrikaans-English sen-

tences, we give 5 randomly selected Yoruba-

English sentences in Table 5 from the dataset used

for manual evaluation (Section 4.3).

It is hypothesised that the exposure of GPT 3.5

to the Yoruba language is to a much lesser ex-

tent than Afrikaans yielding a substantial amount

of unacceptable sentences. Furthermore, as was

postulated by Yong et al. (2023), languages using

the English alphabet and Latin script perform better

on LLMs. Further analysis is required to improve

prompting and quality of sentences.

5. Conclusions and Future Work

In this paper we extended on the of Yong et al.

(2023) where they used prompting of LLMs (includ-

ing GPT 3.5) to generate code-switch sentences.

Our approach evaluates three dimensions: (i) diver-

sity, through a wider range of topics, keywords,

linguistic guidelines and few-shot examples; (ii)

prompt adherence, to understand the ability of GPT

3.5 to follow these prompts; and (iii) quality, to de-

termine the use of GPT 3.5 as a supporting tool to

address code-switched data scarcity. We evaluated

two typologically diverse language pairs: Afrikaans–

English and Yoruba–English.

Our main ﬁndings are: (i) using topics, keywords

and general context words increases coverage; (ii)

linguistic-based guidelines increases diversity in

the types of sentences, (iii) few-shot prompting in-

creases the quality of sentences but is limited in

diversity of the types of sentences;(iv) quality of

sentences are much lower for languages that use

non-Latin script (such as Yoruba); and (v) evalu-

Sentence Accept Comments

Ek is so excited om my nuwe partner te ontmoet. (I

am so excited to meet my new partner.)

✓ -

Ons moet takeaways hê for dinner, maar ek wil nie

weer McDonald’s eet nie. (We must have takeaways

for dinner, but I don’t want to eat McDonald’s again.)

✓

The use of English ‘for’ instead of Afrikaans

‘vir’ is less typical but can be accepted

Ons het ’n nuwe app gedownload om die fotos te

organise. (We downloaded a new app to organise

the photos.)

✓

‘gedownload’ is an example of intra-word

code-switching

Ek moet ’n nuwe uitdaging in my loopbaan aanpak.

(I have to tackle a new challenge in my career.)

✗ No code-switching, only Afrikaans

Daai kursus was ’n disaster, ons het reset van die

begin af. (That course was a disaster, we reset from

the beginning.)

✗

Unclear about the intended meaning with the

use of ‘reset’, however, can be corrected in

context

Table 4: Generated Afrikaans–English sentences, translations and comments on acceptability.

Sentence Accept Comments

o ma install software yii ni computer mi. (You will

install this software in my computer.)

✓

The model is not clear about the right orthog-

raphy for the Yoruba words in the sentence

and used the word “ni" instead of ‘sorii’ which

translates to ‘on’ in Yoruba

60 million naira yen fe po die fun mi. I need to buy or-

ange juice for the party. (That 60 million naira seems

to be a bit too much for me. I need to buy orange

juice for the party.)

✓

This is an inter-sentential code-switched sen-

tence. However, this can be accepted by just

dropping the second sentence

Mo n gbadun ojo meta ti n si se fun mi ni lockdown

ni ojo kan, but honestly, e wa wo mi, I don tire. The

pressure don too much, and I just dey try survive. (I

am enjoying the three days XXX during lockdown in

one day, but honestly, come and see me, I am tired.)

✗

These sentences make no sense. Contains

the Nigerian version of Pidgin-English mixed

with Yoruba and English. The ‘XXX’ indicates

phrases that cannot be translated

eniyan miran naa maa click si awon idile mi lati ba

wa. (That other person will click to my family to come

with.)

✗ This sentence makes no sense

o ma jabo ile-ise yi niwaju wireless connectivity yi.

(You will XXX this company in front of this wireless

connectivity.)

✗

The English translation for the word ‘jabo’

cannot be inferred without knowing the dia-

critics. The sentence makes no sense

Table 5: Generated Yoruba–English sentences, translations and comments on acceptability.

ating quality of data requires a human-in-the-loop.

We provide a framework for linguistically-guided

prompting and we conclude that OpenAI’s GPT ex-

hibits the ability to support synthetic code-switched

data generation and can be invaluable to address

the issue of data availability.

In future work we will address the following: i)

include external annotation to cross-validate the

quality of generated sentences; (ii) improve on the

prompting guidelines to increase quality; (iii) use

correctable sentences to improve the performance

of the latest generation of OpenAI’s GPT to support

large-scale generation; and (iv) expand to more

African languages in an eﬀort to develop a language

agnostic approach to synthetically generate data.

6. Ethical Considerations

Data Generation Research in code-switching is

not only focused on the grammatical aspects of this

phenomenon but also the socio-pragmatic charac-

teristics in discourse (Nel, 2012). Large language

models such as OpenAI’s GPT are inﬂuenced by

social views and inherit encoded biases (Bender

et al., 2021). Our work propose the use of GPT

to support eﬀorts in synthetically generated code-

switched data to increase the prevalence of under-

resourced languages. We therefore carefully con-

sidered the method in which GPT was prompted to

eliminate the introduction of bias. We use general

topics and keywords with the goal to generate a

diverse range of acceptable sentences.

Human Evaluation The generated sentences

were internally evaluated by native speakers of

Afrikaans and Yoruba. We ensure the data is re-

spectful to culture and social norms. We will con-

tinue to include humans-in-the-loop to ensure faith-

ful data generation.

7. Acknowledgements

We thank JP Morgan and ABSA for their ﬁnancial

support, and OpenAI for providing API credits.

8. Bibliographical References

ACL Anthology. 2023. Welcome to the ACL Anthol-

ogy. Accessed: 2023-10-08.

Emily M. Bender, Timnit Gebru, Angelina McMillan-

Major, and Shmargaret Shmitchell. 2021. On

the dangers of stochastic parrots: Can language

models be too big? FAccT 2021 - Proceedings

of the 2021 ACM Conference on Fairness, Ac-

countability, and Transparency, pages 610–623.

Gayatri Bhat, Monojit Choudhury, and Ka-

lika Bali. 2016. Grammatical constraints

on intra-sentential code-switching: From

theories to working models. arXiv e-prints.

ArXiv ID: 1612.04538, [Online] Available:

http://arxiv.org/abs/1612.04538.

Astik Biswas, Ewald van der Westhuizen, Thomas

Niesler, and Febe de Wet. 2018. Improving ASR

for code-switched speech in under-resourced lan-

guages using out-of-domain data. 6th Workshop

on Spoken Language Technologies for Under-

Resourced Languages, SLTU 2018, pages 122–

126.

Justine Calma. 2023. Twitter just closed the book

on academic research. Accessed: 2023-10-06.

Özlem Çetinoğlu, Sarah Schulz, and Ngoc Thang

Vu. 2016. Challenges of Computational Process-

ing of Code-Switching. EMNLP 2016 - 2nd Work-

shop on Computational Approaches to Code

Switching, CS 2016 - Proceedings of the Work-

shop, (1980):1–11.

Matt Crabtree. 2023. What is prompt engineering?

a detailed guide. Accessed: 2023-10-06.

A. Seza Doğruöz, Sunayana Sitaram, Barbara E.

Bullock, and Almeida Jacqueline Toribio. 2021.

A survey of code-switching: Linguistic and so-

cial perspectives for language technologies. In

Proceedings ofthe 59th Annual Meeting of the

Association for Computational Linguistics and

the 11th International Joint Conference on Natu-

ral Language Processing, pages 1654–1666.

European Parliament. 2016. General Data Pro-

tection Regulation. Regulation (EU) 2016/679.

Online. [Available]:

https://eur-lex.

europa.eu/legal-content/EN/TXT/

PDF/?uri=CELEX:02016R0679-20160504

[Accessed: 4 May 2023].

ISCA Proceedings. 2023. Welcome to the ISCA

archive. Accessed: 2023-10-08.

Janice L Jake, Carol Myers-Scotton, and Steven

Gross. 2002. Making a minimalist approach to

codeswitching work: Adding the matrix language.

Bilingualism: language and cognition, 5(1):69–

91.

Susmit Jha, Sumit Kumar Jha, Patrick Lincoln,

Nathaniel D. Bastian, Alvaro Velasquez, and

Sandeep Neema. 2023. Dehallucinating large

language models using formal methods guided

iterative prompting. In 2023 IEEE International

Conference on Assured Autonomy (ICAA), pages

149–152.

Prashant Kodali, Anmol Goel, Monojit Choud-

hury, Manish Shrivastava, and Ponnurangam Ku-

maraguru. 2022. SyMCoM-syntactic measure of

code mixing a study of english-hindi code-mixing.

In Findings of the Association for Computational

Linguistics: ACL 2022, pages 472–480.

Zihan Liu, Genta Indra Winata, Zhaojiang Lin,

Peng Xu, and Pascale Fung. 2020. Attention-

informed mixed-language training for zero-shot

cross-lingual task-oriented dialogue systems. In

Proceedings of the AAAI Conference on Artiﬁcial

Intelligence, volume 34, pages 8433–8440.

Koena Ronny Mabokela, Madimetja Jonas Man-

amela, and Mabu Manaileng. 2014. Modeling

code-switching speech on under-resourced lan-

guages for language identiﬁcation. In Spoken

Language Technologies for Under-Resourced

Languages.

Jeﬀ MacSwan. 2000. The architecture of the bilin-

gual language faculty: Evidence from intrasen-

tential code switching. Bilingualism: language

and cognition, 3(1):37–54.

Thipe I. Modipa, Febe De Wet, and Marelie H.

Davel. 2013. Implications of sepedi/english code

switching for asr systems. Pattern recognition

association of South Africa (PRASA).

Joanine H. Nel. 2012. Grammatical and socio-

pragmatic aspects of conversational code switch-

ing by Afrikaans-English bilingual children. MA

in Linguistics for the Language Professions, Uni-

versity of Stellenbosch.

Thomas Niesler and Febe De Wet. 2008. Accent

identiﬁcation in the presence of code-mixing. In

Odyssey, page 27.

Billian Khalayi Otundo and Martine Grice. 2022.

Intonation in advice-giving in kenyan english and

kiswahili. Proceedings of Speech Prosody 2022,

pages 150–154.

Mario Piergallini, Rouzbeh Shirvani, Gauri Shankar

Gautam, and Mohamed Chouikha. 2016. Word-

level language identiﬁcation and predicting

codeswitching points in swahili-english language

data. In Proceedings of the second workshop

on computational approaches to code switching,

pages 21–29.

Shana Poplack. 1980. Sometimes I’ll start a sen-

tence in Spanish y termino en español: Toward

a typology of codeswitching. Linguistics, 18(7-

8):581–618.

Shana Poplack. 2001a. Code-switching (linguistic).

In International Encyclopedia of the Social and

Behavioral Sciences, pages 2062–2065. Elsevier

Science Ltd.

Shana Poplack. 2001b. Code switching: Linguis-

tic. International Encyclopedia of the Social and

Behavioral Sciences, pages 2062–2065.

Adithya Pratapa, Gayatri Bhat, Monojit Choudhury,

Sunayana Sitaram, Sandipan Dandapat, and Ka-

lika Bali. 2018. Language modeling for code-

mixing: The role of linguistic theory based syn-

thetic data. ACL 2018 - 56th Annual Meeting

of the Association for Computational Linguistics,

Proceedings of the Conference (Long Papers),

1:1543–1553.

S. Mondal Riktika, S. Pathak, P. Jyothi, and

A. Raghuveer. 2022. CoCoa: An encoder-

decoder model for controllable code-switched

generation. Proceedings of the 2022 Confer-

ence on Empirical Methods in Natural Language

Processing, pages 2466–2479.

Mohd Sanad Zaki Rizvi, Anirudh Srinivasan,

Tanuja Ganu, Monojit Choudhury, and Sunayana

Sitaram. 2021. GCM: A toolkit for generating syn-

thetic code-mixed text. EACL 2021 - 16th Confer-

ence of the European Chapter of the Association

for Computational Linguistics, Proceedings of the

System Demonstrations, pages 205–211.

Sebastian Ruder. 2022. Acl 2022 highlights.

[Online]. Available:

https://www.ruder.io/

acl2022/#next-big-ideas

[Accessed: 5

May 2023].

Thamar Solorio. 2021. Moving the Needle

in NLP Technology for the Processing of

Code-Switching Language. [Online]. Available:

http://solorio.uh.edu/wp-content/

uploads/2021/08/Solorio-NAACL-2021.

pdf [Accessed: 5 May 2023].

South Africa. 2013. Protection of Personal

Information Act, No. 4 of 2013. Govern-

ment Gazette, 581(37067). Online. [Avail-

able]:

https://www.gov.za/documents/

protection-personal-information-act

[Accessed: 4 May 2023].

Ankit Srivastava, Vijendra Singh, and Gur-

deep Singh Drall. 2019. Sentiment analysis of

twitter data. International Journal of Healthcare

Information Systems and Informatics, 14:1–16.

Ewald Van der Westhuizen and Thomas Niesler.

2018. A ﬁrst South African corpus of multilingual

code-switched soap opera speech. In Proceed-

ings of the Eleventh International Conference

on Language Resources and Evaluation (LREC

2018), Miyazaki, Japan. European Language Re-

sources Association (ELRA).

Ondene van Dulm. 2007. The grammar of English-

Afrikaans code switching. PhD Dissertation, Rad-

boud Universiteit Nijmegen.

Ewald Van Der Westhuizen and Thomas Niesler.

2017. Synthesising isizulu-english code-switch

bigrams using word embeddings. In Proceedings

of Interspeech 2017.

Jules White, Quchen Fu, Sam Hays, Michael Sand-

born, Carlos Olea, Henry Gilbert, Ashraf El-

nashar, Jesse Spencer-Smith, and Douglas C.

Schmidt. 2023. A prompt pattern catalog

to enhance prompt engineering with chat-

gpt. ArXiv ID: 2302.11382, [Online] Available:

https://arxiv.org/abs/2302.11382.

Genta Indra Winata, Alham Fikri Aji, Zheng-Xin

Yong, and Thamar Solorio. 2022. The Decades

Progress on Code-Switching Research in NLP:

A Systematic Survey on Trends and Challenges.

arXiv e-prints. ArXiv ID: 2212.09660, [Online]

Available: http://arxiv.org/abs/2212.09660.

Genta Indra Winata, Andrea Madotto, Chien-

Sheng Wu, and Pascale Fung. 2019. Code-

switched language models using neural

based synthetic data from parallel sentences.

ArXiv ID: 1909.08582, [Online] Available:

https://arxiv.org/abs/1909.08582.

Zheng-Xin Yong, Ruochen Zhang, Jessica Zosa

Forde, Skyler Wang, Arjun Subramonian, Holy

Lovenia, Samuel Cahyawijaya, Genta Indra

Winata, Lintang Sutawika, Jan Christian Blaise

Cruz, Yin Lin Tan, Long Phan, Rowena Gar-

cia, Thamar Solorio, and Alham Fikri Aji. 2023.

Prompting multilingual large language mod-

els to generate code-mixed texts: The case

of south east asian languages. arXiv e-

prints. ArXiv ID: 2303.13592, [Online] Available:

https://arxiv.org/abs/2303.13592.