effort to increase diversity and reduce the model’s
propensity to default to certain words. Our goal is
not to evaluate various prompting templates, how-
ever, we add linguistic guidelines in the prompts
to further increase diversity. We propose this as
an approach towards language agnostic prompting.
We also test the performance of GPT 3.5 with few-
shot in-context examples. We specifically consider
whether GPT can support the generation of larger
code-switched datasets and to what extent.
Our contributions are as follows: (i) we provide a
framework to increase the diversity of synthetically
generated code-switched data by prompting Ope-
nAI’s GPT; and (ii) we position GPT as a pivot to ad-
dress code-switched data scarcity in low-resource
languages while emphasising the need for native
speakers in the loop.
Increasing data availability is at the center of
developing language models that serve multilingual
communities. Our work is a step towards closing
the gap in low-resourced and under-represented
languages.
2. Related Work
2.1. Code-Switching Research
Various types of code-switching have been identi-
fied but the type that attracts the most academic
research is intra-sentential code-switching which
can occur anywhere within a sentence boundary
(Poplack, 1980) and as a result, adds complex-
ity in evaluation (Poplack, 2001b). Another com-
plex type is intra-word code-switching where the
stem of one language is bound to another language
(Çetinoğlu et al., 2016; Van der Westhuizen and
Niesler, 2018).
Over and above the issue of data diversity
(Winata et al., 2022), one of the major challenges in
code-switching studies is related to data availability
(Doğruöz et al., 2021). A survey by (Winata et al.,
2022) showed that up until October 2022, a rela-
tively small amount of papers (ACL Anthology, 2023
and ISCA Proceedings, 2023) focused on code-
switching research in African languages with very
few publicly available datasets. Eleven publications
mention South African languages. The non-English
South African languages referenced are isiZulu,
isiXhosa, Setswana, Sesotho and Afrikaans. Only
one proceeding includes Afrikaans code-switching
(Niesler and De Wet, 2008) with no published
dataset. A paper by Van der Westhuizen and
Niesler (2018) introduced the first corpus on isiZulu,
isiXhosa, Setswana, Sesotho curated from tran-
scribed soap opera speech data and eight of the
papers makes use of this dataset and is mainly
focused on automatic speech recognition (ASR)
systems.
Code-switching in Kiswahili–English is studied in
two papers but no datasets were made available
(Otundo and Grice, 2022; Piergallini et al., 2016).
In addition to a survey by Winata et al. (2022),
one other paper was found that addresses Sepedi–
English code-switching. Modipa et al. (2013) de-
velop a corpus from a set of radio broadcasts to
evaluate the implication of code-switching in ASR
systems. This dataset is publicly available. This
brief review of the state of code-switching research
in an African context motivates our work to develop
methods for addressing data scarcity.
A predominant approach to mitigating data avail-
ability issues involves augmenting existing datasets
through the generation of synthetic code-switched
data. Some of the methods to augment the ear-
lier mentioned South African speech corpus in-
clude the use of word embeddings to synthesise
code-switched bigrams to find similar words in
the sparse training data (Westhuizen and Niesler,
2017). Biswas et al. (2018) evaluated adding
out-of-domain monolingual data and synthesised
code-switched data using an LSTM to augment the
dataset.
For non-African languages, Rizvi et al. (2021)
developed a toolkit that generates multiple code-
switched sentences using either the Equivalence
Constraint or the Matrix Language Frame. The lim-
itations are that it relies on a good sentence aligner
and parser and parallel translated sentences as
input. The notion is that this approach should
work on any language pair. Winata et al. (2019)
implemented a sequence-to-sequence model for
English-Mandarin code-switched data. Although
the model does not require external knowledge
regarding word alignments, it still relies on an exist-
ing English–Mandarin code-switched dataset and
parallel corpora. The work of (Liu et al., 2020) intro-
duced an attention-informed zero-shot adaptation
method that relies on a limited number of parallel
word pairs. The languages covered are German,
Italian, Spanish and Thai, the latter two for natu-
ral language understanding. The shortcoming of
the above-mentioned approaches is the diversity of
data. Most existing code-switched datasets were
collected from social media platforms such as Twit-
ter and therefore limits the type of code-switching
(Doğruöz et al., 2021).
To this issue, Riktika et al. (2022) developed an
encoder-decoder translation model for controlled
code-switched generation. It uses monolingual
Hindi and a publicly available Hindi–English code-
switched dataset as input to generate data that is
faithful to syntactic and lexical attributes.
Yong et al. (2023) proposed an approach that is
independent of existing code-switched datasets or
parallel corpora through prompting of LLMs. Their
objective was to test whether multilingual LLMs