Mesin Penerjemah Bahasa Hokkien-Indonesia dengan Fine-Tuning Quantized Low-Rank Adaptation (QLoRA) pada Large Language Model Meta AI (LLaMA)
Hokkien-Indonesian Translator with Quantized Low-Rank Adaptation (QLoRA) Fine-Tuning on Large Language Model Meta AI (LLaMA)
Abstract
Hokkien is a dialect within the Southern Min language group and one of low-resource languages (LRLs). Natural language processing (NLP) task such as neural machine translation (NMT) on LRL can apply fine-tuning to mitigate limitation of LRL parallel corpus. Previous study that did fine-tuning on a large language model (LLM) to support with Hokkien translation task has not supported Indonesian translation, hence, this study will use fine-tuning on LLM to develop a language model that supports Hokkien – Indonesian translation task. The method that will be used in this study is quantized low-rank adaptation (QLoRA) fine-tuning. QLoRA uses lower computing resources than full fine-tuning that requires more resources. The LLM that will be used in this study is the Taigi-Llama-2-Translator-7B model, a LLaMA 2-based model with 7 billion parameters. This study will use dictionary, terminology, news, and transcription text dataset from sites of Taiwan Ministry of Education, Hugging Face, National Yang Ming Chiao Tung University (NYCU). The evaluation metrics to evaluate model performance before fine-tuning and after fine-tuning in this study is BLEU and chrF++. Study results show the model supports Indonesian translation after fine-tuning with performance increase on Hokkien – Indonesian translation with average score of BLEU from 0.01 to 0.16 and average score of chrF++ from 3.48 to 23.77.
Collections
- Undergraduate Theses [1235]