Subword Embedding dengan Pendekatan Byte Pair Encoding dan Morfologi Bahasa Indonesia

Amalia, Amalia

Subword Embedding dengan Pendekatan Byte Pair Encoding dan Morfologi Bahasa Indonesia

dc.contributor.advisor	Sitompul, Opim Salim
dc.contributor.advisor	Mantoro, Teddy
dc.contributor.advisor	Nabahan, Erna Budhiarti
dc.contributor.author	Amalia, Amalia
dc.date.accessioned	2022-10-26T08:42:42Z
dc.date.available	2022-10-26T08:42:42Z
dc.date.issued	2021
dc.identifier.uri	https://repositori.usu.ac.id/handle/123456789/51040
dc.description.abstract	The conventional word embedding method treats words as the smallest independent entity unit and ignores word's internal structure. This causes the embedding unable to handle out of Vocabulary (OOV) and unable to capture the explicit relationship between syntactic morphology. One solution is to generate a word embedding from its subwords, such as characters or morphemes. This study aims to generate a subword embedding for bahasa Indonesia by considering additional information on bahasa Indonesia morphology. The biggest challenge of this research is the process of tokenizing each word into its subwords and the process of merging sub words encoding into a full word. This study's framework is word segmentation using Byte Pair Encoding (BPE), sub-word encoding using the GloVe algorithm, merging subword encoding into a whole word encoding using the addition method, and subword embedding using Skip Gram word2vec. The training corpus is from Wikipedia bahasa Indonesia, with total numbers of words is 729,260,836 and 364,184 unique words in the capacity of 746 MB. The implemented model of this study produces a pre-trained Subword embedding with 385,446 tokens in a capacity of 950.3 MB. The evaluation is carried out by utilizing a series of analogy test set to measure the semantic and syntactic relation. The evaluation refers to the benchmark analogy test set of Google and BATS adapted for bahasa Indonesia. The evaluation result showed that the model of this study is able to handle OOV and had a good result in capturing semantic and syntactic relations. In additional contribution, this study also generated another corpus combination from Wikipedia and crawled newspaper with 1,652,081,275 words. This study also yields pretrained word vectors for bahasa Indonesia, built using word2vec and fastText algorithm. This study also generated an intrinsic evaluation model test set of 4935 analogy questions test for bahasa Indonesia.	en_US
dc.language.iso	id	en_US
dc.publisher	Universitas Sumatera Utara	en_US
dc.subject	Byte Pair Encoding	en_US
dc.subject	Morphology of Bahasa Indonesia	en_US
dc.subject	Subword Embedding	en_US
dc.subject	Word Embedding	en_US
dc.title	Subword Embedding dengan Pendekatan Byte Pair Encoding dan Morfologi Bahasa Indonesia	en_US
dc.type	Thesis	en_US
dc.identifier.nim	MIM178123001
dc.identifier.nidn	NIDN0017086108
dc.identifier.nidn	NIDN0026106209
dc.identifier.kodeprodi	KODEPRODI55001#Ilmu Komputer
dc.description.pages	140 Halaman	en_US
dc.description.type	Disertasi Doktor	en_US

Files in this item

Name:: 178123001.pdf
Size:: 3.373Mb
Format:: PDF
Description:: Fulltext

View/Open

This item appears in the following Collection(s)

Doctoral Dissertations [67]
Disertasi

Show simple item record