Show simple item record

dc.contributor.advisorSitompul, Opim Salim
dc.contributor.advisorMantoro, Teddy
dc.contributor.advisorNabahan, Erna Budhiarti
dc.contributor.authorAmalia, Amalia
dc.date.accessioned2022-10-26T08:42:42Z
dc.date.available2022-10-26T08:42:42Z
dc.date.issued2021
dc.identifier.urihttps://repositori.usu.ac.id/handle/123456789/51040
dc.description.abstractThe conventional word embedding method treats words as the smallest independent entity unit and ignores word's internal structure. This causes the embedding unable to handle out of Vocabulary (OOV) and unable to capture the explicit relationship between syntactic morphology. One solution is to generate a word embedding from its subwords, such as characters or morphemes. This study aims to generate a subword embedding for bahasa Indonesia by considering additional information on bahasa Indonesia morphology. The biggest challenge of this research is the process of tokenizing each word into its subwords and the process of merging sub words encoding into a full word. This study's framework is word segmentation using Byte Pair Encoding (BPE), sub-word encoding using the GloVe algorithm, merging subword encoding into a whole word encoding using the addition method, and subword embedding using Skip Gram word2vec. The training corpus is from Wikipedia bahasa Indonesia, with total numbers of words is 729,260,836 and 364,184 unique words in the capacity of 746 MB. The implemented model of this study produces a pre-trained Subword embedding with 385,446 tokens in a capacity of 950.3 MB. The evaluation is carried out by utilizing a series of analogy test set to measure the semantic and syntactic relation. The evaluation refers to the benchmark analogy test set of Google and BATS adapted for bahasa Indonesia. The evaluation result showed that the model of this study is able to handle OOV and had a good result in capturing semantic and syntactic relations. In additional contribution, this study also generated another corpus combination from Wikipedia and crawled newspaper with 1,652,081,275 words. This study also yields pretrained word vectors for bahasa Indonesia, built using word2vec and fastText algorithm. This study also generated an intrinsic evaluation model test set of 4935 analogy questions test for bahasa Indonesia.en_US
dc.language.isoiden_US
dc.publisherUniversitas Sumatera Utaraen_US
dc.subjectByte Pair Encodingen_US
dc.subjectMorphology of Bahasa Indonesiaen_US
dc.subjectSubword Embeddingen_US
dc.subjectWord Embeddingen_US
dc.titleSubword Embedding dengan Pendekatan Byte Pair Encoding dan Morfologi Bahasa Indonesiaen_US
dc.typeThesisen_US
dc.identifier.nimMIM178123001
dc.identifier.nidnNIDN0017086108
dc.identifier.nidnNIDN0026106209
dc.identifier.kodeprodiKODEPRODI55001#Ilmu Komputer
dc.description.pages140 Halamanen_US
dc.description.typeDisertasi Doktoren_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record