Coreference Resolution untuk Teks Bahasa Indonesia Menggunakan Random Forest Classifier

Sari, Nia Ulan

Coreference Resolution for Indonesian Text Using Random Forest Classifier

View/Open

Cover (902.8Kb)

Fulltext (4.558Mb)

Date

2024

Author

Sari, Nia Ulan

Advisor(s)

Purnamawati, Sarah

Rahmat, Romi Fadillah

Metadata

Show full item record

Abstract

Coreference Resolution is a subtask in Natural Language Processing (NLP) that focuses on identifying and solving the reference problem of two or more similar entities in text. In Indonesian texts, especially in novels, coreference resolution is crucial because of the complex language and rich variety of entities and references. Characters and entities in novels often interact, and references to characters may appear repeatedly. Another problem is that the presence of possessive pronouns which are widely used in novel texts in the form of affixes rather than complete words can cause confusion in determining references between entities. Therefore, coreference resolution research was carried out for Indonesian texts by detecting affix possessive pronouns using the Random Forest Classifier method. This research also utilizes Part-of-Speech Tagging (POS Tag) and Named Entity Recognition (NER) to maximize detection of person entities. By using 18 novel texts as training data and 10 novel texts as test data after the pre-processing stage, there are a total of 109306 entity and pronoun pairs in the training data, and 4938 pairs in the test data. This research uses RandomSearchCV to help the Random Forest Classifier algorithm find the best hyperparameters in the training process. By using the confusion matrix evaluation method, the metric values obtained from the test results of all test data are an accuracy of 85.5%, precision of 85%, recall of 82.2%, and f1-score of 83.6%.

URI

https://repositori.usu.ac.id/handle/123456789/96434

Collections

Undergraduate Theses [866]