Sistem Pencarian Dokumen Bahasa Indonesia Berbasis Konten Visual dan OCR Menggunakan Pendekatan Deep Learning

Saragih, Wina Octaria

Sistem Pencarian Dokumen Bahasa Indonesia Berbasis Konten Visual dan OCR Menggunakan Pendekatan Deep Learning

dc.contributor.advisor	Amalia
dc.contributor.advisor	Ginting, Dewi Sartika
dc.contributor.author	Saragih, Wina Octaria
dc.date.accessioned	2025-07-19T02:17:22Z
dc.date.available	2025-07-19T02:17:22Z
dc.date.issued	2025
dc.identifier.uri	https://repositori.usu.ac.id/handle/123456789/105798
dc.description.abstract	The development of digital technology has led to the accumulation of a large number of documents, especially in the form of images or scans, which makes it difficult to search using traditional keyword-based methods. This research aims to develop a visual content-based automatic document search system for Indonesian language documents using Optical Character Recognition (OCR) technology and Deep Learning approach. The methodology used consists of three main components. Firstly, the Tesseract OCR module is used to extract text from document images, with preprocessing to improve accuracy. Second, the Word2Vec algorithm with Skip-gram architecture (vector size 300, window 10, min_count 2) is used to convert text into low-dimensional semantic vectors. Third, the Transformer paraphrase-multilingual-MiniLM-L12-v2 model is used as an initial filtering mechanism with a threshold of 0.2 to determine document relevance. The system was built using Django framework as the backend and PostgreSQL as the metadata management, with the data divided 80% for training and 20% for testing. Evaluation was performed on 50 searches with varying query lengths (16-472 tokens) against 1,000 PDF documents. The results showed excellent performance with Precision@10 of 1.0, average Recall of 0.911692, F1-score of 0.952634, and MAP@10, MRR@10, and NDCG@10 values each reaching 1.0. The search time ranged from 1.95 to 12.86 seconds, with an average of 4.43 seconds. This time variation is more influenced by the complexity of the OCR and semantic matching process, rather than the query length. The system proved effective and consistent in finding and ranking relevant documents.	en_US
dc.language.iso	id	en_US
dc.publisher	Universitas Sumatera Utara	en_US
dc.subject	Cosine Similarity	en_US
dc.subject	Deep learning	en_US
dc.subject	Document Seacrh	en_US
dc.subject	OCR (Optical Character Recognition)	en_US
dc.subject	Transformer	en_US
dc.subject	Word2Vec	en_US
dc.title	Sistem Pencarian Dokumen Bahasa Indonesia Berbasis Konten Visual dan OCR Menggunakan Pendekatan Deep Learning	en_US
dc.title.alternative	Indonesian Document Search System Based on Visual Content and OCR Using Deep Learning Approach	en_US
dc.type	Thesis	en_US
dc.identifier.nim	NIM201401110
dc.identifier.nidn	NIDN0121127801
dc.identifier.nidn	NIDN0104059001
dc.identifier.kodeprodi	KODEPRODI55201#Ilmu Komputer
dc.description.pages	161 Pages	en_US
dc.description.type	Skripsi Sarjana	en_US
dc.subject.sdgs	SDGs 4. Quality Education	en_US

Files in this item

Name:: Sistem Pencarian Dokumen Bahasa ...
Size:: 676.7Kb
Format:: PDF
Description:: Cover

View/Open

Name:: Wina Octaria Saragih_Sistem ...
Size:: 4.082Mb
Format:: PDF
Description:: Fulltext

View/Open

This item appears in the following Collection(s)

Undergraduate Theses [1254]
Skripsi Sarjana

Show simple item record