• Login
    View Item 
    •   USU-IR Home
    • Faculty of Computer Science and Information Technology
    • Department of Computer Science
    • Undergraduate Theses
    • View Item
    •   USU-IR Home
    • Faculty of Computer Science and Information Technology
    • Department of Computer Science
    • Undergraduate Theses
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Sistem Pencarian Dokumen Bahasa Indonesia Berbasis Konten Visual dan OCR Menggunakan Pendekatan Deep Learning

    Indonesian Document Search System Based on Visual Content and OCR Using Deep Learning Approach

    Thumbnail
    View/Open
    Cover (676.7Kb)
    Fulltext (4.082Mb)
    Date
    2025
    Author
    Saragih, Wina Octaria
    Advisor(s)
    Amalia
    Ginting, Dewi Sartika
    Metadata
    Show full item record
    Abstract
    The development of digital technology has led to the accumulation of a large number of documents, especially in the form of images or scans, which makes it difficult to search using traditional keyword-based methods. This research aims to develop a visual content-based automatic document search system for Indonesian language documents using Optical Character Recognition (OCR) technology and Deep Learning approach. The methodology used consists of three main components. Firstly, the Tesseract OCR module is used to extract text from document images, with preprocessing to improve accuracy. Second, the Word2Vec algorithm with Skip-gram architecture (vector size 300, window 10, min_count 2) is used to convert text into low-dimensional semantic vectors. Third, the Transformer paraphrase-multilingual-MiniLM-L12-v2 model is used as an initial filtering mechanism with a threshold of 0.2 to determine document relevance. The system was built using Django framework as the backend and PostgreSQL as the metadata management, with the data divided 80% for training and 20% for testing. Evaluation was performed on 50 searches with varying query lengths (16-472 tokens) against 1,000 PDF documents. The results showed excellent performance with Precision@10 of 1.0, average Recall of 0.911692, F1-score of 0.952634, and MAP@10, MRR@10, and NDCG@10 values each reaching 1.0. The search time ranged from 1.95 to 12.86 seconds, with an average of 4.43 seconds. This time variation is more influenced by the complexity of the OCR and semantic matching process, rather than the query length. The system proved effective and consistent in finding and ranking relevant documents.
    URI
    https://repositori.usu.ac.id/handle/123456789/105798
    Collections
    • Undergraduate Theses [1235]

    Repositori Institusi Universitas Sumatera Utara - 2025

    Universitas Sumatera Utara

    Perpustakaan

    Resource Guide

    Katalog Perpustakaan

    Journal Elektronik Berlangganan

    Buku Elektronik Berlangganan

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    Theme by 
    Atmire NV
     

     

    Browse

    All of USU-IRCommunities & CollectionsBy Issue DateTitlesAuthorsAdvisorsKeywordsTypesBy Submit DateThis CollectionBy Issue DateTitlesAuthorsAdvisorsKeywordsTypesBy Submit Date

    My Account

    LoginRegister

    Repositori Institusi Universitas Sumatera Utara - 2025

    Universitas Sumatera Utara

    Perpustakaan

    Resource Guide

    Katalog Perpustakaan

    Journal Elektronik Berlangganan

    Buku Elektronik Berlangganan

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    Theme by 
    Atmire NV