Deteksi Deepfake pada Wajah dalam Video dengan Cross-Attention Multi-Scale Vision Transformer dan EfficientNet

Manurung, Gery Jonathan

Deepfake Detection on Faces in Video with Cross-Attention Multi-Scale Vision Transformer and EfficientNet

View/Open

Cover (655.7Kb)

Fulltext (3.130Mb)

Date

2025

Author

Manurung, Gery Jonathan

Advisor(s)

Nasution, Umaya Ramadhani Putri

Sawaluddin

Metadata

Show full item record

Abstract

The proliferation of sophisticated deepfake videos poses a serious threat to digital trust and security, demanding detection systems that are not only accurate but also computationally efficient for practical applications. This research aims to design, implement, and evaluate a hybrid architecture that balances high accuracy with inference efficiency for video-based deepfake detection. The proposed model integrates EfficientNet-B1 as a feature extractor with a Cross-Attention Multi-Scale Vision Transformer (Cross-ViT) for context modeling. The model was trained on a combination of the FaceForensics++ and Celeb-DF(v2) datasets and evaluated on an out-of-distribution dataset, DeepFakeDetection (DFD), to test its generalization capabilities. The evaluation results demonstrate reliable detection performance, achieving an Area Under the Curve (AUC) of 92.35% and a video-level F1-score of 83.62%. The model's primary advantage is its exceptional computational efficiency, requiring only 0.349 G-FLOP for per-frame inference, despite having a large parameter count (114.33 Million). This study also reveals that the use of a small batch size, Face-Cutout augmentation, and a Binary Cross-Entropy (BCE) loss function significantly contributes to improved generalization and effective video-level aggregation. This research successfully validates an efficient and scalable hybrid architecture that offers a practical solution for deepfake detection by balancing accuracy, inference speed, and model size.

URI

https://repositori.usu.ac.id/handle/123456789/105815

Collections

Undergraduate Theses [873]