Detection of AI-Generated Images and Videos Using Vision Transformers

Dinesh M, Jeffrin Hannah

doi:10.21275/SR26227161330

Detection of AI-Generated Images and Videos Using Vision Transformers

Dinesh M, Jeffrin Hannah

Abstract: This study presents a deep learning framework for detecting AI-generated images and videos using transformer-based architectures. Image classification is performed using a Vision Transformer (ViT-B/16) model trained on standardized 224 ? 224 inputs, while video detection employs a frame-based strategy with a ResNet50 backbone and prediction averaging across uniformly sampled frames. A structured preprocessing pipeline ensures input consistency, and the system is integrated into a secure web-based interface for real-time inference. Experimental evaluation on a binary real-versus-AI-generated dataset achieves 93.28% test accuracy, with precision of 0.9492, recall of 0.9145, and F1-score of 0.9316. The results demonstrate that transformer-based image representations combined with frame-level video aggregation provide an effective approach for reliable detection of AI-generated media.

Keywords: Deepfake detection, Vision Transformer (ViT), ViT-B/16, ResNet50, AI-generated media, Frame-based video analysis, Multimedia forensics, Transformer-based classification

How to Cite?: Dinesh M, Jeffrin Hannah, "Detection of AI-Generated Images and Videos Using Vision Transformers", Volume 15 Issue 3, March 2026, International Journal of Science and Research (IJSR), Pages: 371-380, https://www.ijsr.net/getabstract.php?paperid=SR26227161330, DOI: https://dx.doi.org/10.21275/SR26227161330

Download Citation: APA | MLA | BibTeX | EndNote | RefMan