Chú thích ảnh cho người khiếm thị sử dụng Transformer

Authors

  • Thịnh Nguyễn Văn Trường Đại học Sư phạm TP. Hồ Chí Minh
  • Khiêm Nguyễn Thiên Trường Đại học Sư phạm TP. HCM
  • Đạt Đỗ Đức Trường Đại học Sư phạm TP. HCM
  • Trí Nguyễn Ngọc Hoài Trường Đại học Sư phạm TP. HCM
  • Thịnh Võ Văn Trường Đại học Sư phạm TP. HCM
  • Thích Văn Thịnh Trường Đại học Sư phạm TP. HCM

Keywords:

Chú thích ảnh tự động, mô hình Encoder-Decoder, Swin Transformer, Transformer Decoder, ứng dụng hỗ trợ người khiếm thị

Abstract

Visual impairment affects millions worldwide, posing significant challenges in accessing visual information. With the rapid development of mobile devices, particularly the Android platform, image-to-audio description applications are becoming increasingly popular. However, generating accurate, context-rich descriptions that are compatible with mobile deployment remains a considerable challenge. This study proposes an image captioning model based on the encoder-decoder architecture, in which the Swin Transformer is employed to extract hierarchical visual features and a Transformer Decoder is used to generate textual descriptions. The model is trained on standard datasets such as MS COCO and Flickr30k and fine-tuned on the Vietnamese-language KTVIC dataset to enhance its applicability in local contexts. Experimental results show that the model performs well on standard evaluation metrics including BLEU, METEOR, and CIDEr. In addition to the model, we developed an Android application that integrates image captioning and text-to-speech functionality, enabling real-time spoken descriptions to assist visually impaired users in accessing image content. The application demonstrates stable performance and responsive inference time under practical conditions. These results highlight the potential of the proposed approach in improving visual information accessibility for the visually impaired community.

Published

27-10-2025

How to Cite

Nguyễn Văn, T., Nguyễn Thiên, K., Đỗ Đức, Đạt, Nguyễn Ngọc Hoài, T., Võ Văn, T., & Văn Thịnh, T. (2025). Chú thích ảnh cho người khiếm thị sử dụng Transformer. HUFLIT Journal of Science, 9(3), 14. Retrieved from https://hjs.huflit.edu.vn/index.php/hjs/article/view/269

Issue

Section

Science and Technology

Categories

Most read articles by the same author(s)

Similar Articles

1 2 3 4 5 6 7 8 9 10 11 12 13 14 > >> 

You may also start an advanced similarity search for this article.