Chú thích ảnh cho người khiếm thị sử dụng Transformer
Keywords:
Chú thích ảnh tự động, mô hình Encoder-Decoder, Swin Transformer, Transformer Decoder, ứng dụng hỗ trợ người khiếm thịAbstract
Visual impairment affects millions worldwide, posing significant challenges in accessing visual information. With the rapid development of mobile devices, particularly the Android platform, image-to-audio description applications are becoming increasingly popular. However, generating accurate, context-rich descriptions that are compatible with mobile deployment remains a considerable challenge. This study proposes an image captioning model based on the encoder-decoder architecture, in which the Swin Transformer is employed to extract hierarchical visual features and a Transformer Decoder is used to generate textual descriptions. The model is trained on standard datasets such as MS COCO and Flickr30k and fine-tuned on the Vietnamese-language KTVIC dataset to enhance its applicability in local contexts. Experimental results show that the model performs well on standard evaluation metrics including BLEU, METEOR, and CIDEr. In addition to the model, we developed an Android application that integrates image captioning and text-to-speech functionality, enabling real-time spoken descriptions to assist visually impaired users in accessing image content. The application demonstrates stable performance and responsive inference time under practical conditions. These results highlight the potential of the proposed approach in improving visual information accessibility for the visually impaired community.
