Disclaimer: All code within VAVI is written by me, except for the submodules or code snippets implemented in server.py with BLIP. Record.js and Camera.js were written by me with the help of React-Native Documentation and Expo Development Documentation.
Directions: Download, extract, and run.
Today, there are more than 295 million people in the world living with moderate-to-severe visual impairment, and 43 million people are unconditionally blind [1]. The visually impaired can have many difficulties in day to day tasks, heavily affecting their quality of life [2]. One such difficulty is perceiving their surroundings, and being able to evaluate certain information about common household objects and scenes. VAVI aims to aid the visually impaired in their day-to-day tasks by answering various questions about photos being taken. Visual impairment drastically changes quality of life [2]. Visually impaired people seek to become more independent [3, 4]. Centers for the visually impaired such as the Vista Center for the Visually Impaired have independence immersion programs, which are specifically designed to aid the visually impaired in regaining control of their daily lives [5]. Navigation technologies and reading text are assistive technologies which help the visually impaired become more independent. As such, there have been numerous digital assistants built on aiding the visually impaired in reading text through voice recognition, as well as numerous navigation technologies to aid the visually impaired in moving about areas [6, 7, 8]. For instance, there are canes which leverage a vision transformer model to help the visually impaired to detect their surroundings [6]. In addition, DAVID, a digital assistant software, is designed to recognize text on real world objects and provide audio feedback in real time, utilizing voice user interface technology such as speech recognizing and speech synthesis as the means of interaction through voice input [7]. However, these digital assistants fail to address a critical issue for the visually impaired: the ability to obtain information about objects and the surroundings without needing a human assistant to identify and describe the environment for them. Therefore, we present VAVI, a Virtual Assistant for the Visually Impaired. VAVI is a mobile application designed for both iOS and Android platforms, in which users may obtain information about the world around them through simply taking a photo. The user then prompts VAVI by recording audio to ask questions. More specifically, we send these prompts to the cloud server to perform few-shot, downstream vision-language tasks on household images employing the SMMC framework, composing the dense image captioning capabilities of a visual language model with the powerful reasoning abilities of a large language model [9].
VAVI replaces the usual need for a human assistant to answer questions about the environment, and attempts to match or exceed the capabilities of a human assistant. We borrow the SMMC method, a modular framework in which a language-based exchange is facilitated between pretrained models without training or fine-tuning [9]. Through SMMC, we are able to champion the few-shot capabilities of these pre-trained models via prompt engineering and multi-model dialogue to execute joint inference on downstream vision-language tasks, tailored to aiding the visually impaired.
- “Vision Atlas,” The International Agency for the Prevention of Blindness. https://www.iapb.org/learn/vision-atlas/ (accessed Feb. 01, 2023).
- V. R. Cimarolli and K. Boerner, “Social Support and Well-being in Adults who are Visually Impaired,” Journal of Visual Impairment & Blindness, vol. 99, no. 9, pp. 521–534, Sep. 2005, doi: 10.1177/0145482X0509900904.
- S. K. West, B. Munoz, G. S. Rubin, K. Bandeen-Roche, and S. Zeger, “Function and Visual Impairment in a Population-Based Study of Older Adults,” Investigative Ophthalmology, vol. 38, no. 1, 1997.
- B. E. K. Klein, R. Klein, K. E. Lee, and K. J. Cruickshanks, “Performance-based and self-assessed measures of visual function as related to history of falls, hip fractures, and measured gait time: The beaver dam eye study,” Ophthalmology, vol. 105, no. 1, pp. 160–164, Jan. 1998, doi: 10.1016/S0161-6420(98)91911-X.
- “Programs & Services,” Vista Center for the Blind and Visually Impaired. https://vistacenter.org/programs-services/ (accessed Feb. 01, 2023).
- B. Kumar, “ViT Cane: Visual Assistant for the Visually Impaired.” arXiv, Sep. 25, 2021. doi: 10.48550/arXiv.2109.13857.
- E. Marvin, “Digital Assistant for the Visually Impaired,” in 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Feb. 2020, pp. 723–728. doi: 10.1109/ICAIIC48513.2020.9065191.
- S. Treiman, “Visually Impaired by Dry or Wet AMD? Low-Vision Apps, Devices, and Virtual Assistants Can Expand Your View,” EverydayHealth.com, Dec. 13, 2022. https://www.everydayhealth.com/vision/low-vision-apps-devices-virtual-assistants-expand-the-view-for-the-visually-impaired/ (accessed Feb. 01, 2023).
- E. Liu, “Simplifying Multimodal Composition: A Novel Zero-shot Framework to Visual Question Answering and Image Captioning,” Jun. 13, 2023. https://doi.org/10.21203/rs.3.rs-3027308/v1 (accessed Jun. 15, 2023).