Leader ImViA : Julien Dubois, Cyrille Migniot
Leader : Maxime Ambard (LEAD)
Beginning: 01/09/2020
End: 31/08/2023
Abstract:
The project involves developing electronic glasses to help blind people avoid obstacles when walking. The system is based on a technology that converts visuo-spatial information collected in three dimensions by a time-of-flight or stereoscopic camera into a spatialised sound stream using a method known as sensory substitution. The 3 visuo-spatial dimensions (right-left, low-high and far-remote) are translated by auditory cues (right ear-left ear, low-high, flat sound-percussive sound). To be operational, the hardware system must meet four criteria: it must not take up too much space, it must not heat up too much, it must have good energy autonomy and it must be responsive. These criteria can only be met if the algorithms used are optimised for implementation on a mobile device.
The processing of the information that converts the video stream acquired by a time-lapse or stereoscopic camera into a spatialised sound stream takes place in two stages. Firstly, the raw video stream from the camera is filtered to retain only the graphical elements that we want to retain. Once the video stream has been simplified, a second stage transforms it into a stereophonic sound stream. To do this, we associate each graphic pixel retained during the first stage (video filtering) with a sound pixel. For example, a graphic pixel located on the right of the image is associated with a stereophonic monotone sound that will sound louder in the right ear. If the graphic pixel is located at the top of the image, it is associated with a high-pitched tone. If it is detected as being close to the camera, a crackling sound effect is applied.
Despite the dazzling progress made in mobile computing over the last decade, this type of real-time processing, consisting of video stream filtering on the one hand and audio synthesis on the other, remains a challenge. The work in this ‘computing’ workpackage will therefore focus on optimising these two algorithmic processing stages and their implementation on mobile devices.
Firstly, 3D reconstruction of the information acquired by the camera is not trivial. For example, if the camera used is of the time-of-flight (infrared) type, the signal is quickly blurred by sunlight and the system’s field of use is restricted to the inside of buildings or outside at night. If this camera is stereoscopic, the usual 3D reconstruction algorithms are very demanding in terms of computing capacity and often generate artefacts that are visually tolerable but which prove to be prohibitive when sonified. We will be able to take advantage of the fact that, in our specific case, we do not want to reconstruct the entire 3D image, but are only interested in the moving contours. This should enable us to design specific algorithms that are less demanding.
Sound synthesis is also computationally demanding. For example, the activated sound pixels, although pre-calculated at the start of the application using HRTF (Head-Related Transfer Function) audio spatialisation techniques, have to be added to each other in real time within the sound card’s audio stream before being sent as a stereophonic output to the user. We have already worked on these aspects in the case of video acquisition using a 2D camera, but the use of 3D information multiplies the calculation and memory resources required. We already have a version of the algorithm in 3D on a workstation, but it needs to be optimised to be less demanding for mobile implementation.
The main algorithmic characteristic of these processes is that they can largely be carried out in parallel. This is a very important advantage, pointing to the possibility of radical optimisation if these algorithms are designed for hardware architectures with high parallel computing capacities. We believe that this is the key to validating the 4 criteria mentioned in the first paragraph of this text.