I was invited to the ALS Association / Prize4Life Assistive Technology Workshop last week to discuss the work the Microsoft Research Enable team is doing along with about two dozen other people doing similar work. Along with me, two pALS I've been working with were also invited - Gal Sont and Steve Gleason. Unfortunately, neither was able to fly out to Washington DC to attend the conference due to the logistical challenges they live with.
I'm pretty familiar with Suitable Technologies Beam telepresence robots. We've had one here at Microsoft Research (MSR) for well over a year that we use for visiting researchers, offsite employees, and the occasional sick day. I asked Suitable and they offered to send two Beams to the conference so our pALS could attend.
Familiarity & Eye Gaze interaction
The first challenge we needed to overcome was familiarity with the Beam and its user interface. Gal and Steve were new to the Beam. We practiced ahead of time for a few hours using the Beam we had here at MSR. Both used mouse emulation from their eye gaze software. Gal uses (and wrote) Click2Speak and controlling the Beam was straightforward. Steve uses Tobii PC Eye Go and in order to control the Beam he had to switch to 'Mouse Emulation' mode.
The second challenge we needed to overcome was getting speech synthesis to work over the Beam. Telepresence systems have a feature called acoustic echo cancellation (AEC) which is used to keep the remote speaker from hearing herself echo'd in the call.
To keep this from happening, the Beam 'cancels out' sound emitted from the computer speakers from what is picked up by the microphone. A side effect of this is if you generate your speech via the same computer you're Beaming in with, the speech doesn't get transmitted to the Beam.
How it does this is it essentially 'pairs up' a speaker and a microphone together so as you send the audio to the speaker, it listens for and removes that sound from the microphone. So the workaround is to use a second microphone to pick up the speech, one that the echo canceller will ignore. We purchased an inexpensive USB microphone, the Kinobo SF-555B, and mounted it next to the speaker and this did the trick.
Back channel chat & hand raising
Alright, at this point we have Gal and Steve controlling and speaking through the Beam to attend the workshop. However we still have the challenge that the best of the speech systems available today allow a pALS to speak at around 15 words per minute, compared to 150 words per minute for spoken speech. When you're speaking one on one or in a family scenario, people learn to develop patience and politely wait for responses (I'm sure most pALS would laugh politely at this statement), alongside all the shortcuts that come from knowing a person well and using facial body language. But now, let's open the floodgates to a room full of animated speakers and a round table discussion -- that all flies out the door.
So here we are at a workshop about ALS and technology and what are the important goals to pursue and ... Gal and Steve can't get a word in. At one point we all politely paused, asked a question to the pALS, and then a room full of people waited in complete silence for one of them to compose a reply. We lasted about 10 seconds before the conversation resumed :-(
Fortuitously, I had a Skype session running with all the pALS -- we set Skype up earlier while we were getting the Beams configured and we had a surprise (for me) visit from a 3rd pALS who was attending the conference on my Surface over Skype.
So on a whim, I started transcribing the important questions to the pALS over the Skype chatroom and then they started providing their opinions via chat. The conversation might have moved on by then, so I would interject, restate the original question to the room, and then read out the answer the pALS provided. This worked really well!
As the session went on, the participants warmed to the approach. Eventually we found a rhythm where the room would ask a question for the pALS first, move into open discussion, and then when I saw an answer on the chatroom I would raise my hand, the room would pause, and the pALS could hit 'play' on the speech synthesizer, fully participating almost entirely independently.
How can we do better?
From my point of view, the pattern we found worked really well. However it still had some limitations:
- We had to run two parallel communications systems -- Suitable Beams and a Skype chatroom.
- We needed an in room moderator to transcribe questions directed to the pALS, read when answers were ready, and act as an 'attention proxy' by raising a hand to give space for the speech synthesizer to inject into the conversation.
- The pALS were beholden to the moderator to be on top of the game in order to speak, they lost some independence.
I think we can do better. Some brainstorm ideas I had were:
- Create a queue for questions to the pALS. They have a lot to say but they need a little more time to prepare their answers.
- Implement a 'raised hand' signal they can turn on, like an industrial warning light, rather than rely on a moderator to raise his hand.
- Integrate speech synthesizer output into the audio stream of the Beam, rather than having to resort to the two microphone trick to defeat AEC.