Jim Glass from MIT is giving a talk now which I am blogging about. His talk is about Speech as Interface and Content: Advances and Challenges. Conversation systems use a mixture of human and computer interfaces. He just showed a funny clip from Saturday Night Live where a guy is going out on a blind date and his date works in customer service and answers him just like a customer service automated attendant would, which was pretty funny!
Here's the abstract of his talk.
Spoken interaction between humans and machines has long been a goal of
scientists and engineers. As computational devices continue to shrink in size,
speech-based interfaces are more relevant than ever. At the same time, audio
and video media are fast becoming significant data types themselves. Without
additional processing however, searching these materials can be tedious.
Spoken language technology offers the opportunity to provide structure for
more effective browsing, summarization, and even translation.
In this talk I describe ongoing research in our group to enable
accessible, multimodal, customizable, and context-aware speech-based
interfaces. I will also present our recent activities in spoken lecture
processing which attempt to transcribe and index academic lecture recordings.
I will also discuss our related efforts to address the long-term challenge of
unsupervised word acquisition. Barring Murphy's Law, the talk will include
web-based demonstrations of recently developed interface and content
processing prototypes.
Bio:
James R. Glass obtained his S.M. and Ph.D. degrees in Electrical
Engineering and Computer Science from the Massachusetts Institute of
Technology in 1985, and 1988, respectively. Currently, he is a Principal
Research Scientist at the MIT Computer Science and Artificial Intelligence
Laboratory where he heads the Spoken Language Systems Group. He is also a
Lecturer in the Harvard-MIT Division of Health Sciences and Technology. His
primary research interests are in the area of speech communication and
human-computer interaction, centered on automatic speech recognition and
spoken language understanding.
Speech-based interfaces are good methods for interacting with devices where in certain environments make it difficult to use keyboards. Speech-based interfaces are being extended to the web. Right now, he is showing a demo of City Browser where you can talk to the system and find restaurants on a geographic map. It's using a Java applet, he's trying to show it on the web browser but the Internet is slow in the room, it's taking a while to load. Oh, now it is coming up. You ask a question about what type of restaurants you want to get in a particular city, and then the system returns the location of the restaurants on a Google map. If the question you ask is wrong, then you can correct it with a visual interface and allow users to edit the query rather than speak it. Another thing that you can specify is finding restaurants along a particular street or find more information about restaurants when you circle a particular area. So the system combines audio and visual feedback and returns visual results along with corresponding audio.
It makes me think how a lot of research now is not strictly computationally based, it also involves the user as well. And it's nice to see that, because there are certain things that humans do much better and tasks that humans can do easier than computers. This is part of context-aware computing and definitely part of ubiquitous computing. You can take a performance enhancement by restricting the vocabulary to the specific domain, eg. vocabulary for Boston will be different than vocabulary for San Francisco. Given the original query, you can dynamically alter the vocabulary based on partial understanding.
Multimodal interaction is interesting because multimodality enables more natural, flexible, efficient and robust human-computer interaction. This may reduce the complexity of the system and the interface. There are multiple research problems that are involved in achieving this which is explained more here. Another interesting multimodal interaction application is person verification using a combination of face recognition and speaker identification.
The next part of his talk is about speech as content. There are lots of media content such as video and audio, which is difficult and tedious to search on. The motivation is whether spoken language technology can be used to structure speech content. Another interesting area of application is how to use speech technology to disseminate and understand lecture recordings, and they've created one called Lecture Browser. Some hard problems include analyzing lectures because spoken word has different vocabulary than the text. Another neat thing is to provide lecture structure induction of certain segments and parts of a lecture and summarization. They have a lecture processing prototype where people can submit videos of lectures and then it can be integrated into their Lecture Browser.
Overall, this is a very interesting talk, with relevant applications.
On Technorati: speech, multimodal interface
No comments:
Post a Comment