This essay is developed based on attempted to summarize the current state of speech and pen input technology and to identify its strengths, limitations and lastly, report on the key multimodal research challenge.
Multimodal technology can be useful in many different environments such as multi-modal interaction for people with disabilities, multi-modal interaction for distributed applications, multimodal systems is emerging in which the user will be able to employ natural communication modalities, including voice, hand and pen-based gesture, eye-tracking, body-movement.
Multimodality allows taking benefits in an optimal way of the human communication capacities. Multimodal interface aim at integrating several communication means in a harmonious way and thus make computer behavior close to human communication paradigms, and multimodal is very easy to learn and use
Major evolution in new input technologies and algorithms, hardware speed, distributed computing and spoken language, and spoken language technology in particular all have supported the emergence of more transparent and natural communication with this new class of multimodal system. (Designing the user interface for multimodal speech and pen-based gesture applications, 2002, p422). Therefore a lot of technology these days applying the spoken language such as workstation, telephony application and even appears on the small palm computers.
There are several capabilities of the spoken language system, its supporting new training system for learning foreign language and basic reading skills, as well as automated dictation system for application such as word processing, legal record for example LA Voice Navigator 3.1 software that can control workstation only using voice.
Therefore development of spoken language technology become more expand and nowadays steady advance have occurred in pen-based hardware and software capabilities, which currently equip with handwriting and gesture recognition on handhelds, palm and recently on mobile phone. Pen input technology also support sketching application for the design picture, user interface design circuit design, etc by using sketch pad as well.
The multimodal subfield involving speech and pen-based gestures has been able to explore a wider range of research issues and to advance more rapidly in its multimodal architectures and application. (Designing the user interface for multimodal speech and pen-based gesture applications, 2002, p423)
The speech input systems provide computers with the ability to identify spoken words and phrases and focuses on word identification, not word understanding. The latter is part of natural language processing, which is a separate research area. Compare this to entering characters into a computer using a keyboard. The computer has the ability to identify the characters which are typed. However, there is no implicit understanding by the computer as to what these characters mean.
There are some advantages from speech input, first speech input offers speed, and high bandwidth information, and lastly speech input relative ease of use. And it is also permits the user’s hands and eyes to be busy with the task, which is particularly valuable when users are in motion or in natural field settings.
During pen or voice input interaction, user inclined prefer entering expressive information via speech, although their choice for pen input technology become increase for digits, symbols, and graphic content.
While talking about the speech input advantages, pen input technology have some advantages, firstly pen input can be used to write words that are corresponding to speech, secondly pen input also can be used to convey symbols and sign, gestures, simple graphics and artwork, and to render signatures, lastly pen input can be used to point, to select visible object like the mouse does in a direct manipulation interface, and as a means of microphone engagement for speech input. (Designing the user interface for multimodal speech and pen-based gesture applications, 2002, p424)
Pen input technology have advantage of allow users to engage in more powerfully expressive and transparent information-seeking dialogues in human language technology form. Speech is the preferred medium for subject, verb, and object expression. Compare with speech-only interaction to speech and pen interaction for visual-spatial tasks, multimodal pen or voice interaction can result in 10 percent faster in completion time, 36 percent fewer task-critical errors, shorter and simpler linguistic constructions, 90 to 100 percent user preference to interact this way, and 50 percent fewer spontaneous disfluencies.
Compare to unimodal recognition-based interface, multimodal interface design has particular advantageous feature which is can support superior error recovery. There are both user-centered and system-centered reasons why multimodal system facilitates error recovery. First, in a multimodal interface users intuitively pick the mode that is less error-prone. Second, in a multimodal interface user language is often simplified. Third, users intuitively switch modes after an error, so the same problem is not repeated. Fourth, users report less subjective frustration with errors when interacting multimodally, even when errors are as common as in a unimodal interface. Lastly, a well-designed multimodal architecture can support mutual disambiguation.
While there are a lot of large individual have different way to communicate or interact with the computer, a multimodal interface allow users to control or to make their on selection how to communicate or interact with the computer. Therefore, multimodal interface have the potential to accommodate a broader range of users than traditional graphical user interface including users of difference ages, skill level, or even for disabilities people for example in modern multimodal technology offered important opportunity for people with disabilities because they create new type of communication. Disabilities can have an effect on a wide range of peoples’ abilities and functions.
There are several problems may be due to communication, firstly the problem may be due to sensory impairments for example deafness, blindness and muteness. Secondly the problem may be due to physical impairments for example, lack of motor control of the speech articulator system, lastly the problem may be due cognitive impairments. These impairments may affect face-to-face communication, human with computer interaction and human with human communication that mediated by computer.
Multimodal that recognizes speech and pen-based gestures was designed and studied in the early 1990s, with the original QuickSet prototype. “QuickSet is a collaborative, handheld, multimodal system for interacting with distributed applications”. (Multimodal interaction for 2D and 3D environment, 1999, Cohen, P.R).
In virtue of its modular, agent-based design, with the handheld PC and 3Dvisualization QuickSet users can create entities in a modular Semi-Automated Forces distributed simulation, assign mission, and watch the simulation unfold. And the users can issue a command with the 2D visualization, example for the command such as first: “Turn on heads up display”, second: “Take me to objective alpha”, third: “Fly me to this platoon” accompanied by a gesture on the QuickSet map, fourth: “Fly me along this route at fifty meters” accompanied by drawing a route on the QuickSet map. Here, the first users control the visualization, the second and the third users navigate to out of view location and entities, and the fourth users move along to prescribe paths. (Multimodal interaction for 2D and 3D environment, 1999, Cohen, P.R).
There are many performance advantages has been describe for speech and pen input technology and users have strong preference to interact multimodally nonetheless not all the system design is compatible or best approaches with a multimodal interface, and speech and pen input systems are still relatively expensive, both in terms of software, additional hardware needed and memory requirements, some care is needed before deciding that speech and pen input will benefit a particular user. With those speech input system restricted to discrete utterance, parsing of the speech signal into word by word tokens, is not done. In other word human speaker have to talk with the system using word by word style. (Put that there: voice and gestures at the graphics interface, 1980, Bolt, R.A).
Finally, although a multimodal speech and pen input system combination is an attractive interface choice for the next generation system due to mobility, transparency, nonetheless other modality combination also need to be explored and will be preferred by users for certain application.
However, “The market for multimodal devices is expected to rapidly expand as large software companies and enterprise software vendors supporting mobile field personnel enter this space and extend their marketing reach with multimodal applications. As carriers and enterprises seek integrated solutions that eliminate the need to carry multiple devices for voice and data connectivity, market leadership will help deliver these solutions sooner,” said Elizabeth Herrell, Research Director at Giga Information Group (“New Multimodal Server Enables Voice, Text and Graphic on Smart Phones and PDAs,” October 29, 2001).” (http://www.lobby7.com/press_121001.htm)
The robustness of the multimodal system need to be improved, the first thing to is to do future research planning for developing adaptive multimodal speech and gestures architectures by classify the problem first and when to adapt and how to adapt that multimodal system. There are two candidates for multimodal system adaptation, user-centered and environmental.
With respect to environmental parameters for adapting multimodal pen or voice system, back ground noise and speech signal to noise ratio (SNR) are two widely used audio dimensions that have an impact on system recognition rates. (Designing the user interface for multimodal speech and pen-based gesture applications, 2002, p451)
To design robust mobile pen or voice system for field use, future research will need experiment with adaptive processing that tracks the background noise level and dynamically adjust the system’s weightings for speech and gestures input from the less to more reliable input mode (Rogozan and Deleglise 1998) (Designing the user interface for multimodal speech and pen-based gesture applications, 2002, p424)
In conclusion, interest in multimodal interface design growing largely by the goal of supporting more transparent, flexible, efficient, ease use, and powerfully expressive means of human-computer interaction. Multimodal interface is important nowadays not only very useful for difference ages, skill level, or even for disabilities people, but also in dealing with business environment. With multimodal interface system business environment will be more efficiently for example word processing using the speech recognition or pen input technology.
However, there are several limitation for the multimodal system, which is speech and pen input systems are not cost effective in other word still relatively expensive, both in terms of software, additional hardware needed and memory requirements, some care is needed before deciding that speech and pen input will benefit a particular user. And multimodal interface system needs to adapt so that their robustness can be enhance. Therefore there are two candidates for system adaptations are user-centered and environmental parameters.
* Bolt, R.A, (1980). Put-that-there: Voice and gestures at the graphics interface, Computer graphics, 14, 3, 262-270
* Cohen, P.R., McGee, D., Oviatt, S., Wu, L., Clow, J., King, R., Julier, S., and Rosenblum, L., (1999). Multimodal interaction for 2D and 3D environments, IEEE Computer Graphics and Applications, 19, 4, 10 -13, IEEE Press
* Landay, J., Larson, J., and Ferro, D., (2002). Designing the user interface for multimodal speech and pen-based gesture applications: State-of-the-art systems and future research directions, In Carroll, J.M. (Ed), Human-Computer interaction in the new millenium, New York: ACM Press, Addison-Wesley.