Appropriating the human body as an input device is appealing not only because e have roughly two square meters of external surface area, but also because much of it is easily accessible by our hands (e. G. , arms, upper legs, torso). Furthermore, preconception – our sense of how our body is configured in three-dimensional space – allows us to accurately Interact with our bodies In an eyes-free manner. For example, we can readily flick each of our fingers, touch the tip of our nose, and clap our hands together without visual assistance.
Few external Input devices can claim this accurate, eyes-free input characteristic and provide such a large interaction area. In this paper, we present our work on Skink’s – a method that allows the body to be appropriated for finger input using a novel, non-invasive, wearable bio-acoustic sensor. The contributions of this paper are: 1) We describe the design off novel, wearable sensor for bio-acoustic signal acquisition (Figure 1). 2) We describe an analysis approach that enables our system to resolve the location of finger taps on the body. We present Skink’s, a technology that appropriates the human body for acoustic transmission, allowing the skin to be used as an Input surface. In particular, we solve the location of finger taps on the arm and hand by analyzing mechanical vibrations that propagate through the body. We collect these signals using a novel array of sensors worn as an armband. This approach provides an always available, naturally portable, and on-body finger input system. We assess the capabilities, accuracy and limitations of our technique through a two-part, twenty-participant user study.
To further illustrate the utility of our approach, we conclude with several proof-of-concept applications we developed. Author Keywords Bio-acoustics, finger input, buttons, gestures, on-body interaction, projected displays, audio interfaces. ACM Classification Keywords H. 5. 2 [User Interfaces]: Input devices and strategies; B. 4. 2 [Input/output Devices]: Channels and controllers General terms: Human Factors INTRODUCTION carried on our bodies. However, their small size typically leads to limited interaction space (e. G. Diminutive screens, buttons, and Jog wheels) and consequently diminishes their usability and functionality. Since we cannot simply make buttons and screens larger without losing the primary benefit of small size, we consider alternative approaches that enhance interactions with small mobile systems. One option is to opportunistically appropriate surface area from the environment for interactive purposes. For example,  describes a technique that allows a small mobile device to turn tables on which it rests into a gestures finger input canvas.
However, tables are not always present, and in a mobile context, users are unlikely to want to carry permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CHI 2010, April 10-15, 2010, Atlanta, Georgia, USA. Copyright 2010 ACM 978-1-60558-929-9/10/04…. $10. 00. Figure 1 .
A wearable, bio-acoustic sensing array built into an armband. Sensing elements detect vibrations transmitted through the body. The two sensor packages shown above each contain five, specially weighted, cantilevered pizza films, responsive to a particular frequency range. 3) We assess the robustness and limitations of this system through a user study. ) We explore the broader space of bio-acoustic input through prototype applications and additional experimentation. RELATED WORK Always-Available Input for everyday computing tasks, and require levels of focus, training, and concentration that are incompatible with typical computer interaction.
There has been less work relating to the intersection of finger input and biological signals. Researchers have harnessed the electrical signals generated by muscle activation during normal hand movement through electromyography (MEG) (e. G. ). At present, however, this approach typically requires expensive amplification systems and the application of conductive gel for effective signal acquisition, which would limit the acceptability of this approach for most users. The input technology most related to our own is that of Aments et al. 2], who placed contact microphones on a user’s wrist to assess finger movement. However, this work was never formally evaluated, as is constrained to finger motions in one hand. The Hammond system  employs a similar setup, and through an HEM, yields classification accuracies around 90% for four gestures (e. G. , raise heels, snap fingers). Performance of false positive rejection remains untested in both systems at present. Moreover, both techniques required the placement of sensors near the area of interaction (e. G. , the wrist), increasing the degree of invasiveness and visibility.
Finally, bone conduction microphones and headphones – now common consumer technologies – represent an additional bio-sensing frequencies relevant to human speech propagate well through bone. Bone conduction microphones are typically worn near the ear, where they can sense vibrations propagating from the mouth and larynx during speech. Bone conduction deadpanned send sound through the bones of the skull and Jaw directly to the inner ear, bypassing transmission of sound through the air and outer ear, leaving an unobstructed path for environmental sounds.
Acoustic Input The primary goal of Skink’s is to provide an lackadaisically mobile input system – that is, an input system that does not require a user to carry or pick up a device. A number of alternative approaches have been proposed that operate in this space. Techniques based on computer vision are popular (e. G. [3,26,27], see  for a recent survey). These, however, are computationally expensive and error prone in mobile scenarios (where, e. G. , non-input optical flow is prevalent). Speech input (e. G. 1 5]) is a logical choice for always-available input, but is limited in its precision in unpredictable acoustic environments, and suffers from privacy and scalability issues in shared environments. Other approaches have taken the form of wearable computing. This typically involves a physical input device built in a form considered to be part of one’s clothing. For example, glove-based input systems (see  for a review) allow users to retain most of their natural hand movements, but are embosser, uncomfortable, and disruptive to tactile sensation.
Post and Roth  present a “smart fabric” system that embeds sensors and conductors into fabric, but taking this approach to always-available input necessitates embedding technology in all clothing, which would be prohibitively complex and expensive. The Sixteenths project  proposes a mobile, lackadaisically input/output capability by combining projected information with a color-marker-based vision tracking system. This approach is feasible, but suffers from serious occlusion and accuracy limitations. For example, determining whether, e. G. A finger has tapped a button, or is merely hovering above it, is extraordinarily difficult.
In the present work, we briefly explore the combination of on-body sensing with on-body projection. Bio-sensing Skink’s leverages the natural acoustic conduction properties of the human body to provide an input system, and is thus related to previous work in the use of biological signals for computer input. Signals traditionally used for diagnostic medicine, such as heart rate and skin resistance, have been appropriated for assessing a user’s emotional state (e. G. These features are generally superconductivity ND cannot be controlled with sufficient precision for direct input.
Similarly, brain sensing technologies such as electroencephalography (EGG) and functional near- infrared spectroscopy (finer) have been used by HCI researchers to assess cognitive and emotional state (e. G. [9,1 1,14]); this work also primarily looked at involuntary signals. In contrast, brain signals have been harnessed as a direct input for use by paralyzed patients (e. G. [8,1 but direct broncobuster interfaces (Ibis) still lack the bandwidth required Our approach is also inspired by systems that leverage acoustic transmission mound at multiple sensors to locate hand taps on a glass window.
Sushi et al.  use a similar approach to localize a ball hitting a table, for computer augmentation of a real-world game. Both of these systems use acoustic time-of-flight for localization, which we explored, but found to be insufficiently robust on the human body, leading to the fingerprinting approach described in this paper. SKINK’S To expand the range of sensing modalities for lackadaisically input systems, we introduce Skink’s, a novel input technique that allows the skin to be used as a finger input surface.
In our prototype system, we choose to focus on the arm (although the technique could be applied elsewhere). This is an attractive area to appropriate as it provides considerable surface area for interaction, including a contiguous and flat area for projection (discussed subsequently). Fur- thermo, the forearm and hands contain a complex assemblage of bones that increases acoustic distinctiveness of different locations. To capture this acoustic information, we developed a wearable armband that is non-invasive and easily removable (Figures 1 and 5).
In this section, we discuss the mechanical phenomena hat enables Skink’s, with a specific focus on the mechanical properties of the arm. Then we will describe the Skink’s sensor and the processing techniques we use to segment, analyze, and classify bio-acoustic signals. Bio-Acoustics Figure 2. Transverse wave propagation: Finger impacts displace the skin, creating transverse waves (ripples). The sensor is activated as the wave passes underneath it. When a finger taps the skin, several distinct forms of acoustic energy are produced. Some energy is radiated into the air as sound waves; this energy is not captured by the Skink’s system.
Among the acoustic energy transmitted through the arm, the most readily visible are transverse waves, created by the displacement of the skin from a finger impact (Figure 2). When shot with a high-speed camera, these appear as ripples, which propagate outward from the point of contact (see video). The amplitude of these ripples is correlated to both the tapping force and to the volume and compliance of soft tissues under the impact area. In general, tapping on soft regions of the arm creates higher amplitude transverse waves than tapping on boney areas (e. G. Wrist, palm, fingers), which have negligible compliance. In addition to the energy that propagates on the surface of the arm, some energy is transmitted inward, toward the skeleton (Figure 3). These longitudinal (compressive) waves travel through the soft tissues of the arm, exciting the bone, which is much less deformable then the soft tissue but can respond to mechanical excitation by rotating and translating as a rigid body. This excitation vibrates soft tissues surrounding the entire length of the bone, resulting in new longitudinal waves that propagate outward to the skin.
We highlight these two separate forms of conduction – arrangers waves moving directly along the arm surface, and longitudinal waves moving into and out of the bone through soft tissues – because these mechanisms carry energy at different frequencies and over different distances. Roughly speaking, higher frequencies propagate more readily through bone than through soft tissue, conduction. While we do not explicitly model the specific mechanisms of conduction, or depend on these mechanisms for our analysis, we do believe the success of our technique depends on the complex acoustic patterns that result from mixtures of these modalities.
Similarly, we also believe that Joints play an important role in making tapped locations acoustically distinct. Bones are held together by ligaments, and Joints often include additional biological structures such as fluid cavities. This makes Joints behave as acoustic filters. In some cases, these may simply dampen acoustics; in other cases, these will selectively attenuate specific frequencies, creating electrification acoustic signatures. Figure 3. Longitudinal wave propagation: Finger impacts create longitudinal (compressive) waves that cause internal skeletal structures to vibrate.
This, in turn, creates longitudinal waves that emanate outwards from the bone (along its entire length) toward the skin. Sensing To capture the rich variety of acoustic information described in the previous section, we evaluated many sensing technologies, including bone conduction microphones, conventional microphones coupled with stethoscopes , pizza contact microphones , and accelerometers. However, these transducers were engineered for very different applications than measuring acoustics transmitted through the human body.
As such, we found them to be lacking in several significant ways. Foremost, most mechanical sensors are engineered to provide relatively flat response curves over the range of frequencies that is relevant to our signal. This is a desirable property for most applications where a faithful representation of an input signal – uncensored by the properties of the transducer – is desired. However, because only a specific set of frequencies is conducted through the arm in response to tap input, a flat response curve leads to the capture of irrelevant frequencies and thus to a high signal-to-noise ratio.
While bone conduction microphones might seem a suitable hooch for Skink’s, these devices are typically engineered for capturing human voice, and filter out energy below the range of human speech (whose lowest frequency is around GHz). Thus most sensors in this category were not especially sensitive to lower-frequency signals (e. G. , GHz), which we found in our empirical pilot studies to be vital in characterizing finger taps. To overcome these challenges, we moved away from a single sensing element with a flat response curve, to an array of highly tuned vibration sensors.
Specifically, we employ small, cantilevered pizza films Unseemliness, Measurement Specialties, Inc. ). By adding small weights to the end of the cantilever, we are able to alter the resonant frequency, allowing the sensing element to be responsive to a unique, narrow, low-frequency band of the acoustic spec- Figure 4. Response curve (relative sensitivity) of the sensing element that resonates at 78 HZ’S. Trump. Adding more mass lowers the range of excitation to which a sensor responds; studies showed to be useful in characterizing bio-acoustic input.
Figure 4 shows the response curve for one of our sensors, tuned to a resonant frequency of GHz. The rev shows a ?db drop-off В±GHz away from the resonant frequency. Additionally, the cantilevered sensors were naturally insensitive to forces parallel to the skin (e. G. , shearing motions caused by stretching). Thus, the skin stretch induced by many routine movements (e. G. , reaching for a doorknob) tends to be attenuated. However, the sensors are highly responsive to motion perpendicular to the skin plane – perfect for capturing transverse surface waves (Figure 2) and longitudinal waves emanating from interior structures (Figure 3).
Finally, our sensor design is relatively inexpensive ND can be manufactured in a very small form factor (e. G. , MESS), rendering it suitable for inclusion in future mobile devices (e. G. , an arm-mounted audio player). Armband Prototype Figure 5. Prototype armband. Processing In our prototype system, we employ a Mackey Onyx OFF audio interface to digitally capture data from the ten sensors (http://Mackey. Com). This was connected via Firmware to a conventional desktop computer, where a thin client written in C interfaced with the device using the Audio Stream Input/output (ASIA) protocol. Each channel was sampled at 5. Hz’s, a sampling rate that would be considered too low for speech or environmental audio, but was able to represent the relevant spectrum of frequencies transmitted through the arm. This reduced sample rate (and consequently low processing bandwidth) makes our technique readily portable to embedded processors. For example, the Teammates processor employed by the Ordains platform can sample analog readings at kHz with no loss of precision, and could therefore provide the full sampling power required for Skink’s (kHz total). Data was then sent from our thin client over a local socket to our primary application, Ritter in Java.
This program performed three key functions. First, it provided a live visualization of the data from our ten sensors, which was useful in identifying acoustic features (Figure 6). Second, it segmented inputs from the data stream into independent instances (taps). Third, it classified these input instances. The audio stream was segmented into individual taps using an absolute exponential average of all ten channels (Figure 6, red waveform). When an intensity threshold was exceeded (Figure 6, upper blue line), the program recorded the timestamp as a potential start of a tap.
If the intensity did not fall below a second, independent “closing” threshold (Figure 6, lower purple line) between moms and moms after the onset crossing (a duration we found to be the common for finger impacts), the event was discarded. If start and end crossings were detected that satisfied these criteria, the acoustic data in that period (plus a moms buffer on either end) was considered an input event (Figure 6, upper Array Lower Array 25 Hz’s 25 Hz’s 27 Hz’s 27 Hz’s 30 Hz’s 40 Hz’s 38 Hz’s 44 HZ’S 78 HZ’S 64 HZ’S Our final prototype, shown in Figures 1 and 5, features two arrays of five sensing elements, incorporated into an armband form factor.
The decision to have two sensor on the upper arm (above the elbow), we hoped to collect acoustic information from the fleshy bicep area in addition to the firmer area on the underside of the arm, with better acoustic coupling to the Hummers, the main bone that runs from shoulder to elbow. When the sensor was placed below the elbow, on the forearm, one package was located near the Radius, the bone that runs from the lateral side of the elbow to the thumb side of the wrist, and the other near the Ulna, which runs parallel to this n the medial side of the arm closest to the body.
Each location thus provided slightly different acoustic coverage and information, helpful in disambiguating input location. Based on pilot data collection, we selected a different set of resonant frequencies for each sensor package (Table 1). We tuned the upper sensor package to be more sensitive to lower frequency signals, as these were more prevalent in fleshier areas. Conversely, we tuned the lower sensor array to be sensitive to higher frequencies, in order to better capture signals transmitted though (denser) bones. Table 1 . Resonant frequencies of individual elements in the two sensor packages. Lay for each segmented input. These are fed into the trained SAVE for classification. We use an event model in our software – once an input is classified, an event associated with that location is instantiated. Any interactive features bound to that event are fired. As can be seen in our video, we readily achieve interactive speeds. EXPERIMENT Participants Figure 6: Ten channels of acoustic data generated by three finger taps on the forearm, followed by three taps on the wrist. The exponential average of the channels is shown in red. Segmented input windows are highlighted in green.
Note how different sensing elements are actuated by the two locations. To evaluate the performance of our system, we recruited 13 participants (7 female) from the Greater Seattle area. These participants represented a diverse cross-section of potential ages and body types. Ages ranged from 20 to 56 (mean 38. 3), and computed body mass indexes (Bombs) ranged from 20. 5 (normal) to 31. 9 (obese). Experimental Conditions tactical green regions). Although simple, this heuristic proved to be highly robust, mainly due to the extreme noise suppression provided by our sensing approach. After an input has been segmented, the waveforms are analyzed.
The highly discrete nature of taps (I. E. Point impacts) meant acoustic signals were not particularly expressive over time (unlike gestures, e. G. , clenching of the hand). Signals simply diminished in intensity overtime. Thus, features are computed over the entire input window and do not capture any temporal dynamics. We employ a brute force machine learning approach, computing 186 features in total, many of which are derived combinatorial. For gross information, we include the average amplitude, standard aviation and total (absolute) energy of the waveforms in each channel (30 features).
From these, we calculate all average amplitude ratios between channel pairs (45 features). We also include an average of these ratios (1 feature). We calculate a 256- point FT for all ten channels, although only the lower ten values are used normalized by the highest-amplitude FT value found on any channel. We also include the center of mass of the power spectrum within the same GHz to GHz range for each channel, a rough estimation of the fundamental frequency of the signal displacing each sensor (10 features).
Subsequent feature selection established the all-pairs amplitude ratios and certain bands of the FT to be the most predictive features. These 186 features are passed to a Support Vector Machine (SAVE) classifier. A full description of CSV’s is beyond the scope of this paper (see  for a tutorial). Our software uses the implementation provided in the Weak machine learning toolkit . It should be noted, however, that other, more sophisticated classification techniques and features could be employed. Thus, the results presented in this paper should be considered a baseline.
Before the SAVE can classify input instances, t must first be trained to the user and the sensor position. This stage requires the collection of several examples for each input location of interest. When using Skink’s to recognize live input, the same 186 acoustic features are computed on-the- We selected three input groupings from the multitude of possible location combinations to test. We believe that these groupings, illustrated in Figure 7, are of particular interest with respect to interface design, and at the same time, push the limits of our sensing capability.
From these three groupings, we derived five different experimental conditions, described below. Fingers (Five Locations) One set of gestures we tested had participants tapping on the tips of each of their five fingers (Figure 6, “Fingers”). The fingers offer interesting performances that make them compelling to appropriate for input. Foremost, they provide clearly discrete interaction points, which are even already well-named (e. G. , ring finger). In addition to five finger tips, there are 14 knuckles (five major, nine minor), which, taken together, could offer 19 readily identifiable input locations on the fingers alone.
Second, we have exceptional finger-defogger dexterity, as demonstrated when we count by aping on our fingers. Finally, the fingers are linearly ordered, which is potentially useful for interfaces like number entry, magnitude control (e. G. , volume), and menu selection. At the same time, fingers are among the most uniform appendages on the body, with all but the thumb sharing a similar skeletal and muscular structure. This drastically reduces acoustic variation and makes differentiating among them difficult.
Additionally, acoustic information must cross as many as five (finger and wrist) Joints to reach the forearm, which further dampens signals. For this experimental indention, we thus decided to place the sensor arrays on the forearm, Just below the elbow. Despite these difficulties, pilot experiments showed measurable acoustic differences among fingers, which we theorize is primarily related to finger length and thickness, interactions with the complex structure of the wrist bones, and variations in the acoustic transmission properties of the muscles extending from the fingers to the forearm.
Whole Arm (Five Locations) Another gesture set investigated the use of five input locations on the forearm and hand: arm, wrist, palm, thumb and middle finger (Figure 7, “Whole Arm”). We selected Design and Setup We employed a within-subjects design, with each participant performing tasks in each of the five conditions in randomized order: five fingers with sensors below elbow; five points on the whole arm with the sensors above the elbow; the same points with sensors below the elbow, both sighted and blind; and ten marked points on the forearm with the sensors above the elbow.
Participants were seated in a conventional office chair, in front of a desktop computer that presented stimuli. For conditions with sensors below the elbow, we placed the armband ?CM away from he elbow, with one sensor package near the radius and the other near the ulna. For conditions with the sensors above the elbow, we placed the armband ?CM above the elbow, such that one sensor package rested on the biceps. Right-handed participants had the armband placed on the left arm, which allowed them to use their dominant hand for finger input.
For the one left-handed participant, we flipped the setup, which had no apparent effect on the operation of the system. Tightness of the armband was adjusted to be firm, but comfortable. While performing tasks, artisans could place their elbow on the desk, tucked against their body, or on the chair’s adjustable armrest; most chose the latter. Procedure Figure 7: The three input location sets evaluated in the study. Locations for two important reasons. First, they are distinct and named parts of the body (e. G. , “wrist”). This allowed participants to accurately tap these locations without training or markings.
Additionally, these locations proved to be acoustically distinct during piloting, with the large spatial spread of input points offering further variation. We used these locations in three different conditions. One condition placed the sensor above the elbow, while another placed it below. This was incorporated into the experiment to measure the accuracy loss across this significant articulation point (the elbow). Additionally, participants repeated the lower placement condition in an eyes-free context: participants were told to close their eyes and face forward, both for training and testing.
This condition was included to gauge how well users could target on-body input locations in an eyes-free context (e. G. , driving). Forearm (Ten Locations) In an effort to assess the upper bound of our approaches sensing resolution, our fifth ND final experimental condition used ten locations on Just the forearm (Figure 6, “Forearm”). Not only was this a very high density of input locations (unlike the whole- arm condition), but it also relied on an input surface (the forearm) with a high degree of physical uniformity (unlike, e. . , the hand). We expected that these factors would make acoustic sensing difficult. Moreover, this location was compelling due to its large and flat surface area, as well as its immediate accessibility, both visually and for finger input. Simultaneously, this makes for an ideal projection surface for dynamic interfaces. To maximize the surface area for input, we placed the sensor above the was done in the previously described conditions, we employed small, colored stickers to mark input targets.
This was both to reduce confusion (since locations on the forearm do not have common names) and to increase input consistency. As mentioned previously, we believe the forearm is ideal for projected interface elements; the stickers served as low-tech placeholders for projected buttons. For each condition, the experimenter walked through the input locations to be tested and demonstrated finger taps on each. Participants practiced duplicating these motions for approximately one minute with each gesture set. This allowed participants to familiarize themselves with our naming conventions (e. . “Pinky’, “wrist”), and to practice tapping their arm and hands with a finger on the opposite hand. It also allowed us to convey the appropriate tap force to participants, who often initially tapped unnecessarily hard. To train the system, participants were instructed to comfortably tap each location ten times, with a finger of their choosing. This constituted one training round. In total, three rounds of training data were collected ere input location set (30 examples per location, 1 50 data points total).
An exception to this procedure was in the case of the ten forearm locations, where only two rounds were collected to save time (20 examples per location, 200 data points total). Total training time for each experimental condition was approximately three minutes. We used the training data to build an SAVE classifier. During the subsequent testing phase, we presented participants with simple text stimuli (e. G. “tap your wrist”), which instructed them where to tap. The order of stimuli was randomized, with each location appearing ten times in total.
The system performed real-time segmentation and classification, and provided immediate feedback to the participant (e. G. “you tapped your wrist”). We provided feedback so that participants could see where the system was making errors (as they would if using a real application). If an input was not segmented (I. E. The tap was too quiet), participants could see this and would simply tap again. Overall, SEG- Figure 8. Accuracy of the three whole-arm-centric conditions. Error bars represent standard deviation. Imitation error rates were negligible in all conditions, and not included in further analysts.
RESULTS Figure 9. Higher accuracies can be achieved by collapsing the ten input locations into groups. A-E and G were created using a design-centric strategy. F was created following analysis of per-location accuracy data. In this section, we report on the classification accuracies for the test phases in the five different conditions. Overall, classification rates were high, with an average accuracy across conditions of 87. 6%.