Introduction

Does skin have sufficient information capacity to support high-bandwidth sensory substitution, such as the transmission of spoken language?

The idea of mapping sound to touch is not new (Gault 1924). In the late 1980s through the early 1990s, a number of vibrotactile aids were created (Milnes et al. 1996; Yuan et al. 2005; Galvin et al. 1999; Ranjbar et al. 2009; Ronnberg et al. 1998; Weisenberger et al. 1991a, b; Reed and Delhorne 2003; Scott and De Felippo 1977; Galvin et al. 2001; Phillips et al. 1994; Summers and Gratton 1995; Ellis and Robinson 1993; Weisenberger 1989; Traunmuller 1980; Rothenberg et al. 1977). Such aids all work in a similar fashion: by amplitude modulation of a vibrotactile stimulus of fixed frequency based on the envelope of the entire signal (single-channel) or a bandpassed version of the signal (multi-channel). While effective at conveying some degree of adjunct phonetic information (Brooks et al. 1986a, b; Galvin et al. 1999; Weisenberger et al. 1991a, b), they do not perform well in the absence of lipreading (Brooks et al. 1986b; Weisenberger et al. 1991a, b). These aids all implement a lossy method of both speech feature extraction and encoding to the skin.

Our long-term goal is to develop a sound-to-touch device that accurately extracts important features of speech and guarantees that the information is passed through the skin without loss (even though this information may itself be a lossy compression of the original sound). This is where prior implementations fall short; they are thus classified as hearing aids, not hearing solutions. To that end, we seek to determine whether the minimum bandwidth for compressing speech can be pushed low enough—and the maximum bandwidth across skin can be pushed high enough—to meet in the middle.

What is the bandwidth required for speech?

From a classic telecommunications perspective, a digitally sampled audio signal requires a bitrate of 64 kilobits per second (kbps) for intelligibility (Rodman 2006). This is further defined as the signal being sampled at a rate of 8 kHz, with a quantization of 8 bits (ITU 1993). This 64 kbps rate serves as an upper bound for several reasons: (1) The audio signal is uncoded—i.e., no processing is performed on the signal itself to extract the pertinent information required for representing speech. (2) Acceptable intelligibility is characterized as being able to comfortably understand speech (no effort or training is required). (3) The signal is reconstructed back into an analog signal to be sent to the ear. (4) The reconstructed speech signal is meant to represent speech in a manner with which we are familiar. By modifying these assumptions, we can derive much more parsimonious figures for the information rate of speech. For example, intelligible speech information can be compressed into data streams using codecs (encoders–decoders). Modern open-source codecs operate as low as 1.2 kbps (Rowe 1997, 2011). Proprietary codecs can be found that encode speech streams as low as 600 bps (Chamberlain 2001).

An even lower-bitrate approach is to encode only phonetic components, excluding the contextual information conveyed by speech (e.g., questions, sarcasm, emotion). In English, there are ~44 phonemes, thus requiring 5.5 bits to encode an arbitrary phoneme. The average number of phonemes per word in English is ~3 (Lamel et al. 1989), and the rate of spoken English typically varies between 100 and 400 words per minute (Grosjean 1979). Taken together, this translates to a low bitrate of ~110 bps for conveying purely phonetic information. Even this figure could be considered a liberal estimate as the distribution of phonemes in English is not uniformly random and follows a nonstationary process (past observed phonemes can predict future phonemes), with rates estimated closer to ~60 bps for spoken English (Reed and Durlach 1998).

Thus, depending on the approach to encoding the auditory signal, one arrives at a range of necessary bandwidths between 110 bps and 64 kbps for conveying speech information. We now turn to whether skin can support these datarates.

What is the achievable bandwidth of skin? As a first step in answering this question, we attempt to determine an optimal method of physically coding information constrained to an area of skin. The region and size of area used in this study are limited to a single case, but future studies will explore throughput trade-offs that may occur between different coding methods as the size of area varies. The region used for testing is the lower back, which we have chosen for its lack of fine spatial acuity (Weinstein 1968) to provide lower-bounded estimates for this work.

Although information can be coded to the skin in a variety of manners (stretch, temperature, vibration, etc.), we have chosen to explore coding with vibration for a number of reasons. First, vibrational interfaces are inexpensive and commonplace (e.g., vibratory motors, piezoelectrics, and solenoids) as opposed to interfaces for stretch and temperature. Second, temperature has both poor localization and temporal acuity (Cain 1973; Jones and Berris 2002). Coding information using stretch has promising temporal and spatial properties (Gleeson et al. 2009), and compact interfaces have been developed (Hayward and Cruz-Hernandez 2000). While there has been commercialization in this space (Tactical Haptics produces handheld stretch interfaces, for example), compact stretch interfaces have yet to be realized in this context.

Having decided to use vibration, estimating the rate of information that a given area of skin can support will depend on the encoding implementation, which we will explore in Experiment 1. Experiment 2 will determine how far apart a number of interfaces at separate sites on the skin should be separated.

Experiment 1: comparison of skin encoding methods

There are two elements required for estimating an achievable bandwidth: the maximum rate at which information can be presented and the maximum set-size (alphabet size) of the information. A fundamental factor that may affect both of these elements is how this information is encoded, which is examined in this experiment.

A large body of literature exists on deriving throughput estimates under a variety of conditions. One way to conservatively estimate the bitrate at a single site is to perform vibrotactile identification tasks for sequences of stimuli. In one study, researchers estimated achievable rates of 5 and 7 bps per channel on the fingertip and wrist, respectively, using a task of identifying a 160-ms-pulsed vibrational frequency-coded stimulus (Summers et al. 2005) embedded within a sequence of pulses. Here, throughput was calculated using information transfer (IT) (Rabinowitz et al. 1987; Tan 1996), divided by the stimulus duration. We suspect that better rates can be achieved, as 160 ms is an especially long pattern duration. For example, the same group found that participants could discriminate frequency and amplitude changes above chance at pattern durations of 80 ms (and did not test below this; Summers et al. 1997). Further, deriving a throughput estimate is limited by the pattern size that is tested.

One can also relax the assumption of identifying stimuli in sequence or rather treat the inter-stimulus interval (ISI) for identification in a sequence to be zero for calculating upper-bounded limits. One argument for warranting this conjecture is the lack of literature on the long-term effects of training (e.g. over weeks or months). Tactile pattern identification studies on the combined limits of ISI and pattern set-size have yet to be performed in this context. Another argument for such estimation is the question of whether or not the ability to consciously identify patterns is necessary for developing useful percepts. An example of this is intuiting speech, where phonetic perception does not require one to consciously identify and track formant patterns occurring on the cochlea. Given these arguments, work by Cholewiak and Craig (1984) indicates that identification for spatially coded vibrotactile patterns of 4 ms duration is possible. This implies at least 250 bps for binary-coded data and potentially much greater throughputs with larger set-sizes: Cholewiak and Craig (1984) limited the study to 10 patterns, which would imply a limit of ~830 bps. This surprisingly high estimate underscores the need to reiterate that this value is derived from stimuli presented in isolation. Indeed, the study further suggests a throughput of approximately 5 bits/s for sequentially presented stimuli—even for sequences of only two stimuli. This illustrates the limitations of using temporally isolated stimuli as a means of estimating throughput, as real sensory stimulation occurs as a persistent stream.

Another method for encoding tactile information is to use patterns that are encoded in both space and time (e.g., a fast sweep of vibration across the body, which we call a “spatiotemporal” pattern). Several studies have examined psychophysical characteristics of encoding information using space and time, such as the effect of stimulus frequency (Summers and Chanter 2002) or the ability to identify such stimuli (Craig 2002; Evans and Craig 1991; Jones 2011; Tan et al. 2003). With the exception of Craig (2002), this work did not explore identification performance of such spatiotemporal patterns as a function of presentation speed. Here, we expand on this work by comparing the identification performance of spatiotemporal patterns, patterns encoded in just space, and patterns encoded using intensity (defined as frequency monotonically coupled with force) with a single motor varied across presentation speed. We hypothesize that spatiotemporal patterns will yield superior identification over spatial patterns and patterns encoded by a single motor’s intensity for an area of skin. Generally, Tan (1996) writes, “…the most important thing to do in increasing information transfer is to use stimuli with as many dimensions as possible.” Even earlier than this, William James (1890) wrote with great insight: “the fluctuation in a quality’s intensity is a less efficient aid to our abstracting of it than the diversity of the other qualities in whose company it may appear.”

Apparatus

Inspired by Tan et al. (2003) and Jones (2011), we have developed a wirelessly controlled array of vibrotactile motors for delivering arbitrary patterns (“tactons”) to the skin (Fig. 1). The tactons can vary in space, time, and vibrational intensity (frequency coupled with force). Our array is controlled by an open-source microcontroller testbed, the Arduino Uno (16 MHz Atmel ATMega328 RISC processor; outputs control the vibrational intensity of a motor through pulse-width modulation). The device is powered from a battery source and controlled wirelessly over the 802.15.4 protocol (with XBee modules from Digi). For the first experiment, our tacton array consisted of nine vibrational motors in a 3 × 3 grid. Specifically, we used cylindrical eccentric rotating mass (ERM) motors (model #307-100 from Precision Microdrives). Cylindrical brushed vibrational motors have the benefits of being easy to control in intensity, operating in standard voltage ranges (0–5 V), and have fine temporal haptic characteristics (6 ms from off to a perceivable intensity, 19.3 ms from fully on to off using active breaking with H-bridges). We avoided “coin” or “pancake” vibration motors as their design limits their achievable temporal precision (typically ~40 ms on/off times) and amplitude of vibration. The distance between motors was 2.5 cm in the horizontal and vertical directions by location of the motors’ eccentric rotating mass, so the entire array was roughly 5 cm × 5 cm (slightly larger as the separation was measured from the center of each mass, and the motors were elongated in the vertical direction). The motors were pressed firmly against the skin by using an elastic back brace.

Fig. 1
figure 1

Vibrotactile vest used in experiments 1 and 2. Motors are attached to a back brace to ensure that they are pressed firmly against a participant’s back. The controller and battery pack reside in pockets on the back of the vest. The motor layout shown is used in experiment 2

Participants

We tested 10 participants (two female and eight male). Seven of the participants had no prior experience with the device. Two of the participants had minimal experience with the device, having taken the second experiment first, but over 2 months earlier. One participant (one of the authors) had moderate prior experience with the device from developing the experiments. All participants were between the ages of 18–45.

Method

Participants wore the 3 × 3 tacton array on the mid-lower back connected to a computer over the wireless link. The experiment consisted of three blocks of a vibrotactile pattern identification task. At the start of each block, participants were presented with an instruction screen that explained the task and provided a visual representation of the stimulus set (similar to Fig. 2a). During this phase, participants could hover the computer’s mouse over each visual representation. Doing so caused the program to transmit the corresponding stimulus to the array, which allowed the participant to feel the pattern. Hovering the mouse continuously over a representation caused the pattern to be played repeatedly. This panel lasted for 2 min after which the block of trials began. A block consisted of 192 trials. On each trial, a random pattern was chosen and presented to participants as a vibrotactile stimulus. Participants were then asked to identify the pattern from the set of eight possible patterns (chance = 12.5 %). Unlike Summers et al. (2005), stimuli were presented in temporal isolation on each trial (not within a train of stimuli). Each stimulus was presented eight times per block using one of the three possible pattern durations: 45, 90, or 135 ms. Therefore, each pattern was presented a total of 24 times in a block. Each block used a set of patterns built from a distinct type of encoding as follows (Fig. 2a):

Fig. 2
figure 2

a The alphabet of possible patterns used for each type of vibrotactile encoding. b Pattern identification performances as a function of pattern duration and encoding type. Bars indicate mean with standard error of the mean

Block 1: A single vibratory motor pulse presented at one of the eight intensity levels. Intensity levels were determined as a function of frequency, ranging from ~70 to 340 Hz. Due to the type of motor used, other effects like force cannot be controlled for, but monotonically increases with the frequency of vibration (as referenced from the motors datasheet). The coupling of frequency and force has previously been shown to be effective (Pongrac 2006) for increasing discriminability. The step size of the frequency divisions was determined using a Weber fraction of 0.2–0.3 in line with previous literature (Cohen and Kirman 1986; Mahns et al. 2006; Pongrac 2006). The characteristics of the full stimulus set are listed in Table S1.

Block 2: Spatial tactons a combination of motors was on for the entire pattern duration. Spatial configurations were determined to be as orthogonal as possible under the constraints of using three motors at a fixed intensity (the maximum possible at ~340 Hz) and having a center of gravity in the middle of the array (as is the case for Blocks 1 and 3).

Block 3: Spatiotemporal tactons Neighboring motors were turned on and off in sequence to produce vibratory “sweeps” across the skin. We designed a pattern set in which adjacent motors were turned on and off in succession. The sweeps were contained entirely within the pattern duration (e.g., if the pattern duration was 45 ms, each motor was activated in succession for 15 ms). If turned on, a motor was set to the maximum intensity level, ~340 Hz.

As each of the three blocks had a total possible stimulus set-size that was orders of magnitude greater than the eight stimuli applied for study, we formulated the applied sets to be as equivalent (versus as optimal) as possible between each other. Specifically, each stimulus set maintained a center of gravity on the middle of the array, used equal presentation times, and used the same number of motors (three, with the exception of the first intensity-coded block as to avoid a bias in spatial layout).

Experiment 1 results

Figure 2b shows identification performance as a function of the condition and the pattern duration. A Friedman test indicates that the encoding scheme has a significant effect on performance (χ 2 (2,4) = 23.72, p ≪ 0.01). A two-way ANOVA was ruled out after failing Levene’s test for equality of variances. Even with untrained subjects, identification performance is well above chance at pattern durations as low as 45 ms. Second, as per our working hypothesis, spatiotemporal patterns yield higher identification performance than either spatial patterns or single motor amplitude modulation. Identification of the single motor intensity and spatial patterns remains fairly constant with pattern duration as a function of pattern length, and yet, as noted by Craig (2002), spatiotemporal performance improves with longer duration.

Observing participants’ confusions between stimuli averaged over all participants and durations, we find that spatiotemporal patterns appear to exhibit the least variance in confusion compared to other methods of encoding (Fig. 3, Table S2). For the single motor case, participants tend to confuse patterns of neighboring intensity. Spatial encodings have a more uniform confusion matrix. This coupled with poor overall (but still above chance) performance (Fig. 2) may be indicative of motor spacing being minimally over the vibrotactile two-point resolution threshold for the lower back (a topic that is subsequently explored in Experiment 2). It should also be reiterated that all three stimulus sets used were primarily derived to be as equivalent as possible between one another, as opposed to being optimal for within-set identification. For example, better spatial set identification performance might be achieved by manipulating the patterns’ centers of gravity. This might provide an “unfair” advantage when attempting to compare performance to the other sets, however. For spatiotemporal patterns, diagonal patterns presented to the participant tended to get confused with horizontal patterns containing a diagonal pattern’s horizontal component. This did not occur for the reverse case, however, when the presented stimulus was a horizontal sweep.

Fig. 3
figure 3

Confusion matrices for each type of encoding, averaged across all participants and duration

Last, we calculated the information transfer (IT) (Rabinowitz et al. 1987; Tan 1996) and hypothetical throughput (as stimuli were presented in isolation) for each case as a metric for comparison (Fig. 4a, b). Information transfer can be thought of as a measure of the number of possible bits that can be sent in a transmission taking the amount confusion between stimuli (encoded bits) into account. A Friedman test indicates that encoding scheme has a significant effect on both IT and throughput (χ 2(2,4) = 33.69, p ≪ 0.01). A two-way ANOVA was ruled out after failing Levene’s test for demonstrating equality of variances. The formula used to calculate IT can yield biased estimates, however, so a suggested corrective factor is given according to Miller (1953). This measurement according to Miller (1953) tends to overcorrect unless n > 5k 2, where n is the total number of trials devoted to an alphabet and k is the alphabet size in this case. As our experiment does not satisfy this condition (n = 64, 5 k 2 = 320), we (1) performed an analysis of IT pooled over all subjects (Fig. 4b) and (2) provide the IT and throughput with and without the corrective factor (Figure S1A, B). The true IT is expected to lay between these three values (pooled IT, IT without a correction, and IT with a correction). If we divide the information transfer by the length of the pattern duration, we can obtain a hypothetical asymptotic estimate of throughput in bits per second (Fig. 4b). To derive a concrete throughput estimate, identification testing of stimuli in sequences of stimuli needs to be performed while varying inter-stimulus intervals (ISI). The hypothetical estimates in Fig. 4b assume that stimulus identification is possible at ISI = 0. Further, if one wishes to find an achievable throughput, one should also maximize training and the stimulus set-size. We provide these estimates to point out that a trade-off exists between pattern duration and throughput. It should be noted that the IT calculation takes into account not only the proportion of stimuli calculated correctly, but also the error patterns for each of the stimuli. As such, the results from Fig. 2b do not directly translate to those in Fig. 4a, b but both indicate that spatiotemporal sweeps are the best encoding method.

Fig. 4
figure 4

a Pooled IT and b pooled hypothetical throughput (as stimuli were present in isolation) for each encoding method and presentation duration. Bars indicate mean with standard error of the mean

Spatiotemporal patterns yield the best IT, but at the slowest pattern duration (Fig. 4). The highest IT values are, in general, expected at the slowest presentation speed as (1) its calculation is independent of time and (2) slower presentations are generally easier to identify. At the fastest (45 ms) pattern duration, the single motor case has the highest IT, but by a nonsignificant amount. In all other results, however, spatiotemporal patterns appear to yield the best performance (Figs. 2b, 3, 4a). The discrepancy from the single motor case could be due to lack of training: The best-performing participant (who had some prior experience wearing the vest) had an IT between 1.10 and 1.65 bits for the 45 ms spatiotemporal condition, but only an IT between 0.54 and 1.09 bits in the 45 ms single motor condition. Paying attention to the actual IT values, the single motor case is slightly <1 bit, indicating that only the least and most intense patterns can be consistently discriminated. Therefore, this set might as well be collapsed to just these two patterns. Spatial patterns have an IT that is well below 1 bit, which means that identification errors would abound even if the set were reduced to just two patterns—a telling sign that the motors have been spaced well under the vibrotactile two-tacton resolution threshold for that region of skin. Single motor and spatial pattern IT are not affected by pattern duration (single motor: χ 2(2,4) = 33.69, p ≪ 0.01 spatial: χ2(2,4) = 33.69, p = 0.15), suggesting that duration is irrelevant, provided it is longer than some threshold (<45 ms), see Fig. 4a.

Counter to this, we observe that spatiotemporal patterns do exhibit a relationship with presentation duration. These patterns become less salient at shorter durations (to the point where coding information using intensity is more effective). The relationship is also somewhat proportional, which would imply that an achievable IT rate might be fixed as a function of duration.

Collapsing results across pattern duration yields IT estimates of 0.15, 0.60, and 0.69 bits for spatial, single motor, and spatiotemporal patterns, respectively. Spatiotemporal patterns yield better identification performance overall, with the singular exception of the 45 ms single motor case. Last, the alphabet size of spatiotemporal patterns is much more scalable than the single motor case, which is fundamentally limited by the Weber fraction and the range of intensities to which the skin’s receptors are sensitive.

While it immediately appears that spatiotemporal patterns have the greatest potential for encoding information to the skin, there is one subtle point of contention between the spatiotemporal and spatial encodings. Are the spatiotemporal patterns truly being integrated over to form a single perception? Or could our spatiotemporal patterns in effect be cheating—providing three distinct spatial encoding frames. Further, such spatiotemporal frames have different centers of gravity as opposed to the spatial encodings, which all have a fixed center of gravity. The spatiotemporal patterns do maintain average fixed center of gravity that is equal to the spatial encodings, however. A future study is required to disentangle this issue: An optimal spatial set—as opposed to the spatial set in this study that is designed to be as equivalent as possible to the single motor and spatiotemporal sets—should be constructed and tested against this study’s spatiotemporal set or an optimal spatiotemporal set. In addition, these sets should be presented at a frame rate for which the spatiotemporal frames fuse into a single percept, i.e., the individual frames are not apparent as distinct spatial patterns. This would require a different type of tactile interface (such as a piezoelectrics or voice-coil actuators) to be used instead of the eccentric rotating mass (ERM) motors used in this study, which have a ~10–20 ms resolution. Regardless, the total spatial pattern set space is a subset of the total spatiotemporal pattern set space. This at least implies that spatiotemporal encodings have a greater IT potential.

With these considerations taken together, we conclude that the most effective way to encode information for the skin between the methods tested is to use spatiotemporal patterns. This supports Tan’s aforementioned insight that it is best to use as many dimensions as possible (Tan 1996). It follows that (1) combining amplitude and frequency-modulated characteristics, (2) varying center of gravity, and (3) modulating tactile interface on/off timing to induce spatiotemporal sweeps that are not constant in speed should all contribute to an optimal class of vibrotactile codes.

Experiment 2: vibrotactile 2-tacton resolution

The effects of vibrotactile array placement have been studied for temporally static presentations and single arrays (Bikah et al. 2008; Cholewiak and Collins 1995; Geldard and Sherrick 1965; Mahns et al. 2006; Summers et al. 2005). The earliest study of pattern discriminability used a single vibrotactile array placed across the entire body (Geldard and Sherrick 1965). For the case of a temporally static presentation, and discrimination between two patterns that differed by only one array element, the researchers found that the number of errors increased exponentially as a function of the number of array elements. A more important factor dictating discrimination performance, they discovered, was the principle of communality. Communality is the property that the similarity of two patterns—not the number of elements involved—dictates how discriminable two patterns are. Geldard and Sherrick found that the number of elements did not play a meaningful role. For some cases, there was better discrimination performance with more elements (Geldard and Sherrick 1965).

But does the communality principle hold using smaller arrays at different body sites? Cholewiak and Collins (1995) tested whether the communality principle held at the finger, palm, and thigh using smaller arrays, and also temporally static pattern presentations. They found that the principle held as long as two-point discrimination thresholds were roughly obeyed.

Jones (2011) examined identification of temporally dynamic tactile patterns on the forearm, waist, and back. Jones found markedly poorer identification performance on the forearm. Upon further investigation, it appeared that the direction of the pattern played a role in identification: Transverse sweeps were easier to identify than longitudinal sweeps, which were easier to identify than combined (diagonal sweeps). This effect was not seen on the waist or back and may have to do with the asymmetry of receptive fields on the arm. But what of arrays operating simultaneously in parallel? Evans and Craig (1991) found that when two vibrotactile sweeps moved in the same direction, the target stimulus was easier to identify. These results appear to demonstrate that the classical communality principle does not hold for temporally dynamic patterns. We expand on this work by observing discrimination performance as a function of array separation distance and understanding these results in the context of results found in Experiment 1.

Apparatus

We used the same hardware as in Experiment 1, but with a different layout. Rather than a 3 × 3 array of our cylindrical brushed motors, we used a 5 × 2 motor layout subdivided into five 1 × 2 vertical arrays. The horizontal spacing between each array was 1.5 cm. Within each array, we used a 2.5-cm vertical spacing. Spacing was measured relative to the locations of the motors’ eccentric rotating masses (Fig. 1). The 2.5-cm vertical spacing was used to maintain a consistency with Experiment 1 for providing spatiotemporal stimuli (using vertical sweeps).

Participants

We tested 15 participants. Six of the participants had previous experience wearing the device from the first experiment. Of the 15 participants, three were female and 12 were male. All were between the ages of 18 and 45.

Method

Similar to Experiment 1, participants wore the motor array on the mid-lower back (Fig. 1) connected wirelessly to a computer. An instruction panel explained the task, but no example stimuli were provided. For 192 trials, participants were presented with a pair of simultaneous stimuli lasting 60 ms from two out of the five possible arrays chosen at random and judged if they felt one or two stimuli. On any given trial, there were three possible stimulus sets (Fig. 5A):

Fig. 5
figure 5

a The set of possible stimuli used in experiment 2. Each stimulus type (same direction sweeps, opposite direction sweeps, or a pair of on–off pulses) could occur at any of four horizontal distances (1.5, 3, 4.5, or 6 cm apart). b Two-tacton resolution as a function of horizontal distance between vertical pairs of motors. Resolution was comparable across all stimulus types and reached >80 % by 6 cm of separation. Bars indicate mean and standard error of the mean

Stimulus 1: Parallel vertical vibratory sweeps On a given trial, two of the five arrays gave a simultaneous vertical sweep in the same direction (up or down).

Stimulus 2: Opposing direction vertical vibratory sweeps On a given trial, two of the five arrays gave a simultaneous vertical sweep in opposing directions (one up and one down).

Stimulus 3: Single pulses On a given trial, two of the five arrays gave a pulse (simultaneously) using the top motor of their respective arrays.

When turned on, the motors were set to the maximum intensity level with a vibration frequency of approximately 340 Hz. Sixteen trials were given for each possible array separation distance. For a given distance, the array pairs were randomly chosen using a uniform distribution. Participants were told of the possible stimuli, but also told that some trials contained only one stimulus (rather than two). This was done so that participants would not be biased toward answering that they felt two stimuli. We did not actually present single stimulus trials as participants could possibly use additive intensities as a cue. Because this is not a classic two-point discrimination task—where either one or two stimuli might be presented on a given trial and stimuli can involve spatiotemporal patterns—this experiment is called a “vibrotactile two-tacton resolution” experiment.

Experiment 2 results

Resolving tactons as two individual patterns was comparable for all stimulus types according to Friedman’s test (χ 2(2,6) = 4.57, p > 0.1), but with spatiotemporal patterns (vertical sweeps in this case) trending toward better discrimination than single motor pulses (Fig. 5b). A two-way ANOVA was ruled out after failing Levene’s test for equality of variances. One possible reason for this trend might be due to travelling waves for spatiotemporal stimuli causing a small additive effective. This could provide an intensity cue. Another possible reason for this trend is the question previously raised in Experiment 1: The spatial codes maintain a fixed center of gravity, but the spatiotemporal sweeps, if they are perceptually equivalent to two shorter spatial frames and have the benefit of two spatial presentations with different centers of gravity.

At 6 cm apart, the mean performance for resolving two tactons was >80 % for all types of patterns (Fig. 5b). At this distance, four participants for the single motor pulse case obtained 100 % resolution, and five participants for both types of sweeps obtained 100 % resolution (different participants in each case). Participants judged the tactons as distinct more often than not at about 4 cm apart (the 50 % criterion). Our result sheds light on the poor performance obtained with spatial patterns in Experiment 1, as each motor was separated by <4 cm (a separation of 2.5 cm in Experiment 1). Further, as identification performance of spatiotemporal patterns in Experiment 1 was well above the performance of spatial patterns, this demonstrates that two-tacton limitations of spatial patterns can be overcome by applying time as another dimension to the stimulus (that is, sweeps of vibration within an array).

Discussion

We conclude that spatiotemporal patterns are a preferred method for encoding information to the skin in the context of the performed studies. More importantly, the results of these studies provide supporting evidence that encoding across more dimensions is better. This suggests that even higher ITs might be achieved by modulating the frequency and/or amplitude and the timing of frames for spatiotemporal patterns. While these studies are not able to disentangle whether the spatiotemporal stimuli are “cheating” by being multiple spatial stimuli or are truly being integrated as a single percept, spatial stimuli can still be considered a subset of the spatiotemporal set. We conclude that spatiotemporal patterns offer a much richer stimulus set (and therefore a potentially higher IT) when constrained to a fixed pattern duration.

We can also use the results of these studies to attempt an achievable throughput estimate for the torso under the assumptions that (1) an interstimulus interval of zero is reachable and (2) parallel arrays can be used and are scalable with regard to throughput. The achievable hypothetical throughput for a single 5 × 5 cm square 3 × 3 vibrotactile array is at least 10 bits per second (potentially much greater with a larger stimulus set). From experiment 2, we determined that multiple arrays must be placed approximately 6 cm apart to be able to resolve patterns between arrays. From these data, we can roughly estimate whether skin can support the transmission of speech: An average torso conservatively has a surface area of 3500 cm2 (Tikuisis et al. 2001), which means we could fit at least 25 arrays (each of which requires a devoted area of 121 cm2 for proper separation specific to the tactile interfaces used in this study) on the body. From this, one can only crudely speculate an achievable bit per second. Assuming that throughput scales linearly with the number of arrays, one estimate could be 250 bps (using 10 bits per second). The best-performing participant registered a throughput between 24 and 37 bps per site, which (under a linear assumption) would translate into 600 to 925 bps if using 25 sites across the torso. Such an estimate does suggest the necessary throughput for speech information might exist, given that phonemes have an estimated throughput of 60 bps (Reed and Durlach 1998) or speech audio can be compressed to as low as 600 bps (Chamberlain 2001). These extrapolations must be read with appropriate caution: We presented stimuli in temporal isolation, constrained the set to a fixed alphabet of eight patterns, performed identification experiments using a single array (i.e., would use of multiple arrays be integrated over as one with a larger alphabet?), and used untrained participants. These extrapolations could either be far lesser or greater than reality depending on the aforementioned factors and the accuracy of the assumptions. Nonetheless, these results provide the first steps toward determining the best methods for optimizing vibratory throughput for the skin. While it is possible to derive a more formal model, this estimate serves the purpose of demonstrating that usable high-throughput sensory substitution applications are possible (as underscored by the work of Bach-y-Rita et al. 1969), and as such, we hope to spur the development of more applications in this field.

Six questions remain before one can derive a more formal estimate of the limitations of sending information through skin using vibration. First, our study was limited to temporally isolated stimuli. In the context of stimuli in sequence, the more separated stimuli are in time (Summers et al. 1997) and the longer their duration within a sequence (Summers et al. 2005), the easier they are to identify. While our calculated IT and throughput may be optimistic in the context of being temporally isolated, it should still be noted that our participants (and participants in the aforementioned studies) were untrained. As such, these calculations could also be conservative. Further, the ability to discern complex time-varying patterns is a central issue in determining the viability of high-throughput sensory substitution devices. Supporting evidence comes from vision-to-touch substitution studies (Bach-y-Rita et al. 1969; Chebat et al. 2011). The ability to identify individual tactile stimuli may not be necessary for providing useful perceptions. In normal speech comprehension, for example, one does not consciously track the stream of individual spectrotemporal components (formants) resonating in the cochlea that forms the auditory perception of phonemes.

Our second question regards the degree to which long-term training improves information transfer. In Experiment 1, while we found the single motor case to have the best IT rate, it was by a nonsignificant amount compared to the spatiotemporal case. The most trained participant had an IT rate in the spatiotemporal case well above that for single motor case. Presumably, long-term practice and feedback can grow the tactile vocabulary, refine vibrotactile two-tacton resolution, and reduce the time window for pattern presentation. Geldard (1957) was able to train several participants to total proficiency using a 45-letter vibrotactile alphabet. This could yield a throughput of about 120 bps with a 45 ms pattern duration for a single array presenting spatiotemporal patterns.

Third, how will nonlinear, time-varying, and inter-channel-dependent properties of skin receptors play a role physically and perceptually? So far, we have limited our study to single pattern presentations separated by long periods of time (≫1 s). Challenges with vibrotactile pattern resolution and identification can arise in both temporal and spatial domains. For example, masking effects have been shown to exist both temporally and spatially (Cholewiak and Collins 1995; Craig 1982; Enriquez and Maclean 2008; Geldard and Sherrick 1965). Vibrotactile adaptation is another known effect, which has been a prolific topic of study since the 1930s (Bensmaïa et al. 2005; Gescheider et al. 2004; Leung et al. 2005; O’Mara et al. 1988; Watanabe et al. 2010; Wedell and Cumming 1938). Some of these effects may act as fundamental bottlenecks, some might be overcome through training, and some might even be beneficial for discrimination and identification (Goble and Hollins 1994).

Fourth, can training in conjunction with other sensory modalities improve information transfer? Held and Hein (1963) demonstrated that receiving correlated sensory input from other modalities is integral to sensory development. More recently, Lim and Holt (2011) have shown that training using implicit feedback (i.e., from a video game) with correlated stimuli from other modalities can greatly increase performance and reduce training times over traditional explicit feedback methods (as what is used in this report).

Fifth, how does IT change as a function of the number of vibrotactile arrays functioning in concert, how does that relate to their placement, and how does that effect arriving at an optimal stimulus set? Might multiple arrays be perceptually integrated over to act as one large display? Does the addition of each array effectively exponentially increase the size of our vocabulary in terms of performance, despite each array drawing upon the same vocabulary? Returning to James’ (1890) insight, we may find it better to utilize one large array that can encode a small amount over many different features, rather than multiple arrays working in concert. Further, multiple arrays might integrate as one. Tan (1996) has promoted this idea of using higher-dimensional stimuli, each with a small set of possible values rather than the converse. She argues that our ability to recognize and discriminate between faces is a good example of this notion. Faces have a rich set of features, but we do not pay much conscious attention to the minute details of each feature. Rather, it is the sum of the features taken together that aid in discrimination. The results of our first and second experiments underscore this notion of higher dimensionality yielding better identification performance. We found that single motor pulses and spatiotemporal sweeps were much easier to discern than simple spatial patterns (Experiment 1). In Experiment 2, we found a reliable vibrotactile two-tacton resolution of ~6 cm (note that this is not the threshold). In Experiment 1, our motors were spaced 2.5 cm apart. Thus, the results of Experiment 2 are sufficient for explaining the poor (but above chance) discrimination performance of purely spatial patterns. It is the addition of time as a dimension for encoding information that enabled us to overcome this spatial limitation.

Last, how might an intended application impact encoding scheme? For example, information can be represented quantitatively (e.g., most raw sensory information) or categorically (e.g., text). Given a sensory application like sound-to-touch substitution, coding information in vibrotactile intensity (vibration/amplitude), in order to preserve a mapping with quantized signal amplitudes or transform coefficient values, could be much more effective than encoding this information in spatiotemporal sweeps.

In summary, our findings demonstrate that—for a single site—spatiotemporal sweeps yield a better information transfer than spatial and intensity encoded patterns. Further, spatiotemporal patterns vastly increase the potential stimulus set-size and can reduce two-tacton resolution limitations for a multi-motor array. Our future work will first be to determine the efficacy of spatiotemporal and spatial encodings at the limits of tactile temporal acuity. This will serve to elucidate whether or not spatiotemporal encodings provide a true benefit over spatial encodings. Second, we will determine the IT of spatiotemporal identification performance of stimuli in sequence. In addition, we will determine whether identification of individual stimuli is necessary for deriving useful percepts. Last, we will explore how long-term practice, correlation with other senses, feedback from motor output, and higher-dimensional tactons might drive information transfer even higher. This work provides first steps toward determining the plausibility of a complete sound-to-touch sensory substitution device to encode speech through the skin for the deaf.