Chapter 3: Psychoacoustics

by Simon Carlile

Download Chapter

Download the chapter: TheSonificationHandbook-chapter3 (PDF, 5.7M)

Media Examples

In this example there are three talkers who start talking one after the other and end up with all talkers speaking concurrently. Your need to listen to this example using headphones. In the first file (ex1_a) the talkers are mixed so that they are diotic (the left and right ears sound are the same) and they all appear to be located in the middle of the head. In the second file (ex1_b) the talkers have been spatailised using some basic auralisation methods so that the individual talkers appear to the right, left and in front (or back) of the listener. The process of spatialisation makes it much easier to segregate the individual talkers and to attend to one message and ignore the other two. This illustrates the importance of spatial hearing in solving the “cocktail party” problem.

media file S3.1 (a)

download: SHB-S3.1a (mp3, 640k)

media file S3.1 (b)

download: SHB-S3.1b (mp3, 642k)

This sound is a log sine tone sweep that goes from 20 Hz up to 10 kHz that is of equal amplitude across the whole frequency range. While your headphones or loudspeakers will modify the output levels to some extent, when you play this sound, the first half of the stimulus should sound relatively quite compared to the third quarter and then the last quarter should trail off into silence as the frequency increases to the upper limit of your hearing. This change in loudness as a function of the frequency is due mainly to the frequency dependent nature of the sensitivity of the auditory system and the pattern of equal loudness curves that are illustrated in Figure 3 in Chapter 3.

media file S3.2

download: SHB-S3.2 (wav, 800k)

This file contains six 1 second pulses of a 500Hz tone that change in level according to: 0 dB (reference level), +10 dB, +15 dB, +18 dB, +19 dB & +20 dB.

media file S3.3

download: SHB-S3.3 (mp3, 195k)

This sound contains four pairs of pulses with a 3 Hz difference in the frequency of each pulse. The frequencies of the first pulse of each pair are 100 Hz, 500 Hz, 1000 Hz and 2000 Hz. For the lower frequencies the frequency differences in the pulse pairs are easily discernable but as the frequency increases this becomes increasing difficult. This is an example of Weber’s law which states that the size of a just noticeable difference (JND) is a constant proportion of the initial stimulus value. In this example the frequency difference has been kept constant between each pair but the value of the initial stimulus increases substantially with each subsequent pair. This means that for the first pair the frequency difference is much larger than the JND while for the last pair the difference is much smaller than the JND.

media file S3.4

download: SHB-S3.4 (mp3, 258k)

This example contains four pairs of pulses where the second of each pair is double the frequency (an octave above) the first stimulus. The frequencies of the first pulses are 1 kHz, 2 kHz 3 kHz 4 kHz. As the upper frequency crosses the 5 kHz barrier (for the 3 kHz to 6 kHz and the 4 kHz to 8 kHz) we lose our ability to discern an octave spacing between the pairs. The reasons for this are not well understood but may relate to the fact that for frequencies above 4 kHz – 5 kHz the auditory nerve is unable to lock the neural signal (action potentials) to the phase of the stimulus. This “phase locking” provides an important temporal code of sound frequency for the nervous system.

media file S3.5

download: SHB-S3.5 (mp3, 258k)

This example contains three different sounds. The first is a pure tone at 200 Hz, the second is a combination tone with three components separate by 200 Hz (800 Hz, 1000 Hz and 1200 Hz) and the third is a 1000 Hz reference tone. Because of the 200 Hz spacing between the combination tone in the second stimuli the auditory system ascribes a fundamental pitch of 200 Hz, matching the first tone even though there is no energy at 200 Hz. The 1000 Hz tone is provided as a reference to contrast the fact that the dominant pitch heard in the second tone is at 200 Hz and certainly not at the centre frequency of the combinations tone (1000 Hz)!

media file S3.6

download: SHB-S3.6 (mp3, 195k)

This example demonstrates how the temporal envelope of speech sound extracted from just a few frequency channels is sufficient to convey speech information. This file contains four examples of speech processing using the same sentence token. The first is without any processing. The next three are referred to as modulation noise band speech and the temporal envelopes have been extracted for a number of different frequency bands and then used to modulate bandpass noise centred on the different frequency channels from which the modulation information was extracted. In the second speech token, 10 log spaced frequency bands from 1 Hz to 16 kHz are used. In the third, 5 bands and in the last 3 bands. The speech information is quite evident in all but the last sample indicating the temporal variation in the envelope information is sufficient to support a high level of speech intelligibility. This processing is very similar to the sound processing carried out by the cochlear prosthesis.

media file S3.7

download: SHB-S3.7 (mp3, 308k)

This example contains two series of non-harmonically related tones that overlap in frequency and in time. The difference in the onset and offset of each series cause the otherwise unrelated tones to be heard as two auditory objects. You may have to play this example several times to get a good impression of the two auditory objects.

media file S3.8

download: SHB-S3.8 (mp3, 25k)

This sound example contains 3 complex tone pulses. The first contains 5 harmonically related pure tone components with a fundamental at 180 Hz and the second with an overlapping but 5 different harmonically related tones with a fundamental at a perfect fifth above (270 Hz). The third sound , is the combined two complex sounds which group themselves into 2 auditory objects based on their harmonicity – that is, they are heard as two sounds with an interval of a perfect V as the components have been grouped based on their frequency relationship with one of the two fundamental frequencies in the complex.

media file S3.9

download: SHB-S3.9 (mp3, 162k)

These two files present a sequence of tone pips that have different separation in frequency. In the first file (ex10_a) the tones are separated by 20 Hz and the perception of this sequence has been likened to a horse galloping – that is, the tone pips are grouped together a single auditory entity and are perceived as a single stream of sound. In the second sequence (ex10_b) the tone pips are separate by 400 Hz and after a few seconds of listening the sound breaks up into two different streams likened to “Morse codes” using a high frequency and a low frequency stream. This demonstrates that frequency proximity and temporal proximity can be used to link different sound components together over time into different auditory streams.

media file S3.10 (a)

download: SHB-S3.10a (mp3, 258k)

media file S3.10 (b)

download: SHB-S3.10b (mp3, 258k)

When a number of talkers are speaking concurrently, the differences in the fundamental frequencies of the voices as well as the onset asynchronies of the different phonetic elements can be used to segregate the different talkers. Voice differences are also used to help set up streams associated with each of the talkers. When these differences are decreased it becomes increasingly difficult to ascribe the different phonetic elements to the different talkers. In these example the same female talker is speaking against a background of male talkers (Ex11_a), two other female talkers (Ex11_b) and two other copies of her own voice speaking different sentences (Ex11_c). The form of the sentence is “Ready call-sign go to colour number now” – This is taken from the so called CRM corpus. Try and hear out the colour and the number from the talkers using the call-sign “baron”. In the first example the colour and number are blue and three, in the second blue two and in the third blue one. As the differences between the talkers decreases it becomes increasingly more difficult to make out what the talker saying “baron” is saying.

media file S3.11 (a)

download: SHB-S3.11a (mp3, 75k)

media file S3.11 (b)

download: SHB-S3.11b (mp3, 75k)

media file S3.11 (c)

download: SHB-S3.11c (mp3, 75k)

In this example there are four pulses of noise separate by 0.5 seconds that are presented in stereo – the second and fourth pulses have an interaural time difference of 300Us with the delay first in the right ear and then in the left ear. You need to wear headphones to listen to this demonstration properly. The interaural delays will make the 2nd and 4th noise bursts appear lateralised to the left and right ear respectively.

media file S3.12
download: SHB-S3.12 (mp3, 74k)

In this example there are four pulses of noise separate by 0.5 seconds that are presented in stereo – the second and fourth pulses have an interaural level difference of -10 dB with the lower level first in the right ear and then in the left ear. You need to wear headphones to listen to this demonstration properly. The interaural level difference will make the 2nd and 4th noise bursts appear lateralised to the left and right ear respectively.

media file S3.13
download: SHB-S3.13 (mp3, 74k)

In this example the sound of the talker has been filtered so that they appear to be about 6 metre away and slowly approaching the listener. In this case there are no reverberation cues and the perception of far distance is based solely on the level attenuation and absorption of high frequencies due to distance. As the talker gets closer several other HRTF based related changes have a significant impact on the perception of sources in the near field (< 1 m).

media file S3.14

download: SHB-S3.14 (mp3, 1.4M)