◙ The science of speech sound

Phonetics is the science of speech sounds. In my time at GCHQ I spent 4 years working on speech synthesis, work which I was free to discuss with anyone. In fact, just before I moved within GCHQ, some Russian linguists asked to visit us. This caused such consternation among the security people that shortly afterwards the work was moved to Malvern.

Speech synthesis turned out to be an excellent way of learning about phonetics. If I was ever uncertain about how anything was actually said, I could experiment with small changes that no person could consistently make, and hear what sounded right and what sounded wrong. I always suspected that some of the things said about Scottish speech were wrong, and I was able to prove that. Unfortunately I never kept any recordings of synthesised Scottish speech.

As a science, phonetics is bedevilled by a form of parochialism. We may speak the same language, but the American way of describing it was almost incomprehensible to me. Even the English way causes me some problems. In this talk, I will concentrate on English as spoken in both England and Scotland, and will occasionally mention other languages, but I will ignore America.

The description of English is always based on so-called “Received Pronunciation”, also known as Queen’s Speech or BBC English. I will usually call it RP. There may have been people who spoke this in our youth, but I rather doubt if we ever hear it now.

The nature of sound and its analysis

Before I try to describe the sounds of speech, I will give a brief introduction to sound in general. I did this in my talk on sound recording, but some of you might find a brief recap worthwhile. In the context of speech, sound is simply vibrations in air. This picture shows a tuning fork vibrating and causing vibrations to propagate in the air. These vibrations are described in terms of frequency and amplitude. Frequency is a measure of the number of vibrations per second, 1 vibration (or cycle) per second being known as 1 Hertz. The higher the frequency, the higher the pitch of the sound. For speech purposes we are interested in a range from about 70 Hz to several thousand Hertz.

Amplitude is a measure of the strength of these vibrations. I will usually use the word loudness in this context. It is measured in decibels, but that is quite a difficult concept and I won’t refer to them again.

You also need to know a bit about resonance. Every physical object, including a contained body of air, has a number of frequencies at which it will naturally vibrate, and this is the principle of musical instruments. When sound with a wide range of frequencies hits a body, the natural frequencies of the body will be made louder, and other frequencies will become inaudible. This is known as resonance.

How speech is produced

The production of speech is so complicated that it’s a wonder that a child can master it. The first thing that is needed is a source of sound. This is provided by the vocal cords which are positioned in the larynx . You might be thinking of something like guitar strings, but the name is misleading since they are really two flaps of muscle and connective tissue . They are firmly fixed at the forward end in the Adam’s apple (even in women!), but at the back they can be moved together or apart to produce a gap of varying width, called the glottis. This moving together or apart can be done at very fast rates, ranging from 70 Hz in some men to over 200 Hz in some women and about 300Hz in children, and this is what gives the voice its pitch. When we are speaking we have fine control over pitch.

We also have fine control over the amplitude of the vibrations, which determines the loudness of the voice. In addition, we can let extra air through to produce a breathy voice, or we can restrict the amount of air to produce the creaky voice used by ventriloquists. Both of these effects can be finely controlled, and are used in some languages, but not English, to distinguish words. We also close the glottis completely for some speech sounds, and keep it open but without vibration for others.

Having produced the basic sound, known in phonetics as voice, we have to modify it to create speech as it passes through the vocal tract . We have basically three ways of doing this. We can move the tongue to alter the shape of the mouth cavity, we can move the lips and jaws to alter the sound which comes out, and we can raise or lower the soft palate to affect the amount of sound which passes through the nose. The organs which move are known as articulators, and the points of articulation are shown here .

All of these movements are taking place rapidly, and we are usually completely unaware of them. If you want to teach anyone to play a musical instrument, you have to tell them what to do with their hands and sometimes mouth, but a child learns to speak solely by listening and experimenting with its own voice until it is copying what it hears. What you hear is not what a listener hears, so this is not straightforward. Partly because of this, and partly because we are all physically different, no two people have identical pronunciation.

How speech is heard

Hearing speech is much simpler. In physiological terms it is identical to hearing any other sound. The vibrations of the air cause corresponding vibrations in the ear, which result in nerve signals being sent to the brain. It seems that early in the processing the brain distinguishes between speech sounds and others, and they are probably sent to different parts of the brain. It is actually very difficult, if not impossible, to hear speech simply as a sequence of sounds.

What we need to know about hearing is that we are attuned to certain speech sounds, and don’t find it easy to hear others. When we listen to a foreign language we may not be able to tell that some of the sounds do not occur in English. To me, some of the sounds an English speaker makes are indistinguishable. So “good” and “food” have the same vowel. We also are constantly predicting what words are coming next, and don’t need to hear them perfectly to recognise them. The effect of this is that speech can be far from perfect without becoming incomprehensible.

How speech is analysed

It is extremely difficult to take an objective view of speech, because you have to listen to the sounds without being distracted by the meaning. The first thing our ears do is to analyse the incoming sound into individual frequencies, and there are instruments known as spectrographs (nowadays more likely to be software) which can do the same, and print out the result as a spectrogram. This diagram shows time horizontally and frequency vertically, with amplitude shown as blackness, and you can see at the top that it is the utterance “I can see you”. The vertical lines are effectively puffs of air coming through the vocal cords, so if you measure the gaps between them you can find out the pitch of the voice. The dark bands which you can see in places are resonances. You should notice a brief silence starting at about 0.4 seconds, and a short burst of high frequency for the /s/ from 1.0 to 1.2 seconds. Modern spectrograms like this are in colour, but I find the black and white easier to read.


Now I will move on the the nature of speech sounds. We are all familiar with the idea that C_A_T spells “cat”. The individual speech sounds such as /k/ are known as “phonemes”, and they are specific to a language. For example, we have the phoneme “ch”, but it actually consists of two sounds (t and sh) in quick succession. A French or German native would have to learn it as two sounds. This works the other way too. In German there is the single phoneme /tz/ which we hear as /t/ followed by /z/.

When we speak about different languages, the concept of a phoneme can be quite slippery. For example, try saying the word “little”, and notice that your tongue is in different positions for the initial /l/ and the final one. The two sounds are known as light and dark /l/, and in some languages are seen as different. We regard them as the same phoneme, but a phonetician calls them separate allophones in English. Two sounds are different phonemes if they can be used to distinguish words.

It is important at this point not to think about written language. A phonetician will talk about sounds or phonemes, not about letters. (Incidentally, a phoneticist is an advocate of phonetic spelling, and a phonetician is an expert on phonetics.)

Before we can talk clearly about phonemes, we have to have a way of writing them down. English spelling is notoriously capricious, so we use the International Phonetic Alphabet . The full version of this can represent any speech sound in any language, and is capable of making very subtle distinctions, but we normally use a version which is specific to English.


All speech sounds are divided in vowels and consonants, but the distinction is not clear-cut, as you will hear later. In phonetics, a vowel is a sound produced with an open vocal tract which does not constrict the airflow. But we normally think of a vowel as the principal sound of a syllable.

One way to think of vowels is that they are phonemes which you can sing. They are formed by producing a sound in the vocal cords, and shaping the mouth cavity to make it resonate at certain frequencies. Unlike most musical instruments, the resonant frequencies are unrelated. They are known as formants. The lowest resonant frequency is the first formant, and there are several above it, but the first two are sufficient to identify the vowel. No two people have exactly the same formants, but our ears don’t usually register the small (and sometimes not so small) differences. The diagram shows the frequencies of the first two formants for a selection of English vowels, and also shows that we settle on the vowel in different ways depending on what comes before.

When you form a vowel, you make the main resonance somewhere in your mouth from front /i/ to back /u/, and you position your tongue in a range from closed to open. All vowels are shown on a 2-dimensional vowel diagram which shows these distinctions.

Certain points on this diagram are known as “cardinal vowels”. This is a phonetic term, and does not imply that the vowels are in any way better than others. Most European languages use cardinal vowels, and so do many Scots, but they are pretty rare in England.

It is said that we have 5 letters to represent vowels because Greek and Latin had only 5 vowel sounds, and this is true today of Spanish and Modern Greek. English has many more. Scots generally have 12 or 13, whereas RP has 20 or 21. The main reason for this difference is that in Scottish speech, an r after a vowel is just that, whereas in RP it modifies the vowel. This diagram ◙ shows an incomplete set of English vowels.

I also have problems with the normal English distinction between short and long vowels. An English phonetician will tell you that /ɛ/ as in “bet” is a short vowel and /i/ as in “beat” is long. But this is not true of Scottish speech, where vowel length is used to distinguish words. For example, “insider” and “in cider” sound quite different to me, as do “wood” and “wooed”.

One important distinction is between monophthongs and diphthongs. In the former, the tongue stays in the same place throughout the vowel, as in /i/, whereas in the latter it moves, as in /ai/. Some vowels, such as /o/ are pronounced as monophthongs in Scotland and the North of England, but as diphthongs in the South of England.

I’m not going to work through all 20 or 21 RP vowels, partly because I can’t say all of them, but mainly because you would all be asleep by the end. The commonest vowel in English is, in fact, one which does not appear in most languages. It is known as schwa /ə/ and is the normal sound of an unstressed vowel. So “banana” has two schwas.

I’m just going to take three vowel sounds to show you how much variation can occur even within Britain.

My first vowel is /i/ as in “beat”. As this is both the furthest forward vowel and the most closed, you might think we would all pronounce it in the same way, but in fact there are variations. Some Scots pronounce it as the corresponding cardinal vowel, and RP is close to that. However, you can see from the diagram that some English speakers are actually quite far from the “pure” sound, and may even pronounce it as a diphthong.

Next I look at /o/ as in “boat”. Here we have a clear north-south divide. Again the Scottish vowel is cardinal, and north of England is not far away. In the north of the country it is normally pronounced as a monophthong, while in the south it is a diphthong which may cover quite a wide range.

Finally, the /ɑʊ/ sound as in “bout”. This is normally a diphthong, which is why we spell it with two letters, moving from a sound like /ɑ/ to a sound like /ʊ/. However, you’ll see from the diagram that it is subject to enormous variation, and all you can say is that it gets closer. In some parts of the country it is a monophthong, so we can sometimes have a forecast of “sharry” rain.


Unlike vowels, consonants cannot comfortably, if at all, be sung. They are divided into a number of classes depending on how they are articulated, and then into two subclasses of voiced and unvoiced. I’ll deal with the latter distinction first. When you produce a consonant you may have your vocal cords vibrating, in which case it is voiced (like /z/ in “zing”), or silent, in which case it is unvoiced (like /s/ in “sing”). RP has 24 consonants, and Scottish speech has 26. I will go quickly through all of them.

Stop Consonants

◙ Stop consonants, usually called just stops, are formed by briefly stopping the flow of air through the vocal cords, closing up the vocal tract at some place (the point of articulation), then opening it suddenly and possibly starting the vocal cords vibrating. Because of the sudden opening, they are also known as plosives. Since a consonant is likely to be followed by a vowel, the cords will start vibrating at some time, and the distinction between a voiced and an unvoiced stop is the “voice onset time”.

All English stops come in pairs of unvoiced and voiced. Thus /t/ and /d/ are both produced by stopping the airflow with the tongue against the upper gum, then suddenly releasing it. Similarly /p/ and /b/ are produced by closing the lips, and /k/ and /g/ by closing the throat at the soft palate. In fact, if you say “get” and “gut” you should be able to feel that you close the throat in different positions. To us they sound the same, but in some languages they are different phonemes.

Fricative consonants

◙ Fricative consonants, usually called just fricatives, are formed in a quite different way. The vocal tract is almost closed at some point, and air is forced through the gap where it makes a hissing or similar sound. The vocal cords may or may not be vibrating at the same time.

Like the stops, fricatives come in pairs of unvoiced and voiced. These are /s/ and /z/, /ʃ/ and /Ʒ/, /f/ and /v/, /θ/ and /ð/ .

The sound /h/ is usually considered a fricative, but it is a bit different in that it is unclear where in the mouth it is formed, and it has no voiced equivalent.

The Scottish /x/, as in “loch”, is also a fricative, formed well back in the throat and unvoiced. Some languages have a voiced form. This book, being written by an Englishman, says /x/ is not a phoneme, but I can’t see many Scots agreeing with him.

Affricative Consonants

◙ We also have two consonants known as affricatives. They are composite sounds, formed by a stop followed immediately by a fricative. They are /tʃ/ and /dƷ/. A foreigner learning English will initially hear each of these as two consonants, whereas we hear them as one.

Nasal Consonants

◙ Nasals are produced by putting the tongue in much the same positions as the stops, vibrating the vocal cords, and allowing the sound to come out through the nose instead of the mouth.

So if you put your mouth in the position for /p/, but let the sound come out through your nose, you get the sound /m/. Similarly corresponding to /t/ is /n/, and to /k/ is /ŋ/. Because of the articulator positions, we usually get certain combinations of nasal and stop. Thus we get /mp/ and /mb/, /nt/ and /nd/, and /ŋk/ and /ŋg/. Although we can do other combinations, we don’t always say what we think we do. Thus “tenpence” will often be pronounced as “tempence”.


◙ A small number of consonants are known as continuants, because they don’t come into any other class, and because their sound can be prolonged. In fact, they are sometimes called semi-vowels.

/l/ is not a very straightforward consonant. I’ve already mentioned the light and dark versions, but the story is much more complex. To form the sound, you raise the centre of your tongue, vibrate your vocal cords, and let the sound come out on both sides of the tongue. However, you have a lot of freedom to choose your tongue position, so there are a lot of different /l/ sounds, mainly according to local accent. We are not attuned to the differences, and usually hear all of them as /l/. The sound is normally voiced, but after /p/ or /k/ it may be unvoiced. The Welsh LL is an unvoiced version of /l/, and although it occurs in passing in English speech, it is very difficult for us to make the sound out of its normal context, which is why we may put a /θ/ in front of it.

/r/ can be formed in lots of different ways. The common form in Scotland and some regional English accents is trilled, in which the tip of the tongue vibrates, but this is rare in RP. It can be a fricative, sometimes with the tip of the tongue rolled back, or what is known as a tap, in which the tongue briefly touches the roof of the mouth.

/j/ is normally regarded as a consonant, although it is really a reduced form of the vowel /i/.

/w/ is similarly a reduced form of /u/, but Scottish speech, and some English, also has an unvoiced form /ʍ/, so that “Wales” and “whales” are quite distinct. Again some English phoneticians tend to deny that it is a phoneme, even though it clearly distinguishes words.


So far I’ve covered the individual phonemes which make up English speech, but I haven’t tried to describe how we make utterances out of them. The first thing to note is that it takes time to move the tongue and other articulators, so there is a gradual change from one phoneme to the next. Sometimes this actually changes the sound of a phoneme, but we don’t notice because we are listening to the content.

We are accustomed to thinking in terms of words, but normal speech does not make word boundaries clear. The answers to “How did you do in your exams?” and “What was the weather like on your holiday?” could both be “4 grade As” (at least in my speech).

◙ We are all accustomed to the concept of syllables, but it is not an easy concept to define. A syllable normally consists of one vowel, preceded by zero to three consonants and followed by zero to four consonants. Thus “string” has 3 initial consonants, and “twelfths” has 4 final ones. But this definition is not without difficulties. If “little” has two syllables, what is the vowel in the second? Try “twin-kle twin-kle li-ttle li-ttle star”. In my speech “world” has 2, but in most English accents it has 1. Some people pronounce 2 syllables in “film”, others make it 1.

In addition, when a word of more than one syllable has several consonants together in the middle, we can’t define the syllable boundary solely on phonetic considerations — we need to use our understanding of the structure of the language. Think of the word “bandstand”. We know it is a compound, so we know the syllable break comes after “band”. But is the word “structure” broken as “struc-ture” or “struct-ure”? Consideration of the Latin root would suggest the latter, but we would probably be more comfortable with the former.

All of this matters, because, even if we don’t know it we have different versions of some consonants, particularly /s/ and /l/, depending on position in the syllable. I find it almost impossible to hear the differences in normal speech, but when I synthesised speech I could clearly hear when I was using the wrong version.


◙ In almost every word of more than one syllable, one will be more stressed than the other(s). Anyone who has been taught how to analyse the rhythms of poetry will know that there are stressed and unstressed syllables. But it is actually a little more complicated than that. If you say the word “photograph”, then the 1st syllable is stressed. In “photography”, the 2nd syllable is stressed and the others unstressed. But in “photographic”, not only is the stress moved to the 3rd syllable, but the 1st is given a little more stress than in “photography”. Three levels of stress are enough for all cases.

Anyone who has learnt Italian will have been taught that there are pretty strict stress rules which apply to most words. A word such as “pietá” which doesn’t conform is written with a stress mark. English, on the other hand, has what is known as “free stress”. There are actually some rules which we’ve absorbed without realising. For example, in any word ending in “ ation”, the stress goes on the “a”. However, in most cases stress seems arbitrary. We do use stress sometimes to distinguish nouns or adjectives from verbs. Thus we compact the soil, but a lady has a powder compact. We really should refer to a compact disc, but possibly think of it as one word, and move the stress.

So how do we indicate stress in speech?

When you hear speech, you normally hear stressed syllables as louder than the others, but that is not the full story. There are actually four ways of indicating stress in English. The most important is a vowel change. If a syllable is totally unstressed, the vowel will usually, but not always, be reduced to a schwa or to /ɪ/. In addition, we can use syllable length (usually vowel length) and pitch (either low or high), or even loudness, to give prominence to stressed syllables.


English is what is known as a rhythmic language. I’m sure we can all remember analysing poetry in terms of feet, where each foot had a single stressed syllable and one or more unstressed ones. You may remember words like iamb, spondee, dactyl and so on. Although in normal speech we don’t consciously bother about feet, we do have a strong tendency to prefer equally spaced stressed syllables. Try saying “nine famous men”, and you should find that you shorten the stressed vowel in “famous” to make the word fit the rhythm. If you change it to “ninety famous men”, “ninety” should take the same time as “nine” did. Here is an example I found — you should find that the more unstressed syllables there are, the faster you speak.

Intonation or Prosody

◙ Intonation, or more strictly prosody, is a huge and complex subject, because we put into our utterances a lot of information about our state of mind and other non-verbal information. The term “prosody” covers what might be called the musical aspects of speech, stress, timing and pitch as measured over utterances rather than words.

Frank Muir gave a couple of examples of what we can do with prosody. “What is this thing called, love?” and “What’s that in the road, a head?”.

From the point of view of a phonetician all that matters is the sound, and I’ll give a very brief description of that.

All normal utterances can be split into “tone groups”. Roughly speaking, if we were writing an utterance down, we would punctuate between tone groups. Every tone group must have a nucleus, which is the last stressed syllable plus all that comes after it. It may also have a head consisting of all syllables from the first stressed one to the predecessor of the nucleus, a pre-head of unstressed syllables at the start. Some writers regard any unstressed syllables after the nucleus as the tail, others as part of the nucleus. Each of these parts is subject to different rules.

The nucleus is the most important, and it normally includes a change of pitch. It may fall, rise, fall and rise, or rise and fall, and both falls and rises can be over a small range or a large one.

The head may remain on a single pitch, which can be high, low or middle, or it may gradually rise or gradually fall. The pre-head is usually entirely on a low pitch, but can be entirely on a high one.

The nucleus, being the part which we have given most prominence, is the part which carries most information.

Intonation and stress are shown in a diagram like this, with low pitch at the bottom and high at the top. Blobs represent vowels, and the bigger a blob the more stressed the vowel. Here are two more ◙ ◙.


I’ve given a quick overview of phonetics from the bottom up, starting with speech sounds, and then making them into utterances. As I said at the start, phonetics deals only with the sounds of speech, and I hope you now have a better understanding.