Chapter 10 · The Voice as a Signature · Activity 10.1

The Voice as a Signature

The same sentence, spoken three times: cheerful, tired, angry. The words stay the same — but pitch, loudness and tempo reveal what the words leave unsaid. You make that measurable.

Duration 90 min Difficulty medium Group alone, bonus as a class Fully digital S

In a nutshell

What: You record the same short sentence in three moods and let Python compute the sound features: mean pitch, how much it varies, the loudness and the speaking tempo. You see in black and white that how something is said carries a message of its own.

Extension: Whisper turns the recording into text — and so separates the what from the how. Bonus: as a class you build a door opener that recognises from the voice who is speaking.

You need: a microphone and Python with librosa (sound analysis) and numpy. For the extension, openai-whisper.

What it's about

Say the sentence "I'm fine" in three ways: cheerful, tired, angry. The words stay the same — and yet everyone immediately understands three different things. What carries the difference is not the words but the sound: the pitch, the tempo, the little breaks and pauses. In this sound lies a second message, often more honest than the first.

And the sound becomes readable the same way as everything in this book: you turn it into an image. A spectrogram shows the frequencies of the voice over time — and at once the same image recognition applies that otherwise tells cats from dogs. In this activity, though, we deliberately stay with the transparent features: pitch, loudness, tempo. You choose the features yourself — the light, explainable end of the toolbox.

A little background

What we measure. Three simple, telling quantities: the base pitch (how high the voice sits on average) and how much it varies (monotone vs. melodic); the loudness and its variation (energetic vs. flat); and the speaking tempo, roughly as the share of sounding to silent sections. From these few numbers comes the emotional fingerprint of a sentence.

As unique as a fingerprint. The voice carries two things at once: what you say and who you are. Its fine peculiarities allow speakers to be recognised — the basis of your phone listening for your voice. Exactly this uniqueness is used by the bonus at the end.

The honest boundary: faked voices

Here the book's guiding idea begins to wobble. Honest signals counted as honest because you cannot fake them. With the voice that is no longer quite true: from a few seconds of recording, today's models produce a deceptively real copy of your voice, saying sentences you never spoke. A signal is only as honest as it is hard to fake — and technology can shift that threshold. Whoever reads voices must from now on also ask whether a voice is even real. That is why no serious security should rest on the voice alone — not even your bonus door opener.

Recording

Install the packages. pip install librosa numpy soundfile. (For the extension also pip install openai-whisper.)
Speak the same sentence three times. Choose a short, neutral sentence — say "I'm fine" or "Today is Monday". Record it three times: once cheerful, once tired, once angry. Same text, same distance from the microphone.
Save as WAV. happy.wav, tired.wav, angry.wav. Quiet room, no echo, close enough to the microphone.
Analyse. Run the script below over all three and compare the numbers.

Analysing with librosa

About twenty lines. The code pulls the four features from each recording and lines them up. Full code on GitHub.

import librosa
import numpy as np

def features(file):
    y, sr = librosa.load(file)                   # load audio
    duration = librosa.get_duration(y=y, sr=sr)

    # pitch (fundamental frequency) only where the voice really sounds
    f0, voiced, _ = librosa.pyin(
        y, fmin=70, fmax=400, sr=sr)
    pitch  = np.nanmean(f0)                       # mean pitch in Hz
    melody = np.nanstd(f0)                        # variation = melody

    # loudness (energy) over time
    energy = librosa.feature.rms(y=y)[0]
    loud   = float(np.mean(energy))

    # speaking tempo, roughly: share of sounding to all sections
    voiced_share = float(np.mean(voiced))

    return dict(duration=duration, pitch=pitch, melody=melody,
                loud=loud, voiced_share=voiced_share)

for name in ["happy", "tired", "angry"]:
    m = features(f"{name}.wav")
    print(f"{name:6s}  pitch {m['pitch']:5.0f} Hz   "
          f"melody {m['melody']:5.0f}   loud {m['loud']:.3f}   "
          f"voiced {m['voiced_share']:.2f}")

What you should see

Usually a clear pattern emerges: cheerful — higher pitch, lots of melody (large variation), energetic; tired — lower, monotone (little melody), quiet, slow with many pauses; angry — loud, often lower and pressed, a hard tempo. The words were identical — the numbers are not. That is exactly the second message of the voice.

Seeing the spectrogram

If you want to see the sound as an image, draw the spectrogram with librosa.display.specshow(librosa.amplitude_to_db(np.abs(librosa.stft(y)))). The same depiction from which a machine later reads a cat's meow or the click of a thirsty tomato — voice, animal and plant, one format.

Worksheet

What the sound reveals

Enter pitch, melody, loudness and voiced share for all three recordings. Which feature separates "cheerful" and "tired" most clearly?
The words were identical in all three recordings. Why does a person still immediately understand three different things? What carries the meaning?
Explain in one sentence how the sound becomes an image (spectrogram) — and why that is so practical.
The voice carries two things at once. Name both — and one example each of what you use one and the other for.
Why is the voice the most striking example of the "honest boundary"? What makes it newly fakeable, and what consequence does that carry?

Show solution

1. Individual; often the melody (variation of pitch) and the loudness separate most clearly: cheerful is melodic and energetic, tired monotone and quiet.

2. Because the meaning lies not only in the words but in the sound above them — pitch, tempo, pauses, breaks. This sound is a second message, often more honest than the words themselves.

3. You lay short time windows over the signal and compute for each which frequencies are in it (Fourier); plotted over time, that makes an image. It's practical because then the same image recognition applies that also reads photos — you need no separate "sound AI".

4. First what you say (the content — e.g. for transcription with Whisper) and second who you are (the individual sound signature — e.g. so the phone recognises its owner).

5. Because honest signals are only honest as long as they are hard to fake — and modern models produce a deceptively real voice copy from a few seconds. Consequence: whoever reads voices must in future also check whether a voice is real; security must never rest on the voice alone.

When it sticks

Problem	Likely cause & fix
`librosa.load` can't find the file	Wrong path or format. Save as WAV; start the script in the same folder or give the full path.
Pitch is `nan`	Too quiet or too noisy — `pyin` finds no voice. Get closer to the microphone, quiet room, adjust `fmin/fmax` to your vocal range.
All three recordings look the same	Spoken too tamely. Exaggerate the moods clearly — the contrast should be visible.
Very slow on long files	`pyin` is compute-heavy. Short sentences (2–4 s) are enough; optionally load audio at 16 kHz (`librosa.load(file, sr=16000)`).
Whisper complains about `ffmpeg`	Whisper needs `ffmpeg`. Install it via the system package manager (e.g. `apt install ffmpeg` / `brew install ffmpeg`).

Extension — separating the what from the how

So far you have measured the sound. Whisper, a large pre-trained speech model, reads the content — it turns the recording into text. So you hold both messages separately in your hands:

import whisper

model = whisper.load_model("base")           # small, runs without a graphics card
result = model.transcribe("angry.wav", language="en")
print("Said (the WHAT):", result["text"])
# The HOW you already measured above - pitch, melody, loudness.

The point: in all three recordings Whisper delivers the same text, while your sound features show three different moods. What was said is the same; how it was said is not.

Bonus — "Open sesame": the voice key

As a class you build a door opener from voice. Everyone records the same sentence several times; a small model learns from the sound features who is speaking — not what is said. Speak the sentence, and the model guesses the name.

Collect. Each person records the same sentence 5 times (so e.g. 5 people × 5 = 25 recordings). Names as filenames.
Pull features. For each recording the same sound features as above (plus optionally the MFCC, which librosa provides as librosa.feature.mfcc — a compact sound fingerprint).
Learn. A simple classifier (e.g. sklearn's KNeighborsClassifier) learns features → name. Hold back some recordings for testing.
Test. Speak a new recording — does the model guess the speaker? And the honest test: can someone with a similar voice (or a voice copy!) fool the key?

The lesson of the bonus

The voice key works astonishingly well — and yet can be fooled. That is exactly the point: a good copy could open it. The bonus is a game, not a security system. Whoever protects their phone or their door with the voice alone has not understood the honest boundary.

Food for thought

The voice often reveals more than the words — and this second message is usually more honest, because it is harder to steer. Yet the same technology that makes it readable can also fake it. That is the heart of Chapter 10: honesty is not a fixed property of a signal but depends on how hard it is to fake — and that threshold is shifting.
A read sound feature is a state, not a cause: "pressed, loud voice" does not necessarily mean "angry at you". The same caution as with the face — expression is not feeling.
Because the voice is as unique as a fingerprint, it is useful and delicate at once. Who may record, store, reproduce your voice? This question grows more pressing with every better voice copy.