5 weeks left

DAT264x: Identifying Accents in Spectrograms of Speech
Hosted By Microsoft


Identifying Accents from Spectrograms of Speech Samples

Voice recognition software enables our devices to be responsive to our speech. We see it in our phones, cars, and home appliances.

According to a study commissioned by the Washington Post,

Amazon’s Alexa and Google’s Assistant are spearheading a voice-activated revolution, rapidly changing the way millions of people around the world learn new things and plan their lives. But for people with accents — even the regional lilts, dialects and drawls native to various parts of the United States — the artificially intelligent speakers can seem very different: inattentive, unresponsive, even isolating. For many across the country, the wave of the future has a bias problem, and it’s leaving them behind.

The researchers found that smart speakers made by Google and Amazon made 30 percent more errors in parsing the speech of non-native speakers compared to native speakers. Other research has shown that voice recognition software often works better for men than women.

Algorithmic biases often stem from the datasets on which they're trained. One of the ways to improve non-native speakers' experiences with voice recognition software is to train the algorithms on a diverse set of speech samples. Accent detection of existing speech samples can help with the generation of these training datasets, which is an important step toward closing the "accent gap" and eliminating biases in voice recognition software.

In this challenge, you will use standard AI tools to identify three different accents in speech samples.


This dataset contains samples of speech sampled at 22,050 Hz from three different types of accents that were then transformed into spectrograms.

A spectrogram is a visual representation of the various frequencies of sound as they vary with time. The x-axis represents time (in seconds), and the y-axis represents frequency (measured in Hz). The colors indicate the amplitude of a particular frequency at a particular time (i.e., how loud it is). We're measuring amplitude in decibels. So in the example spectrogram below, lower frequencies are louder than higher frequencies.


Spectrograms were created using librosa, a python package for music and audio analysis. The code to generate a spectrogram looks like this:

S = librosa.feature.melspectrogram(y=obs, sr=22050)
spectrogram = librosa.power_to_db(S)
plt.imsave(file_path, arr=spectrogram, cmap='gray', origin='lower')

Under the hood, this process:

  • Takes the fourier transform of a windowed excerpt of the raw signal, in order to decompose the signal into its consistuent frequencies.
  • Maps the powers of the spectrum onto the mel scale. The mel scale is a perceptual scale where pitches are judged to be equal in distance from one another based on the human ear.
  • Takes the logs of the power (amplitude squared) at each of the mel frequencies to convert to decibel units.
  • Plots and saves the resulting image.

There is a lot of useful information encoded in these spectrograms. Now it's time to use your deep learning skills to parse out which patterns correspond to which types of accents.