RE: speech with microcontrollers

Back to Subject List

Old thread has been locked -- no new posts accepted in this thread

???
04/15/01 22:44
Read: times

#10829 - RE: speech with microcontrollers

Hi John,

Do you wish to achieve some sort of speech synthesis, or do you simply want to replay pre-recorded chunks of sound? Either way, with only 8K or program memory, and 256 byte of RAM, it's going to be difficult!!! Of course, you can add more memory to your application, if chip-count and space etc. isn't too much of an issue. Because I don't know whether you want to create a speech-synthesiser, or just playback/record audio, I will talk a bit about both:

Audio Playback/Record:

You may already know about the basics of digital audio reproduction. If so, ignore this bit, I have written it for the benefit of any beginners out there who would like to experiment in this field.

Digital audio reproduction is actually quite easy to achieve. To record, you will need an A/D converter, which the microcontroller can read at a constant rate (the sample-rate). The audio signal is fed into the A/D converter, and the series of values produced are stored by the microcontroller into an area of RAM. To reproduce the sound, the microcontroller simply sends the stored values to a D/A converter in the same order and at the same rate as they were recorded. The D/A will reproduce the audio signal, which can then be fed to an amplifier, and then a speaker.

The human ear can detect sounds roughly between 20Hz and 22KHz in frequency. To achieve a system that can reproduce sounds between that range, you need a sample rate of at least twice the maximum frequency you want to sample (22KHz), thus it must be 44KHz or higher (otherwise aliasing can occur, which produces fairly unpleasant noise in the audio recording, more on this in a bit!). A problem arises here, because it would take 44,000 samples (which, if you're using an 8-bit A/D convert equates to just under 43K) just to represent 1 second of sound. Seeing as a standard-type 8052 can only access a maximum of 64K, and you may be wanting to use some of that addressing space for other things, a sample-rate as high as 44KHz is probably unacceptable.

The solution of course is to reduce the sample-rate, that way it will take less memory to represent that one second of sound. So what's the catch? Reducing the sample-rate reduces the maximum frequency you can record (remember the sample-rate must be at least twice that of the maximum frequency to avoid aliasing). Most of the time, this isn't any great loss, all that happens is the recording doesn't sound as crisp, due to the lack of high frequency aspects present in the original sound. You can get away with sample rates as low as 6KHz (perhaps even lower) and still get a recording of reasonable clarity. It's worth remembering that the frequency of human speech generally hovers around 1KHz, and with the exception of any hissing sounds such as 'esss' sounds, we rarely exceed 3KHz.

By now, you are perhaps asking yourself, what on earth is aliasing? I have found this to be a fairly good way of explaining it. We have probably all noticed that when you see the wheels of a car going round on television or in a film, they often appear to be going backwards. This is due to the cyclic behaviour of a wheel, i.e. it, or the pattern of the hubcap repeatedly returns to its original position after a certain amount of time. Find out how many times this happens in a second, and you get a frequency. If this frequency is the same as the sample-rate of the recording device (in this case, a camera), then the wheel, or hub-cap pattern will be in the same position every time the camera takes a picture, producing a film of what appears to be a wheel at rest, even though it's turning! If the frequency of the wheel is higher than that of the sample-rate, then the wheel will appear to be going backwards. This effect is called aliasing. It is generally accepted that if you use a sample-rate twice that of any frequency you want to sample, aliasing can't occur.

When making a recording, measures need to be taken so that the audio signal doesn't contain any frequencies higher than that of half your sample-rate. You can do this by passing the signal through a fairly steep low-pass filter with a cut-off frequency of sample-rate/2. This should be done, even if you use a sample rate of 44KHz, because although you probably won't be able to hear anything above 22KHz, you will still hear the result of the aliasing effect caused by its presence, just as the wheels on our car are not only going backwards, but seem to be doing so rather slowly.

It is often a good idea to place an identical low-pass filter on the output of the D/A converter, as this makes sure any frequency artefacts (noises that shouldn't be there) added by the sample process (mainly time-quantization) aren't heard in the final output.

You may only want your application to playback sound, in which case I am assuming you are acquiring the sound you are going to use with a PC. If this is so, you need to convert the sound files into an appropriate format, which can be included as constant data in your program before compiling it. This could be done in many different ways, so I won't bother talking about it here.

If you don't want any too much processing time to be taken up when playing back a recording, you might consider using a chip such as one from ISD's ISD1400 series, which are standalone devices designed for the recording and playback of audio signals. They can be controlled externally by a microcontroller, so they might be just what you're looking for!

Speech Synthesis:

If you wish to implement some form of speech synthesis into your code, then you are probably in for a bumpy ride! Speech-synthesis is an art in its own right, and is something people have been researching for a long time. As a result, many different methods have been suggested, some more effective than others. Of course there is a price to pay, in that the better the results, the more complicated the method!

Probably the easiest way to achieve speech synthesis is to store recordings of any words you're going to use in the microcontroller's memory (to be honest, I think you will need to store them in external memory, because 8K just isn't going to be enough). When you want your system to say something, string the appropriate recordings together to make a sentence. Of course, the more words you want in the system's 'vocabulary', the more memory you are going to have to use. Generally in embedded-applications, memory is something there isn't much of, so you are going to be fairly limited in what you can do with this method. Incidentally, this is the method used by most telephone companies when a voice at the other end tells you what number last rang etc.

The second easiest method is actually surprisingly effective. The English language is actually made up of around 59 distinct sounds, just as words are made up of 26 letters. These sounds are called allophones, and can be strung together in various sequences to produce any word. All you need is a recording of all 59 allophones, which actually only amounts to about 7 seconds of sound stored in the microcontroller or system's memory, and then you have the ability to make it say anything you like!

There was once a chip made by GI in the 80s called the SPO256-AL2 which could produce all 59 allophones. You simply wrote a number to the device indicating which allophone you wanted it to produce, and it generated a PWM signal that when filtered, amplified and outputted through a speaker, made the correct sound. Once the sound had finished, the device generated an interrupt signal to tell the controlling device it was ready for the next allophone. You could buy expansion-cards for many of the computers around at that time which used the chip. I had one for my good old Amstrad CPC464, and remember having hours of fun playing with it!.. Wonderful chip! Unfortunately, GI eventually discontinued it, so you can't get hold of it anymore. I wish someone like Holtek (who seem to like making novelty devices) would make something to a similar design, I'm sure there are many applications for such a chip these days.. Just imagine a talking toaster!!!

It is possible to produce the same sort of results as the SPO256 in software if you have recordings of the allophones it produced. I just-so-happen to have a full set of .WAV files covering all the allophones it produced. I also have the SPO256 datasheet in Adobe Acrobat format, which has an excellent description of the way in which the allophones should be used. The only disadvantage of doing the whole thing in software (apart from the extra use of processing power) is that the SPO256 did provide a certain degree of smoothing between each allophone as it was played. Simply playing each allophone in sequence in software doesn't and wouldn't be able to provide this, nevertheless you still get good results. I have even had a lot of success just sticking the allophones together in various sequences using Sound Recorder in Windows!!!

If you (and anyone else for that matter) want the allophones and a copy of the datasheet, send me e-mail, and I will e-mail them back as soon as I can.

There are newer speech-synthesis techniques around these days, such as real-time Linear Predictive Coding (LPC), but this all seems to be rather complicated, I am still in the process of trying to understand how LPC works. To be honest I don't think even the faster 8052s would be up to the task anyway!

I hope I have helped you in some way. Apologies for the length of the reply, people who have read my 'proper' replies in the past have probably noticed they tend to be a bit lengthy, but detailed I hope!

Any more questions, don't hesitate to ask,

Matt Bucknall.

List of 14 messages in thread

Topic Author Date
speech with microcontrollers              01/01/70 00:00
RE: speech with microcontrollers              01/01/70 00:00
RE: speech with microcontrollers              01/01/70 00:00
RE: speech with microcontrollers              01/01/70 00:00
RE: speech with microcontrollers              01/01/70 00:00
RE: speech with microcontrollers              01/01/70 00:00
RE: speech with microcontrollers              01/01/70 00:00
RE: speech with microcontrollers              01/01/70 00:00
RE: speech with microcontrollers              01/01/70 00:00
RE: speech with microcontrollers              01/01/70 00:00
RE: speech with microcontrollers              01/01/70 00:00
RE: speech with microcontrollers              01/01/70 00:00
RE: speech with microcontrollers              01/01/70 00:00
HELP              01/01/70 00:00

Back to Subject List

Topic	Author	Date
speech with microcontrollers		01/01/70 00:00
RE: speech with microcontrollers		01/01/70 00:00
RE: speech with microcontrollers		01/01/70 00:00
*RE: speech with microcontrollers*		01/01/70 00:00
RE: speech with microcontrollers		01/01/70 00:00
RE: speech with microcontrollers		01/01/70 00:00
RE: speech with microcontrollers		01/01/70 00:00
RE: speech with microcontrollers		01/01/70 00:00
RE: speech with microcontrollers		01/01/70 00:00
RE: speech with microcontrollers		01/01/70 00:00
RE: speech with microcontrollers		01/01/70 00:00
RE: speech with microcontrollers		01/01/70 00:00
RE: speech with microcontrollers		01/01/70 00:00
HELP		01/01/70 00:00