Synthesized Speech using Syllable Concatenation

speechb Middle pannel is a concatenator speech pannel.

Right side of each panel has a radio shack speaker with an internal audio amp. On the back is a RCA-phono plug for external amps. External amps are employed at large sites. The sound, not only may go to different rooms, but also to different buildings and outside yards.

The speech is not transferred to remote facilities directly. Speech is recreated at remote sites by event data, and is applied to another speech panel.

speachc3 There are no internal parts that can not be purchased at radio shack. Front nob is to put the board into record mode; for automatic setup and eprom recording.

Speech Concatenator card

There are many wiring corrections on the underside of these boards. I made a lot of mistakes on the manufacture of these boards. But I made each work by cutting traces and adding wire. I got too excited with a wire wrap version, and I needed to learn to slow down.

The speech board is an external board, fed by the parallel port of a computer. A special parallel cable is needed, as the speech board requires 11 parallel data channels instead of the normal 8 for printers. As one might guess, the speech board is fast. Four boards were produced, and by 1995 all were in use.

In 1993 one could not wait for the development of speech chips. I wanted speech. And I wanted it now. I needed a simple approach that was obtainable and fast. The result is the syllabe concatenator board.

Speech Concatenation:
The speech board uses four eprom chips (ISD series) to record phonemes, syllables or words. The syllables can then be concatenated (combined) to probably form all human words, and consequently sentences. The board does not use an internal processor for the complicated gymnastics. But rather simple FIFO shift register on input, which keeps costs down. Unlike other cards, this board does not sound like a computer, but sounds much more "human like".

Other sound cards use valuable computer resources in the form of memory or tasks. Unlike all other sound boards on the market, this board was specifically designed for auxillary support in high speed machine control. Responsive control demands that no tasks hold the program, or delay the program, from its primary function: to control and monitor equipment.  Talking to humans is secondary. And even worse, taking time out to talk is disgusting. The computer gives instructions to the speech board of what to say, and from there, the computer gives no more concern to its speech progress. The speech board is truly independent 
The speech board buffers can only hold about 10 sentences at a time. If many speakable events are happening at the same time, the buffers are overran and the speech board forgets the early sentences. Of course, humans are more important than machines. But in the direct process of control they are not. My speech boards give a kind of divine sanctity to the machine control task at hand. I never imagined that speech would have such a high profile. It really breaths life into the project.
Pinout ISD Chip

For generally applications, the speech can acquirer any voice that can be recorded from syllables. The syllables are then combined to form any word, and the words are combined to form sentences.  The voice can further be deepened and slowed down, or raised in pitch and speeded up by a pot adjustment.  

The recording process in the past required additional equipment: like the DIG88700 board for very short duration switching. But now all syllable recording can be done from the keyboard with either a microphone, or a second sound card for automated recording.
Here is an automated recorder using a PIC that had timing problems.

 They cost about $14 each. The cost for an entire board is be about $100.

As of 2006, there are only two boards still in operation.


stop Removed MOVIE...
Scenes concerning...

stop Removed MOVIE ...
"I see... Water well pump is stopped"

BBALLBLU.GIF, 0 kB Example:
Syllable Concatenator talking on telephone lines,
after dialing.
As heard at the receiving end.

All of the above boards used a parallel port from a computer. They worked but they only fullfulled an emediate need while the microprocessor version using a PIC was developed... RULERMAR.GIF, 1 kB

Synthesized Speech using Syllable Concatenation

Speech Concatenation using ISD2560 and a microprocessor. This approach is truely independent of a computer.
SpeechMicro.gif, 10 kB
Microprocessor Control
The ISD2560 uses 600 internal addresses (100mS each segment) on address pins A0 through A8 (512) and also A9. A8 and A9 are Mode pins. Each address on the ISD2560 has a resolution of .1 seconds. I have determined from the earlier design that I need at least 300 mS per syllable. By dropping the two least significant bits (A0 and A1), one has a longer resolution of 400mS. I address each 400mS as an address - a whole chunk. Be aware that - internally - the chip still recognises its own divisions of 100mS each. So if I stop a recording early, before my unit of 400ms, the Chip will termainate at its own division at the end of either cell one, or cell two, or cell three. Early is ok; late is not. One address byte can now accomodate the address bits A2, A3, A4, A5, A6, A7, A8 with one bit to spare (8 lines). This arrangement gives a ISD2560 chip up to 128 syllables. And the actual number is less, due to some long phrases that span many spaces.

Each board has four chips. But there are provisions for many layers of "4chips", because I allocated two bytes for each address in anticipation of any future developments. For now, four ISD2560's has been plenty.
RULERMAR.GIF, 1 kB I should mention that the address lines are not compatible with microprocessor control. I believe this fact is mentioned in the ISD documentation. It does not apply to PushButton mode, but does apply to Direct Address Mode.

First of all, let me define the subtle problem, as I ran into it with the first design using shift registers. The ISD data lines, as well as the P/R line, are sensitive to noise, line reflections, and pico second pulses. If you experience occasional wrong addresses, or possible no Write capability, then this is the problem. My first design experienced unexpected syllables that made no sence and for no reason.

Microprocessors, including the PIC, use "read before write", which may open an output port pin to high impedance for a short time. For years I have been controling equipment with slow relays where this phenomenon is of no concern. A microsecond of "open" will not effect a relay or any other mechanical device in any way. However, the ISD device is different. One of the things that you can do is to feed the address buss and the P/R pin from a different Port than the one that feeds the EC pin. For example, a good scheme would be to feed the address buss (A1 to A9) from PORTC. And feed EC from PORTA. Confusion of addresses and the P/R state can exists during setup, but after a delay for setling down, comes the critical CE latch pulse or level. During the CE latch NO further corruption will be tolerated.

CodeDelay.gif, 12 kB
You can use the same microprocessor port:
Place a capacitor on the CE line. You slow the rise-and-fall time, and you remove and delay the actual CE pulse away from busy activity of the buss. However, using this method demands an added delay in software to guarantee the CE pulse reaches zero volts. The CE line requires a greater time to reach ground, but it is smooth and clean. You want to stand back and look at the aftermath, NOT be part of it! I love this approach. It is bullet proof.
SoundForge1.gif, 7 kB
Allocate 380mS per syllable. Here you see a syllable starting at about 90mS and going to 300mS. This sylable I am sure came from the old parralel speaker which used 300mS. This is NOT ideal for this speaker. Instead make all syllables out to almost 400mS.

Adjust amplitude to about 3/4 scale. Or keep from flat topping on peaks.

CodeList.gif, 5 kB
Make a list of all syllables with their durations.
By using a list of the Wav files, a computer automates all recording to the ISD2560 Chips.
No need to worry about timing or levels. Just push a button on the mouse...

Shown are several blank cells in the beginning of this ISD chip denoted by "Null----.wav". It seems that I never got around to assigning any sounds there.

The number of each cell is the same as the Index. This makes it easy. So a drop down box shows all cells even if the word or syllable spans several cells. Actually, each index number already spans four cells. I have some sounds that span 10 index numbers which is a duration of 40 cells which is 40 times 100mS. Which is 4 seconds. One of the ISD2560 Chips has a buzzer type of sound that is used on a lot of communication as a preamble (or attention getting sound) just before a sentence is iterated. An operator knows that the Control Network is about to speak. And it costs no overhead, as the PIC in the Concatenator just "automatically" inserts it.
AudTime.gif, 10 kB
...To create an audio test signal.
The first test signal that I want to show you is to measure and adjust timing of the ISD2560.

It starts out as "art" really in one of many sound programs that you can buy. You adjust and place your sounds, in this case tones, in the program just as if you were creating a picture. I specify that I am starting out with a 2kHz tone and "ramp" it down in 50mS. I then place a marker pulse and a low level 10% tone for 50mS. That uses up the first 100mS cell. I then place 50mS of silence and 100ms of tone, and another silence. That is two more cells. And finish up with the last cell using ramps that I can identify on a scope. It is absolutely critical that I easily see the beginning and ending sections of this audio.

Timing Wave Art to generate audio out of Computer Audio Card.
I think I was using Sound Forge for the art.

TestTime.gif, 103 kB
Timing Test Signal 400mS Coming out of the Speaker.
Adjustment in the Pic for beginning, and adjustment in Computer Recorder for exit point.
Note the gap at the beginning of CE-Recording and the beginning of actual audio recording. And it measures about 25mS. Seems to be a problem with ISD.
And there is a similar problem at the end. After CE is raised you would expect the recording will be officially terminated at the end of the cell. One should be able to "coast" to the end. There is still available memory in the cell, but the CE toggle stops the actual audio recording as well. Seems to be a problem with ISD.

You can see on the scope that there is more of the ending ramp than the beginning. When I place my syllables, instead of this test signal, there will by more ending sound than beginning. The placement is not exactly in the center of the 400mS, but that is close enough.

Aud151258.gif, 9 kB
Here is another test signal. In my business, it is called a "sweep" signal because I am going to sweep the audio circuits of the ISD2560 with several frequencies.

Wave setup before leaving computer. An audio signal will be generated out the back of the computer to be applied to the Concatenator. This is a Sweep signal.
The two end Ramps are 2kHz.

The Polarity pulses are evidently two long at 1/2 300Hz Half wave. They are not showing up...

Resp151258.gif, 87 kB
With the above signal going in, this is the signal coming out...

Output of ISD2560:
A Sweep Test Signal Output of Concatenator: 100Hz 500Hz 1kHz 2kHz 5kHz 8kHz.
100Hz almost non existant due to using mic input with 100kOhms and 27uF on AGC. Of course the 5kHz and the 8kHz are dead, I never expected them to be otherwise. If I had the time I would investigate the frequencies around 100Hz and 200Hz. The ISD2560 is a lousy recorder. I do not even want to look at linearity, it is going to be really bad too.
TestSweep.gif, 97 kB
Here is a better responce: using the Ana input pin.

Sweep Output of Concatenator: 20Hz 60Hz 100Hz 500Hz 1kHz 2kHz 5kHz 8kHz.
Internal Digitization filter set at 2.7kHz according to ISD.
Audio applied directly into analog input (similar to a "line input") where there is no bass attenuation And evidently the audio is upside down (inverted) by the one way negative spikes.
A few years ago I tried to double the speed into XTAL input to give a resolution of 6kHz instead of a lousy 3kHz. I would only have 30 seconds of record time, instead of 60, and that would have been an ok trade. But did not work. And it is totally hopeless of getting resolution up to decent 16kHz. So why try. Perhaps some day some one will develop better chips, that is the real problem! And there is no way around it. Perhaps I can invent my own higher speed recorder using a 20MHz PIC with external memory. 44kHz digitization would give 20kHz resolution, far better than the present 2.7kHz resolution.
BBALLBLU.GIF, 0 kB Example of Network Concatenator:
Syllable Concatenator

CodePreSoundID.gif, 14 kB
Here is a way to add extra sounds before the sentence begins to iterate. Actually there are three exta sounds: A marine sonar attention getting sound, An ID disclosing sound which indicates the origin of the sentence, and a pause. The identity of the originator is known with the group ID which is Buffer3, the individual ID which is Buffer4, and personal pronoun which is Buffer6. First the sonar is started playing without using any buffer, while that is playing the extra buffers are loaded. After the sonar finishes with a EOM the intire sequence, beginning with Buffer7, is played. During the playing of the sequence, the PIC is free to do other things in parallel. The PIC occasionally checks to see each EOM, and starts another sound.

Here, I am showing only identity associated with the Group. But for some groups, I key on the individual. For example there are several Broadcast Transmitters, and the one group has several members.
Syllable Concatenator

SpeekMultAnam.gif, 329 kB
Here is an animation:
All bytes start off as empty (white).
Next, a sequence is received that loads all bytes (red).

Buffers6, in the Code above, is the left most byte. And when it is played it is in green.
I should point out that the PIC manually loads Buffers6 and Buffers7 by evaluating Identity bytes and making a decision of an appropriate presound. The content of these bytes is indirectly inferred. This is done automatically for every sequence. In this case, it is simply a "sonar" attention getting sound (And quite arbitrary when I choose it.), and an identity sound. Another reason: A presound MUST be played to generate the first EOM which generates the next CE pulse.

This action generates the first of a series of EOMs and consequently CE pulses. All other buffers are already loaded, from the serial communications of the NET, also indicated in Red.

The bytes to be played are shifted from the right to the left and into the "play-position" which is byte buffers7.

Buffers in white are empty, and they are shifted in from the left as each byte is played and discarded.

As soon as the PIC detects Buffers7 as empty, all playing is halted.

If the PIC detects another string to be iterated while the first is playing, then the first sequence is abruptly halted. The new sequence is started in mid-stride. The Speech Concatenator appears to stutter at such times, and is difficult to understand. But this happens often, and it happens when dozens of events in the real world accur at the same time.

RULERBLU.gif, 10kB
Example of Syllable Concatenator speech. Authored by Clock
BBALLBLU.GIF, 139B Attention: Submarine Sonar sound
BBALLBLU.GIF, 139B Device: Clock
BBALLBLU.GIF, 139B Clear Throat
BBALLBLU.GIF, 139B Message...
BBALLBLU.GIF, 139B End of Message Period: Spit