Now that I've read through the document (and E, don't want to be hijacking this thread so let us know if you want to pull this stuff out into a new one) this looks promising if not a bit complicated. It would be nice (ideal) to have a little better integration into the M1, or a different way of extending the M1 vocabulary. 
The pluses I see are:
ability to add up to 248 minutes of voice/.wav's
Some negatives are:
Complexity of interfacing with M1. Ideally, you would want the new words/phrases to be accessible by the M1 like the other words are, not just as ascii string triggered by rules.
Cost - at least 1 XSP in addition to the 480 AND the 485 and 129. What if you needed more than 1 480 - can multiple 480s tie into a single XSP? If not, you are going to run into a limit real fast with XSPs.
Some questions/ideas?
The M1 has the ability to record up to 60 seconds (10 x 6) of additional voice. The way I understand it you have only the 10 slots of 6 seconds (whether you use it or not) and the only interface to get the voice in is the local phone.  Since a lot of I (and I assume several) people would want to do is add several individual words like family member names, maybe reminder info words, etc, then 60 seconds is really a A LOT of space. An average word is <= 1 second, right? Sooo - would it be beneficial if the slots on the M1 could be broken down to 60 x 1 and linkable? That way, we could add up to 60 individual words direct into the M1 (assuming of course 1 second per word), also have them linkable like the time slots on the 480.
Then, how about a way to interface a pc output to the M1 to transfer the words, instead of a phoneline interface. Ideally, I guess I would want the same digital voice as built in saying the custom words. Maybe this would be sufficent and keeps everything right in the M1?
Does this make sense or am I all wet? Otherwise, it may just be simpler to skip the M1 voice all together and ship all TTS out to a SW or other HW engine but then you lose the built in integration which is nice???