It sounds fairly mechanical, but unless you have the processing ooomph to generate WAV files on the fly, you may not be able to do better than that, and that would be asking a lot of a low level device like the M1. Of course if you are using a higher level, software based system in conjunction with the M1, you can do some of that work there, where you have the option to use high quality voices (though you'll want to have a quick machine and plenty of memory in order to keep it from interfering with regular automation work that's going on at the same time.) High quality speech engines can be pretty piggy in terms of CPU usage and memory, IMHO, because they have to do a fair amount of analysis and then generate a good quality digital representation of the voice on the fly.
A system like the Elk just has small files that represent single works or phrases, that it puts together as required, but that means they cannot be context or syntax sensitive.