Microsoft has created a language model called “Vall-E” that can simulate a person’s voice saying whatever they choose with the only input needed from the person being 3 second clip of the actual person saying something. I guess a fraction of a TikTok video would do the job.
It can even preserve emotion - so if you have a 3 second clip of your friend angrily shouting about something then they could in theory make a clip of your friend angrily shouting about something completely different.
Right now I don’t think you can play with the model yourself. Some people might feel that’s in some ways for the best , until society has figured out some idea of how we’re going to deal with the sheer amount of future “recordings” of things that never happened that everyone is going to be able to produce with minimal effort using with tools like this and the various other generative AI type tools that are already out there.
But I imagine someone else will release a more public tool all too soon. It didn’t take long for folk to figure out how to get AI tools to generate images that are really rather against what most of the systems designers involved wanted them to do.
In the mean time you can hear some Vall-E samples on this page. Scroll down a bit and compare the “Speaker Prompts” - which are the 3 second actual recordings of someone’s voice they fed it - with the “VALL-E” output, which is what the model produced based on it.