Hello from Utah! We are in an early monsoon season so it is raining nearly every day. Great for refilling the Great Salt Lake. Today we are talking digital clones.
My friend Austin wants a digital clone of himself.
He wants to call it ‘GPT-ME’
The idea is to create a digital clone of himself using years of emails, social posts, and videos he has created.
This clone could create both written content, as well as be a digital avatar and create photo and video content.
My friend is a technologist and is often at the bleeding edge of technology. He thought the clone could be useful in a number of ways.
An email assistant who can answer on his behalf.
‘Chat Austin bot’ you could ask questions of
A brainstorming partner.
Or it could just be an art project.
Today we are going to dive into digital clones and figure out how we could create one.
The first step is to ask…
Do you want a digital clone of yourself?
We talked about this in the AI programming edition, but everyone wants a Jarvis. Talking to a computer, and having it give you valuable, timely responses is the science fiction dream.
Tools that help us do more give us leverage.
Levers are force multipliers. Tools and systems are how some people can accomplish 10x, 100x, or 1,000,000x what others can.
And a digital clone could be a cool source of leverage.
Imagine having a digital clone on your website that could answer questions intelligently about your business. This was the dream of the ‘chatbot’ craze a few years ago.
Instead of having to prompt ChatGPT each time about the project you are working on, you could walk to the desk, connect to your clone and dive right back into the project, Tony Stark style.
People are already training GPT on their own text, PDFs, help centers and content. I wrote about how to do that here.
A clone would only amplify that. Not only could it make text content, it could talk like you, sound like you, look like you.
There are two sides to each coin, and on the flip side of the usefulness, is the other side full of potential issues.
Potential Problems with AI clones
There are a few issues that could crop up when people have AI clones.
Consent / Ownership
Problematic responses
Scams
Consent and Ownership
Cloning a personality and using it in AI systems raises questions about consent and ownership. Can I take an individual’s publically published works and create a ‘clone’ of them?
Do individuals have the right to control how their likeness and personality are used in AI systems? Who owns the resulting AI clone, and who has the authority to use it?
Can I clone Marques Brownlee and have his ‘clone’ review my products?
His YouTube videos are public, and I created the clone, so who owns it or can control it?
The music industry is not ready for voice cloning, much less full AI clones as witnessed by the pandemonium around the fake Drake tracks that were just made with voice cloning.
Problematic responses
AI clones, built with the technology we have today, will lack the ability to fully comprehend the consequences of their actions.
Imagine if my friend Austin launches his AI clone, someone asks for travel ideas, and the clone, as ‘Austin’ hallucinates, making derogatory remarks about certain counties, something Austin would never say and could be potentially embarrassed his clone said. Is Austin at fault?
AI Clones can, and will, inadvertently make inappropriate or offensive remarks, breach confidentiality, or cause harm by disseminating incorrect information. This can have legal, reputational, and psychological consequences for the individual being cloned as well as those using them.
Scams
We are already seeing an increase of scams with voice cloning technology. Grandparents are getting calls from ‘family members’ in jail who need bail and losing thousands of dollars.
Add to the voice clone, the ability for the clone to intelligently talk like the person, knowing what they know, and the scams could get worse.
Imagine you answer the phone. A family member is in jail. You ask them the question ‘What was our family pets name?’ to make sure it is really them… and they answer correctly because the clone was trained on their tweets, and they tweeted on the 10th anniversary of Fluffy the Cats death.
That being said, we will need to understand and navigate these issues, because AI clones are going to happen. Here is how you could build a pretty convincing AI clone with the tools already available today.
How to build an AI clone of yourself
There are no services (yet) that will create a full AI clone of you out of the box. There are lots of services that can do parts of it, and we talk about them below.
But we want to build a full AI clone, so we are going to need to build out all the pieces of it ourselves.
There are 4 pieces you need to get a full AI clone. Your clone needs to clone content, clone your voice, mimic your likeness, and be able to move, either with cloned movement or AI generated.
So the ingredients needed make our AI clone are cloned:
Content
Voice
Likeness
Movement
Here is how we can clone or create each one.
Clone your Content
GPT is the leading large language model right now. Other LLMs are rapidly getting better. Hugging Space has a LLM leaderboard where they are ranked.
But since GPT is the best, and has a training API, we will use it. I wrote an article about how to fine tune GPT.
But lets recap here:
1.-Stucture your content in a prompt response format.
You are essentially telling it how it should try to generate text, based on your data and inputs.
You need to create a bunch of examples of what the prompt would be, and then what the answer would be. It would look something like this.
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
2-Send to OpenAI training API
After you have all these [prompt - generated text] examples, you package them up in a folder and submit them to the OpenAI API.
This requires some knowledge of programming (ChatGPT can help you figure this out as well) but this article walks you through it.
A neat trick, is you can also give some of your text to ChatGPT and ask it to create the prompt responses for you.
3-Use your trained model via API calls
Once the OpenAI training API finishes you have a model trained on your content. You can use that model instead of 3.5 or 4 via API or in the sandbox.
Next we need to clone your voice.
Clone your Voice
There are a lot of voice cloning tools. The best or most popular one is Eleven Labs. You hand it a bunch of recordings and for 5 dollars a month, you can have your cloned voice say any text you want.
One important note is Eleven Labs is not instant. Responses can get to less than one second latency at certain quality levels. For smaller chunks of text, it can be as quick as ~500ms. But that wouldn’t be good enough for a conversation.
There are two ways around this. Avoid live conversations or wait until real time voice cloning is available. There are already real time voice cloning projects appearing on Github.
But it could be easy to just avoid real-time conversations. This clone could respond to text input, after it has time to render. Or you could hide the fact that it is not real-time by having previously generated ‘umms’, ‘like’, or ‘hey one second’ generated and use those to fill the gaps.
We now have your content, and we can say it in your voice. Now we need your image or likeness.
Clone your Likeness
There are lots of options for tools that can create digital likenesses of yourself. Let’s break this down into 2d and 3d.
2D image clone
The best tool for creating new images of yourself is Dreambooth. There are TONS of Dreambooth tools and services out there that allow you to clone yourself. You train a model on what you look like, then can generate images of yourself in any situation.
I wrote a whole article about what Dreambooth is and how you can use it.
How Dreambooth works is you upload a bunch of images of yourself in a variety of situations. The AI learns what your character looks like. You can then use that trained model, giving it text prompts to create new images of yourself in any situation.
We will talk about the tools and techniques we use to animate those 2d images to speak below.
But what if we want more than a 2D image? What if we want a clone that can be viewed from any angle? For this, we need a 3D clone.
3D avatar clones
We have reached the edge of the map. There is no super easy way to create a 3D clone of yourself.
One way to get it is to Build it yourself
This video (linked below) is a great tutorial showing you how to go from Dreambooth images to a 3D animated character that you can control.
Just a warning, the process is not easy, but who said making a digital clone would be easy? :)
Simplified the process entails:
training your Dreambooth model
generating images of your avatar (or if you are directly cloning yourself, you could skip to step 3)
Take those images into Blender and use them to sculpt a 3d model of your avatar head
Take that 3d model into a tool called MetaHuman to automatically rig it to be animated
Put it all together in Unreal Engine to control the avatar.
While a long difficult process, this gives you a 3D avatar that can be used anywhere you can use a 3D model.
Another option instead of building the 3d model by hand, is to do a 3D scan.
3D scan to create AI clone
This quick tutorial shows how you can create a 3D model yourself.
The basic process is:
Scan yourself using the In3D app https://in3d.io/
Import that model into Blender.
Use Mixamo to rig the model (make it animatable) https://www.mixamo.com
Take the model into Unity or Unreal engine to animate it.
With this 3D scanning process, you have a 3D model that is ready to be animated and used where ever you want.
There are other options for creating a 3D clone of yourself, but these other options have various limitations or are not public yet. These include:
Ready Player Me
https://readyplayer.me/ offers an option to create an avatar of yourself from a selfie. They say they have 900 worlds or games that can use that avatar, but you can’t access it yourself and use it anywhere you want.
It doesn’t really create your avatar, just uses your selfie to get the basic skin tone, gender, and hair color, then lets you use a Memoji like character creator to select hair style, eye color accessories etc.
Facebook Codec Avatars
Facebook is trying to create lifelike 3D avatars. While they haven’t released any tools or code publically the have published a few impressive papers.
https://research.facebook.com/publications/pixel-codec-avatars/
https://tech.facebook.com/reality-labs/2019/3/codec-avatars-facebook-reality-labs/
Facebook also showed a demo where you can generate a 3D avatar with your smartphone by making various faces into the camera. https://www.uploadvr.com/vr-killer-app-avatar-telepresence/ Again nothing released publically.
You have your 3D clone. But now it needs to move. You need to clone or generate movement.
Clone your movement (or generate movement)
We have the content. We have the voice. We have the image. Now we just need to animate it.
Similar to likeness, we can separate cloning movement into 2D and 3D.
2D movement
Creating speaking animation with a 2D image is straightforward. It mostly involves warping and some clever layers of generated teeth etc. There are multiple tools that do this pretty well.
D-ID is pretty cool. You can give it an image and what you want that image to say, and it will animate the image with realistic movement.
D-ID is how a lot of the AI movie memes are being created. Here is an example and an article explaining how they did it.
HeyGen (more about them later) is another service where you can upload an image and it will animate it talking.
A 3D model, animated to match human speech is harder.
3D movement (animation)
Once again we are off the edge of the map.
I can’t find an off-the-shelf solution, that would generate face and body animations that match a given audio track.
However, there are people working on tools that get close.
One area to look at is VTubers.
VTubers are Youtubers who use a generated character on camera instead of their face. They have pretty sophisticated toolsets to take a video feed and create animation from it.
One tool is Rokoko. It takes a video as input, and outputs motion data to a 3D model.
Another is Media Pipe.
This tool even has a demo you can try with your webcam.
https://mediapipe-studio.webapps.google.com/demo/face_landmarker
The problem is most VTuber tools rely on a video feed of a human an input to create the motion.
Since our digital clone doesn’t generate a video feed, we need to figure out how to use the tools in a different way, or generate a 2D video of our clone moving to then put into the 3D tool. That seems convoluted, so let’s try to generate the motion with AI directly.
One idea could be to train a model to generate motion from an speech audio track or written text.
For example, the Media Pipe takes a video and recognizes 478 face shapes (called landmarks and blendshapes). We could take those shapes, and generate them directly from an audio track or text. We could do this by training an AI to predict landmarks and blendshapes based on text or audio input.
We create a library of training data. This consists of speech tracks and or text from our trained AIs, then we create video recordings where we read the text. Put all that into a training program that would train to predict which face shape a given token of speech would generate.
Frankly, this would be hard, but it could work very well if it works at all. :)
To generate movement for the body, there are other text-to-pose models that are being released we could use.
GestureDiffuCLIP takes text inputs and generates motion.
https://twitter.com/_akhaliq/status/1640543753709428736
If we can train our motion or animation generator to accept text or speech tracks as input, we have finished the last piece we need to create our clone.
Putting it all together aka Frankenstein your AI clone
Here is how we can take everything we have created and put it together to get our AI clone.
Let’s use my friend Austin as an example.
We are going to create for him an AI clone brainstorming partner.
We would create a text or voice dictation input website. Somewhere he could type in the prompt and get responses. We could even use Whisper to let it accept voice dictation to feel more natural.
Let’s say we wanted to brainstorm new food ideas. We could ask our AI clone ‘What are some ideas to combine strawberries and donuts?’
We take our question ‘What are some ideas to combine strawberries and donuts?’ and send it to our trained GPT model with a header prompt. Something like ‘You are an expert brainstormer named Austin. You are helpful and truthful. Let’s brainstorm about the following:
What are some ideas to combine strawberries and donuts?’We take the output from GPT and pass it to ElevenLabs via the API.
Here is a sample of the ElevenLabs result.
If we are doing 2D, we then take our avatar image and pass it plus the audio to D-ID or a similar tool, and it will create the video of the clone image talking, and send it back to Austin.
If we are doing 3D, we take the speech track and pass it to our face shape predictor model. It creates the motion data of our clone saying the response.
We take that motion data and pass it to our Unity or Unreal engine setup to move the 3D clone model of our face and body.
We record that motion to video, add back in the speech track, and then show that finished video to Austin.
Making an AI clone of yourself currently is a long, difficult process, and you need to spend time wiring all the pieces together. Knowing how to code or use APIs (or use ChatGPT to help you with all that) is going to be hugely helpful. But what if you want to buy a tool, or better yet just buy an AI clone?
There are several companies offering tools that do nearly all of what we talked about above.
Tools to create AI Clones of yourself
There is no tool that will create a full AI clone of yourself currently. However, there are many that can do some of the 4 requirements for a clone. No one has added the content cloning piece, probably because that is where things can go really wrong with offensive or false information. Here are a few platforms that will do most of the AI cloning for you for a price.
Synthesia
Content ❌
Voice ✅
Likeness ✅ (2D)
Movement ✅ (2D)
Synthesia doesn’t create content for you, you have to give it the content you want produced. But once you do, it creates a video of your avatar speaking the content. Synthesia has hundreds of generated avatars you can pick from to create content. But they can make custom avatars for $1,000 a year. Or a lower quality version from your webcam for $250.
Colossyan
Content ❌
Voice ✅
Likeness ✅ (2D)
Movement ✅ (2D)
Colossyan doesn’t create content for you, you have to give it the content you want produced. But once you do, it creates a video of your avatar speaking the content.
Colossyan doesn’t list the price.
https://www.colossyan.com/create-your-own-avatar
elai.io
Content ❌
Voice ✅
Likeness ✅ (2D)
Movement ✅ (2D)
Once again, elai.io doesn’t create content for you, you have to give it the content you want produced. But once you do, it creates a video of your avatar speaking the content.
$259 Annually for the avatar, and $659 annually for avatar and voice.
https://app.elai.io/buy-avatar#selfie
Rephrase
Content ❌
Voice ✅
Likeness ✅ (2D)
Movement ✅ (2D)
Seeing a pattern? Rephrase doesn’t create content for you, you have to give it the content you want produced. But once you do, it creates a video of your avatar speaking the content. They don’t give pricing just ‘talk to sales for Enterprise pricing’ which means
https://www.rephrase.ai/pricing/studio
D-ID
Content ❌
Voice ✅ (enterprise customers only)
Likeness ❌ (you upload an image)
Movement ✅ (2D)
We already mentioned D-ID for the animating an already created image. At the risk of sounding like a broken record, D-ID doesn’t create content for you, you have to give it the content you want produced. But once you do, it creates a video of your avatar speaking the content.
Price goes from free trail to $6 a month to $300 dollars a month.
HeyGen
Content ❌
Voice ✅ (enterprise customers only)
Likeness ✅ (you upload an image or use their Stable Diffusion model to generate it)
Movement ✅ (2D)
Very similar to D-ID. Doesn’t create content for you, you have to give it the content you want produced. But once you do, it creates a video of your avatar speaking the content.
$199 / year for the basic, $1000 / year for the full avatar.
Aphid
Content ✅
Voice ❌
Likeness ❌
Movement ❌
Work ✅
Aphid’s main idea is to create AI clones of yourself that can do actual work that you get paid for. It doesn’t create the audio visual but I mention it because it is an interesting idea if it works to create leverage.
One Hour ❌ (not working)
One Hour is like Synthesia in that you can pick a pre-made avatar. They made a splash when they announced they could also create an AI character of anyone. They partnered with a YouTuber to show a little behind the scenes.
But when you try to access the offer at their URL https://hourone.ai/become-a-character the link is broken and I couldn’t find any mention of that offer on their website.
Wrapping Up
There are already examples where people have done 2D clones of themselves. This reporter had a digital clone created and used it to interact with coworkers.
I think we will see 3D digital clones in the next couple of months, and then products and companies offering an ‘end-to-end digital cloning process’ in a year or more.
Once they start appearing, we will need to figure out the legal and cultural ramifications.
I guess the last question I have is, do you want a 3D clone of yourself?
Thanks for reading, see you next week
-Josh
Another tool to create an AI clone I just found - https://avaturn.me/
Are we sure this is a good idea when rolled out on a mass basis? Are we going to have to copyright ourselves at all our age levels? Better run up the legal hurricane flag on the flagpole. Sometimes innocuous actions when done as individuals, become frightful when done in coordination with other innocuous acts. Any world war, if you analyze the myriads of causal agencies, is composed of thousands of individual parts, all innocuous. I've got a bad feeling about this-