Text to Video AI: How to Create Videos for Free — A Complete Guide

Anil Chandra Naidu Matcha
7 min readJun 14, 2024

--

Text to Video generation has been the trend these days and a lot of popular tools are available for this use-case such as Vadoo, Invideo, Pictory etc. But most of these tools are paid and the budget can quickly go out of reach when you need multiple revisions to your output video

So in this article we shall discuss how we can create videos from text for free using an open-source solution https://github.com/SamurAIGPT/Text-To-Video-AI . If you prefer just watching a demo of using the project without getting into the coding aspects here is a tutorial video for the same

Now let’s get technical. Below are the entire workflow for generating a complete video from text

Workflow

  1. Use openai to generate a video script from a topic
  2. Use edgetts to pick a voice and create a audio based on the above generated script
  3. Use whisper and get timed captions for the above audio
  4. Now generate visual keywords for the video script using openai api
  5. Fetch videos based on the above visual keywords using pexels api
  6. Stitch together the videos, audio and captions using Moviepy

If you prefer to go for a premium voice you can try Elevenlabs api instead of edgetts. If you are on a system with lower hardware specifications, you can skip using whisper locally and use whisper api instead. With the entire workflow being clear, let’s understand how the above steps are achieved

Generating script for video

We shall be using the below prompt to generate a video script for a topic. Since we are focused on generating a shorts video, we instruct the LLM to generate a script for around 50 seconds or 140 words. In addition we instruct LLM to generate an interesting and original script

def generate_script(topic):
prompt = (
"""You are a seasoned content writer for a YouTube Shorts channel, specializing in facts videos.
Your facts shorts are concise, each lasting less than 50 seconds (approximately 140 words).
They are incredibly engaging and original. When a user requests a specific type of facts short, you will create it.

For instance, if the user asks for:
Weird facts
You would produce content like this:

Weird facts you don't know:
- Bananas are berries, but strawberries aren't.
- A single cloud can weigh over a million pounds.
- There's a species of jellyfish that is biologically immortal.
- Honey never spoils; archaeologists have found pots of honey in ancient Egyptian tombs that are over 3,000 years old and still edible.
- The shortest war in history was between Britain and Zanzibar on August 27, 1896. Zanzibar surrendered after 38 minutes.
- Octopuses have three hearts and blue blood.

You are now tasked with creating the best short script based on the user's requested type of 'facts'.

Keep it brief, highly interesting, and unique.

Stictly output the script in a JSON format like below, and only provide a parsable JSON object with the key 'script'.

# Output
{"script": "Here is the script ..."}
"""
)

response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": topic}
]
)
content = response.choices[0].message.content
script = json.loads(content)["script"]
return script

Once the script is ready, we pass it to edgetts to generate a voice for the same. Then we pick up the generated audio and pass to whisper to generate timed captions. As the captions are now ready we can get to the next task i.e identifying visual keywords to use for fetching relevant videos for the captions

Finding visual keywords for script

We shall use the below prompt to identify relevant keywords based on the input captions. We use Pexels api for finding videos based on keywords. These keywords should be generated in such a way that they should be ideal for fetching videos and thus should be of only 1–2 words max. And they should be depict something visual as that would make the video search in Pexels easier.

At first we divide the input captions into time segments of 3 seconds. We do this such that there is a new video every 3 seconds to keep it engaging . For every time segment we generate 3 keywords using below prompt. This is done so as to make sure that even if we don’t find any videos for one keyword we can try for the other in the 3 generated keywords.

prompt = """# Instructions

Given the following video script and timed captions, extract three visually concrete and specific keywords for each time segment that can be used to search for background videos. The keywords should be short and capture the main essence of the sentence. They can be synonyms or related terms. If a caption is vague or general, consider the next timed caption for more context. If a keyword is a single word, try to return a two-word keyword that is visually concrete. If a time frame contains two or more important pieces of information, divide it into shorter time frames with one keyword each. Ensure that the time periods are strictly consecutive and cover the entire length of the video. Each keyword should cover between 2-4 seconds. The output should be in JSON format, like this: [[[t1, t2], ["keyword1", "keyword2", "keyword3"]], [[t2, t3], ["keyword4", "keyword5", "keyword6"]], ...]. Please handle all edge cases, such as overlapping time segments, vague or general captions, and single-word keywords.

For example, if the caption is 'The cheetah is the fastest land animal, capable of running at speeds up to 75 mph', the keywords should include 'cheetah running', 'fastest animal', and '75 mph'. Similarly, for 'The Great Wall of China is one of the most iconic landmarks in the world', the keywords should be 'Great Wall of China', 'iconic landmark', and 'China landmark'.

Important Guidelines:

Use only English in your text queries.
Each search string must depict something visual.
The depictions have to be extremely visually concrete, like rainy street, or cat sleeping.
'emotional moment' <= BAD, because it doesn't depict something visually.
'crying child' <= GOOD, because it depicts something visual.
The list must always contain the most relevant and appropriate query searches.
['Car', 'Car driving', 'Car racing', 'Car parked'] <= BAD, because it's 4 strings.
['Fast car'] <= GOOD, because it's 1 string.
['Un chien', 'une voiture rapide', 'une maison rouge'] <= BAD, because the text query is NOT in English.
"""

Now we have generated visual keywords for each time segment. Next we find relevant videos for these keywords using Pexels api. Below is the code to fetch a relevant video in Pexels from a keyword

def search_videos(query_string, orientation_landscape=True):

url = "https://api.pexels.com/videos/search"
headers = {
"Authorization": PEXELS_API_KEY
}
params = {
"query": query_string,
"orientation": "landscape" if orientation_landscape else "portrait",
"per_page": 15
}

response = requests.get(url, headers=headers, params=params)
json_data = response.json()
log_response(LOG_TYPE_PEXEL,query_string,response.json())

return json_data


def getBestVideo(query_string, orientation_landscape=True, used_vids=[]):
vids = search_videos(query_string, orientation_landscape)
videos = vids['videos'] # Extract the videos list from JSON

# Filter and extract videos with width and height as 1920x1080 for landscape or 1080x1920 for portrait
if orientation_landscape:
filtered_videos = [video for video in videos if video['width'] >= 1920 and video['height'] >= 1080 and video['width']/video['height'] == 16/9]
else:
filtered_videos = [video for video in videos if video['width'] >= 1080 and video['height'] >= 1920 and video['height']/video['width'] == 16/9]

# Sort the filtered videos by duration in ascending order
sorted_videos = sorted(filtered_videos, key=lambda x: abs(15-int(x['duration'])))

# Extract the top 3 videos' URLs
for video in sorted_videos:
for video_file in video['video_files']:
if orientation_landscape:
if video_file['width'] == 1920 and video_file['height'] == 1080:
if not (video_file['link'].split('.hd')[0] in used_vids):
return video_file['link']
else:
if video_file['width'] == 1080 and video_file['height'] == 1920:
if not (video_file['link'].split('.hd')[0] in used_vids):
return video_file['link']
print("NO LINKS found for this round of search with query :", query_string)
return None

For each of the 3 keywords for a time segment we try and find if we can fetch a video using getBestVideo function. It returns None if it is unable to find a video. Otherwise it filters the results such that we get videos in our expected shorts resolution i.e 1920x1080 and finds a video with longest duration < 15 seconds.

Merge videos, audio and captions with Moviepy

Now we have everything we needed i.e audio, captions and relevant videos for captions. We need to now stitch everything together to get a final video with all of the above elements. For this we will be using Moviepy library as shown in code below

def download_file(url, filename):
with open(filename, 'wb') as f:
response = requests.get(url)
f.write(response.content)

def search_program(program_name):
try:
search_cmd = "where" if platform.system() == "Windows" else "which"
return subprocess.check_output([search_cmd, program_name]).decode().strip()
except subprocess.CalledProcessError:
return None

def get_program_path(program_name):
program_path = search_program(program_name)
return program_path

def get_output_media(audio_file_path, timed_captions, background_video_data, video_server):
OUTPUT_FILE_NAME = "rendered_video.mp4"
magick_path = get_program_path("magick")
print(magick_path)
if magick_path:
os.environ['IMAGEMAGICK_BINARY'] = magick_path
else:
os.environ['IMAGEMAGICK_BINARY'] = '/usr/bin/convert'

visual_clips = []
for (t1, t2), video_url in background_video_data:
# Download the video file
video_filename = tempfile.NamedTemporaryFile(delete=False).name
download_file(video_url, video_filename)

# Create VideoFileClip from the downloaded file
video_clip = VideoFileClip(video_filename)
video_clip = video_clip.set_start(t1)
video_clip = video_clip.set_end(t2)
visual_clips.append(video_clip)

audio_clips = []
audio_file_clip = AudioFileClip(audio_file_path)
audio_clips.append(audio_file_clip)

for (t1, t2), text in timed_captions:
text_clip = TextClip(txt=text, fontsize=100, color="white", stroke_width=3, stroke_color="black", method="label")
text_clip = text_clip.set_start(t1)
text_clip = text_clip.set_end(t2)
text_clip = text_clip.set_position(["center", 800])
visual_clips.append(text_clip)

video = CompositeVideoClip(visual_clips)

if audio_clips:
audio = CompositeAudioClip(audio_clips)
video.duration = audio.duration
video.audio = audio

video.write_videofile(OUTPUT_FILE_NAME, codec='libx264', audio_codec='aac', fps=25, preset='veryfast')

# Clean up downloaded files
for (t1, t2), video_url in background_video_data:
video_filename = tempfile.NamedTemporaryFile(delete=False).name
os.remove(video_filename)

return OUTPUT_FILE_NAME

ImageMagick is needed to overlay captions in our preferred style on top of the video using TextClip function in Moviepy. Now we have the final output ready and stored in file named rendered_video.mp4

Here is a simple to run Colab notebook along with a demo video if you prefer not installing everything manually on your system.

If you wish to deep dive into the code or you prefer running it locally here is the link to the Github repo

If you wish to skip the coding and wish to use a No-code tool check out text to video ai

--

--