A must-have for streamers and those constantly on video calls.

Creating an automated pipeline for transcribing videos using OBS and AWS.

Leveraging AWS Lambda and Amazon Transcribe for your streams, meetings, and calls so you can focus on what matters!

Nicholaus Lawson
Towards AWS

--

Header image of a robot using the phone and writing
Image generated by Bing Chat

The problem

Recently, with my increased streaming and personal video calls, I needed a way to get a transcription of my calls to refer back to the conversation. In the past, I would play stenographer as I went through a stream or these calls, and I wasn’t as present in the discussion or wouldn’t have the opportunity to take important notes like follow-up items or my impressions/insights because I was too busy trying to capture everything that was said (even in shorthand or just sparse notes).

So, of course, I made a short pipeline to record these (from any platform, Twitch, Zoom, Google Voice, etc.) using OBS and feeding the recordings to AWS Transcribe to get a raw transcription of the calls, and lastly to a Lambda function to get a pretty version of the transcription that I can easily read back through.

I wanted to document the architecture and the resources I used to accomplish this.

Ultimately, I wrote very little code (it was just modifying some GitHub code I found) and only spent around 2 hours (including testing) to get this all set up and functional.

OBS — open broadcast software

First, let’s talk about OBS. I use OBS for all my streaming to Twitch and most presentations because I like having extra control over what I share. For example, presentations look so much better when I can share the OBS presentation window and swap out what windows I want to bring into view without showing off my entire screen or desktop.

The OBS setup is pretty simple. First, I add a window capture to remember to set it to the correct app. Then, I add my main speaker and my mic to the Audio mixer, and that’s it. Now I can hit the ‘Start Recording’ button and wait for my call to wrap up.

Screenshot of OBS showing the callcapture window and Audio Mixer
Image 1 — screenshot of OBS showing the window capture source and the audio mixer.

AWS

Next is the fun part, configuring AWS to do its magic. The architecture for this is straightforward. Let’s look at a diagram then we can walk through each item one at a time.

Architecture diagram of the automated transcription process. Flowing from left to right, starting with an S3 bucket to house the recorded calls flowing to a lambda function kicking off an Amazon Transcribe job which drops a file into S3. That triggers another Lambda function to make the transcription human friendly and dropping that formated doc back into S3.
Image 2 — architecture of the automated transcription process

S3 setup

First up, we need an S3 bucket to keep our videos. Make sure to create a prefix for them to live under so we don’t get into recursion issues when we set up the Lambdas later. I’m using a /videos/ prefix.

Lambda Function — triggerTranscribe

After we have a place to store our videos, we have to create a new Lambda function that will kick off our Amazon Transcription job.

Screenshot of the triggerTranscribe Lambda function showing the S3 trigger
Image 3 — screenshot of the triggerTranscribe Lambda function showing the S3 trigger

We want to trigger this Lambda function based on an S3 creation event, so the transcription job will start automatically when we drop a new video into S3.

Screenshot of the trigger on the Lambda function
Image 4 — screenshot of the S3 trigger configuration for the triggerTranscribe Lambda function

The code for this function is minimal, and I found it on a GitHub repo
https://github.com/ACloudGuru-Resources/aws-lambda-s3-transcribe/blob/main/lab-1-lambda-code.py

I had to add settings to ensure it included the speaker identification and specify that I wanted the transcription dumped into my S3 bucket, not the Transcribe-owned S3 bucket. The bucket we are storing to is the same bucket the videos are stored in, just in a different prefix, /transcriptions/.

Lambda function — transcriptionFormat

The last piece is a Lambda function to take this transcription and format it into something easier for a human to read.

screen shot of the transcription formatter lambda that will take the raw transcription and format it into something more human friendly
Image 5 — screenshot of the transcriptionFormatterLambda function showing the S3 trigger

The S3 trigger is on a different prefix since our Transcribe job drops everything into /transcriptions/.

screenshot of the s3 trigger on the transcription formatter lambda function

The code for this was also found on GitHub and required minimal changes https://github.com/trhr/aws-transcribe-transcript/blob/master/lambda_handler.py

I did change the output location as well as removed the timestamps on the output file. I found them excessive and didn’t add anything when reading through.

The raw transcription record is JSON; it contains the complete transcription without the speaker's information, and each speaker is split out by word.

{
"jobName": "my-first-transcription-job",
"accountId": "111122223333",
"results": {
"transcripts": [
{
"transcript": "Welcome to Amazon Transcribe."
}
],
"items": [
{
"start_time": "0.64",
"end_time": "1.09",
"alternatives": [
{
"confidence": "1.0",
"content": "Welcome"
}
],
"type": "pronunciation"
},
{
"start_time": "1.09",
"end_time": "1.21",
"alternatives": [
{
"confidence": "1.0",
"content": "to"
}
],
"type": "pronunciation"
},
{
"start_time": "1.21",
"end_time": "1.74",
"alternatives": [
{
"confidence": "1.0",
"content": "Amazon"
}
],
"type": "pronunciation"
},
{
"start_time": "1.74",
"end_time": "2.56",
"alternatives": [
{
"confidence": "1.0",
"content": "Transcribe"
}
],
"type": "pronunciation"
},
{
"alternatives": [
{
"confidence": "0.0",
"content": "."
}
],
"type": "punctuation"
}
]
},
"status": "COMPLETED"
}

What this Lambda function does is generate something much easier to read.

spk_1: Hey, good afternoon

spk_0: Hey there. How are you doing?

spk_1: Prety good, just working on this cool AWS project.

spk_0: Oh nice, I can't wait to hear about it.

The End?

That’s it. With a minimum amount of time, I now have a way to get a log of all my recorded calls, no matter what platform I had the call on. This helps me focus more on making meaningful notes during the call without worrying about missing something that got said. If ever in doubt, I can pull up these logs and skim over them for anything I missed.

What more could be done with this? What about piping this into Amazon Comprehend to identify important details like dates, to ensure I didn’t miss any follow-up meetings or deliverables promised? Would it be helpful to know what the sentiment of the conversation was?

What about storing these in Amazon OperSearch or Amazon Kendra so I could easily search through these conversations for topics or people?

I am happy with my new system for now, but I will keep thinking of new and fun ways to expand it.

Master DynamoDB

Get The DynamoDB Book today with 35% OFF using code TOWARDSAWS’.

--

--

Currently @ AWS. Engineer/Solution Architect, tinkerer, woodworker , part time gamer/streamer. Opinions are all mine, not shared by my employer (or anyone else)