Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
- 3. 3
WebRTC.ventures 2024
My Discovery of LAMs, a new AI term
What are LAMs?
• Combine symbolic reasoning
with neural networks.
• Directly model application
actions. They learn by observing
human interactions.
• Understand language like LLMs
but also translate it into
concrete actions (e.g.: UI
actions).
Gif source: Mind2Web (osu-nlp-
group.github.io)
If you don’t like new marketing terms you can just call them LLMs that perform
actions
- 5. 5
WebRTC.ventures 2024
Image source: DALL-
E3
A “Get ready for my trip” could include searching email and calendar for the
flight information, checking into the flight and booking a ride to the airport
(cross checking ride sharing apps).
Use Cases: Automated Trip Preparation
Note: WebRTC is well suited to be a tool
to provide real time feedback to
humans in this type of automations
- 6. 6
WebRTC.ventures 2024
Image source: DALL-
E3
FAQs in the
past
Use Cases: Customer Service Bots
FAQs
today/soon
FAQs in the
future
In customer service scenarios, a bot that can help users or agents perform actions. It could handle a wide
range of tasks such as helping with cloud services management, updating account information, generate
video documentation or troubleshooting issues. This would reduce the workload on humans and provide
faster results.
- 7. 7
WebRTC.ventures 2024
Image source: DALL-
E3
“Automated Appointment Scheduling”. Managing appointments can be time-
consuming. A bot that can schedule appointments and send reminders could be
used.
Could offer a “Quick Tax Filing” feature, retrieving financial data, filling in tax
forms and submitting the return, streamlining the tax filing process for the user.
Could assist traders by automating the process of “Preparing for Market Open.”
This could involve aggregating news articles, social media, and pre-market
trading data.
“Automated Form Testing” could involve the LAM filling out web forms with
various inputs to test validation rules, error messages, and submission processes.
…
Other Use Cases: Scheduling, Filling Out
Forms, Testing, Trading, and More…
- 9. 9
WebRTC.ventures 2024
RTP Forwarding
• Unidirectional forwarding of WebRTC media (RTP/RTCP) to specific UDP ports
• Available in video room plugin or using RTP forward plugin independently
• UDP broadcast/multicast support
• Easiest to integrate with ffmpeg or gstreamer rtp bin
WHEP (WebRTC-HTTP Egress Protocol)
• WHEP player communicates with WHEP endpoint to get unidirectional media server
media
• Available in video room plugin
WebRTC clients
• Bidirectional option that can be used with any plugin.
• Some examples:
• Pion (Go)
• Aiortc (Python)
How Can We Extract Janus Real Time Media
Server Side
Repos: https://github.com/michaelfranzl/janus-rtpforward-plugin, https://github.com/meetecho/janus-gateway and
https://github.com/meetecho/simple-whep-client
- 10. 10
WebRTC.ventures 2024
1.Typed with text feedback
We Got The Media, Now, How Do We Want To
Interact And Get Feedback From The LLM?
2.Spoken with text feedback 3.Spoken with voice feedback
Even images or video
instead of audio?
When WebRTC
makes more sense
to be involved
- 12. 12
WebRTC.ventures 2024
And That’s How We Did It! Using a server side
LLM in Janus based 1 to 1 audio calls
An Agent Assist / Real Time Copilot for a Call Center
This image is not our original project but is a basic representation of the use case through a demo we
developed.
Note: In 2023 we developed our first production application combining LLMs with RAG and Janus
- 13. 13
WebRTC.ventures 2024
1. Manual request done by the agent.
When To Prompt When Building an Agent Assist
Like Solution?
2. Using real time topic or question detection. This is typically powered by TSLMs (Task
Specific Language Models) which can generate topics based on the context of the
language content in the transcript.
- 14. 14
WebRTC.ventures 2024
• Architecture Considerations: If more than one participant interacts with a
bot/agent we can’t handle all client side
• Latency: Server-side STT and LLM operations near media server for reduced delay.
We are experiencing above 1s latencies for first character LLM response to voice
conversations.
• Audio Quality: Clear, high quality audio capture is assumed for most STT models,
that’s the opposite of what WebRTC optimizes for.
• Audio format: PCM audio is usually required for most ASRs (transcoding may be
needed if using Opus)
• LLM Use and Data Flow: Ideally, should be all run in your own servers. But it is
expensive and not trivial to run an optimal LLM API server today, for text it might
be an acceptable compromise.
Other Considerations for RTC-STT-LLM
Integrations
- 16. 16
WebRTC.ventures 2024
Ingredients:
• Mind2Web: A dataset for developing and
evaluating generalist agents for the web
• A LMM (Large Multimodal Model) that
combines NLU with computer vision:
• LLaVA (Open Source)
• GTP-4 Vision
• A headless browser to perform the actions
• App logic to manage the operations:
SeeAct. It is generalist web agent that
autonomously carries out tasks.
How To Perform Browser Actions
Image source: Mind2Web Dataset: https://osu-nlp-group.github.io/Mind2Web/ Repo Source: OSU-NLP-Group/SeeAct: SeeAct is a system for generalist web
agents that autonomously carry out tasks on any given website, with a focus on large multimodal models (LMMs) such as GPT-4V(ision). (github.com)
- 17. 17
WebRTC.ventures 2024
Steps:
1) Action definition including website and task
2) Playwright headless browser (open site)
3) Get interactive HTML elements list
4) Find top candidates from list using Cross-Encoder to
compare elements to action (limiting list of HTML elements)
5) Screenshot of screen for elements identification
6) LLM inference:
6.1) Using GPT vision to extract current site
page information
6.2) Using GPT to obtain action(e.g: CLICK button or
TYPE “abc”) and programmatic grounding
(connection of supported actions to html elements. E.g
CLICK <a>)
7) Browser action with Playwright
How To Perform Browser Actions
Initiated Task: Find
“ABC” blog post
Website: Google’s
homepage.
Vision Analysis: Google
search bar ready for query
input.
Next Step Action: Type
“ABC” into the search bar
identified HTML element
Human Feedback: Continue
or stop operation.
Example
- 20. 20
WebRTC.ventures 2024
• Videochat Web App: agonza1/reunitus
• WebRTC Media Server: Janus
• WebRTC Client: Aiortc
• Speech to Text: RealTimeSTT based on faster-whisper (base mode runs on
CPU too)
• Multimodal LLM: GPT-4V
• Browser Action Core Logic: SeeAct
R-SeeAct - Tech Stack
Source code: agonza1/R-SeeAct and agonza1/reunitus at seeact-bot-integration
(github.com)
- 24. 24
WebRTC.ventures 2024
Main Bottleneck
Image source: DALL-
E3
For prompt 2 we need the completion of the initial image
LLM inference.
Potential Solutions:
1) Reduce size of response for each step decrease
quality
2) Usage of agents with some of the initial required context
3) Other LLM with lower latencies
4) Caching
- 26. 26
WebRTC.ventures 2024
Cost
1. Transcription:
• Using 3rd Party Service Approximately ~$0.02/min
or
• Your own NVIDIA Server Starts at ~$0.006/min
2. Multimodal GPT-V4 requests: ~$0.01/Analyzed Browser Image
3. GPT-4 Action/Context Prompts: ~900 input tokens which is ~$0.01
4. GPT-4 Action Response: ~300 output tokens which is ~$0.01
5. WebRTC Media Server and Headless Service Action Costs Disregarded
Cost per full tasks/request: ~$0.3
*Includes 1 min transcription + 10 image analysis + 10 Prompts
- 28. 28
WebRTC.ventures 2024
Next Project Steps
Short term:
- Speech to text on GPU with CUDA support!
- Display of browser actions in real time
Long term:
- Improve applying partial results to the query (send prompt before full response)
- Use future ChatGPT enhancements like storing context of previous queries (stateful prompts)
- Alternatives using self hosted LLM servers (LLaVA) or leveraging other existing services that have 10x
faster inference
- Implement something like GPTCache for frequent operations
- 29. 29
WebRTC.ventures 2024
Conclusion
WebRTC is well suited to be a
tool to provide real time
feedback to humans in this
type of interactions.
New tech = Opportunities
that will let us have better
experiences and make
interactions more inclusive.
Too Early for LAMs in RTC
The multi step process is what
makes this unusable in RTC
apps.
We are seeing RTC – LLM
integrations implemented
internally first. In call centers
term is called call assist.
Editor's Notes
- Thrilled to be here today presenting at JanusCon.
Before we start, I wanted to share a picture of my wife and me dancing salsa. Just like in salsa dancing, action-response coordination is important in what we're talking about today. We'll see how WebRTC and LLMs working together could revolutionize RTC. Ready to dive in? Let’s go!
- R1 small mobile like. Built to interact with LLMs in real time. Camera, mic for input and a touch screen.
Caught my attention -> Under the hood technology allowing them to play your spotify account or book a flight…And this seems to be handled by something called LAM (Large Action Model)
- Symbolic reasoning: <button> element is a button
Neural networks: html elements with `click`, `start` or `push` are typically buttons
You can think of them as an “action-GPT” or “LLM agents”.
Large Action Models is not an agreed upon industry wide term, it is generating a great deal of interest
Below are a few concrete examples:
- A user could create a “RSVP to a wedding” workflow which might include finding the email invite, RSVPing yes with a note, checking the registry for purchasable items and buying the gift.
THIS IS NOT NEW, has been there for 1-2 years
- Or Imagine a scenario where a user wants to book a hotel room. Instead of having to navigate through multiple websites or apps, the user can simply interact with the bot and provide travel preferences. The bot can then search for available options, compare prices, and make bookings on behalf of the user in real-time.
- In customer service scenarios, a bot that can perform actions can handle a wide range of tasks such as processing returns, issuing refunds, updating account information, or troubleshooting technical issues. This would reduce the workload on human agents and provide faster resolutions for customers.
Note: At this point, user feedback might be required.
- Personal Shopping Assistant: In the realm of online shopping, a bot that can perform actions can act as a personal shopping assistant. Users can specify their preferences and requirements for certain products, and the bot can automatically search for the best deals, apply discounts or coupons, and even complete the purchase process seamlessly.
Smart Home Automation (SMART HOME DEVICES): For homeowners with smart home devices, a bot that can control these devices based on voice commands or predefined preferences would be invaluable. Users can ask the bot to adjust the thermostat, turn on lights…saving time and enhancing convenience.
Workflow Automation: In a business context, a bot that can perform actions can automate various workflows and processes, such as data entry, report generation, or task assignment. This would streamline operations, increase efficiency, and free up human employees to focus on more strategic tasks.
- For bidirectional communication we could just use WebRTC. Spin up a headless browser and connect to Janus as a standard WebRTC client
- Speaking requires less cognitive effort than typing, especially for complex queries. More human-like
Middle point (option2) seemed a good compromise for the demo although wouldn’t have been complicated to do (option3) and makes sense in video calls
- Websockets can be integrated with other services (e.g., speech-to-text APIs, natural language processing). Most AI SaaS platforms have selected this as the preferred protocol.
Implementing websockets is relatively straightforward
Usage of speech to text has the benefit of providing a more human-like experience
graph LR
A[Audio Capture] -- Websockets --> B(Speech to Text)
B --> C{LLM Prompting}
C --> D(Text Response)
D -- Websockets --> F{Client}
F -- Audio --> A
- Usage of RAG (Retrieval Augmented Generation) is key to make this agent assist functionality valuable. A technique that allows you to add knowledge base to the LLM providing more specific answers.
- As mentioned before LAM are just LLMs that perform actions that go beyond text or audio transcription. The ingredients for the hack I did were: 1) Mind2Web dataset (trained to perform actions in 100s of sites)
2) Large Multimodal Model to capture images and understand them
3) Headless browser to interact with the browser
4) App logic code to translate described actions to the actual browse action. SeeAct open source
- Cross-Encoder is a type of neural network architecture used in natural language processing tasks, in the context of text pair classification. Its purpose is to evaluate and provide a single similarity score for a pair of input sentences
- So all put together we have the webrtc client…
This is what I will demonstrate in the demo now
graph LR
subgraph Server
H <--WebRTC--> M
M(WebRTC Client) -- Audio --> I{User Speaking?}
I --|Yes / Audio|--> J(Speech to Text)
I --|No|--> I
J --|Text| --> P(SeeAct)
P --> Q(Headless Browser)
P --> K{LLM Prompting}
L --|WS or SSE| --> P(SeeAct)
Q --|pass image|--> K
end
O --> L(Text Response)
K --|API Call| --> O{3rd Party LLM API}
G[User] <-- WebRTC --> H(Janus)
Q --|opens and performs actions|--> R(website)
graph LR
subgraph Server
H <--WebRTC--> M
M(WebRTC Client) -- Audio --> I{User Speaking?}
I -- Yes / |Audio| --> J(Speech to Text)
J --|Text| --> K{LLM Prompting}
L --|WS or SSE| --> M
L --|Wait| --> N[Wait State]
N --> I
end
O --> L(Text Response)
K --|API Call| --> O{3rd Party API}
G[User] <-- WebRTC --> H(Janus)
- Particularly the specific frameworks and tools used are…
- Launch Janus and connect to video room
Initialize R-SeeAct service
Visualize Browser Actions
- Transcription:
Headless Service Actions:
WebRTC Media Server Latencies <300ms So Can Be Disregarded
sequenceDiagram
participant MediaServer
participant WebRTC_Bot
participant Whisper
participant GPT4_Turbo
participant HeadlessBrowser
MediaServer->>WebRTC_Bot: Send audio (50ms)
WebRTC_Bot->>Whisper: Capture and send audio
Whisper-->>WebRTC_Bot: Transcribe audio (1000ms)
WebRTC_Bot->>HeadlessBrowser: Initialize and gather HTML elements
WebRTC_Bot->>GPT4_Turbo: Prompt with action goal, HTML elements and image
GPT4_Turbo-->>WebRTC_Bot: Text response with image information (5000+ms)
WebRTC_Bot->>GPT4_Turbo: Next prompt with multichoice and further context
GPT4_Turbo-->>WebRTC_Bot: Final action response (5000+ms)
WebRTC_Bot->>HeadlessBrowser: Execute action, e.g: click button
HeadlessBrowser-->>WebRTC_Bot: Executed action (3000ms)