SlideShare a Scribd company logo
UNVEILINGTHETECHSALSAOF
LAMSWITHJANUSINREAL-
REAL-TIMEAPPLICATIONS
@agonza1_io
Alberto Gonzalez Trastoy
WebRTC.ventures
2
WebRTC.ventures 2024
It all starts with
this CES 2024
gadget
presentation
3
WebRTC.ventures 2024
My Discovery of LAMs, a new AI term
What are LAMs?
• Combine symbolic reasoning
with neural networks.
• Directly model application
actions. They learn by observing
human interactions.
• Understand language like LLMs
but also translate it into
concrete actions (e.g.: UI
actions).
Gif source: Mind2Web (osu-nlp-
group.github.io)
If you don’t like new marketing terms you can just call them LLMs that perform
actions
VERTICALS: LAM
USE CASES
How They Will
Unlock Value
Across Industries
4
WebRTC.ventures 2024
5
WebRTC.ventures 2024
Image source: DALL-
E3
A “Get ready for my trip” could include searching email and calendar for the
flight information, checking into the flight and booking a ride to the airport
(cross checking ride sharing apps).
Use Cases: Automated Trip Preparation
Note: WebRTC is well suited to be a tool
to provide real time feedback to
humans in this type of automations
6
WebRTC.ventures 2024
Image source: DALL-
E3
FAQs in the
past
Use Cases: Customer Service Bots
FAQs
today/soon
FAQs in the
future
In customer service scenarios, a bot that can help users or agents perform actions. It could handle a wide
range of tasks such as helping with cloud services management, updating account information, generate
video documentation or troubleshooting issues. This would reduce the workload on humans and provide
faster results.
7
WebRTC.ventures 2024
Image source: DALL-
E3
“Automated Appointment Scheduling”. Managing appointments can be time-
consuming. A bot that can schedule appointments and send reminders could be
used.
Could offer a “Quick Tax Filing” feature, retrieving financial data, filling in tax
forms and submitting the return, streamlining the tax filing process for the user.
Could assist traders by automating the process of “Preparing for Market Open.”
This could involve aggregating news articles, social media, and pre-market
trading data.
“Automated Form Testing” could involve the LAM filling out web forms with
various inputs to test validation rules, error messages, and submission processes.
…
Other Use Cases: Scheduling, Filling Out
Forms, Testing, Trading, and More…
JANUSAI
How To Integrate
Janus with LLMs
8
WebRTC.ventures 2024
9
WebRTC.ventures 2024
RTP Forwarding
• Unidirectional forwarding of WebRTC media (RTP/RTCP) to specific UDP ports
• Available in video room plugin or using RTP forward plugin independently
• UDP broadcast/multicast support
• Easiest to integrate with ffmpeg or gstreamer rtp bin
WHEP (WebRTC-HTTP Egress Protocol)
• WHEP player communicates with WHEP endpoint to get unidirectional media server
media
• Available in video room plugin
WebRTC clients
• Bidirectional option that can be used with any plugin.
• Some examples:
• Pion (Go)
• Aiortc (Python)
How Can We Extract Janus Real Time Media
Server Side
Repos: https://github.com/michaelfranzl/janus-rtpforward-plugin, https://github.com/meetecho/janus-gateway and
https://github.com/meetecho/simple-whep-client
10
WebRTC.ventures 2024
1.Typed with text feedback
We Got The Media, Now, How Do We Want To
Interact And Get Feedback From The LLM?
2.Spoken with text feedback 3.Spoken with voice feedback
Even images or video
instead of audio?
When WebRTC
makes more sense
to be involved
11
WebRTC.ventures 2024
An Architecture Alternative for capturing audio
and interacting with LLMs
The most common approach is capturing audio client side
(simplified)
12
WebRTC.ventures 2024
And That’s How We Did It! Using a server side
LLM in Janus based 1 to 1 audio calls
An Agent Assist / Real Time Copilot for a Call Center
This image is not our original project but is a basic representation of the use case through a demo we
developed.
Note: In 2023 we developed our first production application combining LLMs with RAG and Janus
13
WebRTC.ventures 2024
1. Manual request done by the agent.
When To Prompt When Building an Agent Assist
Like Solution?
2. Using real time topic or question detection. This is typically powered by TSLMs (Task
Specific Language Models) which can generate topics based on the context of the
language content in the transcript.
14
WebRTC.ventures 2024
• Architecture Considerations: If more than one participant interacts with a
bot/agent we can’t handle all client side
• Latency: Server-side STT and LLM operations near media server for reduced delay.
We are experiencing above 1s latencies for first character LLM response to voice
conversations.
• Audio Quality: Clear, high quality audio capture is assumed for most STT models,
that’s the opposite of what WebRTC optimizes for.
• Audio format: PCM audio is usually required for most ASRs (transcoding may be
needed if using Opus)
• LLM Use and Data Flow: Ideally, should be all run in your own servers. But it is
expensive and not trivial to run an optimal LLM API server today, for text it might
be an acceptable compromise.
Other Considerations for RTC-STT-LLM
Integrations
TRYING LAMS
How To Integrate
Janus with LAMs
15
WebRTC.ventures 2024
16
WebRTC.ventures 2024
Ingredients:
• Mind2Web: A dataset for developing and
evaluating generalist agents for the web
• A LMM (Large Multimodal Model) that
combines NLU with computer vision:
• LLaVA (Open Source)
• GTP-4 Vision
• A headless browser to perform the actions
• App logic to manage the operations:
SeeAct. It is generalist web agent that
autonomously carries out tasks.
How To Perform Browser Actions
Image source: Mind2Web Dataset: https://osu-nlp-group.github.io/Mind2Web/ Repo Source: OSU-NLP-Group/SeeAct: SeeAct is a system for generalist web
agents that autonomously carry out tasks on any given website, with a focus on large multimodal models (LMMs) such as GPT-4V(ision). (github.com)
17
WebRTC.ventures 2024
Steps:
1) Action definition including website and task
2) Playwright headless browser (open site)
3) Get interactive HTML elements list
4) Find top candidates from list using Cross-Encoder to
compare elements to action (limiting list of HTML elements)
5) Screenshot of screen for elements identification
6) LLM inference:
6.1) Using GPT vision to extract current site
page information
6.2) Using GPT to obtain action(e.g: CLICK button or
TYPE “abc”) and programmatic grounding
(connection of supported actions to html elements. E.g
CLICK <a>)
7) Browser action with Playwright
How To Perform Browser Actions
Initiated Task: Find
“ABC” blog post
Website: Google’s
homepage.
Vision Analysis: Google
search bar ready for query
input.
Next Step Action: Type
“ABC” into the search bar
identified HTML element
Human Feedback: Continue
or stop operation.
Example
18
WebRTC.ventures 2024
A LAM/LMM Flow Diagram for the WebRTC Demo
19
WebRTC.ventures 2024
A LAM/LMM High Level Architecture for a
WebRTC Application
20
WebRTC.ventures 2024
• Videochat Web App: agonza1/reunitus
• WebRTC Media Server: Janus
• WebRTC Client: Aiortc
• Speech to Text: RealTimeSTT based on faster-whisper (base mode runs on
CPU too)
• Multimodal LLM: GPT-4V
• Browser Action Core Logic: SeeAct
R-SeeAct - Tech Stack
Source code: agonza1/R-SeeAct and agonza1/reunitus at seeact-bot-integration
(github.com)
21
WebRTC.ventures 2024
Demo
Image source: DALL-
E3
CHALLENGESAND
OPPORTUNITIES
Experiences
incorporating real
time LLMs
22
WebRTC.ventures 2024
23
WebRTC.ventures 2024
Latency
15+
Seconds!
24
WebRTC.ventures 2024
Main Bottleneck
Image source: DALL-
E3
For prompt 2 we need the completion of the initial image
LLM inference.
Potential Solutions:
1) Reduce size of response for each step decrease
quality
2) Usage of agents with some of the initial required context
3) Other LLM with lower latencies
4) Caching
25
WebRTC.ventures 2024
Resources
0%
20%
40%
60%
80%
100%
120%
t2 large 2 core 8 GB RAM t2 xlarge 4 core 16 GB RAM c5 2xlarge 8 vCPU (4 core) and 16
GB RAM
c5 4xlarge 16 vCPU (8 core) 32 GB
RAM
CPU Server Usage for WebRTC TTS LLM Demo Across Different Server Sizes
Transcription CPU Usage Action CPU usage Janus CPU usage for 3 participants
26
WebRTC.ventures 2024
Cost
1. Transcription:
• Using 3rd Party Service Approximately ~$0.02/min
or
• Your own NVIDIA Server Starts at ~$0.006/min
2. Multimodal GPT-V4 requests: ~$0.01/Analyzed Browser Image
3. GPT-4 Action/Context Prompts: ~900 input tokens which is ~$0.01
4. GPT-4 Action Response: ~300 output tokens which is ~$0.01
5. WebRTC Media Server and Headless Service Action Costs Disregarded
Cost per full tasks/request: ~$0.3
*Includes 1 min transcription + 10 image analysis + 10 Prompts
CONCLUSIONS
AND FUTURE
What’s next?
27
WebRTC.ventures 2024
28
WebRTC.ventures 2024
Next Project Steps
Short term:
- Speech to text on GPU with CUDA support!
- Display of browser actions in real time
Long term:
- Improve applying partial results to the query (send prompt before full response)
- Use future ChatGPT enhancements like storing context of previous queries (stateful prompts)
- Alternatives using self hosted LLM servers (LLaVA) or leveraging other existing services that have 10x
faster inference
- Implement something like GPTCache for frequent operations
29
WebRTC.ventures 2024
Conclusion
WebRTC is well suited to be a
tool to provide real time
feedback to humans in this
type of interactions.
New tech = Opportunities
that will let us have better
experiences and make
interactions more inclusive.
Too Early for LAMs in RTC
The multi step process is what
makes this unusable in RTC
apps.
We are seeing RTC – LLM
integrations implemented
internally first. In call centers
term is called call assist.
THANK YOU
Alberto Gonzalez Trastoy @lbertogon
webrtc.ventures
Project: agonza1/R-SeeAct

More Related Content

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

  • 2. 2 WebRTC.ventures 2024 It all starts with this CES 2024 gadget presentation
  • 3. 3 WebRTC.ventures 2024 My Discovery of LAMs, a new AI term What are LAMs? • Combine symbolic reasoning with neural networks. • Directly model application actions. They learn by observing human interactions. • Understand language like LLMs but also translate it into concrete actions (e.g.: UI actions). Gif source: Mind2Web (osu-nlp- group.github.io) If you don’t like new marketing terms you can just call them LLMs that perform actions
  • 4. VERTICALS: LAM USE CASES How They Will Unlock Value Across Industries 4 WebRTC.ventures 2024
  • 5. 5 WebRTC.ventures 2024 Image source: DALL- E3 A “Get ready for my trip” could include searching email and calendar for the flight information, checking into the flight and booking a ride to the airport (cross checking ride sharing apps). Use Cases: Automated Trip Preparation Note: WebRTC is well suited to be a tool to provide real time feedback to humans in this type of automations
  • 6. 6 WebRTC.ventures 2024 Image source: DALL- E3 FAQs in the past Use Cases: Customer Service Bots FAQs today/soon FAQs in the future In customer service scenarios, a bot that can help users or agents perform actions. It could handle a wide range of tasks such as helping with cloud services management, updating account information, generate video documentation or troubleshooting issues. This would reduce the workload on humans and provide faster results.
  • 7. 7 WebRTC.ventures 2024 Image source: DALL- E3 “Automated Appointment Scheduling”. Managing appointments can be time- consuming. A bot that can schedule appointments and send reminders could be used. Could offer a “Quick Tax Filing” feature, retrieving financial data, filling in tax forms and submitting the return, streamlining the tax filing process for the user. Could assist traders by automating the process of “Preparing for Market Open.” This could involve aggregating news articles, social media, and pre-market trading data. “Automated Form Testing” could involve the LAM filling out web forms with various inputs to test validation rules, error messages, and submission processes. … Other Use Cases: Scheduling, Filling Out Forms, Testing, Trading, and More…
  • 8. JANUSAI How To Integrate Janus with LLMs 8 WebRTC.ventures 2024
  • 9. 9 WebRTC.ventures 2024 RTP Forwarding • Unidirectional forwarding of WebRTC media (RTP/RTCP) to specific UDP ports • Available in video room plugin or using RTP forward plugin independently • UDP broadcast/multicast support • Easiest to integrate with ffmpeg or gstreamer rtp bin WHEP (WebRTC-HTTP Egress Protocol) • WHEP player communicates with WHEP endpoint to get unidirectional media server media • Available in video room plugin WebRTC clients • Bidirectional option that can be used with any plugin. • Some examples: • Pion (Go) • Aiortc (Python) How Can We Extract Janus Real Time Media Server Side Repos: https://github.com/michaelfranzl/janus-rtpforward-plugin, https://github.com/meetecho/janus-gateway and https://github.com/meetecho/simple-whep-client
  • 10. 10 WebRTC.ventures 2024 1.Typed with text feedback We Got The Media, Now, How Do We Want To Interact And Get Feedback From The LLM? 2.Spoken with text feedback 3.Spoken with voice feedback Even images or video instead of audio? When WebRTC makes more sense to be involved
  • 11. 11 WebRTC.ventures 2024 An Architecture Alternative for capturing audio and interacting with LLMs The most common approach is capturing audio client side (simplified)
  • 12. 12 WebRTC.ventures 2024 And That’s How We Did It! Using a server side LLM in Janus based 1 to 1 audio calls An Agent Assist / Real Time Copilot for a Call Center This image is not our original project but is a basic representation of the use case through a demo we developed. Note: In 2023 we developed our first production application combining LLMs with RAG and Janus
  • 13. 13 WebRTC.ventures 2024 1. Manual request done by the agent. When To Prompt When Building an Agent Assist Like Solution? 2. Using real time topic or question detection. This is typically powered by TSLMs (Task Specific Language Models) which can generate topics based on the context of the language content in the transcript.
  • 14. 14 WebRTC.ventures 2024 • Architecture Considerations: If more than one participant interacts with a bot/agent we can’t handle all client side • Latency: Server-side STT and LLM operations near media server for reduced delay. We are experiencing above 1s latencies for first character LLM response to voice conversations. • Audio Quality: Clear, high quality audio capture is assumed for most STT models, that’s the opposite of what WebRTC optimizes for. • Audio format: PCM audio is usually required for most ASRs (transcoding may be needed if using Opus) • LLM Use and Data Flow: Ideally, should be all run in your own servers. But it is expensive and not trivial to run an optimal LLM API server today, for text it might be an acceptable compromise. Other Considerations for RTC-STT-LLM Integrations
  • 15. TRYING LAMS How To Integrate Janus with LAMs 15 WebRTC.ventures 2024
  • 16. 16 WebRTC.ventures 2024 Ingredients: • Mind2Web: A dataset for developing and evaluating generalist agents for the web • A LMM (Large Multimodal Model) that combines NLU with computer vision: • LLaVA (Open Source) • GTP-4 Vision • A headless browser to perform the actions • App logic to manage the operations: SeeAct. It is generalist web agent that autonomously carries out tasks. How To Perform Browser Actions Image source: Mind2Web Dataset: https://osu-nlp-group.github.io/Mind2Web/ Repo Source: OSU-NLP-Group/SeeAct: SeeAct is a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large multimodal models (LMMs) such as GPT-4V(ision). (github.com)
  • 17. 17 WebRTC.ventures 2024 Steps: 1) Action definition including website and task 2) Playwright headless browser (open site) 3) Get interactive HTML elements list 4) Find top candidates from list using Cross-Encoder to compare elements to action (limiting list of HTML elements) 5) Screenshot of screen for elements identification 6) LLM inference: 6.1) Using GPT vision to extract current site page information 6.2) Using GPT to obtain action(e.g: CLICK button or TYPE “abc”) and programmatic grounding (connection of supported actions to html elements. E.g CLICK <a>) 7) Browser action with Playwright How To Perform Browser Actions Initiated Task: Find “ABC” blog post Website: Google’s homepage. Vision Analysis: Google search bar ready for query input. Next Step Action: Type “ABC” into the search bar identified HTML element Human Feedback: Continue or stop operation. Example
  • 18. 18 WebRTC.ventures 2024 A LAM/LMM Flow Diagram for the WebRTC Demo
  • 19. 19 WebRTC.ventures 2024 A LAM/LMM High Level Architecture for a WebRTC Application
  • 20. 20 WebRTC.ventures 2024 • Videochat Web App: agonza1/reunitus • WebRTC Media Server: Janus • WebRTC Client: Aiortc • Speech to Text: RealTimeSTT based on faster-whisper (base mode runs on CPU too) • Multimodal LLM: GPT-4V • Browser Action Core Logic: SeeAct R-SeeAct - Tech Stack Source code: agonza1/R-SeeAct and agonza1/reunitus at seeact-bot-integration (github.com)
  • 24. 24 WebRTC.ventures 2024 Main Bottleneck Image source: DALL- E3 For prompt 2 we need the completion of the initial image LLM inference. Potential Solutions: 1) Reduce size of response for each step decrease quality 2) Usage of agents with some of the initial required context 3) Other LLM with lower latencies 4) Caching
  • 25. 25 WebRTC.ventures 2024 Resources 0% 20% 40% 60% 80% 100% 120% t2 large 2 core 8 GB RAM t2 xlarge 4 core 16 GB RAM c5 2xlarge 8 vCPU (4 core) and 16 GB RAM c5 4xlarge 16 vCPU (8 core) 32 GB RAM CPU Server Usage for WebRTC TTS LLM Demo Across Different Server Sizes Transcription CPU Usage Action CPU usage Janus CPU usage for 3 participants
  • 26. 26 WebRTC.ventures 2024 Cost 1. Transcription: • Using 3rd Party Service Approximately ~$0.02/min or • Your own NVIDIA Server Starts at ~$0.006/min 2. Multimodal GPT-V4 requests: ~$0.01/Analyzed Browser Image 3. GPT-4 Action/Context Prompts: ~900 input tokens which is ~$0.01 4. GPT-4 Action Response: ~300 output tokens which is ~$0.01 5. WebRTC Media Server and Headless Service Action Costs Disregarded Cost per full tasks/request: ~$0.3 *Includes 1 min transcription + 10 image analysis + 10 Prompts
  • 28. 28 WebRTC.ventures 2024 Next Project Steps Short term: - Speech to text on GPU with CUDA support! - Display of browser actions in real time Long term: - Improve applying partial results to the query (send prompt before full response) - Use future ChatGPT enhancements like storing context of previous queries (stateful prompts) - Alternatives using self hosted LLM servers (LLaVA) or leveraging other existing services that have 10x faster inference - Implement something like GPTCache for frequent operations
  • 29. 29 WebRTC.ventures 2024 Conclusion WebRTC is well suited to be a tool to provide real time feedback to humans in this type of interactions. New tech = Opportunities that will let us have better experiences and make interactions more inclusive. Too Early for LAMs in RTC The multi step process is what makes this unusable in RTC apps. We are seeing RTC – LLM integrations implemented internally first. In call centers term is called call assist.
  • 30. THANK YOU Alberto Gonzalez Trastoy @lbertogon webrtc.ventures Project: agonza1/R-SeeAct

Editor's Notes

  1. Thrilled to be here today presenting at JanusCon. Before we start, I wanted to share a picture of my wife and me dancing salsa. Just like in salsa dancing, action-response coordination is important in what we're talking about today. We'll see how WebRTC and LLMs working together could revolutionize RTC. Ready to dive in? Let’s go!
  2. R1 small mobile like. Built to interact with LLMs in real time. Camera, mic for input and a touch screen. Caught my attention -> Under the hood technology allowing them to play your spotify account or book a flight…And this seems to be handled by something called LAM (Large Action Model)
  3. Symbolic reasoning: <button> element is a button Neural networks: html elements with `click`, `start` or `push` are typically buttons You can think of them as an “action-GPT” or “LLM agents”. Large Action Models is not an agreed upon industry wide term, it is generating a great deal of interest Below are a few concrete examples: - A user could create a “RSVP to a wedding” workflow which might include finding the email invite, RSVPing yes with a note, checking the registry for purchasable items and buying the gift. THIS IS NOT NEW, has been there for 1-2 years
  4. Or Imagine a scenario where a user wants to book a hotel room. Instead of having to navigate through multiple websites or apps, the user can simply interact with the bot and provide travel preferences. The bot can then search for available options, compare prices, and make bookings on behalf of the user in real-time.
  5. In customer service scenarios, a bot that can perform actions can handle a wide range of tasks such as processing returns, issuing refunds, updating account information, or troubleshooting technical issues. This would reduce the workload on human agents and provide faster resolutions for customers. Note: At this point, user feedback might be required.
  6. Personal Shopping Assistant: In the realm of online shopping, a bot that can perform actions can act as a personal shopping assistant. Users can specify their preferences and requirements for certain products, and the bot can automatically search for the best deals, apply discounts or coupons, and even complete the purchase process seamlessly. Smart Home Automation (SMART HOME DEVICES): For homeowners with smart home devices, a bot that can control these devices based on voice commands or predefined preferences would be invaluable. Users can ask the bot to adjust the thermostat, turn on lights…saving time and enhancing convenience. Workflow Automation: In a business context, a bot that can perform actions can automate various workflows and processes, such as data entry, report generation, or task assignment. This would streamline operations, increase efficiency, and free up human employees to focus on more strategic tasks.
  7. For bidirectional communication we could just use WebRTC. Spin up a headless browser and connect to Janus as a standard WebRTC client
  8. Speaking requires less cognitive effort than typing, especially for complex queries. More human-like Middle point (option2) seemed a good compromise for the demo although wouldn’t have been complicated to do (option3) and makes sense in video calls
  9. Websockets can be integrated with other services (e.g., speech-to-text APIs, natural language processing). Most AI SaaS platforms have selected this as the preferred protocol. Implementing websockets is relatively straightforward Usage of speech to text has the benefit of providing a more human-like experience graph LR     A[Audio Capture] -- Websockets --> B(Speech to Text)     B --> C{LLM Prompting}     C --> D(Text Response)     D -- Websockets --> F{Client}     F -- Audio --> A
  10. Usage of RAG (Retrieval Augmented Generation) is key to make this agent assist functionality valuable. A technique that allows you to add knowledge base to the LLM providing more specific answers.
  11. As mentioned before LAM are just LLMs that perform actions that go beyond text or audio transcription. The ingredients for the hack I did were: 1) Mind2Web dataset (trained to perform actions in 100s of sites) 2) Large Multimodal Model to capture images and understand them 3) Headless browser to interact with the browser 4) App logic code to translate described actions to the actual browse action. SeeAct open source
  12. Cross-Encoder is a type of neural network architecture used in natural language processing tasks, in the context of text pair classification. Its purpose is to evaluate and provide a single similarity score for a pair of input sentences
  13. So all put together we have the webrtc client… This is what I will demonstrate in the demo now graph LR     subgraph Server         H <--WebRTC--> M         M(WebRTC Client) -- Audio --> I{User Speaking?}         I --|Yes / Audio|--> J(Speech to Text)         I --|No|--> I         J --|Text| --> P(SeeAct)         P --> Q(Headless Browser)         P --> K{LLM Prompting}         L --|WS or SSE| --> P(SeeAct)         Q --|pass image|--> K     end     O --> L(Text Response)     K --|API Call| --> O{3rd Party LLM API}     G[User] <-- WebRTC --> H(Janus)     Q --|opens and performs actions|--> R(website) graph LR     subgraph Server         H <--WebRTC--> M         M(WebRTC Client) -- Audio --> I{User Speaking?}         I -- Yes / |Audio| --> J(Speech to Text)         J --|Text| --> K{LLM Prompting}         L --|WS or SSE| --> M         L --|Wait| --> N[Wait State]         N --> I     end     O --> L(Text Response)     K --|API Call| --> O{3rd Party API}     G[User] <-- WebRTC --> H(Janus)
  14. Particularly the specific frameworks and tools used are…
  15. Launch Janus and connect to video room Initialize R-SeeAct service Visualize Browser Actions
  16. Transcription: Headless Service Actions: WebRTC Media Server Latencies <300ms So Can Be Disregarded sequenceDiagram     participant MediaServer     participant WebRTC_Bot     participant Whisper     participant GPT4_Turbo     participant HeadlessBrowser     MediaServer->>WebRTC_Bot: Send audio (50ms)     WebRTC_Bot->>Whisper: Capture and send audio     Whisper-->>WebRTC_Bot: Transcribe audio (1000ms)     WebRTC_Bot->>HeadlessBrowser: Initialize and gather HTML elements     WebRTC_Bot->>GPT4_Turbo: Prompt with action goal, HTML elements and image     GPT4_Turbo-->>WebRTC_Bot: Text response with image information (5000+ms)     WebRTC_Bot->>GPT4_Turbo: Next prompt with multichoice and further context     GPT4_Turbo-->>WebRTC_Bot: Final action response (5000+ms)     WebRTC_Bot->>HeadlessBrowser: Execute action, e.g: click button     HeadlessBrowser-->>WebRTC_Bot: Executed action (3000ms)