Closed Bug 1893921 Opened 3 months ago Closed 24 days ago

Creation of content processes is slow on macos arm64 runners on GitHub actions

Categories

(Core :: Security: Process Sandboxing, defect, P3)

defect

Tracking

()

RESOLVED FIXED
129 Branch
Tracking Status
firefox129 --- fixed

People

(Reporter: alexrudenko, Assigned: haik)

References

(Blocks 1 open bug)

Details

(Whiteboard: [webdriver:m11])

Attachments

(6 files, 1 obsolete file)

See https://github.com/puppeteer/puppeteer/actions/runs/8829855155/job/24242835681?pr=12318 It is a test run done on new macos arm64 runners on GitHub actions (https://github.com/actions/runner-images/blob/macos-14-arm64/20240422.3/images/macos/macos-14-arm64-Readme.md#browsers)

Several tests time out and in general the test run is very slow, 21 minutes for one test shard (usually up to 5 min). When run with the trace option, the logs indicate that browsingContext.create always blocks for 5 seconds: https://gist.github.com/OrKoN/1bcde28b31e2dba4ba50909deecdd010

I've triggered a new job with debug enabled. Lets see if this gives some more details:

https://github.com/puppeteer/puppeteer/actions/runs/8829855155/job/24363405593?pr=12318

And here is already some output:

 1714376280846	RemoteAgent	TRACE	[12] ProgressListener Start: expectNavigation=false resolveWhenStarted=false unloadTimeout=5000 waitForExplicitStart=false
1714376280846	RemoteAgent	TRACE	[12] ProgressListener Setting unload timer (5000ms)
1714376285849	RemoteAgent	DEBUG	WebDriverBiDiConnection 7462f260-895e-4517-8a97-128c1d17c323 <- {"type":"event","method":"script.realmCreated","params":{"realm":"8f9e717e-c133-42cb-9051-61e1a4366967","origin":"null","context":"0c801b4e-d1d1-455f-b76b-495f31e4e308","type":"window"}}

It looks like the hang comes from waiting for about:blank to be loaded. Sadly I do not see any update of the web progress listener status and wonder if we somehow miss loading this page.

Alex, is it easy to run a job with the MOZ_LOG="BCWebProgress:5" environment variable set?

Flags: needinfo?(alexrudenko)

I believe the same log was already in the linked gist https://gist.github.com/OrKoN/1bcde28b31e2dba4ba50909deecdd010

To set extra env variables we can use the PR here https://github.com/puppeteer/puppeteer/pull/12328

Flags: needinfo?(alexrudenko)

Here an example:
https://github.com/puppeteer/puppeteer/actions/runs/8875780005/job/24365989726?pr=12328#step:9:350

1714380471274	RemoteAgent	DEBUG	WebDriverBiDiConnection adc3c975-c114-4598-bfaa-f6881514cfc9 -> {"id":8,"method":"browsingContext.create","params":{"type":"tab","userContext":"3da5f8e5-eb20-4d60-b6a2-3b6706ee4274"}}
1714380471274	RemoteAgent	TRACE	Module root/browsingContext.sys.mjs found for ROOT
1714380471274	RemoteAgent	TRACE	Received command browsingContext.create for destination ROOT
1714380471274	RemoteAgent	TRACE	Module root/browsingContext.sys.mjs found for ROOT
1714380471275	RemoteAgent	TRACE	[10] NavigationListener onStateChange, stateFlags: 196612, status: 0, isStart: false, isStop: false, isNetwork: false, isBindingAborted: false, targetURI: about:blank
1714380471275	RemoteAgent	TRACE	Received command browsingContext._awaitVisibilityState for destination WINDOW_GLOBAL
1714380471275	RemoteAgent	TRACE	Module windowglobal/browsingContext.sys.mjs found for WINDOW_GLOBAL
1714380471277	RemoteAgent	TRACE	[10] NavigationListener onStateChange, stateFlags: 131088, status: 0, isStart: false, isStop: true, isNetwork: false, isBindingAborted: false, targetURI: about:blank
1714380471314	RemoteAgent	DEBUG	WebDriverBiDiConnection adc3c975-c114-4598-bfaa-f6881514cfc9 <- {"type":"event","method":"browsingContext.contextCreated","params":{"children":null,"context":"4e53e36f-31c1-4b45-ba2f-f6dc0adf121f","url":"about:blank","userContext":"3da5f8e5-eb20-4d60-b6a2-3b6706ee4274","parent":null}}
2024-04-29T08:47:51.315Z puppeteer:webDriverBiDi:RECV ◀ [
  '{"type":"event","method":"browsingContext.contextCreated","params":{"children":null,"context":"4e53e36f-31c1-4b45-ba2f-f6dc0adf121f","url":"about:blank","userContext":"3da5f8e5-eb20-4d60-b6a2-3b6706ee4274","parent":null}}'
]
1714380471330	RemoteAgent	TRACE	Received DOM event TabSelect for [object XULElement]
[Parent 5860: Main Thread]: I/BCWebProgress OnLocationChange({isTopLevel:1, isLoadingDocument:1}, {URI:about:blank, originalURI:about:blank}, about:blank, 0) on {top:1, id:a, url:about:blank}
1714380471336	RemoteAgent	TRACE	Module windowglobal-in-root/realm-created.sys.mjs not found for WINDOW_GLOBAL
2024-04-29T08:47:51.337Z puppeteer:webDriverBiDi:RECV ◀ [
1714380471336	RemoteAgent	DEBUG	WebDriverBiDiConnection adc3c975-c114-4598-bfaa-f6881514cfc9 <- {"type":"event","method":"script.realmCreated","params":{"realm":"9279068c-35a3-483d-854f-cba368b3ed6a","origin":"null","context":"306dcc2d-3624-41f7-b432-d87b10d858bd","type":"window"}}
  '{"type":"event","method":"script.realmCreated","params":{"realm":"9279068c-35a3-483d-854f-cba368b3ed6a","origin":"null","context":"306dcc2d-3624-41f7-b432-d87b10d858bd","type":"window"}}'
1714380471336	RemoteAgent	TRACE	Module windowglobal-in-root/window-global-handler-created.sys.mjs not found for WINDOW_GLOBAL
]
1714380471337	RemoteAgent	DEBUG	WebDriverBiDiConnection adc3c975-c114-4598-bfaa-f6881514cfc9 <- {"type":"event","method":"browsingContext.domContentLoaded","params":{"context":"306dcc2d-3624-41f7-b432-d87b10d858bd","timestamp":1714380471277,"url":"about:blank","navigation":null}}
[Parent 5860: Main Thread]: I/BCWebProgress OnSecurityChange(<null>, <null>, 4) on {top:1, id:a, url:about:blank}
1714380471337	RemoteAgent	TRACE	[306dcc2d-3624-41f7-b432-d87b10d858bd] No navigation found to stop for url: about:blank
[Parent 5860: Main Thread]: I/BCWebProgress OnStateChange({isTopLevel:1, isLoadingDocument:1}, {URI:about:blank, originalURI:about:blank}, STATE_STOP|STATE_IS_NETWORK|STATE_IS_WINDOW, NS_OK) on {top:1, id:a, url:about:blank}
1714380471337	RemoteAgent	DEBUG	WebDriverBiDiConnection adc3c975-c114-4598-bfaa-f6881514cfc9 <- {"type":"event","method":"browsingContext.load","params":{"context":"306dcc2d-3624-41f7-b432-d87b10d858bd","timestamp":1714380471277,"url":"about:blank","navigation":null}}
2024-04-29T08:47:51.338Z puppeteer:webDriverBiDi:RECV ◀ [
  '{"type":"event","method":"browsingContext.domContentLoaded","params":{"context":"306dcc2d-3624-41f7-b432-d87b10d858bd","timestamp":1714380471277,"url":"about:blank","navigation":null}}'
]
2024-04-29T08:47:51.338Z puppeteer:webDriverBiDi:RECV ◀ [
  '{"type":"event","method":"browsingContext.load","params":{"context":"306dcc2d-3624-41f7-b432-d87b10d858bd","timestamp":1714380471277,"url":"about:blank","navigation":null}}'
]
1714380471407	RemoteAgent	TRACE	[12] ProgressListener Start: expectNavigation=false resolveWhenStarted=false unloadTimeout=5000 waitForExplicitStart=false
1714380471407	RemoteAgent	TRACE	[12] ProgressListener Setting unload timer (5000ms)
2024-04-29T08:47:56.506Z puppeteer:webDriverBiDi:RECV ◀ [
1714380476506	RemoteAgent	TRACE	[12] ProgressListener No navigation detected: about:blank
  '{"type":"success","id":8,"result":{"context":"4e53e36f-31c1-4b45-ba2f-f6dc0adf121f"}}'
1714380476506	RemoteAgent	TRACE	[12] ProgressListener Stop: has error=false
]
1714380476506	RemoteAgent	DEBUG	WebDriverBiDiConnection adc3c975-c114-4598-bfaa-f6881514cfc9 <- {"type":"success","id":8,"result":{"context":"4e53e36f-31c1-4b45-ba2f-f6dc0adf121f"}}

We seem to start the progress listener after the page about:blank has already been loaded? Also I see the browsingContext.load event emitted twice.

Blocks: 1750547
Severity: -- → S3
Points: --- → 3
Priority: -- → P2
Whiteboard: [webdriver:m11]

We discussed this in triage yesterday.

First observation is that browsingContext.create waits for the visibilityState of the previous tab to change (here) before we start waiting for the navigation (here).

It would probably be safer to start waiting for the initial navigation right away. However our waitForInitialNavigationCompleted helper is supposed to detect situations where the initial page has already been loaded and resolve early (here).

To verify that, locally I artificially delayed the call to waitForInitialNavigationCompleted by 500ms, 1000ms, 2000ms, and in all cases the helper resumed early, without waiting for the 5s timeout of the ProgressListener. Based on this test, I think that calling the helper "late" works fine.

Therefore I tried to delay the call to waitForInitialNavigationCompleted on Puppeteer as well, by setting browser.tabs.remote.unloadDelayMs to a few hundred ms and pushing to github CI. This should force waitForInitialNavigationCompleted to start late as well, and give enough time to the initial navigation to complete, but it still seems to fail with macos-latest.

Depends on: 1894305

(In reply to Julian Descottes [:jdescottes] from comment #7)

Therefore I tried to delay the call to waitForInitialNavigationCompleted on Puppeteer as well, by setting browser.tabs.remote.unloadDelayMs to a few hundred ms and pushing to github CI. This should force waitForInitialNavigationCompleted to start late as well, and give enough time to the initial navigation to complete, but it still seems to fail with macos-latest.

Note that changing this preference does delay the visibilitychange event even further, so I wouldn't expect that anything changes for Puppeteer in GitHub CI. Based on my comment did you try it out locally by setting browser.tabs.remote.unloadDelayMs to some higher values? Were you able to reproduce the timeout in waiting for the initial navigation?

Flags: needinfo?(jdescottes)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #8)

(In reply to Julian Descottes [:jdescottes] from comment #7)

Therefore I tried to delay the call to waitForInitialNavigationCompleted on Puppeteer as well, by setting browser.tabs.remote.unloadDelayMs to a few hundred ms and pushing to github CI. This should force waitForInitialNavigationCompleted to start late as well, and give enough time to the initial navigation to complete, but it still seems to fail with macos-latest.

Note that changing this preference does delay the visibilitychange event even further, so I wouldn't expect that anything changes for Puppeteer in GitHub CI. Based on my comment did you try it out locally by setting browser.tabs.remote.unloadDelayMs to some higher values? Were you able to reproduce the timeout in waiting for the initial navigation?

Yes, that's what I meant with the previous paragraph: I tried locally with 500ms, 1000ms, 2000ms and it didn't reproduce the issue. The helper successfully detects that the navigation is already done and resolves immediately.

Note that changing this preference does delay the visibilitychange event even further, so I wouldn't expect that anything changes for Puppeteer in GitHub CI.

I think it should have an impact, and it's worrying that it doesn't.
Basically by increasing this value, we delay the "gap" between the call to addTab and the call to waitForInitialNavigationCompleted. For instance if we set the value to 1000 ms, it should give enough time to the new tab to do its initial navigation and waitForInitialNavigationCompleted should then detect that the initial navigation is over and resolve. As I said, I verified locally that the helper was correctly returning early in those cases, so I don't really understand why it doesn't happen on Github CI.

Flags: needinfo?(jdescottes)

I assume that with bug 1894305 now fixed, we should trigger some new debug jobs in GitHub CI for Puppeteer (with the Firefox cache removed first)?

I retriggered a run at https://github.com/puppeteer/puppeteer/actions/runs/9059188428/job/24886294787?pr=12336
My previous push had a typo in MOZ_LOG so it didn't include the Process logs.

Thanks! So it seems it takes a long time to create a new process when we try to create new tabs:

2024-05-13T08:38:30.1111650Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:30.1211520Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:30.1313910Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:30.2590990Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:30.2595100Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:30.2596060Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:31.1948660Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:31.1949320Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:31.1950120Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:31.3429020Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:31.3457780Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:31.3458420Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:32.2271950Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:32.2373640Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:32.2475650Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:32.4686340Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:32.4725860Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:32.4768310Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:33.2278670Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:33.2283020Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:33.2287190Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:33.4980610Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:33.4981200Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:33.4981830Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:34.3286360Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:34.3287130Z 1715589514319	RemoteAgent	TRACE	[12] ProgressListener No navigation detected: about:blank
2024-05-13T08:38:34.3307020Z 2024-05-13T08:38:34.320Z puppeteer:webDriverBiDi:RECV ◀ [
2024-05-13T08:38:34.3307680Z 1715589514319	RemoteAgent	TRACE	[12] ProgressListener Stop: has error=false
2024-05-13T08:38:34.3369870Z   '{"type":"success","id":8,"result":{"context":"5e1f91fa-3853-493c-9146-eb0e5c25d226"}}'
2024-05-13T08:38:34.3371320Z 1715589514320	RemoteAgent	DEBUG	WebDriverBiDiConnection 33fdce95-6f81-4977-8679-0cec0e867265 <- {"type":"success","id":8,"result":{"context":"5e1f91fa-3853-493c-9146-eb0e5c25d226"}}
2024-05-13T08:38:34.3475890Z ]
2024-05-13T08:38:34.3478510Z 1715589514322	RemoteAgent	DEBUG	WebDriverBiDiConnection 33fdce95-6f81-4977-8679-0cec0e867265 -> {"id":9,"method":"browsingContext.setViewport","params":{"context":"5e1f91fa-3853-493c-9146-eb0e5c25d226","viewport":{"width":800,"height":600},"devicePixelRatio":null}}
2024-05-13T08:38:34.3570390Z 2024-05-13T08:38:34.321Z puppeteer:webDriverBiDi:SEND ► [
2024-05-13T08:38:34.3571080Z 1715589514322	RemoteAgent	TRACE	Module root/browsingContext.sys.mjs found for ROOT
2024-05-13T08:38:34.3616340Z   '{"id":9,"method":"browsingContext.setViewport","params":{"context":"5e1f91fa-3853-493c-9146-eb0e5c25d226","viewport":{"width":800,"height":600},"devicePixelRatio":null}}'
2024-05-13T08:38:34.3620350Z 1715589514322	RemoteAgent	TRACE	Received command browsingContext.setViewport for destination ROOT
2024-05-13T08:38:34.3653510Z ]
2024-05-13T08:38:34.3654270Z 1715589514322	RemoteAgent	TRACE	Module root/browsingContext.sys.mjs found for ROOT
2024-05-13T08:38:34.3655360Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:34.3656560Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000

We try to start a process, but we get Process Starting delayed process start, delay=1000 for 5 seconds.

Olli, do you know what could delay the creation of new processes? This is happening on macos arm workers on Github

Flags: needinfo?(smaug)

If I don't remember wrong Randell worked on some process related changes in the past. So maybe he is a better person to ask. Randell, if that is not the case please let us know, and maybe you know a better person to ask for. Thanks!

Flags: needinfo?(smaug) → needinfo?(rjesup)

(I don't know what "macos arm workers on Github" means).

I assume this is on opt/shippable builds?
Do we have performance profiles for this?

(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #15)

(I don't know what "macos arm workers on Github" means).

The workers provided by Github to run Github actions, see https://github.com/actions/runner-images/blob/macos-14-arm64/20240422.3/images/macos/macos-14-arm64-Readme.md#browsers

I assume this is on opt/shippable builds?

Yes.

Do we have performance profiles for this?

Not yet I'll try to get one.

I just saw that there are as well MacOS 13 ARM images:
https://github.com/actions/runner-images/blob/main/images/macos/macos-13-arm64-Readme.md

I wonder if those might show the same problem or not. The current puppeteer checks all run for MacOS 13. So maybe it might be a MacOS 14 issue.

So this is the cause of the 1s delays:

# Process launch delay (in milliseconds).
- name: dom.ipc.processPrelaunch.delayMs
  type: uint32_t
# This number is fairly arbitrary ... the intention is to put off
# launching another app process until the last one has finished
# loading its content, to reduce CPU/memory/IO contention.
  value: 1000
  mirror: always

You can modify this value if you need to spawn a lot of processes in a short time for tests/etc. For browsing, this avoids us slowing down load in order to spawn processes we may need in the future

Flags: needinfo?(rjesup)

Thank you Randell. What's interesting is that no other test suite is actually modifying this preference and we also didn't notice a problem with this delay before ourselves. It's just that the GitHub MacOS ARM workers are so slow. Do you have any idea or might there be another bug already which about visible delays. Most likely users on such machines would also face this problem. So maybe there is some inconsistency? Is there anything in the logs, which we provided, that might explain why there is such a delay? Or do we need the requested performance profile for it?

Further I'm not sure if we should actually change this value for tests. It would clearly cause us to not being able to find such situations.

Flags: needinfo?(rjesup)

It's rare to need to spawn new processes quickly on the web. In normal operation, we keep 3 processes in a cache available; we refill the cache whenever we get Idle time during the load. Even with sites like CNN or fandom which have 12-20+ processes to render (lots of iframes) I found that even a slow old reference laptop never or rarely ran out of preallocated processes -- because sites don't need those 12+ processes all at the same moment.
We only hit the 1s delay if we're calling to create a new process and we have a process in the action of starting that hasn't finished getting ready yet - starting another process while we're waiting for the previous process to finish starting is generally a good way to delay the already-started process.
That said, if this really is happening in non-test situations, we could either reduce the delay, or we could remember we wanted to spawn a process and start that when (or sooner after) the previous process finishes starting.
If you can figure out why we're spawning processes fast enough to it this, or get more detailed logs or an RR trace, I could look at this more. If we have any evidence this is happening in the field in any quantity, we certainly could try to minimize it. Perhaps add a telemetry for the delay getting hit? Not sure if we'd want a count, perhaps since the assumption is that most machines would never hit it, but perhaps some (slow) machines might hit it more often, and a count would help there.

Flags: needinfo?(rjesup)

Thanks for all the feedback!

The issue we are having with this test suite is not only a 1s delay, it seems to take between 4 or 5 seconds to create the process for a new tab.
For the log I pasted in comment 13 we can see the following 6 lines of logs repeated 4 times (each time with a 1 second gap):

2024-05-13T08:38:31.1948660Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:31.1949320Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:31.1950120Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:31.3429020Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:31.3457780Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:31.3458420Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000

I can try to reduce the delay, but I suspect the issue is that we can't start the process and so we keep waiting for 1 second.
You mentioned that we could hit the 1s delay if we already are waiting for a process to finish starting, so I'm guessing we have a process on those machines which is really slow to start and which makes subsequent process creation slow?

get more detailed logs

Do you know what kind of logs would be helpful here?

Flags: needinfo?(rjesup)

Bug 1894695 reports a similar issue and that it started between 124.0a1 and 127.0a1. But I'm not sure if they were really using an ARM worker at the time when it started. So I've requested more information.

Given that this doesn't seem to be WebDriver related but a core issue with process creation I'm going to move this bug to the DOM content processes component.

Severity: S3 → --
Points: 3 → ---
Component: WebDriver BiDi → DOM: Content Processes
Priority: P2 → --
Product: Remote Protocol → Core
Summary: browsingContext.create is slow on macos arm64 runners on GitHub actions → Creation of content processes is slow on macos arm64 runners on GitHub actions
Duplicate of this bug: 1894695

This issue might actually hit more folks as we thought. As bug 1894695 comment 7 states GitHub updated the macos-latest definition and points to the ARM machines now. If users don't hard-code a specific version like macos-12, they will all notice this delay in process creation.

Julian, were you already able to create a Gecko profile of a single affected Puppeteer test? If you have troubles please let me know and I can try.

Flags: needinfo?(jdescottes)

Note also that there is are no intel-based macos-14 machines apparently:

So not only do you have to hardcode, you have to hardcode an older version :-(

Barry, if you may have an easy way to reproduce and you could run a single test with the following environment variables set for Firefox, I would appreciate.

  • MOZ_PROFILER_STARTUP_FILTERS="*"
  • MOZ_PROFILER_STARTUP="1"
  • MOZ_PROFILER_SHUTDOWN="%path_to_json_file%"

Please note that Firefox process needs to be shutdown normally, which means you cannot kill it. Otherwise the profile will not be written to the given path. Thanks!

Weirdly I can't repro this on my M1 Mac even though it's ARM64 too!

The only way I've got to reproduce it is with this workflow:
https://github.com/tunetheweb/github-action-test/blob/main/.github/workflows/tests.yml
You can fork that repo and try it yourself. It runs on every push and saves the profile to the action page ready for you to download. I could attach an example here but it's 49.7 MB.

(In reply to Barry Pollard from comment #27)

The only way I've got to reproduce it is with this workflow:
https://github.com/tunetheweb/github-action-test/blob/main/.github/workflows/tests.yml
You can fork that repo and try it yourself. It runs on every push and saves the profile to the action page ready for you to download. I could attach an example here but it's 49.7 MB.

Thanks a lot for creating such a workflow and the tooling to offer such a profile for download! I've picked the generated JSON file from this workflow and uploaded it to https://profiler.firefox.com/.

Here is the shared profile: https://share.firefox.dev/4bFO4vT

Randell or Olli, do you see something strange going on in there? Here a filtered view for process markers where at least at the beginning two Blocked preallocation markers can be seen each with 5s, but probably it's not the only reason given that further hangs happen afterwards.

Flags: needinfo?(jdescottes) → needinfo?(smaug)

What was done to create that profile? Why are there so many content processes? And why aren't those processes really doing anything?
They are sampled but don't seem to execute much code.
Is the testing framework possibly not sending the urls to load soon enough or what?

I wonder if enabling fission.processSiteNames would make performance profiling a bit more useful.

Flags: needinfo?(smaug)

I took a quick peek at the profile and noticed 4619ms of Recv Thread Latency for IPC message, PContent::Msg_FlushFOGData
https://share.firefox.dev/4bKHxjD
But it's not clear where the latency is coming from.

(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #29)

What was done to create that profile?

As usual the normal profiler environmental variables were set.

Why are there so many content processes? And why aren't those processes really doing anything? They are sampled but don't seem to execute much code.

No idea. Maybe that's the reason why it all runs that slowly? Would some extra logs help? I assume also setting MOZ_LOG="Process:5" to enable process logs may give further details as markers?

Is the testing framework possibly not sending the urls to load soon enough or what?

As we said above the issue is not reproducible locally bit only in GitHub actions for MacOS ARM runners. The only delay the test has is after the navigation. But for particularly testing the overall issue it could potentially be removed.

I wonder if enabling fission.processSiteNames would make performance profiling a bit more useful.

That could be done by setting it as Firefox preference when starting the WebDriver session. Barry, would it be possible for you to make these small modifications and also adding the other env variable that I pointed out above? Thanks!

Flags: needinfo?(barry)

(In reply to Barry Pollard from comment #32)

I presume this does what you need: https://github.com/tunetheweb/github-action-test/commit/b9a0ee0ae5e43870bfdefa66a985ee11a554925f

There is a typo in the if condition (browserName === 'fiefox') which caused the required preference not to be added. If you could do a quick follow-up push I can check the results and share the profile. Thanks.

Oops fixed now. BTW you can fork this repo so you're not holding up waiting for me each time.

https://github.com/tunetheweb/github-action-test/actions/runs/9365810252

Sorry I was kinda busy this morning, but if needed I could do that, right. Thanks for jumping in and helping! It's appreciated.

Here is one example of a navigation that takes more than 5s because there is a Blocked preallocation happening: https://share.firefox.dev/4bJpVop

With the profile overview and filtered by the Process marker it looks like that processes to be launched as immediate seems to be blocked, but normal process launches are fine?

Olli, are those additionally information helpful for further investigating the issue?

Flags: needinfo?(smaug)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #35)

With the profile overview and filtered by the Process marker it looks like that processes to be launched as immediate seems to be blocked, but normal process launches are fine?

The content processes that take ~5s to launch are stuck inside nsAppShell::Init() (see https://share.firefox.dev/4bH2rAn) doing "mac-things" that regard NSColorGetUserAccentColor, so the parent process will get the notification of the process being successfully launched only late. There is a chance that nsAppShell::Init() is doing too many things synchronously that could better be done later?

Component: DOM: Content Processes → Widget: Cocoa
Flags: needinfo?(smaug)
Severity: -- → S3
Priority: -- → P3

Thanks Jens! That opens up the question if there is any way to figure out what exactly is causing this delay in NSColorGetUserAccentColor. Stephen, is there any way with normal MOZ_LOG logging to get more details here? I assume that widget:5 might not help, or?

Also regarding the prioritization I want to note that this is an issue which is seen by everyone using the new MacOS ARM runners for GitHub actions. While for now there is a chance to fallback to older MacOS releases (most recent is MacOS 13) it might not be possible anymore when those runners based on Intel machines are no longer supported. Each and every test that is using Firefox will see a drastic increase of the test duration. IMHO we should find the underlying issue sooner than later to continue allowing users to have a good testing experience. Otherwise we might loose test coverage or as the FCP example show get completely wrong test results.

Flags: needinfo?(rjesup) → needinfo?(spohl.mozilla.bugs)

So we got a reply: https://github.com/actions/runner-images/issues/9974#issuecomment-2150347670

In summary this is:

Regarding configuration I may say that it's not a private info: runner is a virtual machine based on Veertu Anka with the following set of resources

If we know what might cause that they offered to workaround it in their environment.

I'd recommend trying to get an Instruments profile of this happening on the github worker. You should be able get this with something like xctrace record --template 'Time Profiler' --all-processes --time-limit 5s and figure out how to get that out of github worker as an artifact or some thing.

(In reply to Jeff Muizelaar [:jrmuizel] from comment #39)

I'd recommend trying to get an Instruments profile of this happening on the github worker. You should be able get this with something like xctrace record --template 'Time Profiler' --all-processes --time-limit 5s and figure out how to get that out of github worker as an artifact or some thing.

I tried to run xctrace as a background task during the job, but it fails to record the trace when Firefox starts to run and the trace is still recording.
https://github.com/juliandescottes/github-action-test/blob/main/.github/workflows/tests.yml#L37

My background task is:

sleep 50 && xctrace record --template 'Time Profiler' --all-processes --time-limit 5s --output /tmp/xctrace.trace &

And I play with the sleep delay + time-limit to try to match Firefox running the tests. If the trace is recording before Firefox starts I manage to record a valid trace (at least judging by the size of it). But if it overlaps with Firefox running (which is the goal), then the trace fails and only contains a RunIssues.storedata file.

I can attach a failed trace here in case it helps, but maybe I'm just doing it wrong.

Flags: needinfo?(jmuizelaar)

I got an xctrace profile here: https://github.com/jrmuizel/github-action-test/actions/runs/9421839252/

It looks like it wasn't working because xctrace was still writing out the profile before we uploaded it.

I used while pgrep -x "xctrace" > /dev/null; do sleep 2; echo waiting for trace; done and killall -SIGINT xctrace to stop xctrace

Flags: needinfo?(jmuizelaar)

Markus and I looked at the xctrace profile and nothing stood out as incriminating.

The content process is always waiting 5 seconds so I think it's more likely that this is timing out instead of waiting for some cpu intensive work to happen. I'd recommend disabling the sandbox to see if that helps. If it does we can turn on sandbox logging to see if there's something that looks incriminating.

Flags: needinfo?(spohl.mozilla.bugs)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #43)

The content process is always waiting 5 seconds so I think it's more likely that this is timing out instead of waiting for some cpu intensive work to happen. I'd recommend disabling the sandbox to see if that helps. If it does we can turn on sandbox logging to see if there's something that looks incriminating.

Julian, could you try with this suggestion? Would be good to know if it is related to sandboxing and if yet what exactly might show us the log output.

Flags: needinfo?(jdescottes)
Attached file logs.txt
Disabling the sandbox fixes the issue: https://share.firefox.dev/4etFZfy

I did another run with sandbox logging enabled, but I can't see anything relevant in the logs

Jeff, is there anything else I should do to enable sandbox logging, beyond flipping security.sandbox.logging.enabled to true (I re-enabled the sandbox when I captured the logs, maybe I should have disabled it again?)

Flags: needinfo?(jdescottes) → needinfo?(jmuizelaar)

I wonder if it would be helpful to check different supported sandbox levels on MacOS and which one actually triggers this delay:
https://searchfox.org/mozilla-central/rev/a26e972e97fbec860400d80df625193f4c88d64e/browser/app/profile/firefox.js#1453-1468

(In reply to Julian Descottes [:jdescottes] from comment #46)

Jeff, is there anything else I should do to enable sandbox logging, beyond flipping security.sandbox.logging.enabled to true (I re-enabled the sandbox when I captured the logs, maybe I should have disabled it again?)

Ah sorry, I meant the macOS logging using the log program. i.e with sandboxing turned on using log show.

Flags: needinfo?(jmuizelaar)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #47)

I wonder if it would be helpful to check different supported sandbox levels on MacOS and which one actually triggers this delay:
https://searchfox.org/mozilla-central/rev/a26e972e97fbec860400d80df625193f4c88d64e/browser/app/profile/firefox.js#1453-1468

Good idea, I tried levels 1 and 2, it still fails. Only fully disabling the sandbox helps.

Attached file log_show 2.txt

This is the output of log show --last 5m. The test starts around 09:01:56, so ~line 24000.
Anything interesting in there?

Flags: needinfo?(jmuizelaar)

Haik, does anything in the log look suspicious? Have you seen a 5 second timeout during startup before?

Flags: needinfo?(haftandilian)

grepping for "kernel: \(Sandbox.*plugin" shows all the violations we're making.

I'd recommend adjusting https://searchfox.org/mozilla-central/source/security/sandbox/mac/SandboxPolicyContent.h#14 to allow as much stuff from the logs as possible to see if that fixes it. If it does you should be able to narrow down what exactly is being blocked that's causing the problem.

Flags: needinfo?(jmuizelaar)

(here is the Firefox profile corresponding to those logs: https://share.firefox.dev/4be0Twu)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #52)

grepping for "kernel: \(Sandbox.*plugin" shows all the violations we're making.

I'd recommend adjusting https://searchfox.org/mozilla-central/source/security/sandbox/mac/SandboxPolicyContent.h#14 to allow as much stuff from the logs as possible to see if that fixes it. If it does you should be able to narrow down what exactly is being blocked that's causing the problem.

Thanks, I assume we need to build a custom Firefox for this? Or is this something that can be driven via prefs/env variables?

Flags: needinfo?(jmuizelaar)

(In reply to Julian Descottes [:jdescottes] from comment #53)

(here is the Firefox profile corresponding to those logs: https://share.firefox.dev/4be0Twu)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #52)

grepping for "kernel: \(Sandbox.*plugin" shows all the violations we're making.

I'd recommend adjusting https://searchfox.org/mozilla-central/source/security/sandbox/mac/SandboxPolicyContent.h#14 to allow as much stuff from the logs as possible to see if that fixes it. If it does you should be able to narrow down what exactly is being blocked that's causing the problem.

Thanks, I assume we need to build a custom Firefox for this? Or is this something that can be driven via prefs/env variables?

I think you'll need to build a custom Firefox, but maybe Haik knows better.

Flags: needinfo?(jmuizelaar)

Yes, you need to rebuild to make sandbox policy changes with the exception of a few things that are parameterized.

One thing to try that can be set via prefs is to try setting the pref security.sandbox.content.mac.disconnect-windowserver to false with the sandbox enabled which would tell us if it is related to not having a WindowServer connection. Edit: looking at the profile, the NSAppShell::Init() call is stuck in NSColorGetUserAccentColor which might be trying to connect to the WindowServer or something display-related.

Since setting security.sandbox.content.level to 1 did not avoid the problem, it's probably not caused by not having permission to read the filesystem. I suspect it's related to trying to connect to a mach service it doesn't normally connect to in a non-virtualized environment or has a timeout in this environment. These are logged with messages like

plugin-container(3912) deny(1) mach-lookup com.apple.lsd.modifydb

After the WindowServer test, what I would do for next steps is to look for the messages related to denying connection to a service (com.apple.lsd.modifydb in the above example) and allow that service in the content process policy with an allow rule by adding it to SandboxPolicyContent.h here making sure to get the parenthesis right. We might be able to find a single service connection attempt that is causing the delay.

Do we have a method for running this VM type on a developer Mac?

Flags: needinfo?(haftandilian)

Same issue with security.sandbox.content.mac.disconnect-windowserver set to false: https://share.firefox.dev/3Vtc5zE

I will attach the logs.

Do we have a method for running this VM type on a developer Mac?

I didn't find the images for the VMs, but I asked on https://github.com/actions/runner-images/issues/9974 (maybe I should file a new issue since this one has been closed for some time)

I was able to reproduce locally on a macOS (12 and also 15 Beta) VM. I'm working on debugging the delay.

After doing some trial and error testing on the macOS 15 Beta VM, I found that adding allowances in the content process sandbox for the following iokit properties avoids the 5-second delay in nsAppShell::Init(). I'll get a patch posted to add these back after doing some more testing. I'll try to find out more information about security-impacts of the properties.

plugin-container(3756) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:product-id
plugin-container(3756) deny(1) iokit-get-properties iokit-class:IOService property:housing-color
plugin-container(3756) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:IORegistryEntryPropertyKeys
plugin-container(3756) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:ean-storage-present
plugin-container(3756) deny(1) iokit-get-properties iokit-class:IOService property:syscfg-v2-data
plugin-container(3757) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:product-id
plugin-container(3757) deny(1) iokit-get-properties iokit-class:IOService property:housing-color
plugin-container(3757) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:IORegistryEntryPropertyKeys
plugin-container(3757) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:ean-storage-present
plugin-container(3757) deny(1) iokit-get-properties iokit-class:IOService property:syscfg-v2-data
Assignee: nobody → haftandilian

Add access to additional iokit properties in the content process sandbox to avoid a 5 second delay in [NSApplication init] when running in VMs.

After some more debugging, I've found that all the properties listed in comment 59 (product-id, housing-color, IORegistryEntryPropertyKeys, ean-storage-present, and syscfg-v2-data) are read within the call to [NSApplication Init] from nsAppShell::Init(). AppKit'__NSUserAccentColorGetHardwareAccentColorName_block_invoke() is the lowest AppKit function on the stack. I set breakpoints on IOKit'IORegistryEntryCreateCFProperty to catch this. In the VM, the only properties actually present are product-id and ean-storage-present checked by running ioreg. The Safari renderer sandbox also allows reads of all these properties.

Attachment #9408733 - Attachment description: WIP: Bug 1893921 - Creation of content processes is slow on macos arm64 runners on GitHub actions → Bug 1893921 - Creation of content processes is slow on macos arm64 runners on GitHub actions r?spohl
Pushed by haftandilian@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/59eba805400c
Creation of content processes is slow on macos arm64 runners on GitHub actions r=spohl
Status: NEW → RESOLVED
Closed: 28 days ago
Resolution: --- → FIXED
Target Milestone: --- → 129 Branch

Julian, would you mind testing in GitHub CI if the issue is fixed for Puppeteer and the other test repository? Thanks.

Component: Widget: Cocoa → Security: Process Sandboxing
Flags: needinfo?(jdescottes)

Unfortunately, it doesn't seem to help on the puppeteer issue https://github.com/puppeteer/puppeteer/actions/runs/9640423521/job/26584204278?pr=12328

For the github-action-test example, I'm trying to make sure that we use the latest Nightly, but it doesn't seem to help either.
We still seem blocked on the same call to __NSUserAccentColorGetHardwareAccentColorName_block_invoke https://share.firefox.dev/4eCgfOf

Flags: needinfo?(jdescottes)
Flags: needinfo?(haftandilian)

@Julian, would you be able to get me logs again like on comment 50? Could you also report the build ID so we can make sure the fix is there?

I'll retest on the VM I'm using. I was testing with a QEMU based VM using the macOS 15 beta.

Flags: needinfo?(haftandilian) → needinfo?(jdescottes)
Attached file logs_20240624.txt
Sure! 

- buildID is 20240624094028
- logs attached to this message 
- profile at https://share.firefox.dev/4ccD9u9
Flags: needinfo?(jdescottes) → needinfo?(haftandilian)

@Julian, thanks. Would you be able to test a build if I provide a link to a .dmg?

Flags: needinfo?(haftandilian) → needinfo?(jdescottes)

After switching to a macOS 14.5 VM, I can reproduce the delay with the first fix. I'll request backout and re-land a complete fix.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 129 Branch → ---

(In reply to Haik Aftandilian [:haik] from comment #68)

@Julian, thanks. Would you be able to test a build if I provide a link to a .dmg?

I don't have anything setup to do that for now, but I imagine it's doable? Although since you reproduce now, I guess we will reland the complete fix and test with Nightly?

Flags: needinfo?(jdescottes)

Add access to additional iokit properties and sysctls in the content process sandbox to avoid a 5 second delay in [NSApplication init] when running in VMs.

Pushed by haftandilian@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/349060b3997f
Creation of content processes is slow on macos arm64 runners on GitHub actions r=spohl
Status: REOPENED → RESOLVED
Closed: 28 days ago24 days ago
Resolution: --- → FIXED
Target Milestone: --- → 129 Branch
Attachment #9408733 - Attachment is obsolete: true

(In reply to Julian Descottes [:jdescottes] from comment #71)

(In reply to Haik Aftandilian [:haik] from comment #68)

@Julian, thanks. Would you be able to test a build if I provide a link to a .dmg?

I don't have anything setup to do that for now, but I imagine it's doable? Although since you reproduce now, I guess we will reland the complete fix and test with Nightly?

Ideally, we would verify the fix on the github runner first, but given how that seems difficult and the behavior of my 14.5 VM appears to match the logs from the github runner, I landed the updated fix.

And I can confirm the issue is fixed on my test repository.
Thanks a lot for the help here Haik!

(In reply to Julian Descottes [:jdescottes] from comment #76)

And I can confirm the issue is fixed on my test repository.
Thanks a lot for the help here Haik!

Awesome, thanks! Kudos to everyone who helped debug this down to the sandboxing level.

I have not been able to reproduce the problem with other process sandboxes (socket, RDD, GMP, and utility) in the 14.5 VM.

You need to log in before you can comment on or make changes to this bug.