<a href="https://file-hippo.netlify.app/host-https-bugzilla.mozilla.org/home" title="Go to home page">Bugzilla

Comment 7

•

3 months ago

We discussed this in triage yesterday.

First observation is that browsingContext.create waits for the visibilityState of the previous tab to change (here) before we start waiting for the navigation (here).

It would probably be safer to start waiting for the initial navigation right away. However our waitForInitialNavigationCompleted helper is supposed to detect situations where the initial page has already been loaded and resolve early (here).

To verify that, locally I artificially delayed the call to waitForInitialNavigationCompleted by 500ms, 1000ms, 2000ms, and in all cases the helper resumed early, without waiting for the 5s timeout of the ProgressListener. Based on this test, I think that calling the helper "late" works fine.

Therefore I tried to delay the call to waitForInitialNavigationCompleted on Puppeteer as well, by setting browser.tabs.remote.unloadDelayMs to a few hundred ms and pushing to github CI. This should force waitForInitialNavigationCompleted to start late as well, and give enough time to the initial navigation to complete, but it still seems to fail with macos-latest.

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Updated

•

3 months ago

Depends on: 1894305

Comment 8

•

3 months ago

(In reply to Julian Descottes [:jdescottes] from comment #7)

Therefore I tried to delay the call to waitForInitialNavigationCompleted on Puppeteer as well, by setting browser.tabs.remote.unloadDelayMs to a few hundred ms and pushing to github CI. This should force waitForInitialNavigationCompleted to start late as well, and give enough time to the initial navigation to complete, but it still seems to fail with macos-latest.

Note that changing this preference does delay the visibilitychange event even further, so I wouldn't expect that anything changes for Puppeteer in GitHub CI. Based on my comment did you try it out locally by setting browser.tabs.remote.unloadDelayMs to some higher values? Were you able to reproduce the timeout in waiting for the initial navigation?

Flags: needinfo?(jdescottes)

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 9

•

3 months ago

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #8)

(In reply to Julian Descottes [:jdescottes] from comment #7)

Therefore I tried to delay the call to waitForInitialNavigationCompleted on Puppeteer as well, by setting browser.tabs.remote.unloadDelayMs to a few hundred ms and pushing to github CI. This should force waitForInitialNavigationCompleted to start late as well, and give enough time to the initial navigation to complete, but it still seems to fail with macos-latest.

Note that changing this preference does delay the visibilitychange event even further, so I wouldn't expect that anything changes for Puppeteer in GitHub CI. Based on my comment did you try it out locally by setting browser.tabs.remote.unloadDelayMs to some higher values? Were you able to reproduce the timeout in waiting for the initial navigation?

Yes, that's what I meant with the previous paragraph: I tried locally with 500ms, 1000ms, 2000ms and it didn't reproduce the issue. The helper successfully detects that the navigation is already done and resolves immediately.

Note that changing this preference does delay the visibilitychange event even further, so I wouldn't expect that anything changes for Puppeteer in GitHub CI.

I think it should have an impact, and it's worrying that it doesn't.
Basically by increasing this value, we delay the "gap" between the call to addTab and the call to waitForInitialNavigationCompleted. For instance if we set the value to 1000 ms, it should give enough time to the new tab to do its initial navigation and waitForInitialNavigationCompleted should then detect that the initial navigation is over and resolve. As I said, I verified locally that the helper was correctly returning early in those cases, so I don't really understand why it doesn't happen on Github CI.

Flags: needinfo?(jdescottes)

Comment 10

•

3 months ago

I assume that with bug 1894305 now fixed, we should trigger some new debug jobs in GitHub CI for Puppeteer (with the Firefox cache removed first)?

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 11

•

2 months ago

I retriggered a run at https://github.com/puppeteer/puppeteer/actions/runs/9059188428/job/24886294787?pr=12336
My previous push had a typo in MOZ_LOG so it didn't include the Process logs.

Comment 12

•

2 months ago

I've re-triggered the failing job with debug logs enabled:
https://github.com/puppeteer/puppeteer/actions/runs/9059188428/job/24888261056?pr=12336

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 13

•

2 months ago

Thanks! So it seems it takes a long time to create a new process when we try to create new tabs:

2024-05-13T08:38:30.1111650Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:30.1211520Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:30.1313910Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:30.2590990Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:30.2595100Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:30.2596060Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:31.1948660Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:31.1949320Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:31.1950120Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:31.3429020Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:31.3457780Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:31.3458420Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:32.2271950Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:32.2373640Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:32.2475650Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:32.4686340Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:32.4725860Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:32.4768310Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:33.2278670Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:33.2283020Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:33.2287190Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:33.4980610Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:33.4981200Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:33.4981830Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:34.3286360Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:34.3287130Z 1715589514319	RemoteAgent	TRACE	[12] ProgressListener No navigation detected: about:blank
2024-05-13T08:38:34.3307020Z 2024-05-13T08:38:34.320Z puppeteer:webDriverBiDi:RECV ◀ [
2024-05-13T08:38:34.3307680Z 1715589514319	RemoteAgent	TRACE	[12] ProgressListener Stop: has error=false
2024-05-13T08:38:34.3369870Z   '{"type":"success","id":8,"result":{"context":"5e1f91fa-3853-493c-9146-eb0e5c25d226"}}'
2024-05-13T08:38:34.3371320Z 1715589514320	RemoteAgent	DEBUG	WebDriverBiDiConnection 33fdce95-6f81-4977-8679-0cec0e867265 <- {"type":"success","id":8,"result":{"context":"5e1f91fa-3853-493c-9146-eb0e5c25d226"}}
2024-05-13T08:38:34.3475890Z ]
2024-05-13T08:38:34.3478510Z 1715589514322	RemoteAgent	DEBUG	WebDriverBiDiConnection 33fdce95-6f81-4977-8679-0cec0e867265 -> {"id":9,"method":"browsingContext.setViewport","params":{"context":"5e1f91fa-3853-493c-9146-eb0e5c25d226","viewport":{"width":800,"height":600},"devicePixelRatio":null}}
2024-05-13T08:38:34.3570390Z 2024-05-13T08:38:34.321Z puppeteer:webDriverBiDi:SEND ► [
2024-05-13T08:38:34.3571080Z 1715589514322	RemoteAgent	TRACE	Module root/browsingContext.sys.mjs found for ROOT
2024-05-13T08:38:34.3616340Z   '{"id":9,"method":"browsingContext.setViewport","params":{"context":"5e1f91fa-3853-493c-9146-eb0e5c25d226","viewport":{"width":800,"height":600},"devicePixelRatio":null}}'
2024-05-13T08:38:34.3620350Z 1715589514322	RemoteAgent	TRACE	Received command browsingContext.setViewport for destination ROOT
2024-05-13T08:38:34.3653510Z ]
2024-05-13T08:38:34.3654270Z 1715589514322	RemoteAgent	TRACE	Module root/browsingContext.sys.mjs found for ROOT
2024-05-13T08:38:34.3655360Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:34.3656560Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000

We try to start a process, but we get Process Starting delayed process start, delay=1000 for 5 seconds.

Olli, do you know what could delay the creation of new processes? This is happening on macos arm workers on Github

Flags: needinfo?(smaug)

Comment 14

•

2 months ago

If I don't remember wrong Randell worked on some process related changes in the past. So maybe he is a better person to ask. Randell, if that is not the case please let us know, and maybe you know a better person to ask for. Thanks!

Flags: needinfo?(smaug) → needinfo?(rjesup)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 15

•

2 months ago

(I don't know what "macos arm workers on Github" means).

I assume this is on opt/shippable builds?
Do we have performance profiles for this?

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 16

•

2 months ago

(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #15)

(I don't know what "macos arm workers on Github" means).

The workers provided by Github to run Github actions, see https://github.com/actions/runner-images/blob/macos-14-arm64/20240422.3/images/macos/macos-14-arm64-Readme.md#browsers

I assume this is on opt/shippable builds?

Yes.

Do we have performance profiles for this?

Not yet I'll try to get one.

Comment 17

•

2 months ago

I just saw that there are as well MacOS 13 ARM images:
https://github.com/actions/runner-images/blob/main/images/macos/macos-13-arm64-Readme.md

I wonder if those might show the same problem or not. The current puppeteer checks all run for MacOS 13. So maybe it might be a MacOS 14 issue.

Randell Jesup [:jesup] (needinfo me)

Comment 18

•

2 months ago

So this is the cause of the 1s delays:

# Process launch delay (in milliseconds).
- name: dom.ipc.processPrelaunch.delayMs
  type: uint32_t
# This number is fairly arbitrary ... the intention is to put off
# launching another app process until the last one has finished
# loading its content, to reduce CPU/memory/IO contention.
  value: 1000
  mirror: always

You can modify this value if you need to spawn a lot of processes in a short time for tests/etc. For browsing, this avoids us slowing down load in order to spawn processes we may need in the future

Flags: needinfo?(rjesup)

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 19

•

2 months ago

Thank you Randell. What's interesting is that no other test suite is actually modifying this preference and we also didn't notice a problem with this delay before ourselves. It's just that the GitHub MacOS ARM workers are so slow. Do you have any idea or might there be another bug already which about visible delays. Most likely users on such machines would also face this problem. So maybe there is some inconsistency? Is there anything in the logs, which we provided, that might explain why there is such a delay? Or do we need the requested performance profile for it?

Further I'm not sure if we should actually change this value for tests. It would clearly cause us to not being able to find such situations.

Flags: needinfo?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Comment 20

•

2 months ago

It's rare to need to spawn new processes quickly on the web. In normal operation, we keep 3 processes in a cache available; we refill the cache whenever we get Idle time during the load. Even with sites like CNN or fandom which have 12-20+ processes to render (lots of iframes) I found that even a slow old reference laptop never or rarely ran out of preallocated processes -- because sites don't need those 12+ processes all at the same moment.
We only hit the 1s delay if we're calling to create a new process and we have a process in the action of starting that hasn't finished getting ready yet - starting another process while we're waiting for the previous process to finish starting is generally a good way to delay the already-started process.
That said, if this really is happening in non-test situations, we could either reduce the delay, or we could remember we wanted to spawn a process and start that when (or sooner after) the previous process finishes starting.
If you can figure out why we're spawning processes fast enough to it this, or get more detailed logs or an RR trace, I could look at this more. If we have any evidence this is happening in the field in any quantity, we certainly could try to minimize it. Perhaps add a telemetry for the delay getting hit? Not sure if we'd want a count, perhaps since the assumption is that most machines would never hit it, but perhaps some (slow) machines might hit it more often, and a count would help there.

Flags: needinfo?(rjesup)

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 21

•

2 months ago

Thanks for all the feedback!

The issue we are having with this test suite is not only a 1s delay, it seems to take between 4 or 5 seconds to create the process for a new tab.
For the log I pasted in comment 13 we can see the following 6 lines of logs repeated 4 times (each time with a 1 second gap):

2024-05-13T08:38:31.1948660Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:31.1949320Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:31.1950120Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:31.3429020Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:31.3457780Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:31.3458420Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000

I can try to reduce the delay, but I suspect the issue is that we can't start the process and so we keep waiting for 1 second.
You mentioned that we could hit the 1s delay if we already are waiting for a process to finish starting, so I'm guessing we have a process on those machines which is really slow to start and which makes subsequent process creation slow?

get more detailed logs

Do you know what kind of logs would be helpful here?

Flags: needinfo?(rjesup)

Updated

•

2 months ago

Comment 22

•

2 months ago

Bug 1894695 reports a similar issue and that it started between 124.0a1 and 127.0a1. But I'm not sure if they were really using an ARM worker at the time when it started. So I've requested more information.

Given that this doesn't seem to be WebDriver related but a core issue with process creation I'm going to move this bug to the DOM content processes component.

Severity: S3 → --

Points: 3 → ---

Component: WebDriver BiDi → DOM: Content Processes

Priority: P2 → --

Product: Remote Protocol → Core

Summary: browsingContext.create is slow on macos arm64 runners on GitHub actions → Creation of content processes is slow on macos arm64 runners on GitHub actions

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Updated

•

2 months ago

Duplicate of this bug: 1894695

Updated

•

2 months ago

Comment 24

•

2 months ago

This issue might actually hit more folks as we thought. As bug 1894695 comment 7 states GitHub updated the macos-latest definition and points to the ARM machines now. If users don't hard-code a specific version like macos-12, they will all notice this delay in process creation.

Julian, were you already able to create a Gecko profile of a single affected Puppeteer test? If you have troubles please let me know and I can try.

Flags: needinfo?(jdescottes)

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 25

•

2 months ago

Note also that there is are no intel-based macos-14 machines apparently:

So not only do you have to hardcode, you have to hardcode an older version :-(

Comment 26

•

2 months ago

•

Edited

Barry, if you may have an easy way to reproduce and you could run a single test with the following environment variables set for Firefox, I would appreciate.

MOZ_PROFILER_STARTUP_FILTERS="*"
MOZ_PROFILER_STARTUP="1"
MOZ_PROFILER_SHUTDOWN="%path_to_json_file%"

Please note that Firefox process needs to be shutdown normally, which means you cannot kill it. Otherwise the profile will not be written to the given path. Thanks!

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 27

•

2 months ago

Weirdly I can't repro this on my M1 Mac even though it's ARM64 too!

The only way I've got to reproduce it is with this workflow:
https://github.com/tunetheweb/github-action-test/blob/main/.github/workflows/tests.yml
You can fork that repo and try it yourself. It runs on every push and saves the profile to the action page ready for you to download. I could attach an example here but it's 49.7 MB.

Comment 28

•

2 months ago

(In reply to Barry Pollard from comment #27)

The only way I've got to reproduce it is with this workflow:
https://github.com/tunetheweb/github-action-test/blob/main/.github/workflows/tests.yml
You can fork that repo and try it yourself. It runs on every push and saves the profile to the action page ready for you to download. I could attach an example here but it's 49.7 MB.

Thanks a lot for creating such a workflow and the tooling to offer such a profile for download! I've picked the generated JSON file from this workflow and uploaded it to https://profiler.firefox.com/.

Here is the shared profile: https://share.firefox.dev/4bFO4vT

Randell or Olli, do you see something strange going on in there? Here a filtered view for process markers where at least at the beginning two Blocked preallocation markers can be seen each with 5s, but probably it's not the only reason given that further hangs happen afterwards.

Flags: needinfo?(jdescottes) → needinfo?(smaug)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 29

•

2 months ago

What was done to create that profile? Why are there so many content processes? And why aren't those processes really doing anything?
They are sampled but don't seem to execute much code.
Is the testing framework possibly not sending the urls to load soon enough or what?

I wonder if enabling fission.processSiteNames would make performance profiling a bit more useful.

Flags: needinfo?(smaug)

Andrew Creskey [:acreskey]

Comment 30

•

2 months ago

I took a quick peek at the profile and noticed 4619ms of Recv Thread Latency for IPC message, PContent::Msg_FlushFOGData
https://share.firefox.dev/4bKHxjD
But it's not clear where the latency is coming from.

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 31

•

2 months ago

(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #29)

What was done to create that profile?

As usual the normal profiler environmental variables were set.

Why are there so many content processes? And why aren't those processes really doing anything? They are sampled but don't seem to execute much code.

No idea. Maybe that's the reason why it all runs that slowly? Would some extra logs help? I assume also setting MOZ_LOG="Process:5" to enable process logs may give further details as markers?

Is the testing framework possibly not sending the urls to load soon enough or what?

As we said above the issue is not reproducible locally bit only in GitHub actions for MacOS ARM runners. The only delay the test has is after the navigation. But for particularly testing the overall issue it could potentially be removed.

I wonder if enabling fission.processSiteNames would make performance profiling a bit more useful.

That could be done by setting it as Firefox preference when starting the WebDriver session. Barry, would it be possible for you to make these small modifications and also adding the other env variable that I pointed out above? Thanks!

Flags: needinfo?(barry)

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 32

•

2 months ago

I presume this does what you need: https://github.com/tunetheweb/github-action-test/commit/b9a0ee0ae5e43870bfdefa66a985ee11a554925f

In which case the results are here: https://github.com/tunetheweb/github-action-test/actions/runs/9363622641

Flags: needinfo?(barry)

Comment 33

•

2 months ago

(In reply to Barry Pollard from comment #32)

I presume this does what you need: https://github.com/tunetheweb/github-action-test/commit/b9a0ee0ae5e43870bfdefa66a985ee11a554925f

There is a typo in the if condition (browserName === 'fiefox') which caused the required preference not to be added. If you could do a quick follow-up push I can check the results and share the profile. Thanks.

https://github.com/tunetheweb/github-action-test/actions/runs/9365810252

Comment 34

•

2 months ago

Oops fixed now. BTW you can fork this repo so you're not holding up waiting for me each time.

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 35

•

2 months ago

Sorry I was kinda busy this morning, but if needed I could do that, right. Thanks for jumping in and helping! It's appreciated.

Here is one example of a navigation that takes more than 5s because there is a Blocked preallocation happening: https://share.firefox.dev/4bJpVop

With the profile overview and filtered by the Process marker it looks like that processes to be launched as immediate seems to be blocked, but normal process launches are fine?

Olli, are those additionally information helpful for further investigating the issue?

Flags: needinfo?(smaug)

Jens Stutte [:jstutte]

Comment 36

•

2 months ago

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #35)

With the profile overview and filtered by the Process marker it looks like that processes to be launched as immediate seems to be blocked, but normal process launches are fine?

The content processes that take ~5s to launch are stuck inside nsAppShell::Init() (see https://share.firefox.dev/4bH2rAn) doing "mac-things" that regard NSColorGetUserAccentColor, so the parent process will get the notification of the process being successfully launched only late. There is a chance that nsAppShell::Init() is doing too many things synchronously that could better be done later?

Component: DOM: Content Processes → Widget: Cocoa

Flags: needinfo?(smaug)

Stephen A Pohl [:spohl]

Updated

•

2 months ago

Severity: -- → S3

Priority: -- → P3

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 37

•

2 months ago

Thanks Jens! That opens up the question if there is any way to figure out what exactly is causing this delay in NSColorGetUserAccentColor. Stephen, is there any way with normal MOZ_LOG logging to get more details here? I assume that widget:5 might not help, or?

Also regarding the prioritization I want to note that this is an issue which is seen by everyone using the new MacOS ARM runners for GitHub actions. While for now there is a chance to fallback to older MacOS releases (most recent is MacOS 13) it might not be possible anymore when those runners based on Intel machines are no longer supported. Each and every test that is using Firefox will see a drastic increase of the test duration. IMHO we should find the underlying issue sooner than later to continue allowing users to have a good testing experience. Otherwise we might loose test coverage or as the FCP example show get completely wrong test results.

Flags: needinfo?(rjesup) → needinfo?(spohl.mozilla.bugs)

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 38

•

2 months ago

•

Edited

So we got a reply: https://github.com/actions/runner-images/issues/9974#issuecomment-2150347670

In summary this is:

Regarding configuration I may say that it's not a private info: runner is a virtual machine based on Veertu Anka with the following set of resources

If we know what might cause that they offered to workaround it in their environment.

Comment 39

•

2 months ago

I'd recommend trying to get an Instruments profile of this happening on the github worker. You should be able get this with something like xctrace record --template 'Time Profiler' --all-processes --time-limit 5s and figure out how to get that out of github worker as an artifact or some thing.

Comment 40

•

1 month ago

(In reply to Jeff Muizelaar [:jrmuizel] from comment #39)

I'd recommend trying to get an Instruments profile of this happening on the github worker. You should be able get this with something like xctrace record --template 'Time Profiler' --all-processes --time-limit 5s and figure out how to get that out of github worker as an artifact or some thing.

I tried to run xctrace as a background task during the job, but it fails to record the trace when Firefox starts to run and the trace is still recording.
https://github.com/juliandescottes/github-action-test/blob/main/.github/workflows/tests.yml#L37

My background task is:

sleep 50 && xctrace record --template 'Time Profiler' --all-processes --time-limit 5s --output /tmp/xctrace.trace &

And I play with the sleep delay + time-limit to try to match Firefox running the tests. If the trace is recording before Firefox starts I manage to record a valid trace (at least judging by the size of it). But if it overlaps with Firefox running (which is the goal), then the trace fails and only contains a RunIssues.storedata file.

I can attach a failed trace here in case it helps, but maybe I'm just doing it wrong.

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 41

•

1 month ago

Attached file RunIssues.storedata — Details

Updated

•

1 month ago

Flags: needinfo?(jmuizelaar)

Comment 42

•

1 month ago

I got an xctrace profile here: https://github.com/jrmuizel/github-action-test/actions/runs/9421839252/

It looks like it wasn't working because xctrace was still writing out the profile before we uploaded it.

I used while pgrep -x "xctrace" > /dev/null; do sleep 2; echo waiting for trace; done and killall -SIGINT xctrace to stop xctrace

Flags: needinfo?(jmuizelaar)

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 43

•

1 month ago

Markus and I looked at the xctrace profile and nothing stood out as incriminating.

The content process is always waiting 5 seconds so I think it's more likely that this is timing out instead of waiting for some cpu intensive work to happen. I'd recommend disabling the sandbox to see if that helps. If it does we can turn on sandbox logging to see if there's something that looks incriminating.

Stephen A Pohl [:spohl]

Updated

•

1 month ago

Flags: needinfo?(spohl.mozilla.bugs)

Comment 44

•

1 month ago

(In reply to Jeff Muizelaar [:jrmuizel] from comment #43)

The content process is always waiting 5 seconds so I think it's more likely that this is timing out instead of waiting for some cpu intensive work to happen. I'd recommend disabling the sandbox to see if that helps. If it does we can turn on sandbox logging to see if there's something that looks incriminating.

Julian, could you try with this suggestion? Would be good to know if it is related to sandboxing and if yet what exactly might show us the log output.

Flags: needinfo?(jdescottes)

Comment 45

•

1 month ago

Attached file logs.txt — Details

Disabling the sandbox fixes the issue: https://share.firefox.dev/4etFZfy

I did another run with sandbox logging enabled, but I can't see anything relevant in the logs

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 46

•

1 month ago

Jeff, is there anything else I should do to enable sandbox logging, beyond flipping security.sandbox.logging.enabled to true (I re-enabled the sandbox when I captured the logs, maybe I should have disabled it again?)

Flags: needinfo?(jdescottes) → needinfo?(jmuizelaar)

Comment 47

•

1 month ago

I wonder if it would be helpful to check different supported sandbox levels on MacOS and which one actually triggers this delay:
https://searchfox.org/mozilla-central/rev/a26e972e97fbec860400d80df625193f4c88d64e/browser/app/profile/firefox.js#1453-1468

Comment 48

•

1 month ago

(In reply to Julian Descottes [:jdescottes] from comment #46)

Jeff, is there anything else I should do to enable sandbox logging, beyond flipping security.sandbox.logging.enabled to true (I re-enabled the sandbox when I captured the logs, maybe I should have disabled it again?)

Ah sorry, I meant the macOS logging using the log program. i.e with sandboxing turned on using log show.

Flags: needinfo?(jmuizelaar)

Comment 49

•

1 month ago

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #47)

I wonder if it would be helpful to check different supported sandbox levels on MacOS and which one actually triggers this delay:
https://searchfox.org/mozilla-central/rev/a26e972e97fbec860400d80df625193f4c88d64e/browser/app/profile/firefox.js#1453-1468

Good idea, I tried levels 1 and 2, it still fails. Only fully disabling the sandbox helps.

Comment 50

•

1 month ago

Attached file log_show 2.txt — Details

This is the output of log show --last 5m. The test starts around 09:01:56, so ~line 24000.
Anything interesting in there?

Flags: needinfo?(jmuizelaar)

Markus Stange [:mstange]

Comment 51

•

1 month ago

Haik, does anything in the log look suspicious? Have you seen a 5 second timeout during startup before?

Flags: needinfo?(haftandilian)

Comment 52

•

1 month ago

grepping for "kernel: \(Sandbox.*plugin" shows all the violations we're making.

I'd recommend adjusting https://searchfox.org/mozilla-central/source/security/sandbox/mac/SandboxPolicyContent.h#14 to allow as much stuff from the logs as possible to see if that fixes it. If it does you should be able to narrow down what exactly is being blocked that's causing the problem.

Flags: needinfo?(jmuizelaar)

Comment 53

•

1 month ago

(here is the Firefox profile corresponding to those logs: https://share.firefox.dev/4be0Twu)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #52)

grepping for "kernel: \(Sandbox.*plugin" shows all the violations we're making.

I'd recommend adjusting https://searchfox.org/mozilla-central/source/security/sandbox/mac/SandboxPolicyContent.h#14 to allow as much stuff from the logs as possible to see if that fixes it. If it does you should be able to narrow down what exactly is being blocked that's causing the problem.

Thanks, I assume we need to build a custom Firefox for this? Or is this something that can be driven via prefs/env variables?

Flags: needinfo?(jmuizelaar)

Comment 54

•

1 month ago

(In reply to Julian Descottes [:jdescottes] from comment #53)

(here is the Firefox profile corresponding to those logs: https://share.firefox.dev/4be0Twu)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #52)

grepping for "kernel: \(Sandbox.*plugin" shows all the violations we're making.

I'd recommend adjusting https://searchfox.org/mozilla-central/source/security/sandbox/mac/SandboxPolicyContent.h#14 to allow as much stuff from the logs as possible to see if that fixes it. If it does you should be able to narrow down what exactly is being blocked that's causing the problem.

Thanks, I assume we need to build a custom Firefox for this? Or is this something that can be driven via prefs/env variables?

I think you'll need to build a custom Firefox, but maybe Haik knows better.

Flags: needinfo?(jmuizelaar)

Assignee

Comment 55

•

1 month ago

•

Edited

Yes, you need to rebuild to make sandbox policy changes with the exception of a few things that are parameterized.

One thing to try that can be set via prefs is to try setting the pref security.sandbox.content.mac.disconnect-windowserver to false with the sandbox enabled which would tell us if it is related to not having a WindowServer connection. Edit: looking at the profile, the NSAppShell::Init() call is stuck in NSColorGetUserAccentColor which might be trying to connect to the WindowServer or something display-related.

Since setting security.sandbox.content.level to 1 did not avoid the problem, it's probably not caused by not having permission to read the filesystem. I suspect it's related to trying to connect to a mach service it doesn't normally connect to in a non-virtualized environment or has a timeout in this environment. These are logged with messages like

plugin-container(3912) deny(1) mach-lookup com.apple.lsd.modifydb

After the WindowServer test, what I would do for next steps is to look for the messages related to denying connection to a service (com.apple.lsd.modifydb in the above example) and allow that service in the content process policy with an allow rule by adding it to SandboxPolicyContent.h here making sure to get the parenthesis right. We might be able to find a single service connection attempt that is causing the delay.

Do we have a method for running this VM type on a developer Mac?

Flags: needinfo?(haftandilian)

Comment 56

•

1 month ago

Same issue with security.sandbox.content.mac.disconnect-windowserver set to false: https://share.firefox.dev/3Vtc5zE

I will attach the logs.

Do we have a method for running this VM type on a developer Mac?

I didn't find the images for the VMs, but I asked on https://github.com/actions/runner-images/issues/9974 (maybe I should file a new issue since this one has been closed for some time)

Comment 57

•

1 month ago

Attached file log_show_with_disconnect-windowserver_false.zip — Details

Assignee

Comment 58

•

1 month ago

I was able to reproduce locally on a macOS (12 and also 15 Beta) VM. I'm working on debugging the delay.

Assignee

Comment 59

•

29 days ago

After doing some trial and error testing on the macOS 15 Beta VM, I found that adding allowances in the content process sandbox for the following iokit properties avoids the 5-second delay in nsAppShell::Init(). I'll get a patch posted to add these back after doing some more testing. I'll try to find out more information about security-impacts of the properties.

plugin-container(3756) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:product-id
plugin-container(3756) deny(1) iokit-get-properties iokit-class:IOService property:housing-color
plugin-container(3756) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:IORegistryEntryPropertyKeys
plugin-container(3756) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:ean-storage-present
plugin-container(3756) deny(1) iokit-get-properties iokit-class:IOService property:syscfg-v2-data
plugin-container(3757) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:product-id
plugin-container(3757) deny(1) iokit-get-properties iokit-class:IOService property:housing-color
plugin-container(3757) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:IORegistryEntryPropertyKeys
plugin-container(3757) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:ean-storage-present
plugin-container(3757) deny(1) iokit-get-properties iokit-class:IOService property:syscfg-v2-data

Assignee: nobody → haftandilian

Assignee

Comment 60

•

29 days ago

Attached file Bug 1893921 - Creation of content processes is slow on macos arm64 runners on GitHub actions r?spohl (obsolete) — Details

Add access to additional iokit properties in the content process sandbox to avoid a 5 second delay in [NSApplication init] when running in VMs.

https://hg.mozilla.org/mozilla-central/rev/59eba805400c

Assignee

Comment 61

•

29 days ago

After some more debugging, I've found that all the properties listed in comment 59 (product-id, housing-color, IORegistryEntryPropertyKeys, ean-storage-present, and syscfg-v2-data) are read within the call to [NSApplication Init] from nsAppShell::Init(). AppKit'__NSUserAccentColorGetHardwareAccentColorName_block_invoke() is the lowest AppKit function on the stack. I set breakpoints on IOKit'IORegistryEntryCreateCFProperty to catch this. In the VM, the only properties actually present are product-id and ean-storage-present checked by running ioreg. The Safari renderer sandbox also allows reads of all these properties.

Phabricator Automation

Updated

•

29 days ago

Attachment #9408733 - Attachment description: WIP: Bug 1893921 - Creation of content processes is slow on macos arm64 runners on GitHub actions → Bug 1893921 - Creation of content processes is slow on macos arm64 runners on GitHub actions r?spohl

Pulsebot

Comment 62

•

29 days ago

Pushed by haftandilian@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/59eba805400c
Creation of content processes is slow on macos arm64 runners on GitHub actions r=spohl

Aron Cseh

Comment 63

•

28 days ago

bugherder

Status: NEW → RESOLVED

Closed: 28 days ago

status-firefox129: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 129 Branch

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 64

•

26 days ago

Julian, would you mind testing in GitHub CI if the issue is fixed for Puppeteer and the other test repository? Thanks.

Component: Widget: Cocoa → Security: Process Sandboxing

Flags: needinfo?(jdescottes)

Henrik Skupin [:whimboo][⌚️UTC+1][away 07/22 - 07/26]

Comment 65

•

26 days ago

Unfortunately, it doesn't seem to help on the puppeteer issue https://github.com/puppeteer/puppeteer/actions/runs/9640423521/job/26584204278?pr=12328

For the github-action-test example, I'm trying to make sure that we use the latest Nightly, but it doesn't seem to help either.
We still seem blocked on the same call to __NSUserAccentColorGetHardwareAccentColorName_block_invoke https://share.firefox.dev/4eCgfOf

Flags: needinfo?(jdescottes)

Updated

•

26 days ago

Flags: needinfo?(haftandilian)

Assignee

Comment 66

•

26 days ago

@Julian, would you be able to get me logs again like on comment 50? Could you also report the build ID so we can make sure the fix is there?

I'll retest on the VM I'm using. I was testing with a QEMU based VM using the macOS 15 beta.

Flags: needinfo?(haftandilian) → needinfo?(jdescottes)

Comment 67

•

26 days ago

Attached file logs_20240624.txt — Details

Sure! 

- buildID is 20240624094028
- logs attached to this message 
- profile at https://share.firefox.dev/4ccD9u9

Updated

•

26 days ago

Flags: needinfo?(jdescottes) → needinfo?(haftandilian)

Assignee

Comment 68

•

25 days ago

@Julian, thanks. Would you be able to test a build if I provide a link to a .dmg?

Flags: needinfo?(haftandilian) → needinfo?(jdescottes)

Assignee

Comment 69

•

25 days ago

After switching to a macOS 14.5 VM, I can reproduce the delay with the first fix. I'll request backout and re-land a complete fix.

Sandor Molnar[:smolnar]

Comment 70

•

25 days ago

Backed out at request of @haik

Backout link: https://hg.mozilla.org/integration/autoland/rev/54b564576bfecdde005d9757fac990ba4a9813b6

Status: RESOLVED → REOPENED

status-firefox129: fixed → ---

Resolution: FIXED → ---

Target Milestone: 129 Branch → ---

Comment 71

•

25 days ago

(In reply to Haik Aftandilian [:haik] from comment #68)

@Julian, thanks. Would you be able to test a build if I provide a link to a .dmg?

I don't have anything setup to do that for now, but I imagine it's doable? Although since you reproduce now, I guess we will reland the complete fix and test with Nightly?

Flags: needinfo?(jdescottes)

https://hg.mozilla.org/mozilla-central/rev/349060b3997f

Assignee

Comment 72

•

24 days ago

Attached file Bug 1893921 - Creation of content processes is slow on macos arm64 runners on GitHub actions r=spohl — Details

Add access to additional iokit properties and sysctls in the content process sandbox to avoid a 5 second delay in [NSApplication init] when running in VMs.

Pulsebot

Comment 73

•

24 days ago

Pushed by haftandilian@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/349060b3997f
Creation of content processes is slow on macos arm64 runners on GitHub actions r=spohl

Cristian Tuns

Comment 74

•

24 days ago

bugherder

Status: REOPENED → RESOLVED

Closed: 28 days ago → 24 days ago

status-firefox129: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 129 Branch

Phabricator Automation

Updated

•

24 days ago

Attachment #9408733 - Attachment is obsolete: true

Assignee

Comment 75

•

24 days ago

(In reply to Julian Descottes [:jdescottes] from comment #71)

(In reply to Haik Aftandilian [:haik] from comment #68)

@Julian, thanks. Would you be able to test a build if I provide a link to a .dmg?

I don't have anything setup to do that for now, but I imagine it's doable? Although since you reproduce now, I guess we will reland the complete fix and test with Nightly?

Ideally, we would verify the fix on the github runner first, but given how that seems difficult and the behavior of my 14.5 VM appears to match the logs from the github runner, I landed the updated fix.

Comment 76

•

24 days ago

And I can confirm the issue is fixed on my test repository.
Thanks a lot for the help here Haik!

Assignee

Comment 77

•

24 days ago

(In reply to Julian Descottes [:jdescottes] from comment #76)

And I can confirm the issue is fixed on my test repository.
Thanks a lot for the help here Haik!

Awesome, thanks! Kudos to everyone who helped debug this down to the sandboxing level.