0

This is a longshot, but perhaps someone with knowledge of the internal workings of Sysinternal's Process Monitor may have an idea.

Recently we've had a very murky problem at work. We have a software (call it SW1) which creates a socket connection on a particular port with another software (call it SW2) and receives some data from this software. It then creates another socket connection with another process belonging to it, and sends it some data, after which the cycle restarts and it starts receiving some more data from SW2.

This is a very vague description and I have nothing to do with neither of these applications, however as the owner of the workstations I've been heavily involved in support. This whole system worked without any hitches on one particular workstation, however refused to work on four other identical workstations. The symptom was a sudden halt of packets being sent between SW1's two own processes, naturally followed by a timeout by SW2.

Now, for the wacky bit: After weeks of debugging with the relevant teams and running Wireshark, I decided to run Process Monitor perhaps something would show up. Weirdly enough, the socket connections remained established and the whole thing worked! Thinking it was a coincidence, we tried running process monitor on the other three and they all started working. Also, it looks like rebooting everything still keeps the applications working.

Of course the question remains: what impact could Process Monitor possibly have on these applications? Due to the nature of the solution I can't really analyse a procmon capture since it seems to be solving the issue...

Thanks!

1 Answer 1

0

It sounds like a race condition or dead lock.

I.e.: SW1 and SW2 must have a communication protocol with requests and acks. If this protocol is not well designed, there can ben a race condition, in which packets are not send in the correct order. SW1 get stacked waiting for a packet from SW2, but which SW2 has already sent in the past (and SW1 missed it) and SW2 is not going to send it again, becoming to a lock state on SW1.

If this is the case, the failure depends on the execution speed of SW1 and SW2, and further more on the load of the servers. Let say, if both processes are executing slowly, it's more difficult that SW1 misses the packet from SW2 which creates the lock state. Running the system monitor slightly slow the whole system down, which might be enough to make this work.

As for the different servers, if the first sever has more load than the others, then there you have, it works.

4
  • Interesting concept, however from the wireshark analysis it seems like the packets are identical in sequence, up until the part where it stops working of course... also, the most important bit which I'll edit in my question - I don't need to run Procmon each time for the application to work - if I reboot the workstations everything still works! Seems like a one time modification...
    – lcam
    Commented Aug 22, 2017 at 11:45
  • I acknowledge the dead look is a remote possibility. But still it's feasible. The order or the packets is not critical for producing the dead lock, but the timing. If SW2 replies very quickly and SW1 is not well designed, it maybe missed up the packet and gets stacked waiting for it. It's like getting just on time to the bus stop, but nobody is waiting. You don't know if the bus is to come, or if it's gone. Check the timing of the packets from the different servers, or try running any software -other than Procmon- which creates a heavy load on the server to see if you get the same effect. Commented Aug 22, 2017 at 12:20
  • Thanks for the explanation piedramania - I am now looking into replicating the problem on an offline workstation so I could experiment on the issue. I'll give it a go. Thanks!
    – lcam
    Commented Aug 22, 2017 at 12:23
  • This explanation does not hold if everything now works and keeps on working without Process Monitor.
    – harrymc
    Commented Aug 22, 2017 at 12:24

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .