First-order rough-order-of-magnitude analysis is the first step.
It takes (nominally) one instruction to access one word of local memory. Assuming it takes 1000 instructions to format the TCP/IP packet and queue it for transmission, and another 1000 instructions at the far end to receive, dequeue, and interpret it, and realizing that an "access" requires two packets to go, one in each direction, you're talking a minimum of 4000 instructions per access.
At that point, your 4 GHz Intel screamer is, on non-local accesses, performing about like a 4 MHz 8086.
That's before you consider network overhead. Assuming 100 physical machines ("instances"), assuming 16-port routers, you need 8 routers: one at the center of a star, connected to the other seven, with each of those connected to about 16 machines. Assuming that non-local accesses are uniformly distributed, each non-local access will take three router hops: one to the local router, one to the star center, one back down to a local router. (Because you assume uniform distribution, only about 1/7 of your accesses will be to a machine sharing your local router. For a first-order analysis, you can ignore that lucky break.) Because you have to go there and back, you're talking 6 router hops and 4000 instructions, NOT INCLUDING ROUTER OPERATIONS, to do one non-local access.
First-order rough-order-of-magnitude analysis would seem to indicate that your application will die of old age waiting on packets in the router queues.