NET Systems Programming Learned the Hard Way.pptx
- 2. About Me
• .NET developer since 2005 (college
internship)
• Built large-scale SaaS on top of
.NET
• Creator and maintainer of
Akka.NET since 2013
• Canonical actor model
implementation in .NET
• Highly concurrent, low-latency, and
distributed
• Used to build mission-critical real
time applications
• Performance is a feature
- 5. GC Generations
The higher the generation, the most expensive the GC:
• Memory is more fragmented (access is random, not contiguous)
• Compaction takes longer (bigger gaps, more stuff to move,
longer GC pauses)
- 6. .NET Memory Model
private readonly Random myRandom = Random.Shared;
private void DoThing()
{
var i = myRandom.Next();
var j = myRandom.Next(i);
var b = i + j;
var str = b.ToString();
Console.WriteLine(str);
}
Stack
0xAEDC DoThing_vtable
0xFFBD ref(Random.Shared)
0x11CD i = 10;
0x11CE j = 5;
0x11CF b = 15;
0xADDE ref(string)
Managed Heap
0xAEDC class Thing_DoThing mthd
0xFFBD Random.Shared [1024b]
…
…
…
0xADDE string “15”
- 7. GC Considerations
• If you can: keep allocations in Gen 0 / 1
• Value types (no GC)
• Less memory fragmentation, compaction
• Less impact on latency, throughput
• If you can’t: keep Gen 2 objects in Gen 2 forever
• No GC if they’re still rooted!
- 8. GC Practice: Object Pools
• Microsoft.Extensions.ObjectPool<T> - great option for
long-lived Gen2 objects
• Best candidates are “reusable” types
• StringBuilder
• byte[](there are separate MemoryPool types for this)
• Use pre-allocated object, return to pool upon completion
• Doesn’t cause allocations so long as pool capacity isn’t exceeded
- 9. GC Practice: Object Pools
StringBuilder sb = null;
try
{
sb = _sbPool.Get();
using (var tw = new StringWriter(sb, CultureInfo.InvariantCulture))
{
var ser = JsonSerializer.CreateDefault(Settings);
ser.Formatting = Formatting.None;
using (var jw = new JsonTextWriter(tw))
{
ser.Serialize(jw, obj);
}
return Encoding.UTF8.GetBytes(tw.ToString());
}
}
finally
{
if (sb != null)
{
_sbPool.Return(sb);
}
}
Rent an instance from the
ObjectPool<StringBuilder>
Do our work
Return to the pool
- 10. GC Practice: Object Pools
• Pooling StringBuilder inside Newtonsoft.Json
~30% memory savings,
eliminated 100% of Gen 1
GC
~28% throughput
improvement in concurrent
use cases
- 12. Workstation GC vs. Server GC
<Project Sdk="Microsoft.NET.Sdk">
<Import Project="....common.props" />
<PropertyGroup>
<Description>Akka.Remote x-plat performance benchmark</Description>
<Copyright>Copyright (c) Akka.NET Team</Copyright>
<AssemblyTitle>RemotePingPong</AssemblyTitle>
<AssemblyName>RemotePingPong</AssemblyName>
<Authors>Akka.NET Team</Authors>
<TargetFrameworks>$(NetFrameworkTestVersion);$(NetTestVersion);$(NetCoreTestVersion)</TargetFrameworks>
<OutputType>Exe</OutputType>
</PropertyGroup>
<PropertyGroup>
<PlatformTarget>x64</PlatformTarget>
</PropertyGroup>
<PropertyGroup>
<ServerGarbageCollection>true</ServerGarbageCollection>
</PropertyGroup>
<ItemGroup>
<ProjectReference Include="....coreAkka.RemoteAkka.Remote.csproj" />
</ItemGroup>
</Project>
Not enabled by
default!
- 15. Allocations: Delegates and Closures
/// <summary>
/// Processes the contents of the mailbox
/// </summary>
public void Run()
{
try
{
if (!IsClosed()) // Volatile read, needed here
{
Actor.UseThreadContext(() =>
{
ProcessAllSystemMessages(); // First, deal with any system messages
ProcessMailbox(); // Then deal with messages
});
}
}
finally
{
SetAsIdle(); // Volatile write, needed here
Dispatcher.RegisterForExecution(this, false, false); // schedule to run again if there are more messages,
possibly
}
}
Critical path of actor msg
processing
Closes over ‘this’, allocates
delegate each time
- 16. Eliminate Delegate: Inlining
/// <summary>
/// Processes the contents of the mailbox
/// </summary>
public void Run()
{
try
{
if (!IsClosed()) // Volatile read, needed here
{
var tmp = InternalCurrentActorCellKeeper.Current;
InternalCurrentActorCellKeeper.Current = Actor;
try
{
ProcessAllSystemMessages(); // First, deal with any system messages
ProcessMailbox(); // Then deal with messages
}
finally
{
//ensure we set back the old context
InternalCurrentActorCellKeeper.Current = tmp;
}
}
}
finally
{
SetAsIdle(); // Volatile write, needed here
Dispatcher.RegisterForExecution(this, false, false); // schedule to run again if there are more messages, possibly
}
}
Eliminate delegate by inlining function
From 21kb & 203kb to ~1kb
Throughput improvement of ~10%
- 17. Other Delegate Allocation Removal Methods
• C#9: declare `static` delegates
• Cache delegates / use expression compiler
• ValueDelegates
- 18. Value Delegates
private readonly struct RequestWorkerTask : IRunnable
{
private readonly DedicatedThreadPoolTaskScheduler _scheduler;
public RequestWorkerTask(DedicatedThreadPoolTaskScheduler
scheduler)
{
_scheduler = scheduler;
}
public void Run()
{
// do work
}
}
private void RequestWorker()
{
_pool.QueueUserWorkItem(new RequestWorkerTask(this));
}
Implement our “delegate interface”
using a value type
Runs just the same as a reference
type
Execute the work (might cause a
boxing allocation!)
- 19. Allocations: “Empty” Collections
public State(TS stateName, TD stateData, TimeSpan? timeout = null, Reason stopReason = null, IReadOnlyList<object> replies = null, bool
notifies = true)
{
Replies = replies ?? new List<object>();
StopReason = stopReason;
Timeout = timeout;
StateData = stateData;
StateName = stateName;
Notifies = notifies;
}
Allocates a new,
non-empty array
(32 bytes)
Suspicious…
- 20. Allocations: “Empty” Collections
public State(TS stateName, TD stateData, TimeSpan? timeout = null, Reason stopReason = null, IReadOnlyList<object> replies = null, bool
notifies = true)
{
Replies = replies ?? Array.Empty<object>();
StopReason = stopReason;
Timeout = timeout;
StateData = stateData;
StateName = stateName;
Notifies = notifies;
}
Creates an empty, non-
null collection
- 22. Reference Type: FSM Events
public sealed class Event<TD> : INoSerializationVerificationNeeded
{
public Event(object fsmEvent, TD stateData)
{
StateData = stateData;
FsmEvent = fsmEvent;
}
public object FsmEvent { get; }
public TD StateData { get; }
public override string ToString()
{
return $"Event: <{FsmEvent}>, StateData: <{StateData}>";
}
}
We allocate millions of these per
second in busy networks
- 23. public readonly struct Event<TD> : INoSerializationVerificationNeeded
{
public Event(object fsmEvent, TD stateData)
{
StateData = stateData;
FsmEvent = fsmEvent;
}
public object FsmEvent { get; }
public TD StateData { get; }
public override string ToString()
{
return $"Event: <{FsmEvent}>, StateData: <{StateData}>";
}
}
Value Type: FSM Events
Change to value type
Reduction of ~30mb
Minor throughput
improvement
- 24. Value Types: Boxing Allocations
• Boxing occurs implicitly – when a
struct is cast into an object
• The struct will be wrapped
into an object and placed on
the managed heap.
• Unboxing happens explicitly –
when the object is cast back
into its associated value type.
• Can create a lot of allocations!
StateName is usually an enum (value
type) – is the object.Equals call
boxing?
- 25. Value Types: Boxing Allocations
// avoid boxing
if (!EqualityComparer<TState>.Default.Equals(_currentState.StateName, nextState.StateName) || nextState.Notifies)
{
_nextState = nextState;
HandleTransition(_currentState.StateName, nextState.StateName);
Listeners.Gossip(new Transition<TState>(Self, _currentState.StateName, nextState.StateName));
_nextState = default;
}
Used generic comparer to avoid casting
value types into object – removed 100%
of boxing allocations at this callsite.
- 26. Value Type: Message Envelope
/// <summary>
/// Envelope class, represents a message and the sender of the message.
/// </summary>
public readonly struct Envelope
{
public Envelope(object message, IActorRef sender)
{
Message = message;
Sender = sender;
}
public IActorRef Sender { get; }
public object Message { get; }
}
Used millions of times per
second in Akka.NET
readonly struct? Value
type? Should be “zero
allocations”
- 27. Reference Type: Message Envelope
/// <summary>
/// Envelope class, represents a message and the sender of the message.
/// </summary>
public sealed class Envelope
{
public Envelope(object message, IActorRef sender)
{
Message = message;
Sender = sender;
}
public IActorRef Sender { get; }
public object Message { get; }
}
What if we change to a
reference type? Will this reduce
allocations?
394kb 264kb
3.15mb 2.1mb
215 us 147 us
1860 us 1332 us
- 28. Value Type Pitfalls
• Copy-by-Value
• References to value types in other scopes requires copying
• ref parameters can work, but in narrowly defined contexts
• Excessive copying can be more expensive than allocating a reference
• Use reference types when semantics are “referential”
• Value types are not magic – work best in “tight” scopes
• Use the right tool for the job
- 30. Reference Type: Message Envelope
• What happens when we benchmark with significantly increased
cross-thread message traffic?
• Now if we convert Envelope back into a struct again…
• Thread access makes a difference!
- 32. ThreadStatic and ThreadLocal<T>
• Allocates objects directly into thread local storage
• Objects stay there and are available each time thread is used
• Ideal for caching and pooling
• No synchronization
• Data and work all performed adjacent to stack memory
• Downside: thread-local data structures aren’t synchronized
• Variants!
- 33. Thread Local Storage & Context Switching
• Reference types passed between
threads often age into older
generations of GC
• Value types passed between
threads are copied (no GC)
• Thread-local state is copied into
CPU’s L1/L2 cache from memory
typically during execution
• Context switching occurs when
threads get scheduled onto
different CPUs or work is moves
onto different threads.
- 34. Thread Locality & Context Switching
Each thread gets
~30ms of execution
time before yielding
- 35. Thread Locality & Context Switching
Current quantum is
over – time for
other threads to
have a turn
- 36. Thread Locality & Context Switching
Context switch! Thread
0 now executing on CPU
1 – memory and state
will have to be
transferred.
- 37. Context Switching: High Latency Impact
/// <summary>
/// An asynchronous operation will be executed by
a <see cref="MessageDispatcher"/>.
/// </summary>
#if NETSTANDARD
public interface IRunnable
#else
public interface IRunnable : IThreadPoolWorkItem
#endif
{
/// <summary>
/// Executes the task.
/// </summary>
void Run();
}
// use native .NET 6 APIs here to reduce
allocations
// preferLocal to help reduce context switching
ThreadPool.UnsafeQueueUserWorkItem(run,
true);
IThreadPoolWorkItem interface
added in .NET Core 3.0 – avoids delegate
allocations for executing on ThreadPool
Consume IThreadPoolWorkItem
with preferLocal=true – tells the
ThreadPool to attempt to reschedule
work on current thread / CPU.
- 39. Thread Locality w/o Context Switching
No context switch –
same thread will have a
chance to execute on
same CPU. Might be
able to benefit from
L1/L2 cache, locality of
memory access, etc.
- 40. Data Structures & Synchronization
/// <summary> An unbounded mailbox message queue. </summary>
public class UnboundedMessageQueue : IMessageQueue, IUnboundedMessageQueueSemantics
{
private readonly ConcurrentQueue<Envelope> _queue = new ConcurrentQueue<Envelope>();
/// <inheritdoc cref="IMessageQueue"/>
public bool HasMessages
{
get { return !_queue.IsEmpty; }
}
/// <inheritdoc cref="IMessageQueue"/>
public int Count
{
get { return _queue.Count; }
}
….
}
Could, in theory, improve
memory performance by
replacing with a LinkedList (no
array segment allocations from
resizing)
- 41. Data Structures & Synchronization
/// <summary> An unbounded mailbox message queue. </summary>
public class UnboundedMessageQueue : IMessageQueue, IUnboundedMessageQueueSemantics
{
private readonly object s_lock = new object();
private readonly LinkedList<Envelope> _linkedList = new LinkedList<Envelope>();
public bool HasMessages
{
get
{
return Count > 0;
}
}
public int Count
{
get
{
lock (s_lock)
{
return _linkedList.Count;
}
}
}
….
Not a thread-safe data
structure, has to be synced-
with lock
Should offer better memory
performance than
ConcurrentQueue<T>
Wooooooof 🤮
- 42. Data Structures & Synchronization
• What went wrong there?
• ConcurrentQueue<T> is lock-free
• Uses volatile and atomic compare-and-swap operations
• i.e. Interlocked.CompareExchange
• Significantly less expensive, even on a single thread, than lock
• LinkedList<T> may not be all that memory efficient
• Internal data structure allocations per-insert rather than array block
allocations
• Better off rolling your own, probably
Editor's Notes
- https://learn.microsoft.com/en-us/dotnet/standard/garbage-collection/fundamentals
https://learn.microsoft.com/en-us/dotnet/standard/garbage-collection/large-object-heap LOH threshold is 85,000 bytes
- See Microsoft.Extensions.ObjectPool
- Note: Background GC is enabled for both.
- Examples:
https://particular.net/blog/pipeline-and-closure-allocations
- 32 bytes adds up when you allocate millions of these per-second
https://stackoverflow.com/questions/16131641/memory-usage-of-an-empty-list-or-dictionary
public class List<T> : IList<T>, ICollection<T>, IList, ICollection, IReadOnlyList<T>, IReadOnlyCollection<T>, IEnumerable<T>, IEnumerable
{
private T[] _items; //4 bytes for x86, 8 for x64
private int _size; //4 bytes
private int _version; //4 bytes
[NonSerialized]
private object _syncRoot; //4 bytes for x86, 8 for x64
private static readonly T[] _emptyArray; //one per type
private const int _defaultCapacity = 4; //one per type
...
}