SlideShare a Scribd company logo
.NET Systems Programming
Learned the Hard Way
By Aaron Stannard (@Aaronontheweb)
CEO, Petabridge
Creator, Akka.NET
About Me
• .NET developer since 2005 (college
internship)
• Built large-scale SaaS on top of
.NET
• Creator and maintainer of
Akka.NET since 2013
• Canonical actor model
implementation in .NET
• Highly concurrent, low-latency, and
distributed
• Used to build mission-critical real
time applications
• Performance is a feature
Garbage Collection
Perf Loss, GC Modes, and GC Generations
.NET GC
Source: https://www.csharpstar.com/interview-questions-garbage-collection-csharp/
Compacted
memory
GC Generations
The higher the generation, the most expensive the GC:
• Memory is more fragmented (access is random, not contiguous)
• Compaction takes longer (bigger gaps, more stuff to move,
longer GC pauses)
.NET Memory Model
private readonly Random myRandom = Random.Shared;
private void DoThing()
{
var i = myRandom.Next();
var j = myRandom.Next(i);
var b = i + j;
var str = b.ToString();
Console.WriteLine(str);
}
Stack
0xAEDC DoThing_vtable
0xFFBD ref(Random.Shared)
0x11CD i = 10;
0x11CE j = 5;
0x11CF b = 15;
0xADDE ref(string)
Managed Heap
0xAEDC class Thing_DoThing mthd
0xFFBD Random.Shared [1024b]
…
…
…
0xADDE string “15”
GC Considerations
• If you can: keep allocations in Gen 0 / 1
• Value types (no GC)
• Less memory fragmentation, compaction
• Less impact on latency, throughput
• If you can’t: keep Gen 2 objects in Gen 2 forever
• No GC if they’re still rooted!
GC Practice: Object Pools
• Microsoft.Extensions.ObjectPool<T> - great option for
long-lived Gen2 objects
• Best candidates are “reusable” types
• StringBuilder
• byte[](there are separate MemoryPool types for this)
• Use pre-allocated object, return to pool upon completion
• Doesn’t cause allocations so long as pool capacity isn’t exceeded
GC Practice: Object Pools
StringBuilder sb = null;
try
{
sb = _sbPool.Get();
using (var tw = new StringWriter(sb, CultureInfo.InvariantCulture))
{
var ser = JsonSerializer.CreateDefault(Settings);
ser.Formatting = Formatting.None;
using (var jw = new JsonTextWriter(tw))
{
ser.Serialize(jw, obj);
}
return Encoding.UTF8.GetBytes(tw.ToString());
}
}
finally
{
if (sb != null)
{
_sbPool.Return(sb);
}
}
Rent an instance from the
ObjectPool<StringBuilder>
Do our work
Return to the pool
GC Practice: Object Pools
• Pooling StringBuilder inside Newtonsoft.Json
~30% memory savings,
eliminated 100% of Gen 1
GC
~28% throughput
improvement in concurrent
use cases
.NET GC Modes
Workstation GC vs. Server GC
<Project Sdk="Microsoft.NET.Sdk">
<Import Project="....common.props" />
<PropertyGroup>
<Description>Akka.Remote x-plat performance benchmark</Description>
<Copyright>Copyright (c) Akka.NET Team</Copyright>
<AssemblyTitle>RemotePingPong</AssemblyTitle>
<AssemblyName>RemotePingPong</AssemblyName>
<Authors>Akka.NET Team</Authors>
<TargetFrameworks>$(NetFrameworkTestVersion);$(NetTestVersion);$(NetCoreTestVersion)</TargetFrameworks>
<OutputType>Exe</OutputType>
</PropertyGroup>
<PropertyGroup>
<PlatformTarget>x64</PlatformTarget>
</PropertyGroup>
<PropertyGroup>
<ServerGarbageCollection>true</ServerGarbageCollection>
</PropertyGroup>
<ItemGroup>
<ProjectReference Include="....coreAkka.RemoteAkka.Remote.csproj" />
</ItemGroup>
</Project>
Not enabled by
default!
Workstation GC vs. Server GC
Workstation GC Server GC
Memory Allocations
Hidden Sources of Allocations and Eliminating Them
Allocations: Delegates and Closures
/// <summary>
/// Processes the contents of the mailbox
/// </summary>
public void Run()
{
try
{
if (!IsClosed()) // Volatile read, needed here
{
Actor.UseThreadContext(() =>
{
ProcessAllSystemMessages(); // First, deal with any system messages
ProcessMailbox(); // Then deal with messages
});
}
}
finally
{
SetAsIdle(); // Volatile write, needed here
Dispatcher.RegisterForExecution(this, false, false); // schedule to run again if there are more messages,
possibly
}
}
Critical path of actor msg
processing
Closes over ‘this’, allocates
delegate each time
Eliminate Delegate: Inlining
/// <summary>
/// Processes the contents of the mailbox
/// </summary>
public void Run()
{
try
{
if (!IsClosed()) // Volatile read, needed here
{
var tmp = InternalCurrentActorCellKeeper.Current;
InternalCurrentActorCellKeeper.Current = Actor;
try
{
ProcessAllSystemMessages(); // First, deal with any system messages
ProcessMailbox(); // Then deal with messages
}
finally
{
//ensure we set back the old context
InternalCurrentActorCellKeeper.Current = tmp;
}
}
}
finally
{
SetAsIdle(); // Volatile write, needed here
Dispatcher.RegisterForExecution(this, false, false); // schedule to run again if there are more messages, possibly
}
}
Eliminate delegate by inlining function
From 21kb & 203kb to ~1kb
Throughput improvement of ~10%
Other Delegate Allocation Removal Methods
• C#9: declare `static` delegates
• Cache delegates / use expression compiler
• ValueDelegates
Value Delegates
private readonly struct RequestWorkerTask : IRunnable
{
private readonly DedicatedThreadPoolTaskScheduler _scheduler;
public RequestWorkerTask(DedicatedThreadPoolTaskScheduler
scheduler)
{
_scheduler = scheduler;
}
public void Run()
{
// do work
}
}
private void RequestWorker()
{
_pool.QueueUserWorkItem(new RequestWorkerTask(this));
}
Implement our “delegate interface”
using a value type
Runs just the same as a reference
type
Execute the work (might cause a
boxing allocation!)
Allocations: “Empty” Collections
public State(TS stateName, TD stateData, TimeSpan? timeout = null, Reason stopReason = null, IReadOnlyList<object> replies = null, bool
notifies = true)
{
Replies = replies ?? new List<object>();
StopReason = stopReason;
Timeout = timeout;
StateData = stateData;
StateName = stateName;
Notifies = notifies;
}
Allocates a new,
non-empty array
(32 bytes)
Suspicious…
Allocations: “Empty” Collections
public State(TS stateName, TD stateData, TimeSpan? timeout = null, Reason stopReason = null, IReadOnlyList<object> replies = null, bool
notifies = true)
{
Replies = replies ?? Array.Empty<object>();
StopReason = stopReason;
Timeout = timeout;
StateData = stateData;
StateName = stateName;
Notifies = notifies;
}
Creates an empty, non-
null collection
Value Types
Not Always Preferable Over Reference Types
Reference Type: FSM Events
public sealed class Event<TD> : INoSerializationVerificationNeeded
{
public Event(object fsmEvent, TD stateData)
{
StateData = stateData;
FsmEvent = fsmEvent;
}
public object FsmEvent { get; }
public TD StateData { get; }
public override string ToString()
{
return $"Event: <{FsmEvent}>, StateData: <{StateData}>";
}
}
We allocate millions of these per
second in busy networks
public readonly struct Event<TD> : INoSerializationVerificationNeeded
{
public Event(object fsmEvent, TD stateData)
{
StateData = stateData;
FsmEvent = fsmEvent;
}
public object FsmEvent { get; }
public TD StateData { get; }
public override string ToString()
{
return $"Event: <{FsmEvent}>, StateData: <{StateData}>";
}
}
Value Type: FSM Events
Change to value type
Reduction of ~30mb
Minor throughput
improvement
Value Types: Boxing Allocations
• Boxing occurs implicitly – when a
struct is cast into an object
• The struct will be wrapped
into an object and placed on
the managed heap.
• Unboxing happens explicitly –
when the object is cast back
into its associated value type.
• Can create a lot of allocations!
StateName is usually an enum (value
type) – is the object.Equals call
boxing?
Value Types: Boxing Allocations
// avoid boxing
if (!EqualityComparer<TState>.Default.Equals(_currentState.StateName, nextState.StateName) || nextState.Notifies)
{
_nextState = nextState;
HandleTransition(_currentState.StateName, nextState.StateName);
Listeners.Gossip(new Transition<TState>(Self, _currentState.StateName, nextState.StateName));
_nextState = default;
}
Used generic comparer to avoid casting
value types into object – removed 100%
of boxing allocations at this callsite.
Value Type: Message Envelope
/// <summary>
/// Envelope class, represents a message and the sender of the message.
/// </summary>
public readonly struct Envelope
{
public Envelope(object message, IActorRef sender)
{
Message = message;
Sender = sender;
}
public IActorRef Sender { get; }
public object Message { get; }
}
Used millions of times per
second in Akka.NET
readonly struct? Value
type? Should be “zero
allocations”
Reference Type: Message Envelope
/// <summary>
/// Envelope class, represents a message and the sender of the message.
/// </summary>
public sealed class Envelope
{
public Envelope(object message, IActorRef sender)
{
Message = message;
Sender = sender;
}
public IActorRef Sender { get; }
public object Message { get; }
}
What if we change to a
reference type? Will this reduce
allocations?
394kb  264kb
3.15mb  2.1mb
215 us  147 us
1860 us  1332 us
Value Type Pitfalls
• Copy-by-Value
• References to value types in other scopes requires copying
• ref parameters can work, but in narrowly defined contexts
• Excessive copying can be more expensive than allocating a reference
• Use reference types when semantics are “referential”
• Value types are not magic – work best in “tight” scopes
• Use the right tool for the job
NET Systems Programming Learned the Hard Way.pptx
Reference Type: Message Envelope
• What happens when we benchmark with significantly increased
cross-thread message traffic?
• Now if we convert Envelope back into a struct again…
• Thread access makes a difference!
Threads, Memory,
Synchronization, and Pain
Threads Hate You and Your Code
ThreadStatic and ThreadLocal<T>
• Allocates objects directly into thread local storage
• Objects stay there and are available each time thread is used
• Ideal for caching and pooling
• No synchronization
• Data and work all performed adjacent to stack memory
• Downside: thread-local data structures aren’t synchronized
• Variants!
Thread Local Storage & Context Switching
• Reference types passed between
threads often age into older
generations of GC
• Value types passed between
threads are copied (no GC)
• Thread-local state is copied into
CPU’s L1/L2 cache from memory
typically during execution
• Context switching occurs when
threads get scheduled onto
different CPUs or work is moves
onto different threads.
Thread Locality & Context Switching
Each thread gets
~30ms of execution
time before yielding
Thread Locality & Context Switching
Current quantum is
over – time for
other threads to
have a turn
Thread Locality & Context Switching
Context switch! Thread
0 now executing on CPU
1 – memory and state
will have to be
transferred.
Context Switching: High Latency Impact
/// <summary>
/// An asynchronous operation will be executed by
a <see cref="MessageDispatcher"/>.
/// </summary>
#if NETSTANDARD
public interface IRunnable
#else
public interface IRunnable : IThreadPoolWorkItem
#endif
{
/// <summary>
/// Executes the task.
/// </summary>
void Run();
}
// use native .NET 6 APIs here to reduce
allocations
// preferLocal to help reduce context switching
ThreadPool.UnsafeQueueUserWorkItem(run,
true);
IThreadPoolWorkItem interface
added in .NET Core 3.0 – avoids delegate
allocations for executing on ThreadPool
Consume IThreadPoolWorkItem
with preferLocal=true – tells the
ThreadPool to attempt to reschedule
work on current thread / CPU.
Performance Comparison
Before After
~3x improvement
~50% improvement
Thread Locality w/o Context Switching
No context switch –
same thread will have a
chance to execute on
same CPU. Might be
able to benefit from
L1/L2 cache, locality of
memory access, etc.
Data Structures & Synchronization
/// <summary> An unbounded mailbox message queue. </summary>
public class UnboundedMessageQueue : IMessageQueue, IUnboundedMessageQueueSemantics
{
private readonly ConcurrentQueue<Envelope> _queue = new ConcurrentQueue<Envelope>();
/// <inheritdoc cref="IMessageQueue"/>
public bool HasMessages
{
get { return !_queue.IsEmpty; }
}
/// <inheritdoc cref="IMessageQueue"/>
public int Count
{
get { return _queue.Count; }
}
….
}
Could, in theory, improve
memory performance by
replacing with a LinkedList (no
array segment allocations from
resizing)
Data Structures & Synchronization
/// <summary> An unbounded mailbox message queue. </summary>
public class UnboundedMessageQueue : IMessageQueue, IUnboundedMessageQueueSemantics
{
private readonly object s_lock = new object();
private readonly LinkedList<Envelope> _linkedList = new LinkedList<Envelope>();
public bool HasMessages
{
get
{
return Count > 0;
}
}
public int Count
{
get
{
lock (s_lock)
{
return _linkedList.Count;
}
}
}
….
Not a thread-safe data
structure, has to be synced-
with lock
Should offer better memory
performance than
ConcurrentQueue<T>
Wooooooof 🤮
Data Structures & Synchronization
• What went wrong there?
• ConcurrentQueue<T> is lock-free
• Uses volatile and atomic compare-and-swap operations
• i.e. Interlocked.CompareExchange
• Significantly less expensive, even on a single thread, than lock
• LinkedList<T> may not be all that memory efficient
• Internal data structure allocations per-insert rather than array block
allocations
• Better off rolling your own, probably
Learn More
• https://getakka.net/ - Akka.NET website, Discord, and source
• https://aaronstannard.com/ - my blog

More Related Content

NET Systems Programming Learned the Hard Way.pptx

  • 1. .NET Systems Programming Learned the Hard Way By Aaron Stannard (@Aaronontheweb) CEO, Petabridge Creator, Akka.NET
  • 2. About Me • .NET developer since 2005 (college internship) • Built large-scale SaaS on top of .NET • Creator and maintainer of Akka.NET since 2013 • Canonical actor model implementation in .NET • Highly concurrent, low-latency, and distributed • Used to build mission-critical real time applications • Performance is a feature
  • 3. Garbage Collection Perf Loss, GC Modes, and GC Generations
  • 5. GC Generations The higher the generation, the most expensive the GC: • Memory is more fragmented (access is random, not contiguous) • Compaction takes longer (bigger gaps, more stuff to move, longer GC pauses)
  • 6. .NET Memory Model private readonly Random myRandom = Random.Shared; private void DoThing() { var i = myRandom.Next(); var j = myRandom.Next(i); var b = i + j; var str = b.ToString(); Console.WriteLine(str); } Stack 0xAEDC DoThing_vtable 0xFFBD ref(Random.Shared) 0x11CD i = 10; 0x11CE j = 5; 0x11CF b = 15; 0xADDE ref(string) Managed Heap 0xAEDC class Thing_DoThing mthd 0xFFBD Random.Shared [1024b] … … … 0xADDE string “15”
  • 7. GC Considerations • If you can: keep allocations in Gen 0 / 1 • Value types (no GC) • Less memory fragmentation, compaction • Less impact on latency, throughput • If you can’t: keep Gen 2 objects in Gen 2 forever • No GC if they’re still rooted!
  • 8. GC Practice: Object Pools • Microsoft.Extensions.ObjectPool<T> - great option for long-lived Gen2 objects • Best candidates are “reusable” types • StringBuilder • byte[](there are separate MemoryPool types for this) • Use pre-allocated object, return to pool upon completion • Doesn’t cause allocations so long as pool capacity isn’t exceeded
  • 9. GC Practice: Object Pools StringBuilder sb = null; try { sb = _sbPool.Get(); using (var tw = new StringWriter(sb, CultureInfo.InvariantCulture)) { var ser = JsonSerializer.CreateDefault(Settings); ser.Formatting = Formatting.None; using (var jw = new JsonTextWriter(tw)) { ser.Serialize(jw, obj); } return Encoding.UTF8.GetBytes(tw.ToString()); } } finally { if (sb != null) { _sbPool.Return(sb); } } Rent an instance from the ObjectPool<StringBuilder> Do our work Return to the pool
  • 10. GC Practice: Object Pools • Pooling StringBuilder inside Newtonsoft.Json ~30% memory savings, eliminated 100% of Gen 1 GC ~28% throughput improvement in concurrent use cases
  • 12. Workstation GC vs. Server GC <Project Sdk="Microsoft.NET.Sdk"> <Import Project="....common.props" /> <PropertyGroup> <Description>Akka.Remote x-plat performance benchmark</Description> <Copyright>Copyright (c) Akka.NET Team</Copyright> <AssemblyTitle>RemotePingPong</AssemblyTitle> <AssemblyName>RemotePingPong</AssemblyName> <Authors>Akka.NET Team</Authors> <TargetFrameworks>$(NetFrameworkTestVersion);$(NetTestVersion);$(NetCoreTestVersion)</TargetFrameworks> <OutputType>Exe</OutputType> </PropertyGroup> <PropertyGroup> <PlatformTarget>x64</PlatformTarget> </PropertyGroup> <PropertyGroup> <ServerGarbageCollection>true</ServerGarbageCollection> </PropertyGroup> <ItemGroup> <ProjectReference Include="....coreAkka.RemoteAkka.Remote.csproj" /> </ItemGroup> </Project> Not enabled by default!
  • 13. Workstation GC vs. Server GC Workstation GC Server GC
  • 14. Memory Allocations Hidden Sources of Allocations and Eliminating Them
  • 15. Allocations: Delegates and Closures /// <summary> /// Processes the contents of the mailbox /// </summary> public void Run() { try { if (!IsClosed()) // Volatile read, needed here { Actor.UseThreadContext(() => { ProcessAllSystemMessages(); // First, deal with any system messages ProcessMailbox(); // Then deal with messages }); } } finally { SetAsIdle(); // Volatile write, needed here Dispatcher.RegisterForExecution(this, false, false); // schedule to run again if there are more messages, possibly } } Critical path of actor msg processing Closes over ‘this’, allocates delegate each time
  • 16. Eliminate Delegate: Inlining /// <summary> /// Processes the contents of the mailbox /// </summary> public void Run() { try { if (!IsClosed()) // Volatile read, needed here { var tmp = InternalCurrentActorCellKeeper.Current; InternalCurrentActorCellKeeper.Current = Actor; try { ProcessAllSystemMessages(); // First, deal with any system messages ProcessMailbox(); // Then deal with messages } finally { //ensure we set back the old context InternalCurrentActorCellKeeper.Current = tmp; } } } finally { SetAsIdle(); // Volatile write, needed here Dispatcher.RegisterForExecution(this, false, false); // schedule to run again if there are more messages, possibly } } Eliminate delegate by inlining function From 21kb & 203kb to ~1kb Throughput improvement of ~10%
  • 17. Other Delegate Allocation Removal Methods • C#9: declare `static` delegates • Cache delegates / use expression compiler • ValueDelegates
  • 18. Value Delegates private readonly struct RequestWorkerTask : IRunnable { private readonly DedicatedThreadPoolTaskScheduler _scheduler; public RequestWorkerTask(DedicatedThreadPoolTaskScheduler scheduler) { _scheduler = scheduler; } public void Run() { // do work } } private void RequestWorker() { _pool.QueueUserWorkItem(new RequestWorkerTask(this)); } Implement our “delegate interface” using a value type Runs just the same as a reference type Execute the work (might cause a boxing allocation!)
  • 19. Allocations: “Empty” Collections public State(TS stateName, TD stateData, TimeSpan? timeout = null, Reason stopReason = null, IReadOnlyList<object> replies = null, bool notifies = true) { Replies = replies ?? new List<object>(); StopReason = stopReason; Timeout = timeout; StateData = stateData; StateName = stateName; Notifies = notifies; } Allocates a new, non-empty array (32 bytes) Suspicious…
  • 20. Allocations: “Empty” Collections public State(TS stateName, TD stateData, TimeSpan? timeout = null, Reason stopReason = null, IReadOnlyList<object> replies = null, bool notifies = true) { Replies = replies ?? Array.Empty<object>(); StopReason = stopReason; Timeout = timeout; StateData = stateData; StateName = stateName; Notifies = notifies; } Creates an empty, non- null collection
  • 21. Value Types Not Always Preferable Over Reference Types
  • 22. Reference Type: FSM Events public sealed class Event<TD> : INoSerializationVerificationNeeded { public Event(object fsmEvent, TD stateData) { StateData = stateData; FsmEvent = fsmEvent; } public object FsmEvent { get; } public TD StateData { get; } public override string ToString() { return $"Event: <{FsmEvent}>, StateData: <{StateData}>"; } } We allocate millions of these per second in busy networks
  • 23. public readonly struct Event<TD> : INoSerializationVerificationNeeded { public Event(object fsmEvent, TD stateData) { StateData = stateData; FsmEvent = fsmEvent; } public object FsmEvent { get; } public TD StateData { get; } public override string ToString() { return $"Event: <{FsmEvent}>, StateData: <{StateData}>"; } } Value Type: FSM Events Change to value type Reduction of ~30mb Minor throughput improvement
  • 24. Value Types: Boxing Allocations • Boxing occurs implicitly – when a struct is cast into an object • The struct will be wrapped into an object and placed on the managed heap. • Unboxing happens explicitly – when the object is cast back into its associated value type. • Can create a lot of allocations! StateName is usually an enum (value type) – is the object.Equals call boxing?
  • 25. Value Types: Boxing Allocations // avoid boxing if (!EqualityComparer<TState>.Default.Equals(_currentState.StateName, nextState.StateName) || nextState.Notifies) { _nextState = nextState; HandleTransition(_currentState.StateName, nextState.StateName); Listeners.Gossip(new Transition<TState>(Self, _currentState.StateName, nextState.StateName)); _nextState = default; } Used generic comparer to avoid casting value types into object – removed 100% of boxing allocations at this callsite.
  • 26. Value Type: Message Envelope /// <summary> /// Envelope class, represents a message and the sender of the message. /// </summary> public readonly struct Envelope { public Envelope(object message, IActorRef sender) { Message = message; Sender = sender; } public IActorRef Sender { get; } public object Message { get; } } Used millions of times per second in Akka.NET readonly struct? Value type? Should be “zero allocations”
  • 27. Reference Type: Message Envelope /// <summary> /// Envelope class, represents a message and the sender of the message. /// </summary> public sealed class Envelope { public Envelope(object message, IActorRef sender) { Message = message; Sender = sender; } public IActorRef Sender { get; } public object Message { get; } } What if we change to a reference type? Will this reduce allocations? 394kb  264kb 3.15mb  2.1mb 215 us  147 us 1860 us  1332 us
  • 28. Value Type Pitfalls • Copy-by-Value • References to value types in other scopes requires copying • ref parameters can work, but in narrowly defined contexts • Excessive copying can be more expensive than allocating a reference • Use reference types when semantics are “referential” • Value types are not magic – work best in “tight” scopes • Use the right tool for the job
  • 30. Reference Type: Message Envelope • What happens when we benchmark with significantly increased cross-thread message traffic? • Now if we convert Envelope back into a struct again… • Thread access makes a difference!
  • 31. Threads, Memory, Synchronization, and Pain Threads Hate You and Your Code
  • 32. ThreadStatic and ThreadLocal<T> • Allocates objects directly into thread local storage • Objects stay there and are available each time thread is used • Ideal for caching and pooling • No synchronization • Data and work all performed adjacent to stack memory • Downside: thread-local data structures aren’t synchronized • Variants!
  • 33. Thread Local Storage & Context Switching • Reference types passed between threads often age into older generations of GC • Value types passed between threads are copied (no GC) • Thread-local state is copied into CPU’s L1/L2 cache from memory typically during execution • Context switching occurs when threads get scheduled onto different CPUs or work is moves onto different threads.
  • 34. Thread Locality & Context Switching Each thread gets ~30ms of execution time before yielding
  • 35. Thread Locality & Context Switching Current quantum is over – time for other threads to have a turn
  • 36. Thread Locality & Context Switching Context switch! Thread 0 now executing on CPU 1 – memory and state will have to be transferred.
  • 37. Context Switching: High Latency Impact /// <summary> /// An asynchronous operation will be executed by a <see cref="MessageDispatcher"/>. /// </summary> #if NETSTANDARD public interface IRunnable #else public interface IRunnable : IThreadPoolWorkItem #endif { /// <summary> /// Executes the task. /// </summary> void Run(); } // use native .NET 6 APIs here to reduce allocations // preferLocal to help reduce context switching ThreadPool.UnsafeQueueUserWorkItem(run, true); IThreadPoolWorkItem interface added in .NET Core 3.0 – avoids delegate allocations for executing on ThreadPool Consume IThreadPoolWorkItem with preferLocal=true – tells the ThreadPool to attempt to reschedule work on current thread / CPU.
  • 38. Performance Comparison Before After ~3x improvement ~50% improvement
  • 39. Thread Locality w/o Context Switching No context switch – same thread will have a chance to execute on same CPU. Might be able to benefit from L1/L2 cache, locality of memory access, etc.
  • 40. Data Structures & Synchronization /// <summary> An unbounded mailbox message queue. </summary> public class UnboundedMessageQueue : IMessageQueue, IUnboundedMessageQueueSemantics { private readonly ConcurrentQueue<Envelope> _queue = new ConcurrentQueue<Envelope>(); /// <inheritdoc cref="IMessageQueue"/> public bool HasMessages { get { return !_queue.IsEmpty; } } /// <inheritdoc cref="IMessageQueue"/> public int Count { get { return _queue.Count; } } …. } Could, in theory, improve memory performance by replacing with a LinkedList (no array segment allocations from resizing)
  • 41. Data Structures & Synchronization /// <summary> An unbounded mailbox message queue. </summary> public class UnboundedMessageQueue : IMessageQueue, IUnboundedMessageQueueSemantics { private readonly object s_lock = new object(); private readonly LinkedList<Envelope> _linkedList = new LinkedList<Envelope>(); public bool HasMessages { get { return Count > 0; } } public int Count { get { lock (s_lock) { return _linkedList.Count; } } } …. Not a thread-safe data structure, has to be synced- with lock Should offer better memory performance than ConcurrentQueue<T> Wooooooof 🤮
  • 42. Data Structures & Synchronization • What went wrong there? • ConcurrentQueue<T> is lock-free • Uses volatile and atomic compare-and-swap operations • i.e. Interlocked.CompareExchange • Significantly less expensive, even on a single thread, than lock • LinkedList<T> may not be all that memory efficient • Internal data structure allocations per-insert rather than array block allocations • Better off rolling your own, probably
  • 43. Learn More • https://getakka.net/ - Akka.NET website, Discord, and source • https://aaronstannard.com/ - my blog

Editor's Notes

  1. https://learn.microsoft.com/en-us/dotnet/standard/garbage-collection/fundamentals https://learn.microsoft.com/en-us/dotnet/standard/garbage-collection/large-object-heap  LOH threshold is 85,000 bytes
  2. See Microsoft.Extensions.ObjectPool
  3. Note: Background GC is enabled for both.
  4. Examples: https://particular.net/blog/pipeline-and-closure-allocations
  5. 32 bytes adds up when you allocate millions of these per-second https://stackoverflow.com/questions/16131641/memory-usage-of-an-empty-list-or-dictionary public class List<T> : IList<T>, ICollection<T>, IList, ICollection, IReadOnlyList<T>, IReadOnlyCollection<T>, IEnumerable<T>, IEnumerable { private T[] _items; //4 bytes for x86, 8 for x64 private int _size; //4 bytes private int _version; //4 bytes [NonSerialized] private object _syncRoot; //4 bytes for x86, 8 for x64 private static readonly T[] _emptyArray; //one per type private const int _defaultCapacity = 4; //one per type ... }