Possible to mix garbage collection and manual memory management?

Question

Do you think it is possible to have a language that uses garbage collection (GC) by default, but allows you take more control with manual memory management like C++ or Rust, in areas of the software where you care more about performance?

Are there any examples of languages that do this, in academia or the "real world"? Are there particular key problems that need to be overcome?

Maybe the allocation strategy could be a generic type parameter.

“Are there any examples of languages that do this”: Nim. — xigoi, Commented Sep 24, 2023 at 19:32
Expanding on @MartinBerger's comment: early Rust used to have a garbage collector for the things it couldn't verify statically. — Seggan, Commented Sep 25, 2023 at 0:08
Conservative GC has been bolted on to languages with manual memory management. — Barmar, Commented Sep 25, 2023 at 14:52

reirab · Accepted Answer · 2023-09-25 23:48:17Z

Embedded developer here. Yes, it is possible and, yes, there are real-world examples. C# is a particularly common one.

The vast majority of memory allocations (and other resource allocations) in C# are managed (i.e. garbage collected,) but you can use manually-allocated memory as well as unmanaged resources allocated from other operating system APIs when you need to. The vast majority of C# applications never need to do this, but, when you have performance-critical code or just need to interface with some API that uses unmanaged resources, you can manually allocate memory and use pointers for it just as you would in C or C++.

One of the most useful aspects of this in C# in particular is that it allows you to access an array without bounds checking being performed automatically by the runtime. While bounds checking is great in most cases, when you're processing an extremely large amount of data, omitting it will result in vastly faster execution time. This is especially useful in the embedded world where you typically don't have the CPU power of a desktop computer and you're frequently running on battery power, but it can be useful in any application that needs to process very large data sets. Of course, with no automatic bounds checking, it is up to the developer to make sure not to read or write outside of the buffer's bounds and attempting to do so results in undefined behavior, just as it would in C or C++.

Of course, for managed memory, the system can garbage collect or even relocate a buffer at completely unpredictable times, so just grabbing a pointer to a managed buffer and then using it later results in undefined behavior. In order to resolve this problem, C# provides the fixed statement. This statement creates a scope within which the specified managed array is not eligible for garbage collection and cannot be relocated by the garbage collector and provides a pointer for direct access to that array's memory within that scope. Using the pointer outside of that scope is an invalid operation and will result in undefined behavior, but this is no different from attempting to use a buffer after deallocating it in C or C++. You just have to ensure that your code does not use the pointer again outside of the fixed statement.

As kaya3's answer discussed, another problem with mixing managed and unmanaged resources is dealing with the case where a managed object references an unmanaged resource and the managed object is garbage collected. The way C# (and .Net in general) handles this is the Dispose Pattern. In the Dispose pattern, there a managed object which might have unmanaged resources to clean up (or which might need to dispose other managed objects) has a public Dispose() method, which calls a protected Dispose(bool) with the Boolean parameter set to true. If it has unmanaged resources to clean up, it also implements a finalizer, which calls Dispose(false). The Dispose(bool) method always cleans up unmanaged resources, but only attempts to dispose other managed resource if called with its Boolean parameter set to true, since those resources may have already been garbage collected if Dispose(bool) was called from the finalizer. As long as this pattern is followed correctly, unmanaged resources owned by managed objects will be released when the program is done with the managed object, even if Dispose() is never called explicitly on it.

As an embedded developer who writes everything from WPF touchscreen user interfaces down to PCIe device drivers, personally I love having the flexibility to write the vast majority of our user-mode applications in a managed language, but still be able to use pointers when I need to crunch through a very large dataset efficiently or interoperate with libraries written in other language, such as C. 95% or so of our code can still benefit from the ease and safety of managed memory without having to compromise performance on the other 5% or else write the other 5% in a different language. And it also means that we can create managed objects to 'wrap' unmanaged resources in exactly the same way that the runtime itself does this for operating system resources such as file handles, mutexes, etc.

As an embedded developer, I find it interesting that many people associate GC systems with bigger platforms, but the first really popular language implementations to use GC were designed for machines with as little as 4K o4 5K of RAM. The GC algorithm used was terriblely slow, but it did manage a tiny memory footprint. — supercat, Commented Sep 26, 2023 at 17:29
Slight technicality, but most c# applications do need to do this somewhere, it's most c# code that doesn't. Most c# developers will only ever use unmanaged capabilities through libraries (including the std lib) but it still comes up on profilers etc. — user1937198, Commented Oct 5, 2023 at 19:27
@user1937198 Yes, that's definitely true. I really meant that the code written by the application developer in the vast majority of C# applications would never need to do that, not that the libraries weren't doing it under the hood. The runtime and standard libraries are, of course, filled with pointers or handles to unmanaged resources of all sorts, but, at least in the case of the ones in the libraries, they use that same Dispose Pattern to wrap those in managed objects so that application developers won't need to deal with the unmanaged resources. — reirab, Commented Oct 6, 2023 at 1:20

kaya3 · Accepted Answer · 2023-09-25 17:58:07Z

Certainly it is possible, but do people actually want it?

Well, of course we all want "the best of both worlds". But by and large, each world is big enough by itself to satisfy almost everyone who lives there: languages with manual memory management tend to have standard library or third-party library features for some levels of automatic memory management (e.g. reference counting and arena allocation cover most use-cases); and the performance cost of having a GC for every object is clearly acceptable to many programmers, and often negligible in practice.

Are there particular key problems that need to be overcome?

The problems will occur when you try to mix the two. If a program uses exclusively managed objects, or exclusively unmanaged memory allocations, then there is no problem, but in that case there would also be no reason to use a language which supports both ─ you might as well just have two different languages.

It's worth noting that many of these problems also apply to FFI, for example when writing C extensions for Python. This isn't exactly mixing managed and unmanaged memory in the same language, but the C programmer who writes the extension does have to account for Python's garbage collection. The upside is that Python programmers can't accidentally cause problems with unmanaged memory.

So if you want the benefits of a GC'd language but you also want to offer careful programmers the performance benefits of having some manually managed allocations, consider FFI as an alternative language design feature.

Managed objects holding references to unmanaged memory

A garbage-collected object which holds a reference to an unmanaged allocation, will have to manually deallocate the unmanaged memory when the managed object is collected. You can do this with finalisers, but there are many problems associated with finalisers:

They're called non-deterministically in an unreliable order, so you can't guarantee anything about the program state (e.g. object invariants) when they're invoked;
They are invoked by the GC but can change the live-status of other objects, complicating the GC implementation;
Programmers frequently misuse them for non-memory resources;
There's no natural place in the program to catch or handle exceptions or other runtime errors raised in finalisers;
If the GC is multithreaded then finalisers need to be written to take care of this, even if the program itself is single-threaded; deadlocks in finalisers are also possible.

These are all reasons that newer languages tend to avoid having finalisers, and older languages are deprecating them (e.g. Java). But without finalisers, or something like them, GC-managed objects can't own unmanaged memory without leaking it when they're collected.

An alternative to finalisers which solves some problems is to have the GC enqueue notifications about collected objects (i.e. phantom references) into a user-controlled queue, so that the user's program can choose when to traverse that queue and in which thread; this should be safer overall, but it shifts more work onto the user, and that comes with new potential sources of bugs.

Unmanaged memory holding references to managed objects

A manually-managed allocation which holds a reference to a GC-managed object will have to manually inform the GC about it. The GC needs to know about all roots (definitely-live references to managed objects); it can know automatically about roots held by the call stack, but not those held in unmanaged memory. Then, when the unmanaged allocation is deallocated, the programmer needs to again manually inform the GC that the reference has been dropped.

This basically means the programmer must manually write code which would otherwise be generated automatically by a tracing GC.

Raw memory access of managed objects

If the language allows reading/writing to memory using raw pointers, then the language must ensure that either

Nobody can acquire a pointer to a managed object;
Managed objects have reliable, specified memory layouts (in particular, the object header used by the GC must be a fixed or reliably computed size); or
Otherwise, if programmers can directly read/write to managed allocations, that they know not to do this (although realistically, they will do it anyway).

The issue is that writing to an object header will generally lead to undefined behaviour in the sense that anything could happen. The GC is a black box, so a mutated object header could have arbitrary meaning to the GC; a mutated header might violate some invariants, or cause a finaliser to be invoked at the wrong time.

Very useful answer, thanks. Though I'm not sure I agree that there is "no reason" to use such a language. I think I would like it. ... Good point that a one-language solution would be a lot like an FFI. I've been tinkering with creating my own language. Python almost works for me, because I want interactive/notebook development and I can work with their two-language/FFI solution to performance. My problem with Python is the lack of sophisticated static types and analysis. — Rob N, Commented Sep 24, 2023 at 18:08
@RobN: "My problem with Python is the lack of sophisticated static types and analysis." Have you looked at mypy? — Oddthinking, Commented Sep 25, 2023 at 6:54
Instead of referring to "unmanaged memory" I think it is more useful to think of resources in terms of responsibilities that are acquired, and may only be released by fulfilling them or handing them off. Code that opens a file acquires a responsibility to close it; code that acquires a lock also acquires a responsibility to release it, etc. An advantage of a hybrid approach where code manually carries out responsibilities is that the GC can ensure that a reference to an object associated with e.g. a file, will always be a valid reference to something, even if it's an object that's... — supercat, Commented Sep 25, 2023 at 21:05
...unable to do anything except that it's a permanently useless object. Also, GC frameworks may allow objects to be pinned, with a proviso that code must ensure that no outside copies of the address will ever be used after an object is unpinned. — supercat, Commented Sep 25, 2023 at 21:07
I strongly agree with supercat's "responsibilities" framework. When you allocate an unmanaged object, you assume responsibility for its disposal. Some of the objections discussed in your answer represent cases where the programmer is not living up to this responsibility. (Most clearly "unmanaged object owned by a managed object.") You shouldn't (I claim) lay blame for this on the mixture of modes; it's already a consequence of using unmanaged memory at all. — Glenn Willen, Commented Sep 26, 2023 at 0:11

Steve Jessop · Accepted Answer · 2023-09-26 22:00:06Z

Yes, a very loose summary of a system I worked with (and implementing) pre-2007:

The main part was an assembly language designed with a view to implementing Java. Or maybe refined for Java, I don't remember: that part happened before I joined the company. Think of an assembly language with a lot of macro facilities built in, including macros that make virtual function calls and some other relatively high-level concepts. All this targeted a virtual instruction set that would later be translated to actual machine code (not a million miles from the idea of LLVM instructions, but invented before LLVM itself existed).

So, all essential JVM opcodes were available as assembler macros or function calls. At the same time, by which I mean, from the same source code written in the same language inside the same function as these Java operations, the programmer also had direct access to the kernel calls of its non-protected mode OS. And also to a stdlib style set of functions. The stdlib was mostly used by the system's C compiler when it generated the backend code, but there's nothing to stop anyone calling them who is writing in assembly.

If you stored a pointer to a Java object into a marked region of memory (the stack or a Java object, basically) then it would be GCed normally (normally for Java, I mean: mark sweep). You could also allocate unmanaged memory just as easily with malloc and free themselves, or with similar-style kernel functions. I'm pretty sure there was also some facility to help with reference counting, although I don't remember exactly how that worked. Everything you can do in Java translated reasonably obviously to assembly code, it just took longer to write.

So, when writing code you genuinely made a choice for each piece of memory whether to allocate it as a Java object (in which case it would be GCed provided you didn't mess up and fail to store the pointer anywhere), from the kernel (in which case you were responsible for freeing it, via the assisted refcounting if you wanted it), or from libc (in which case you were broadly responsible for freeing it, but as a concession and to support easier porting of existing C programs, there was a concept of a process, and processes did keep records of all their malloc allocations and other C-style resources, to clean up on process exit).

It therefore had more mixed modes of resource management than any one piece of code really wanted to use: you'd mix at most two of the three in practice.

This system was never publicly available as a programming environment, although as a Java implementation it did appear in some embedded devices, including early HTC Java phones. It also made it onto some other devices without the Java parts: it was quite modular in that sense and you could build a stripped-down version of the OS that just didn't include anything you didn't call.

And fundamentally, it did what you ask for but probably not what you want. It didn't have the "ease of use" of Java, because you literally had to think, "OK, is my next virtual call made through an interface or through a class", because invoke_virtual and invoke_interface were different macros. It had a language restriction that each instruction had at most one side-effect (counting "call out" as one side-effect): you could write a + b + c if those were integers, but you literally couldn't write a() + b() + c() if those were function pointers. You couldn't even write total += a(). You had to write the equivalent of result = a(); total += result. You couldn't write a + b + c if those were strings, either: it would be probably 4 lines of code to create a StringBuilder from a, then append b, then append c, then create a string from the result. When you did that, all that memory would be GCed. But the assembler absolutely did not type check your code: but you didn't ask for type checking, you asked for mixed memory management ;-)

One sense it probably didn't do what you're asking: you could mix the modes, but not for the same types. If you want a GCed array of integers, fine, use a Java array (and of course you could get the base address to do direct memory access to it as a buffer). But if you want to manually manage a Java object, you can't: there was no legitimate way to call "free" on a java.lang.String on grounds you happen to know you're the last remaining reference to it. If you want to do that you need to allocate a buffer for a nul-terminated byte array from the kernel, instead, and then it wouldn't have all the convenience methods of String. Hello, strncpy.

So there's a reason it was never intended as the new general-purpose programming language for the world. But for relatively high-end (for the time) embedded systems programming, and for accelerating Java code before Sun ever released a JIT for mobile devices: it wasn't bad. We beat Sun on pretty much all the benchmarks at the time. You could take Java code, figure out which bits "really needed" GC and which could be manually managed, and rewrite the Java code to run faster and occupy less RAM. We literally did that for some critical parts of the Java standard libraries, and left other parts as the off-the-shelf Java implementations.

Aside from all this complicated stuff, there's also things like the Boehm GC for C code. It uses gc "by default" because you call GC_MALLOC "by default" instead of malloc. But you can still manually manage memory by calling malloc and free instead. That's what you asked for, right? ;-) — Steve Jessop, Commented Sep 26, 2023 at 21:33
Welcome, this was really interesting to read! Given this answer and another one, perhaps embedded/mobile development is a niche where this kind of "mixing" of memory management strategies makes sense and is worth it? — kaya3, Commented Sep 26, 2023 at 22:23
@kaya3: certainly I'd say that embedded is a realm where developers are more likely to want to write code that looks like systems programming, even though what you're actually writing is definitely an application. Of course, "mobile" today means a device as powerful or more than the desktop I was using in 2000 to write this code for what we then called smartphones, and now would call barely fit to use as a paperweight. — Steve Jessop, Commented Sep 26, 2023 at 22:54
I was thinking also that perhaps FFI is less of an option in (some) embedded contexts; if you can't farm out the manual memory management to a separate lower-level language then it makes more sense to have it alongside GC-managed code in the same language. But this is just my speculation since I have no experience with embedded programming. — kaya3, Commented Sep 26, 2023 at 23:00
Yeah, although I think probably that model does also work very well: you "just" need the high-level language to have a reasonably lightweight interpreter. I don't do embedded any more and I don't know the state of the art, but even back then there was always a tension between the code that you want to write quickly, and the code that you want to hand-optimise. You didn't want to write everything in assembly if you could avoid it: but at the time we considered even C kind of heavyweight unless you were very careful what std libraries you called. — Steve Jessop, Commented Sep 26, 2023 at 23:02

supercat · Accepted Answer · 2023-09-26 15:32:15Z

An abstraction model which uses explicit object disposal will be essentially as good or better than garbage collection in most usage cases, but at least three usage cases where garbage collection is better are biggies:

It's useful to be able to pass around references to immutable data-holding objects, including objects of mutable type which are wrapped by immutable objects and will never change after wrapper construction is complete, as proxies for the data contained therein, without having to know or care how many references exist to any particular object.
Sometimes it's useful for objects to perform actions on behalf of other objects, without having any particular interest in those objects. For example, an object might notify an "updates counter" object every time it's updated, but if the last entity that's interested in the update counter goes away, the monitored object's reference to the update counter should go away at some point(*)
A tracing garbage-collector can ensure that after an object is notified its services will no longer be required, notifies other entities upon which it relies likewise, and consequently becomes useless, references to that object will continue to remain as valid references to a useless object, and the object can then deterministically reject attempts to continue using it rather than having such requests act upon other arbitrary objects, with unpredictable effects.

The existence of a garbage collector in no way obviates the need for explicit deterministic resource cleanup, but even the best imaginable mechanisms for explicit resource cleanup in no way obviate the above advantages for a tracing garbage collector.

(*) I've yet to see a framework handle this particularly well; a design I'd like to see would be to have event handlers provide an "are you still interested?" function which would allow an event handler to indicate whether further notifications were required. If the event handler has been told its services are no longer required, it could let any object that sends it an event know (via return value) that it should be unsubscribed. To guard against memory leaks, events with event lists could send "are you still interested?" messages to two subscribers (if that many exist) every time a new subscription is added. Unused subscription entries for dead objects might remain around indefinitely, but there would be no way for subscription lists to fill up with an unbounded number of such entries, since old once would be getting cleaned out while new ones were added.

(2) sounds a bit like the "entity that's interested in the update counter" should hold a hard reference, and the "monitored object" should hold a weak reference. Then, when the last entity that's interested in the update counter goes away, the monitored object's reference to the update counter can no longer be acquired - this failure to acquire could be interpreted the same way you'd interpret a negative response to "are you still interested?". Of course weak references have a non-trivial implementation overhead. Am I missing something else, or is the problem that cost? — Steve Jessop, Commented Sep 27, 2023 at 2:46
@SteveJessop: It's possible to manage views using ownership and "non-owning" pointers, even without GC, but that requires adding another layer of notifications to deal with changes to subscriptions, and all interactions among notifications need to be handled in thread-safe and deadlock-free fashion. If one has an object which is e.g. supposed to count how many times some monitored event happens, but not do anything other than hold the count, allowing multiple objects who are interested in the thing being watched to share that object without the observer having to care can simplify things. — supercat, Commented Sep 27, 2023 at 15:51

vsz · Accepted Answer · 2023-09-27 12:16:46Z

2

Yes

Qt, when used with C++, does this. The Qt data structures are managed. However, nothing prevents you from using C++ or even plain old C pointers at the same time. One could even use Qt and STL data structures at the same time. For example, QVector and std::vector. Can be useful when interfacing with libraries which don't use Qt.

answered Sep 27, 2023 at 12:16

vsz

1212 bronze badges

Add a comment |

Mark Morgan Lloyd · Accepted Answer · 2023-09-26 06:50:32Z

1

Yes.

Object Pascal behaves like that. Class instances and ordinary blocks of memory are subject to manual allocation and deallocation, while strings and dynamic arrays are reference-counted.

answered Sep 26, 2023 at 6:50

Mark Morgan Lloyd

3351 silver badge4 bronze badges

Add a comment |

Stack Exchange Network

Possible to mix garbage collection and manual memory management?

6 Answers 6

Managed objects holding references to unmanaged memory

Unmanaged memory holding references to managed objects

Raw memory access of managed objects

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
garbage-collection
memory-management
.

Hot Network Questions

Possible to mix garbage collection and manual memory management?

6 Answers 6

Managed objects holding references to unmanaged memory

Unmanaged memory holding references to managed objects

Raw memory access of managed objects

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged garbage-collectionmemory-management.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
garbage-collection
memory-management
.