9
$\begingroup$

An object is partially initialized if the program execution didn't exit their constructor yet. Usually the programmer is responsible for not leaving partially initialized objects in unwanted places and calling them in unwanted ways. But like Rust has "ownership" and "lifetime" to protect against invalid pointer accesses, are there ways to protect against misusing partially initialized objects, while still allowing every operation that does useful work in the constructor?

My knowledge about Rust is limited, but if there aren't some edge cases that I don't know, it doesn't seem to have accessible partially initialized objects at all. But I'm thinking about languages with more traditional constructors like in C++.

The background of this question is, I'm thinking about how much damage could be done, if there could be async constructors, and objects are allocated in-place instead of referenced from elsewhere, which could leave a lot of partially initialized variables in random places.

Example C++ code that referred to an uninitialized object in this way:

#include<iostream>

struct D {
    int x;
    D(){x=12345;}
};

struct A {
    int y;
    A();
};

struct B:A {
    D d;
    int dx(){return d.x;}
};

B b;

A::A(){std::cout<<b.dx()<<std::endl;}

int main(){}
$\endgroup$
4
  • $\begingroup$ Related, I think, but not quite the same $\endgroup$
    – Bbrk24
    Commented Jun 12 at 13:58
  • $\begingroup$ shipilev.net/blog/2014/safe-public-construction $\endgroup$
    – Moonchild
    Commented Jun 12 at 15:44
  • 1
    $\begingroup$ Two interesting places this crops up in C#: (1) The spec says that C# initializes structs in temporary storage and copies to the destination when the constructor is complete, but the compiler will elide the copy if doing so is "safe". The safety property it is looking for is "could code outside the object observe partially constructed state if the constructor throws"? $\endgroup$ Commented Jun 12 at 20:22
  • 1
    $\begingroup$ (2) Finalizable objects are eligible for finalization the moment they are created; it is possible in rare situations for the finalizer of a dead object to start running before the constructor finishes! Finalizers must be robust in the face of handling objects that were not successfully constructed. $\endgroup$ Commented Jun 12 at 20:23

4 Answers 4

9
$\begingroup$

There are two related issues here: one is memory safety, and the other is domain invariants. We want to guarantee fields are initialised, because an uninitialised field with whatever bit pattern was previously left in that field's location might not be valid for the field's type (e.g. it may be a number out of range, a pointer to an object of the wrong type, or a stale pointer to an object which no longer exists). Additionally, we may also want programmers to be able to ensure their own invariants (e.g. for a Range type, that this.start <= this.end).

Definite assignment

By static analysis, a compiler can determine that a field is or isn't "definitely assigned" at each point in the constructor. The goal is to check that all fields are definitely assigned before a reference to the object can escape (by any means).

This may be done via syntax-directed rules, as Java does (albeit not for fields): for example, a field x is definitely assigned after a statement this.x = ...;, a field is definitely assigned after if(...) A else B if it is definitely assigned after A and definitely assigned after B, and so on. Alternatively, a control-flow graph can be derived, then a variable is "definitely assigned" if every exit node is dominated by an assignment to that variable.

Such rules are conservative, so it is possible to write programs which in reality always assign to a variable, but which nonetheless don't satisfy the rules. The compiler will reject such programs; this could upset programmers who "know better than the compiler". Typescript offers an escape-hatch via definite assignment assertions, allowing the programmer to override the compiler's judgement, although (like other escape hatches) this means opting out of having the compiler check the correctness of your code.

Definite assignment checks are based on the principle of proving that the fields will be assigned after some statement or set of statements complete normally. This means that separate guarantees need to be made for when the constructor does not complete normally (e.g. if it throws an exception, yields, or falls into an infinite loop). In those cases we want to ensure that a reference to the uninitialised object cannot be acquired. The rules to ensure this can often be more onerous than the more straightforward rules requiring definite assignment before the constructor terminates.

Normally, a reference to a partially initialised object can't escape from the code which invokes the constructor, since it receives no such reference in case the constructor fails (or, if the constructor is merely paused, the invoking code doesn't receive a reference to the new object until the constructor resumes and completes). However, the constructor itself must also not provide other code with a reference to this before all fields are initialised; in particular, instance methods should not be called, nor should a superclass constructor. (Alternatively, if a superclass constructor may be called earlier, then instance methods must not be called from anywhere in the superclass constructor, even after all superclass fields are definitely assigned.) If you allow constructors to yield, then likewise a constructor must not yield this (or similar) before all fields are initialised.

Default field values

In Java, fields have default initial values based on their types, such that a field has a well-defined value even if is not assigned in the constructor (or an initialiser). This is a semantic guarantee rather than a runtime requirement, so a compiler is permitted to optimise away the assignment of the default value if it is a dead store.

This means that objects in Java always have their fields initialised, guaranteeing memory safety and preventing implementation-specific behaviour, though it is not guaranteed that a default initial value for a field will satisfy any invariant intended by the programmer.

Note that this is only possible in Java because null is a valid value for all reference types. In general, non-nullable reference types do not have sensible default values, nor do function types, file handles, or so on.

Allow fallible constructors

In languages with exceptions, constructors can be permitted to throw them. This allows programmers to throw exceptions rather than allow objects to be created which violate their invariants. However, it doesn't help with memory safety since it depends on the programmer writing code to verify the object's state themselves.

Ban fallible constructors

A different approach is taken by languages like Rust, in which there are no non-trivial constructors or initialisers at all. Objects (really, structs and tuples) are created by directly providing values for all of their fields, and so it is trivially not possible to "see" an object before those values are assigned.

This guarantees memory safety, but means invariants (and other considerations such as encapsulation) must be addressed in a different way. Rather than writing a constructor which checks whether the object is being constructed correctly and throws an exception otherwise, programmers are expected to write factory functions which perform those checks before creating the object.

$\endgroup$
5
  • $\begingroup$ Most of this answer is about how to make a constructor work correctly and initialize every field. But I'm asking about a paused constructor, by calling other code within the constructor or using async constructors, assuming constructors always work correctly if they could be finished. $\endgroup$
    – user23013
    Commented Jun 12 at 14:42
  • 3
    $\begingroup$ @user23013 All of these approaches apply to the case of constructors being interrupted via yield or await, if your language allows constructors to be coroutines or asynchronous. A constructor which has not finished executing because it threw an exception, vs. a constructor which has not finished executing but might be resumed later, doesn't make much difference as far as I can tell. In approach 1 that means you can't yield this (or yield anything from which this is accessible), in approach 2 the default values prevent unsafety including while the constructor is paused, and so on. $\endgroup$
    – kaya3
    Commented Jun 12 at 14:52
  • $\begingroup$ Note that in Rust you can deinitialize fields, leaving an object in a partially initialized state. The Rust compiler uses control-flow analysis to track each field's state and ensure all fields are properly initialized before a value is used, or alternatively to drop only the still initialized fields when the value is dropped. $\endgroup$ Commented Jun 13 at 7:14
  • $\begingroup$ And the easiest way to implement "Default field values" is to memset the object's memory to 0. $\endgroup$
    – dan04
    Commented Jun 13 at 22:27
  • 1
    $\begingroup$ @dan04 That works for Java because of null, but if e.g. the default value for a list in your language should be an empty list, then constructing a default value might require another allocation. And if users are allowed to define default values for their own types (e.g. as Rust's Default trait allows) then it's more complicated. $\endgroup$
    – kaya3
    Commented Jun 13 at 22:44
6
$\begingroup$

The general problem you describe is to allow computations on constructors, and at the same time to avoid partially constructed objects that do "escape" into other contexts. This is something that linear type theory may help.

Two stage construction constructor

Inside the constructor, the object has a special type that cannot be expressed in code. And a syntax to mark the completion of the constructor, to allow a normal object to interact with the system, even inside the constructor.

class C {
    def constructor( self, args... )
    {
        // self starts as an "unborn" object.
        // It's already allocated, but no invariants hold.

        // Also, self.c cannot be null and also cannot
        // mutate after construction (init only).

        self.a = ...;
        self.b = ...;

        MyCache.Add(self); // Compiler error: Unborn object cannot escape.
        self = new;        // Compiler error: Unassigned property.

        self.c = ...;

        // This completes the constructor, and guarantees all invariants
        self = new;

        MyCache.Add(self); // This is ok.
        self.c = ...;      // Compiler error, immutable read only property.
    }
}

var c = new C( args ... );
MyCache.Add(c);            // This is ok.
c.c = ...;                 // Compiler error, immutable read only property.

Do not store references of partially constructed objects

This may sound silly, but is a long standing gotcha in Java and also in other languages, more close to metal. See the size of code for double-checked locking examples in Wikipedia!

The misconception is that:

singletons[i].reference = new Singleton();

means:

  1. Allocate instance
  2. Run constructor
  3. Store object reference

But in multi-threaded C/C++/Java this also means:

  1. Allocate instance
  2. Store object reference
  3. Run constructor

So an another thread can access singletons[i].reference and get a reference to a non constructed object. This invalidates all invariants!

For example, any non-null object property can now return null values. How bad is that?

Allow constructors to fail graciously

Running computations on a constructor means that these computations can fail. But if the only way to communicate these failures is to throw exceptions, the code may end up cluttered with exception handling code, even these failures are expected.

A gentle way is to allow constructors to fail more graciously

def constructor ( self? , args... )
{
    // Note the ? on self parameter.
    // This means that this constructor may not create an object.

    var memory = malloc( 1000 );
    if ( memory == null )
        return null;
}

And by so, code now can be more simple:

var fileinfo = new FileInfo(path) ?? FileInfo.Empty;

// or

var fileinfo = new FileInfo(path); // fileinfo is FileInfo?
if ( fileInfo ) { ... }            // fileinfo is FileInfo inside if
else { ... }                       // Failed to obtain file fin

Allow constructors to return errors

Same as above, buy we are interested on why FileInfo constructor failed.

def constructor ( args... ) self , error
{
    // Multiple returns on constructors are always disjoint unions
    // Alternative, explicit syntax:
    //
    //     def constructor () self |,| error

    if ( ... )
        return self;
    return (error) err;
}

var info, err = new FileInfo(path);

Console.Log( info.FileSize ); // Compiler error, disjoint type direct access.
throw err;                    // Compiler error, disjoint type direct access.

if ( info ) { ... } // Ok
else throw err;     // Ok
```
$\endgroup$
2
  • $\begingroup$ Swift and Objective-C have something similar to the last one; In Swift it's spelled init?(...), and in Objective-C it's done by marking the return type as instancetype _Nullable or instancetype __nullable rather than just instancetype. $\endgroup$
    – Bbrk24
    Commented Jun 12 at 16:36
  • 1
    $\begingroup$ "But in multi-threaded C/C++/Java this also means […]" – Even in Java? I find that incredibly surprising! My understanding of that statement would have been that the expression new Singleton() doesn't finish evaluating until the constructor has finished running, and the assignment doesn't occur until its right-hand side has finished evaluating. Which of these is not true for Java? $\endgroup$ Commented Jun 14 at 15:49
5
$\begingroup$

The simplest method is to by default initialize to something as soon as the lifetime of an object begins.

In D this is the Foo.init value which every type has which gets memcopyd onto the variable when the lifetime begins. It can be customized in the definition of the struct. You can also just pick zero as initial value for everything.

Then you can opt-out by declaring the initial value of your variable as void: Foo f = void;

Straight forward compiler optimizations will let dead stores (where a field gets written twice without being read inbetween) get optimized out in the common case.

Using this scheme means that the programmer will need to opt-into uninitialized values.

Designing a constructor around making an object immutable after construction can be done by making the constructor the exception to not being able to write to the non-mutable fields (the others are still default init where it matters). Or by delaying actually writing the fields until a designated point in the constructor. Then you have a preamble that computes the actual fields after which the fields get written and the lifetime officially begins, and a epilogue where it might fit itself into a datastructure as it might need.

$\endgroup$
5
$\begingroup$

I'd like to add an interesting solution Bob Nystrom used in one of his languages, Magpie (explained in this article):

Basically, every class has a factory method, new, which by default simply creates an object using the given parameters and can be overridden to supply some more complex initialization logic. The trick is that within these methods, there's no way to actually access any instance variables of the new object, since it hasn't actually been created yet!

Instead, at the end of that method you call another, built-in method called construct, which takes in a record (basically a map) containing a key and value for every instance variable, and returns a new instance of the specific class initialized accordingly. That way, instance creation is atomic, meaning it is completely prohibited by design to perform operations on a partially initialized object.

$\endgroup$
4
  • 1
    $\begingroup$ Rust's approach, too, is to make the creation of structs an atomic operation. The rest of the details (that construct is a named function rather than using the name of the type, that new has a mandated name and a default implementation) are different, but the mechanism for preventing partial initialisation is the same. $\endgroup$
    – kaya3
    Commented Jun 12 at 20:26
  • $\begingroup$ @kaya3 True, it is functionally pretty much equivalent. Though the semantics is different (as you're already "inside the object" instead of creating it "from the outside") and it's a nice workaround for a dynamically typed language. $\endgroup$ Commented Jun 13 at 5:58
  • 1
    $\begingroup$ I'd argue that you aren't "inside" the object because there is no this to use; the object is created by construct, so there is no object for most of new despite its name. $\endgroup$
    – kaya3
    Commented Jun 13 at 14:09
  • $\begingroup$ As mentioned by kaya3, you are not "inside" the object in Rust, because factory functions ("constructors") do not take a self parameter. The only "constructors" baked into the language are Type (for types without fields), Type(...) (for types with unnamed fields), and Type { ... } (for types with named fields). And those are atomic. $\endgroup$ Commented Jun 15 at 10:42

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .