45

If I define a variable of a certain type (which, as far as I know, just allocates data for the content of the variable), how does it keep track of which type of variable it is?

5
  • 8
    Who/what are you referring to by "it" in "how does it keep track"? The compiler or the CPU or something/one else like the language or the program?
    – Erik Eidt
    Commented Oct 21, 2018 at 14:57
  • 6
  • 8
    @ErikEidt IMO the OP obviously means "the variable itself" by "it." Of course the two-word answer to the question is "it doesn't".
    – alephzero
    Commented Oct 21, 2018 at 20:57
  • 2
    great question! especially relevant today given all the fancy languages that do store their type. Commented Oct 22, 2018 at 12:39
  • @alephzero That was obviously a leading question.
    – Luaan
    Commented Oct 23, 2018 at 11:28

5 Answers 5

109

Variables (or more generally: “objects” in the sense of C) do not store their type at runtime. As far as machine code is concerned, there is only untyped memory. Instead, the operations on this data interpret the data as a specific type (e.g. as a float or as a pointer). The types are only used by the compiler.

For example, we might have a struct or class struct Foo { int x; float y; }; and a variable Foo f {}. How can a field access auto result = f.y; be compiled? The compiler knows that f is an object of type Foo and knows the layout of Foo-objects. Depending on platform-specific details, this might be compiled as “Take the pointer to the start of f, add 4 bytes, then load 4 bytes and interpret this data as a float.” In many machine code instruction sets (incl. x86-64) there are different processor instructions for loading floats or ints.

One example where the C++ type system cannot keep track of the type for us is an union like union Bar { int as_int; float as_float; }. An union contains up to one object of various types. If we store an object in an union, this is the union's active type. We must only try to get that type back out of the union, anything else would be undefined behavior. Either we “know” while programming what the active type is, or we can create a tagged union where we store a type tag (usually an enum) separately. This is a common technique in C, but because we have to keep the union and the type tag in sync this is fairly error prone. A void* pointer is similar to an union but can only hold pointer objects, except function pointers.
C++ offers two better mechanisms to deal with objects of unknown types: We can use object-oriented techniques to perform type erasure (only interact with the object through virtual methods so that we don't need to know the actual type), or we can use std::variant, a kind of type-safe union.

There is one case where C++ does store the type of an object: if the class of the object has any virtual methods (a “polymorphic type”, aka. interface). The target of a virtual method call is unknown at compile time and is resolved at run time based on the dynamic type of the object (“dynamic dispatch”). Most compilers implement this by storing a virtual function table (“vtable”) at the start of the object. The vtable can also be used to get the type of the object at runtime. We can then draw a distinction between the compile-time known static type of an expression, and the dynamic type of an object at runtime.

C++ allows us to inspect the dynamic type of an object with the typeid() operator which gives us a std::type_info object. Either the compiler knows the type of the object at compile time, or the compiler has stored the necessary type information inside the object and can retrieve it at runtime.

15
  • 3
    Very comprehensive. Commented Oct 21, 2018 at 13:47
  • 9
    Note that to access the type of a polymorphic object the compiler still must know that the object belongs to a particular inheritance family (i.e. have a typed reference/pointer to the object, not void*).
    – Ruslan
    Commented Oct 21, 2018 at 14:40
  • 5
    +0 because the first sentence is untrue the the last two paragraphs correct it.
    – Marcin
    Commented Oct 21, 2018 at 19:59
  • 4
    Generally what is stored at the start of a polymorphic object is a pointer to the virtual method table, not the table itself. Commented Oct 22, 2018 at 0:32
  • 4
    @v.oddou In my paragraph I ignored some details. typeid(e) introspects the static type of the expression e. If the static type is a polymorphic type, the expression will be evaluated and that object's dynamic type is retrieved. You cannot point typeid at memory of unknown type and get useful information. E.g. typeid of an union describes the union, not the object in the union. The typeid of a void* is just a void pointer. And it's not possible to dereference a void* to get at its contents. In C++ there is no boxing unless explicitly programmed that way.
    – amon
    Commented Oct 22, 2018 at 9:26
54

The other answer explains well the technical aspect, but I'd like to add some general "how to think about machine code".

The machine code after the compilation is pretty dumb, and it really just assumes that everything works as intended. Say you have a simple function like

bool isEven(int i) { return i % 2 == 0; }

It takes an int, and spits out a bool.

After you compile it, you can think about it as something like this automatic orange juicer:

automatic orange juicer

It takes in oranges, and returns juice. Does it recognize the type of objects that it get in? No, they are just supposed to be oranges. What happens if it gets an apple instead of an orange? Perhaps it will break. It doesn't matter, as a responsibile owner won't try to use it this way.

The function above is similar: it is designed to take ints, and it may break or do something irrelevant when fed something else. It (usually) doesn't matter, because compiler (generally) checks that it never happens - and it indeed never happens in well-formed code. If compiler detects a possibility that a function would get wrong typed value, it refuses to compile the code and returns type errors instead.

The caveat is that there are some cases of ill-formed code that compiler will pass. Examples are:

  • incorrect type-casting: explicit casts are assumed to be correct, and it is on programmer to ensure that he isn't casting void* to orange* when there is an apple on the other end of the pointer,
  • memory management issues such as null pointers, dangling pointers or use-after-scope; compiler isn't able to find most of them,
  • I'm sure there is something else I'm missing.

As said, the compiled code is just like the juicer machine - it doesn't know what it processes, it just executes instructions. And if the instructions are wrong, it breaks. That is why above problems in C++ result in uncontrolled crashes.

8
  • 4
    The compiler attempts to check that the function is passed an object of the correct type, but both C and C++ are too complex for the compiler to prove it in every case. So, your apples-and-oranges comparison to the juicer is quite instructive.
    – Calchas
    Commented Oct 21, 2018 at 21:52
  • @Calchas Thanks for your comment! This sentence was indeed an oversimplification. I elaborated a bit on the possible problems, they are actually pretty related to the question.
    – Frax
    Commented Oct 21, 2018 at 22:21
  • 6
    wow great metaphor for machine code! your metaphor is made 10x better by the picture too! Commented Oct 22, 2018 at 12:38
  • 2
    "I'm sure there is something else I'm missing." - Of course! C's void* coerces to foo*, the usual arithmetic promotions, union type punning, NULL vs. nullptr, even just having a bad pointer is UB, etc. But I don't think listing all of those things out would materially improve your answer, so it's probably best to leave it as it is.
    – Kevin
    Commented Oct 23, 2018 at 1:09
  • @Kevin I don't think it's necessary to add C here, since the question is only tagged as C++. And in C++ void* doesn't implicitly convert to foo*, and union type punning isn't supported (has UB).
    – Ruslan
    Commented Oct 23, 2018 at 12:12
3

A variable has a number of fundamental properties in a language like C:

  1. A name
  2. A type
  3. A scope
  4. A lifetime
  5. A location
  6. A value

In your source code, the location, (5), is conceptual, and this location is referred to by its name, (1).  So, a variable declaration is used to create the location and space for the value, (6), and in other lines of source, we refer to that location and the value it holds by naming the variable in some expression.

Simplifying only somewhat, once your program is translated into machine code by the compiler, the location, (5), is some memory or CPU register location, and any source code expressions that reference the variable are translated into machine code sequences that reference that memory or CPU register location.

Thus, when the translation is completed and the program is running on the processor, the names of the variables are effectively forgotten within the machine code, and, the instructions generated by the compiler refer only to the variables' assigned locations (rather than to their names).  If you're debugging and requesting debugging, the location of the variable associated with the name, is added to metadata for the program, though the processor still sees machine code instructions using locations (not that metadata).  (This is an over simplification as some names are in program's metadata for the purposes of linking, loading, and dynamic lookup — still the processor just executes the machine code instructions it is told to for the program, and in this machine code the names have been converted to locations.)

The same is also true for the type, scope, and lifetime.  The compiler generated machine code instructions know the machine version of the location, which stores the value.  The other properties, like type, are compiled into the translated source code as specific instructions that access the variable's location.  For example, if the variable in question is a signed 8-bit byte vs. an unsigned 8-bit byte, then expressions in the source code that reference the variable will be translated into, say, signed byte loads vs. unsigned byte loads, as needed to satisfy the rules of the (C) language.  The type of the variable is thus encoded into the translation of the source code into machine instructions, which command the CPU how to interpret the memory or CPU register location each and every time it uses the location of the variable.

The essence is that we have to tell the CPU what to do via instructions (and more instructions) in the machine code instruction set of the processor.  The processor remembers very little about what it just did or was told — it only executes the instructions given, and it is the compiler or assembly language programmer's job to give it a complete set of instruction sequences to properly manipulate variables.

A processor directly supports some fundamental data types, like byte/word/int/long signed/unsigned, float, double, etc..  The processor generally will not complain or object if you alternately treat the same memory location as signed or unsigned, for example, even though that would usually be a logic error in the program.  It is the job of programming to instruct the processor at every interaction with a variable.

Beyond those fundamental primitive types, we have to encode things in data structures and use algorithms to manipulate them in terms of those primitives.

In C++, objects involved in class hierarchy for polymorphism have a pointer, usually at the beginning of the object, that refers to a class-specific data structure, which helps with virtual dispatch, casting, etc..

In summary, the processor otherwise doesn't know or remember the intended use of storage locations — it executes the machine code instructions of the program that tell it how to manipulate storage in CPU registers and main memory.  Programming, then, is the job of the software (and programmers) to use storage meaningfully, and to present a consistent set of machine code instructions to the processor that faithfully execute the program as a whole.

4
  • 1
    Careful with "when the translation is completed, the name is forgotten" ... linking is done through names ("undefined symbol xy") and may well be happen at run time with dynamic linking. See blog.fesnel.com/blog/2009/08/19/…. No debug symbols, even stripped: You need the function (and, I assume, global variable) name for dynamic linking. So only names of internal objects can be forgotten. By the way, good list of variable properties. Commented Oct 22, 2018 at 16:41
  • @PeterA.Schneider, you're absolutely right, in the big picture of things, that linkers and loaders also participate and use names of (global) functions and variables that came from the source code.
    – Erik Eidt
    Commented Oct 22, 2018 at 17:09
  • An additional complication is that some compilers interpret rules which, per the Standard, are intended to let compilers assume certain things won't alias as allowing them regard operations involving different types as unsequenced, even in cases which do not involve aliasing as written. Given something like useT1(&unionArray[i].member1); useT2(&unionArray[j].member2); useT1(&unionArray[i].member1);, clang and gcc are prone to assume that the pointer to unionArray[j].member2 can't access unionArray[i].member1 even though both are derived from the same unionArray[].
    – supercat
    Commented Oct 22, 2018 at 18:37
  • Whether the compiler interprets the language specification correctly or not, its job is to generate machine code instruction sequences that carries out the program. This means that (modulo optimization and many other factors) for each variable access in the source code it has to generate some machine code instructions that tell the processor what size and data interpretation to use for the storage location. The processor doesn't remember anything about the variable so each time it is supposed to access the variable, it has to be instructed exactly how to do it.
    – Erik Eidt
    Commented Oct 22, 2018 at 20:12
2

if I define a variable of a certain type how does it keep track of type of variable it is.

There are two relevant phases here:

  • Compile time

The C compiler compiles C code to machine language. The compiler has all information that it can get from your source file (and libraries, and whatever other stuff it needs to do its job). The C compiler keeps track of what means what. The C compiler knows that if you declare a variable to be char, it is char.

It does this by using a so-called "symbol table" which lists the names of the variables, their type, and other information. It is a rather complex data structure, but you can think of it as just keeping track of what the human-readable names mean. In the binary output from the compiler, no variable names like this appear anymore (if we ignore optional debug information which may be requested by the programmer).

  • Runtime

The output of the compiler - the compiled executable - is machine language, which is loaded into RAM by your OS, and executed directly by your CPU. In machine language, there is no notion of "type" at all - it only has commands which operate on some location in RAM. The commands do indeed have a fixed type they operate with (i.e, there may be a machine language command "add these two 16-bit integers stored at RAM locations 0x100 and 0x521"), but there is no information anywhere in the system that the bytes at those locations actually are representing integers. There is no protection from type errors at all here.

4
  • If by any chance you are referring to C# or Java with "byte code oriented languages" then pointers have by no means omitted from them; quite to the contrary: Pointers are much more common in C# and Java (and consequently, one of the most common errors in Java is the "NullPointerException"). That they are named "references" is just a matter of terminology. Commented Oct 22, 2018 at 13:23
  • @PeterA.Schneider, sure, there is the NullPOINTERException, but there is a very definite distinction between a reference and a pointer in the languages I mentioned (like Java, ruby, probably C#, even Perl to some extent) - the reference go together with their type system, the garbage collection, the automatic memory management etc.; it is usually not even possible to explicitly state a memory location (like char *ptr = 0x123 in C). I believe my usage of the word "pointer" should be pretty clear in this context. If not, feel free to give me a heads-up and I'll add a sentence to the answer.
    – AnoE
    Commented Oct 22, 2018 at 14:42
  • pointers "go together with the type system" in C++ as well ;-). (Actually, Java's classic generics are less strongly typed than C++'s.) Garbage collection is a feature which C++ decided to not mandate, but it's possible for an implementation to provide one, and it has nothing to do with what word we use for pointers. Commented Oct 22, 2018 at 15:28
  • OK, @PeterA.Schneider, I don't really think we're getting level here. I've removed the paragraph where I mentioned pointers, it didn't do anything for the answer anyways.
    – AnoE
    Commented Oct 22, 2018 at 15:39
1

There are a couple of important special cases where C++ does store a type at runtime.

The classic solution is a discriminated union: a data structure that contains one of several types of object, plus a field that says what type it currently contains. A templated version is in the C++ standard library as std::variant. Normally, the tag would be an enum, but if you don't need all the bits of storage for your data, it might be a bitfield.

The other common case of this is dynamic typing. When your class has a virtual function, the program will store a pointer to that function in a virtual function table, which it will initialize for each instance of the class when it is constructed. Normally, that will mean one virtual function table for all class instances, and each instance holding a pointer to the appropriate table. (This saves time and memory because the table will be much larger than a single pointer.) When you call that virtual function through a pointer or reference, the program will look up the function pointer in the virtual table. (If it knows the exact type at compile time, it can skip this step.) This allows code to call a derived type's implementation instead of the base class's.

The thing that makes this relevant here is: each ofstream contains a pointer to the ofstream virtual table, each ifstream to the ifstream virtual table, and so on. For class hierarchies, the virtual table pointer can serve as the tag that tells the program what type a class object has!

Although the language standard does not tell the people who design compilers how they must implement the runtime under the hood, this is how you can expect dynamic_cast and typeof to work.

2
  • "the language standard does not tell coders" you should probably emphasize that the "coders" in question are the folks writing gcc, clang, msvc, etc, not people using those to compile their C++.
    – Caleth
    Commented Oct 22, 2018 at 10:41
  • @Caleth Good suggestion!
    – Davislor
    Commented Oct 22, 2018 at 17:43

Not the answer you're looking for? Browse other questions tagged or ask your own question.