13

Suppose we have an Instance class in a C++ program, which has a GUID/UUID, name, parents, children, and other properties which can be saved to or loaded from an XML file.

The intuitive approach for representing the name of the Instance is to give it a property of type std::string, char*/char[], or whatever other generic string type is being used. Some programs, however, use a separate Name class as a wrapper around a string. I've looked at source code which does this, but from everything I've seen, the Name class has no purpose besides providing a mutex, various asserts and other runtime error checking, and convoluted-looking declaration methods (usually a combination of functions like declare() { <declaration code> }, doDeclare() { <some convoluted stuff>; declare() }, and callDoDeclare() { <even more convoluted nonsense>; doDeclare() }.

Is there actually a reason to use a special Name class, or can I just use a regular string property? Should I even be worrying about this so early on in the development of my project?

EDIT: I wasn't precise enough about the purpose of a Name type in most programs. A Name isn't as much of a label for an Instance as it is general metadata: for example, a Descriptor class could have a name member, and there could be a Property with a Descriptor where the name contains the string "Name", and this property named "Name" could be used to represent the name of an Instance. Not to mention that Instances aren't just encoded and decoded from XML data, but can also be accessed through a scripting API (like in ROBLOX; yes, this question is about ROBLOX, but I didn't want to bring it up at first).

3
  • 10
    You mention "convoluted-looking declaration methods", and then seem dismissive, but it looks to me like it might be important. Have you dug into what they do, and tried to figure out why they're implemented that way?
    – Craig
    Commented Apr 26 at 15:09
  • 1
    (Re: the question has been edited) It's an example of Second-system effect
    – rwong
    Commented Apr 27 at 0:32
  • 1
    To sharpen @Craig's comment: might the fancy-looking constructors be an instance of the Flyweight Pattern (also known as the Borg Pattern) or string interning? One particular reason to do this is constant-time comparison for names when lexicographic ordering is not important. These sorts of patterns would justify a class, and then the other small greebles would feep along later.
    – Corbin
    Commented Apr 27 at 1:07

6 Answers 6

42

from everything I've seen, the Name class has no purpose besides providing a mutex, various asserts and other runtime error checking

This is the reason why — it has behavior associated with it, which is all the more important when you see <some convoluted stuff> and <even more convoluted nonesense>. This convoluted logic has a place to live without it becoming scattered and intermingled with other concerns in your application. This provides opportunities to pass a Name around instead of another more broadly-scoped object in cases where you need the logic for the identifier or name, but not the logic bound to the object it identifies.

Passing a Name around instead of an std::string gets you some compile-time checks that are not possible when just passing around a string. This avoids Stringly-Typed code and Primitive Obsession. When passing a string, is it the name of the kind of object you think it is? The function receiving a Name as a string has no way of knowing this.

Some other questions on this site which are related to this topic:

2
17

from everything I've seen, the Name class has no purpose besides [a list of various purposes]

Sorry if this comes across as facetious but you rolled from claiming there's no purpose into listing the actual purposes. It's possible you're dismissing these as valid or productive contributions to the type definition, which they most certainly are; but it's only going to make sense when you understand the goal that they're trying to achieve.

Have you heard of primitive obsession? It's the direct answer to your question. For your current example, the specific scenario "replace data value with object" is the one that applies most, though I recommend reading through the entire parent page to build an understanding of this guideline.

One of the main benefits here is that you get compile-time type safety. You can't just pass any string into this, it has to be a Name, which means that it forces you to explicitly confirm that this string value is indeed a name.
The second main benefit is that having a custom type here allows you to write custom logic to operate on this type. For example, you might want to generate an filepath-safe variant of this name, or you might want to write some equality check that's broader than just string == string. Having a type enables you to do so in a clear and reusable location.

Should I even be worrying about this so early on in the development of my project?

This is a difficult question to answer when you're still learning about these guidelines. On the one hand, it would be productive to already account for hard lessons that others have had to learn, without needing to make the mistake and have to recover from it yourself. On the other hand, YAGNI always looms over you, and it's really easy to get stuck in analysis paralysis if you try to include every guideline from the get go.

So it's up to you. Would you rather do some research and make a conscious decision to include this guideline? Would you rather blindly follow it on the supposition that it conveys a benefit down the line? Or would you rather not implement something you don't understand and accept that you might have to learn this the hard way?

There's no wrong answer, just pick the answer that works best for how you learn things.

Note also that this is why mentoring is so prevalent for people who are learning the ropes, because a mentor is able to judge on the fly which guidelines make the most sense for the current scenario, while balancing both the amount of guidelines to follow and the severity of failing to follow them. If you have access to a person with more experience in this field and who is willing to sanity check your work before you commit to it, that would definitely be helpful to rely on.

12

the Name class has no purpose besides providing (...)

These are good reasons to introduce a seperate class, but even if Name wouldn't have any additional behaviour it is still beneficial to have a separate class. Two words: strong typing.

I'll give you an example. Some time ago I worked on a project that dealt with multiple classes, say A,B,C. Each of those classes had id field, and all those fields had the same int type. So how is that bad? The project revolved around making various sophisticated aggregations and calculations. We would often deal with nested maps like A.id -> [B.id -> [C.id -> (something)]]. So we would group by A.id, then by B.id, then by C.id. This is a simplistic example, in real scenario such nesting could be of depth 10 or even 20. And we would often had errors because wrong id ended up at wrong level. At the time we could only detect this at runtime, and due to the complexity of the process, we often weren't able to write proper tests. The other team of analytics was able to detect these problems, but then the entire testing process takes very long time.

This could've been very easily avoided with strong typing. All we needed to do is to define AId, BId, CId classes and set those ids on A,B,C. And voilà, problem solved. I've actually proposed this change, but the decision was that the refactoring of this huge codebase would be too costly.

The conclusion of this story is: we did not need int type. We only needed id per class. That is the real need. Similarly your Name class describes it purpose, it is irrelevant how it is implemented under the hood. Purpose. It matters more than concrete representation.

There are other benefits of such design. Say I decide one day I want to change the underlying type of my id field to long or uuid, or string, or whatever. If designed correctly this might mean changing a single line of code. Or maybe couple lines of code, e.g. for conversion between this type and database type. Without this abstraction such refactoring would be costly.

2
  • Indeed, the tagged integer technique is used in many projects to solve this issue; and to apply this technique in C++, it is necessary to define those ID types as classes.
    – rwong
    Commented Apr 26 at 16:45
  • Yes, the product I work on has a lot of numeric IDs, and while we don't actually do it, it would be useful sometimes to have (e.g.) methods which accept CustomerNumber and SiteNumber types instead of a pair of integers which can be (and often are) easily transposed. Commented Apr 28 at 21:18
7

@Flater correctly identified the Primitive Obsession issue, but it may be warranted to explain a bit more why primitive obsession is an issue.

What's in a type?

Types are used for a variety of purposes, so sometimes it's easy to get lost.

At minima, a type is:

  • A set of values.
  • On which a set of sensible operations is provided.

For example, for a String:

  • Set of values: any sequence of any characters, from 0 to infinity.
  • Set of operations: many, many, different operations.

Is a String a good name, thus? Arguably no:

  • What does it mean for a name to be empty?
  • Is it problematic for a name, in this application, to contain punctuation? Non-printable characters? To be thousands of characters long?

That is, explicitly or implicitly, a good name probably has a set of values that is a subset of all possible string values.

A dedicated type (Name) can be used to establish and maintain invariants:

  • A Name is never empty.
  • A Name only contains characters in the [0-9A-Za-z_] set.
  • A Name is between 5 and 30 characters long.

Note that maintaining the invariants imply that not all String operations are available. In fact, in all likelihood, a Name is immutable once built, so invariants only have to be verified at construction.

Going further, I want to emphasize the sensible adjective: just because an operation can exist, does not mean it should exist:

  • What does it mean to catenate two Name? It's quite likely nonsensical.
  • What does it mean to lookup a pattern in a Name? Looks like a hack that'll come back and bite us later, should be a proper property instead.

When using a String, not only do you not have invariants, you also have an unrestricted set of operations many of which make no sense whatsoever -- or worse, encourage bad practices -- for a particular use of String.

Strong Typing

The practice of strong typing goes even further, by adding specific semantics to a type.

It is quite likely that in a given application, the Id used for a cat and a dog is similar: same invariants, same valid set of operations.

Yet, using the same type for both may lead to a cat-lover ending up with a dog in their lap instead, and they won't be happy.

Applying Strong Typing, two types should be created: CatId and DogId, possibly sharing some code, inheriting from the same base class, etc... but allowing us to differentiate between Cat & Dog when it matters, so that we cannot accidentally mix them up when we do not intend to.

It's typically more useful for statically typed languages, obviously, as there type mismatches are raised systematically.

3
  • @gnasher729: Cute example, completely off-topic since in this case we're talking about the name an Instance class. Commented Apr 26 at 11:14
  • What do you mean with "catename"? Did you mean "concatenate"? Commented Apr 26 at 14:51
  • @MarkRotteveel: I meant catenate, yes. Commented Apr 26 at 16:07
2

If you have a “Name” class instead of string, you can use it to split into family name and given name, salutation, ordering (that’s why I didn’t say “first name” because for some people the family name comes first), how to call this person (not always the first of the given names, sometimes something totally different). You can add these features bit by bit.

Without any changes in existing code.

4
  • 2
    While I'm not sure if the OP is asking about people's names, I've done precisely this with Person entities before. Usernames and e-mails are another area I wished I had created value classes for years ago. Can't tell you how many bugs I've had to squash because I forgot to do a case-insensitive comparison of usernames or emails. It's simple, but easy to miss while you're cruising along writing code. Commented Apr 25 at 22:35
  • 2
    Having a class which simply records which kind of comparison and collation you should be using is extremely valuable. But for people's names, we should always remember kalzumeus.com/2010/06/17/…
    – pjc50
    Commented Apr 26 at 8:32
  • 1
    @pjc50 But you can reduce the number of problematic cases. After a company I worked for was bought out, we found that about 25% of employees were not known by the name in their passport. Including the CEO whose name had nothing whatsoever to do with his passport.
    – gnasher729
    Commented Apr 26 at 10:59
  • @gnasher729 Yes, that's extremely normal that the conversational name and "government name" diverge.
    – pjc50
    Commented Apr 26 at 15:47
2

Given the description, it appears the code base isn't just (merely) to enable loading/saving as XML; it appears to me that it's designed to faithfully recreate an in-memory representation of the contents of an XML document, down to the tiniest detail.

Hence the complexity: XML is known to be enormously complicated.

Moreover, XML is sometimes used to represent huge documents - an XML document may contain hundreds of millions of tags. I think it should be apparent, in year 2024, that XML is not a good choice. But the designers of an XML library had to account for that, or else they'd need to make their users pay attention to the various system design limits.

Studying such code base can be rewarding for seasoned programmers, but it could also be detrimental for junior programmers, because the code base tend not to contain any explanation of the "why's". Don't fall into the trap of worrying too much - it can impede the normal thinking process of a sane person. Only do it when one is well equipped with the knowledge and the concrete use cases (system requirements).

If we know the context of the code base, it is possible to reverse-engineer the design decisions like peeling an onion. If not, we can always ask the programmers who originally created or used the code base.

Working from first principles: all tags have names, and names can be organized into namespaces. Namespaces can be specified as a prefix; aliases can be created. Also, namespaces must be strong-named by specifying an URI.

If we do not consider any namespaces, we should be able to treat name as a value-like class. A value-like class is immutable; being immutable means that from the user's perspective it is indistinguishable if it's copied or reference-shared. This simplifies the design.

When namespaces are considered, it is now necessary for users to navigate from a Name to its Namespace, and to iterate through all Names within a given Namespace. Thus, Name and Namespace become relational.

The possibility of very large documents containing hundreds of millions of instances of "names" forces us to contend with the issue of memory consumption. Ideally, if the same name occurs in the document often enough, we would like to have just one C++ instance of this name, so that it can be shared. Replacing actual copies (e.g. std::string) with a pointer (64-bit) is good, but in applications that handle such large amounts of data, reducing to a 64-bit pointer is still not good (small) enough.

To add to this complexity, some XML frameworks allow in-memory mutability and editing. This means we cannot assume Name to be immutable.

And then the framework may allow multithreaded usage. Combined with mutability, it's now necessary to implement a threaded mutex.

Finally, we add event-driven programming features (callback listeners). I'm not sure why it's needed, but hey, we're approaching Michelin one-star, if we just add one more feature across the entire framework.

Such framework will necessarily contain a lot of boilerplate. In C++, it is common to replace these boilerplate with templates and/or C-style macros. It allows senior programmers to reason about the code at a higher abstraction level; however, it makes the code base less accessible to juniors.

If I were to design this, I'd start by asking which of these aren't necessary. Imagine if you can earn a million dollar for each feature that can be omitted.

2
  • This is probably my favourite answer. It seems like a good idea to keep things as simple as possible, considering that I'm a relatively inexperienced programmer, working on a very large project with much room for growth. Perhaps I should not focus so heavily on replicating the functionality of various 20+ year old software products, but instead on making something simple that actually works, expanding on it later if I have to.
    – AcinonX
    Commented Apr 26 at 21:51
  • "making something simple that actually works, expanding on it later if I have to" This is probably the most important concept I've adopted over the years. I love programming & finding interesting and novel ways of doing things, which is great if I never want to actually finish what I'm working on (e.g. personal projects I'm "playing" around with). But for things/ tasks/ projects that I'd like to actually complete I try to always get it working first, even if it is ugly, inefficient, inelegant, whatever. Then re-factor if I have the time, inkling, or a necessity to do so. Especially for a job.
    – RIanGillis
    Commented Apr 28 at 16:18

Not the answer you're looking for? Browse other questions tagged or ask your own question.