Given the description, it appears the code base isn't just (merely) to enable loading/saving as XML; it appears to me that it's designed to faithfully recreate an in-memory representation of the contents of an XML document, down to the tiniest detail.
Hence the complexity: XML is known to be enormously complicated.
Moreover, XML is sometimes used to represent huge documents - an XML document may contain hundreds of millions of tags. I think it should be apparent, in year 2024, that XML is not a good choice. But the designers of an XML library had to account for that, or else they'd need to make their users pay attention to the various system design limits.
Studying such code base can be rewarding for seasoned programmers, but it could also be detrimental for junior programmers, because the code base tend not to contain any explanation of the "why's". Don't fall into the trap of worrying too much - it can impede the normal thinking process of a sane person. Only do it when one is well equipped with the knowledge and the concrete use cases (system requirements).
If we know the context of the code base, it is possible to reverse-engineer the design decisions like peeling an onion. If not, we can always ask the programmers who originally created or used the code base.
Working from first principles: all tags have names, and names can be organized into namespaces. Namespaces can be specified as a prefix; aliases can be created. Also, namespaces must be strong-named by specifying an URI.
If we do not consider any namespaces, we should be able to treat name as a value-like class. A value-like class is immutable; being immutable means that from the user's perspective it is indistinguishable if it's copied or reference-shared. This simplifies the design.
When namespaces are considered, it is now necessary for users to navigate from a Name to its Namespace, and to iterate through all Names within a given Namespace. Thus, Name and Namespace become relational.
The possibility of very large documents containing hundreds of millions of instances of "names" forces us to contend with the issue of memory consumption. Ideally, if the same name occurs in the document often enough, we would like to have just one C++ instance of this name, so that it can be shared. Replacing actual copies (e.g. std::string
) with a pointer (64-bit) is good, but in applications that handle such large amounts of data, reducing to a 64-bit pointer is still not good (small) enough.
To add to this complexity, some XML frameworks allow in-memory mutability and editing. This means we cannot assume Name to be immutable.
And then the framework may allow multithreaded usage. Combined with mutability, it's now necessary to implement a threaded mutex.
Finally, we add event-driven programming features (callback listeners). I'm not sure why it's needed, but hey, we're approaching Michelin one-star, if we just add one more feature across the entire framework.
Such framework will necessarily contain a lot of boilerplate. In C++, it is common to replace these boilerplate with templates and/or C-style macros. It allows senior programmers to reason about the code at a higher abstraction level; however, it makes the code base less accessible to juniors.
If I were to design this, I'd start by asking which of these aren't necessary. Imagine if you can earn a million dollar for each feature that can be omitted.