1

I have a script that uses a rather extensive XML based datastore, and currently I am doing no real validation on the XML, which is becoming a problem since the XML is also currently human edited. In the short term I would like to validate the XML as not just valid XML, but valid data for my purposes. And ultimately I also want to revise the XML, for example moving data from attributes to nodes. And between I want to allow for both, where when I load the XML I can look for a node that could also be an attribute and assign the attribute to a newly created node (in memory) if need be. Conceptually, at least for the validation, a schema is obviously the right answer. However, it doesn't appear as if a schema can support the other needs, mapping attributes to nodes temporarily, and ultimately actually changing the XML, creating nodes, assigning values from attributes, deleting the attributes, and saving back to the XML file. My thinking is that I should create an XML file that maps all this out. It would start by defining what is "valid" XML, and I could use in code to validate my other XML now. Then I could extend it to also map attributes to new nodes and use that to create those nodes on ingest, so my work code could use the nodes based XML, while the file XML is still attributes. Then later again I could add code to revise the XML files. All of which is a lot of work, so I am asking here to make sure this process long term really does make sense. If either PowerShell or XML Schemas already offered a great way to do this without all the extra code I would hate to roll my own.

And, assuming roll my own is the answer, I am curious about one implementation detail. Currently I load the XML, then at the point of use, namely various "task" functions, I read that XML into variables that I then modify and use (expanding tokens to create final file paths, etc). Alternatively I could directly revise the XML itself in memory, and I am curious if there is a performance reason to use one approach over another? The extra variables mean extra memory use, but they are all function variables so they get garbage collected eventually. The total XML can get to a few Kbs at most really, so my sense is that performance isn't the issue to focus on, ease of coding is, but since I have found no really good way to profile PowerShell performance I am just guessing.

1 Answer 1

2

The huge wall of text it's taken you to describe your plan contains a staggering amount of unnecessary work and wheel reinvention.

Experts and novices alike have solved such problems before you by using the right tool for the job:

  • Validation: Use a standard XML schema language such as XSD, RelaxNG, or Schematron to express the vocabulary and grammar of your XML. Use an off-the-shelf validating parser to check whether your XML adheres to the schema. Do not expect any transformation capability here, just an answer to the question of whether XML adheres to the schema and diagnostic messages that indicate where when it does not.
  • Transformation: Use XSLT to map XML from old to new or updated XML schemas. Second choice: use a procedural language with solid XML parsing and preferably XPath support. PowerShell would qualify.

Finally, forget performance. You'd have to try really hard using standard tools to have a performance problem with "a few Kbs" of XML data. Focus on expressiveness/clarity of code and programmer productivity; using established tools and standards will help tremendously.

0

Not the answer you're looking for? Browse other questions tagged or ask your own question.