2
$\begingroup$

Suppose that we are working with an object oriented compiled language, such as somthing C-flavored or Java-flavored.


There is a class named string

  • string.ascii_letters is not a valid attribute access.

  • string.asciiletters IS a valid attribute access.

A feature of a new programming language might be to bind the correct attribute to the invocation despite the presence of a spurious underscore character.


How might you allow an invalid attribute name, such as string.ascii_letters, such that if the compiler deletes all underscores from the invalid attribute name, and the compiler searches the class attributes for the revised attribute name w/o underscores, it finds it?


If string.ascii_letters and string.asciiletters both exist, then the compiler can print an error message or raise an exception, or something like that.


We do not want an underscore insensitive language in the same way that there exist white-space insensitive languages.


Maybe the underscore matters sometimes.

For example, dog.len and dog._len might both be attributes of the class named dog.


The idea is to allow automatically generated aliases of class method names, and other class attributes, without generating all possible aliases.


The language would be semi-underscore sensitive

If there exist two labels in the current scope which differ only in the number of underscore characters present in the label, then the label is underscore-sensitive.

Suppose that there is a label in the current scope L such that for any label in the current scope L′, we have that if LL′ then, LL′ after all underscores are deleted from both label L and L′. Then label L is not underscore-sensitive in the current scope.


I am not talking about a spell-checker in the Integrated Development Environment.

The language itself, and compiler, would allow underscores to be misused provided that no two variables, objects, or class attributes in the current scope differ only by the number of underscore characters used inside of the label.


How would you approach implementing such a feature?

$\endgroup$
5
  • 2
    $\begingroup$ When you say "misspelled" class attributes in the title, are you referring specifically to these surplus/omitted underscores in their names only, or to misspellings more broadly? $\endgroup$
    – Michael Homer
    Commented Apr 5 at 1:24
  • 7
    $\begingroup$ The only sane reason to check for this is to give suggested completions which are robust to misspellings (e.g. if the user writes foo.ascii_ then .asciiletters is still offered as a suggestion), or suggested fixes in error messages (e.g. if the user writes foo.ascii_letters, the error message says "no attribute named ascii_letters, did you mean asciiletters?"). Actually compiling a misspelled attribute as if it had been spelled correctly is dangerous; it creates a hazard in case a new attribute is added later with the other name. Fixing typos is easy; fixing bugs like that is hard. $\endgroup$
    – kaya3
    Commented Apr 5 at 1:43
  • 1
    $\begingroup$ Please remove all those horizontal lines, they make the question really hard to read. $\endgroup$
    – Bergi
    Commented Apr 5 at 15:49
  • 3
    $\begingroup$ Do you really need to do this in the language? This is something that should be in the editor, not the language. $\endgroup$
    – Barmar
    Commented Apr 5 at 23:05
  • $\begingroup$ Do you want the correction to be specific to underscores? Or should it also be able to pick up other common-sense typos (single character errors, transpositions...)? Aside from that, what about users who don't want the compiler to apply such corrections (since it denies them the opportunity to verify)? $\endgroup$ Commented Apr 8 at 3:12

1 Answer 1

3
$\begingroup$

While you say you don't want an underscore-insensitive language, the straightforward implementation strategy is to build one of those, and disambiguate at the end, like a hash table with chaining:

  • Every defined attribute is squashed to its underscore-free name.
  • When there are multiple names squashing to the same string, a list/map of the original names to their identities is put in place where attributes are looked up.
  • When looking up a name, underscores are removed and the right bucket looked up.
  • If there is only one item there, that's the one you're looking for.
  • If there's more than one item there, look up the full name string, and produce the corresponding attribute, or an error.

This lookup could during compilation or in an interpreter, with the only difference being what the data being looked up is. These lookups could be memoised if the cost is expected to be too high, but programs where that arises seem perverse.

If it really is the case that what you care about is "the number of underscore characters used inside of the label", the second-layer lookup can just be that count, rather than the full name. In that case, x__ and _x_ are the same, and distinct from x and _x; I find that hard to recommend.


As a language design this raises some issues of forwards and backwards compatibility: working code with reference to an attribute via a non-canonical name can be broken by the addition of another identifier in the same scope that is underscore-equivalent, as that will now be an error. Depending on how scopes work, this could be a greater or lesser problem.

For example, the addition of an inherited identifier in a superclass, out of the control of the programmer of a subclass, could break their code making use of the original attribute. Within a single local scope an error will likely be caught quickly and resolved, but with any sort of nested or remote introduction of scope entries it could be much more of a problem. In a public-facing interface, like your dog.len example, the addition of any visible name with an underscore is an unavoidable breaking change.

In a fully-controlled environment, canonicalising the source code to refer to the true name after the compiler first encountered the misspelling would avoid these compatibility issues. In a structured editor this may be acceptable, but in most cases it probably would not be for textual source code. However, there have been language implementations that forcibly applied source formatting upon compilation, so it's not out of the question.

If there is not a very compelling reason to support this sort of naming I wouldn't include it for the above reasons. It is viable, however, and I can imagine there could be cases where it is useful.


For a broader class of misspellings, a linear search that computes the Damerau-Levenshtein distance to each other identifier and selecting the unique shortest distance is plausible, but it seems ill-advised to extend this sort of behaviour to wholly-different names like this. For example, x and y have distance 1, but should likely not be treated identically, although the model above would treat x, _x, and x_ (also distance 1) as interchangeable. This is also much more fragile to minor changes, and almost certainly not a good idea.

$\endgroup$
1
  • 2
    $\begingroup$ I would note that compilers such as Clang use the Damerau-Levenshtein distance for suggestions in diagnostics about unknown identifiers. Clang however cuts off the distance calculation at a distance of 2 and will also validate the suggestions by attempting to compile with it -- it helps avoiding suggesting a method where a field is expected, or vice-versa, selecting a field/method whose type match, etc... -- in order to rank the possible suggestions. I do agree it should remain a fix suggestion, though, and the program should not compile/continue. $\endgroup$ Commented Apr 5 at 10:57

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .