2
$\begingroup$

I'm interested in how you might detect similarities in tree structures. For example, in these three JSON documents;

(a)   { name="foo", length=23 }
(b)   { name="boo", length=99, redundant=true }
(c)   { quux="baz", a=44, b=66, c=99 }

it's obvious to the human eye that (a) and (b) are more similar, in content and structure, than (a) and (c). However, I don't know of heuristics I might use to quantify that similarity.

I'm sure such algorithms exist -- they might exist, say, in code that detects similarities in parse trees or abstract syntax trees, to detect duplicate code structures -- but I've no idea how to search for more information because I don't know what to look for.

Is there a name for this kind of algorithm, where you're looking to be able to calculate 'near-ness' between different trees, graphs, or sets?

(Aside: I've just found out about the Levenshtein Distance so maybe there is a common technique like 'print out your data structures as strings and calculate distances between the printed strings' but I'm very new to the concept.)

$\endgroup$
1
  • $\begingroup$ The challenge here is not the algorithm so much as the specification or choice of a metric. There are many possible metrics for measuring some aspects of similarity between trees. $\endgroup$
    – D.W.
    Commented Jan 16, 2016 at 21:04

1 Answer 1

3
$\begingroup$

The search terms you're probably looking for are "similarity measure", "dissimilarity", "distance function", or "metric"... applied to trees.


One approach to your probem is to invent a similarity measure or distance function/metric that can be applied directly to tree. There are some references on that at https://stackoverflow.com/q/21897879/781723. You can also easily envision some recursive algorithms for doing that. For instance, if a tree $T$ has children $T_1,T_2$ (i.e., the left and right subtrees of $T$), you could define

$$d(T,T') = d(T_1,T'_1) + d(T_2,T'_2),$$

with some way to bottom out the recursion at leaves (e.g., if leaves are strings, then use edit distance at the leaves). You could also combine the distances of the child subtrees in some other way, e.g.,

$$d(T,T') = \sqrt{d(T_1,T'_1)^2 + d(T_2,T'_2)^2}$$

or using any other norm.

Another approach is to identify a set of features that capture aspects of the trees that are important to you. Then, each tree can be mapped to a feature vector, and you can measure similarity between two trees by computing the similarity of the two feature vectors. There are many distance metrics one can use on feature vectors (e.g., L1 norm, L2 norm, etc.)

In some cases, there are sophisticated techniques that can be used to speed up the feature-vector-based approach. See https://en.wikipedia.org/wiki/Tree_kernel.

Finally, you could also look at techniques like shingling or minhash and try modifying them to apply to trees instead of strings/sets.

$\endgroup$
1
  • $\begingroup$ That is perfect! Lots to think about. Really appreciate your in-depth answer. $\endgroup$ Commented Jan 16, 2016 at 22:28

Not the answer you're looking for? Browse other questions tagged or ask your own question.