Maziarz et al. (2021)

  • Overall algorithm is
  • identify identical subtrees in an AST, robust to alpha-renaming
    • specifically hashing, as applied to Common Subexpression Elimination (CSE)
  • 4 Types of equivalence for subexpressions
    • Syntactic equivalence
    • α-equivalence (alpha-equivalence)
      • Focus of this paper
      • \x.x+y equivalent to \p.p+y
    • Graph equivalence
      • (let x=e1 in let y=e2 in x+y) equivalent to (let y=e2 in let x=e1 in x+y) equivalent to (e1+e2)
      • let-expressions express lifetimes and evaluation order
    • Semantic equivalence
      • (3+x+4) equivalent to (x+7) equivalent to (7+x)
      • undecidable
  • To avoid false positives (name overloading) in syntactic equality, assume that expressions are preprocessed “so that every binding site binds a distinct variable name”
  • de Bruijn indexing is a form of nameless representation, to be insensitive to alpha-renaming
  • “Given an expression e, in which every binding site binds a distinct variable name, identify all equivalence classes of subexpressions of e, where two subexpressions are equivalent if and only if they are α-equivalent. We wish to achieve this goal in a way that is: [compositional and sub-quadratic]”
  • e-summary is “the result of processing an expression”
    • “two subexpressions are alpha-equivalent iff their e-summaries are equal”
  • Full algorithm
    • Goal: convert an Expression into an ESummary, a pair of Structure and VariableMap
      • Structure is the expression with anonymized variables
      • VariableMap is a map of free variables and their directions to locate their position in the Structure
        • These directions also form a tree (PosTree)
    • Time-complexity is reduced by tweaking the VariableMap merge in the App case to tend towards unbalanced trees.
    • The naive algorithm is and the full algorithm is
    • ESummarys are only useful because they can be hashed
    • Hashing Structure is trivial, but VariableMap requires leveraging a hash combiner that is associative, commutative, and invertible
      • They use XOR which despite being a cryptographically weak hash combiner, still achieves low hash collisions
      • Invertibility (involution?) is leveraged when removing an element from the map
  • This algorithm is generally faster compared to (Charguéraud, 2012) in synthetic and practical (ML expressions) benchmarks
Charguéraud, A. (2012). The Locally Nameless Representation. Journal of Automated Reasoning, 49(3), 363–408. https://doi.org/10.1007/s10817-011-9225-2
Maziarz, K., Ellis, T., Lawrence, A., Fitzgibbon, A., & Peyton Jones, S. L. (2021). Hashing modulo Alpha-Equivalence. Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 960–973. https://doi.org/10.1145/3453483.3454088