The Earth Movers Distance as a Metric for Inorganic Compositions
As computational chemists we have an intuitive grasp of what makes two compositions “chemically similar”, but this concept is surprisingly difficult to capture using standard numeric techniques. Here we present the Earth Movers Distance (EMD) as a well-defined measure of similarity between two compounds, which operates by optimally pairing elements in a source composition to their most similar elements in a target composition and scoring the resultant similarities. This is akin to the cognitive process by which humans judge chemical similarity and as such, the resultant distances consistently align with human understanding. The chemical formula gives a mathematically abstract representation of the inorganic material which is typically difficult to work with. By contrast, we can use a well-formed distance in a wide range of analytical techniques, allowing us to cement these abstractions into tangible information. This is demonstrated effectively on the 12,567 binary structures of the ICSD, where we use this distance to plot detailed chemical maps which separate compositions into families of clear similarity both on a global and local scale. These maps can be clustered using unsupervised machine learning techniques to automatically partition our compounds into digestible subgroups which enables us to identify and distil critical chemical trends that would otherwise have been overlooked. We can additionally use these distances for the automated retrieval of structures from chemical databases, where an exact formulation may never have been reported but a closely related structure provides the reference needed. This metric is fast to compute between two compositions in practice, making it a strong candidate for many other applications in data driven materials discovery.
We will discuss how the EMD is calculated between two compositions, and demonstrate strengths against the standard metric as described in our recently published paper 10.1021/acs.chemmater.0c03381, with code available at https://www.github.com/lrcfmd/ElMD and results viewable at https://www.elmd.io.