by Budhathoki, Kailash; Vreeken, Jilles, MPI Saarland
Published in: Proceedings – IEEE International Conference on Data Mining, ICDM, 2018; http://eda.mmci.uni-saarland.de/pubs/2018/acid-budhathoki,vreeken.pdf
Given two discrete random variables with finite domains respectively. Given samples of from the joint distribution and assuming that there is no common cause of and (no confounder), we want to know if is a cause of (denoted as ), or vice versa. The paper assumes that the data are generated by an additive noise model (ANM), for example
It is well known that ANMs in the discrete setting are always identifiable ( admits an ANM from to , but not in the reverse direction, or vice versa).
The general idea for solving the problem, if causes or vice versa, is to fit an ANM in both directions, choose the direction with „best“ independence between and or between and as the causal one. In the paper the Shannon entropy is used as a dependency measure. With the total entropy for a sample with is . The central statement of the paper is that if , then whenever is induced by an ANM from .
An algorithm called ACID (https://github.molgen.mpg.de/EDA/cisc) is proposed that tries to minimize the entropy of the noise term. ACID performs a discrete regression of on and converges but can be stuck into local minima. The algorithm is a heuristic approach, since the straightforward exact version would be intractable w.r.t. running time. The computational complexity is . For domains with size less than the runtime is in an acceptable range.
Applications and Experimental Results
ACID can be applied for discrete data samples (with reasonable preprocessing it can also be applied to continuous data). ACID was tested with synthetic data generated with artificial ANMs and by using several different families of distributions. It has also been tested on some selected real-world datasets. Since ACID is designed for data coming from an ANM it performs very well on those synthetic data. Also on the (few) selected real-world data, it was quite accurate. However, if other test sets are used (with multiplicative noise, more complex real-world data) it turns out that the overall performance of ACID is not very good compared to other methods (around 50-60% accuracy). Therefore, the conclusion of the authors is a little bit misleading and biased by the good results on the synthetic data set.
The paper provides a heuristic algorithm for discrete regression of two discrete random variables which is theoretically founded on the information-theoretic fact that data that are generated by an ANM with has smaller entropy in that direction than in the other. Its assumptions on the model and data is quite strong and therefore the impact for practical applications is not too high.
Relevance: relevant for discrete data samples (small domain sizes), strong assumptions
Impact: low to medium for practical applications.
Level of difficulty: easy (basic knowledge about statistics and probability theory)