by Budhathoki, Kailash; Vreeken, Jilles, MPI Saarland
Published in: Proceedings – IEEE International Conference on Data Mining, ICDM, 2018; http://eda.mmci.uni-saarland.de/pubs/2018/acid-budhathoki,vreeken.pdf
Problem setting
Given two discrete random variables with finite domains
respectively. Given samples of
from the joint distribution
and assuming that there is no common cause of
and
(no confounder), we want to know if
is a cause of
(denoted as
), or vice versa. The paper assumes that the data are generated by an additive noise model (ANM), for example
with
It is well known that ANMs in the discrete setting are always identifiable ( admits an ANM from
to
, but not in the reverse direction, or vice versa).
Main Content
The general idea for solving the problem, if causes
or vice versa, is to fit an ANM in both directions, choose the direction with „best“ independence between
and
or between
and
as the causal one. In the paper the Shannon entropy
is used as a dependency measure. With
the total entropy for a sample with
is
. The central statement of the paper is that if
, then
whenever
is induced by an ANM from
.
Solution
An algorithm called ACID (https://github.molgen.mpg.de/EDA/cisc) is proposed that tries to minimize the entropy of the noise term. ACID performs a discrete regression of on
and converges but can be stuck into local minima. The algorithm is a heuristic approach, since the straightforward exact version would be intractable w.r.t. running time. The computational complexity is
. For domains with size less than
the runtime is in an acceptable range.
Applications and Experimental Results
ACID can be applied for discrete data samples (with reasonable preprocessing it can also be applied to continuous data). ACID was tested with synthetic data generated with artificial ANMs and by using several different families of distributions. It has also been tested on some selected real-world datasets. Since ACID is designed for data coming from an ANM it performs very well on those synthetic data. Also on the (few) selected real-world data, it was quite accurate. However, if other test sets are used (with multiplicative noise, more complex real-world data) it turns out that the overall performance of ACID is not very good compared to other methods (around 50-60% accuracy). Therefore, the conclusion of the authors is a little bit misleading and biased by the good results on the synthetic data set.
Conclusion
The paper provides a heuristic algorithm for discrete regression of two discrete random variables which is theoretically founded on the information-theoretic fact that data that are generated by an ANM with has smaller entropy in that direction than in the other. Its assumptions on the model and data is quite strong and therefore the impact for practical applications is not too high.
Relevance: relevant for discrete data samples (small domain sizes), strong assumptions
Impact: low to medium for practical applications.
Level of difficulty: easy (basic knowledge about statistics and probability theory)
Tags: Additive Noise Models, Causal Discovery, Discrete Data, Entropy
2 comments
Comments feed for this article
Trackback link: http://www.blog.algobalance.com/wordpress/accurate-causal-inference-on-discrete-data/trackback/