## Problems with Shapley-value-based explanations as feature importance measures

by L. E. Kumar, S. Venkatasubramanian, C. Scheidegger, S. Friedler

Published in: Proceedings of the 37th International Conference on Machine Learning, ICML 2020

## Problemsetting

In many practical machine learning pipelines features attribution methods like Shapley-values are used to decide if features contribute to model accuracy for a certain prediction task in a „relevant“ way. The paper shows that the widely assumed explanation power of Shapley-value based methods to assess feature importance is not existent and that these values are not well-suited to answer the questions data scientists usually have in mind.

Shapley-value setting in a nutshell:

$N$ player, value function $v: 2^{[N]} \rightarrow \mathbb{R}$ with $v(\emptyset) = 0$ quantifies how much collective payoff a set of players can gain by cooperating. $\Delta_v (i,S):=v(S \cup i) - v(S)$ with $S \subseteq 2^{[N]}$.

The Shapley-value of player $i$ is defined by

$\phi_v(i):=\frac{1}{N!} \sum_{S \subseteq [N]} |S|! (N-|S|-1)! \Delta_v(i,S)$

$\phi_v$ has the following characterizing axioms:

• (Symmetry) if $\Delta_v(i,S) = \Delta_v(j,S) \Rightarrow \phi_v(i) = \phi_v(j)$ for all $i,j$.
• (Dummy) if $\Delta_v(i,S)=0 \Rightarrow \phi_v(i)=0$ for all $S$.
• (Additivity) $\phi_v(i) + \phi_w(i) = \phi_{v+w}(i)$.

In the machine learning setting we have a model $f(x_1, \ldots, x_d)$ with features $1,\ldots, d$ as players. $\phi_v(i)$ is interpreted as the influence of $i$ on the outcome. Definitions and estimands of local value functions $v_{f,x}(S)$ (estimand $\hat{v}_{f,x}$)

Here $\mathcal{D}$ is the set of product distributions over the marginals of features in $\overline{S}$, to be precise $\mathcal{D} = \{P : P = \prod_{f \in \overline{S}} \pi_f P\}$.

## Problems with Shapley-values

The main problems with Shapley-values are the following ones.

Problems with conditional distributions:

• redundant features can lead to misleading Shapley values (redundant features can get higher values compared to its non-redundant versions). Therefore, one has to eliminate redundant features and make a choice before calculating the Shapley values (actually one of the aim why Shapley values are used has to be done manually or in a different way).
• in general Shapley values can take a long time to be computed.

Problems with interventional distributions:

• The model needs to be evaluated on out-of-distribution samples. That can lead to the situation that not relevant parts (in order to solve the task at hand) of the feature space needs to be evaluated. This again can lead to misleading Shapley values.

General problems:

• Shapley values are not model agnostic (compared with what is claimed), since it provides actually only for additive models a meaningful and interpretable output. It can have no meaning at all for non-additive models.

## Human-centric issues

The paper also addresses human-centric issues of Shapley-values. The authors first point out that one of the major findings in social sciences was that humans explain phenomena to each other by using contrastive statements (they explain the cause of an effect relative to some other event that did not occur (counterfactual)). The authors further argue that Shapley-values might fail to deliver explanations via contrastive statements (for example to deliver an answer to the question: „Why $f(x)$ rather than $E(f(x))$?“ the expectation $E(f(x))$ needs to be actually attained; even when considering marginal contributions it is not clear why averaging those quantities is a good way to summarize information).

Shapley-values also do not provide guidance for taking action in order to improve an outcome to a desired one.

Furthermore, there is no standard procedure for converting Shapley-values into a statement about a model’s behavior. Even if data scientists do have a clear mental model about what Shapley-values deliver (most of them do not have that according to the authors) they tend to missuse them for their purposes due to confirmation bias and the fact that interpretability is not always helpful in task-specific settings.

## Conclusion

Shapley-values might help to qualitatively inform investigations that lead to answers data scientists usually have in mind when applying Shapley-values. It is not clear that they provide direct answers to those questions. In conclusion the Shapley-value framework is ill-suited as a general solution to the problem of quantifying feature importance. Instead it is recommended to use more focused and specific approaches (with human accessibility) that actually can deliver more direct answers to the questions the data scientist have in mind.

Relevance: relevant for ML community and for all pipelines that make use of feature importance methods

Impact: medium

Level of Difficulty: easy

## Accurate Causal Inference on Discrete Data

by Budhathoki, Kailash; Vreeken, Jilles, MPI Saarland

Published in: Proceedings – IEEE International Conference on Data Mining, ICDM, 2018; http://eda.mmci.uni-saarland.de/pubs/2018/acid-budhathoki,vreeken.pdf

## Problem setting

Given two discrete random variables $X,Y$ with finite domains $\mathcal{X}, \mathcal{Y}$ respectively. Given samples of $X,Y$ from the joint distribution $P_{X,Y}$ and assuming that there is no common cause of $X$ and $Y$ (no confounder), we want to know if $X$ is a cause of $Y$ (denoted as $X \rightarrow Y$), or vice versa. The paper assumes that the data are generated by an additive noise model (ANM), for example

$X:=N_X, \\ Y:=f(X) + N_Y,$ with $N_Y \perp X$

It is well known that ANMs in the discrete setting are always identifiable ($P_{X, Y}$ admits an ANM from $X$ to $Y$, but not in the reverse direction, or vice versa).

The general idea for solving the problem, if $X$ causes $Y$ or vice versa, is to fit an ANM in both directions, choose the direction with „best“ independence between $N_Y$ and $X$ or between $N_X$ and $Y$ as the causal one. In the paper the Shannon entropy $H(X) := -\sum_{x \in \mathcal{X}} P(X=x) \log P(X=x)$ is used as a dependency measure. With $P(Y|X) = P(N_Y |X)$ the total entropy for a sample with $X \rightarrow Y$ is $H(X) + H(N_Y|X)$. The central statement of the paper is that if $X \rightarrow Y$, then $H_{X \rightarrow Y} := H(X) + H(N_Y) < H(Y) + H(N_X)$ whenever $X, Y$ is induced by an ANM from $X \rightarrow Y$.

## Solution

An algorithm called ACID (https://github.molgen.mpg.de/EDA/cisc) is proposed that tries to minimize the entropy of the noise term. ACID performs a discrete regression of $Y$ on $X$ and converges but can be stuck into local minima. The algorithm is a heuristic approach, since the straightforward exact version would be intractable w.r.t. running time. The computational complexity is $O(|\mathcal{Y}|^{|\mathcal{X}|}$. For domains with size less than $40$ the runtime is in an acceptable range.

## Applications and Experimental Results

ACID can be applied for discrete data samples (with reasonable preprocessing it can also be applied to continuous data). ACID was tested with synthetic data generated with artificial ANMs and by using several different families of distributions. It has also been tested on some selected real-world datasets. Since ACID is designed for data coming from an ANM it performs very well on those synthetic data. Also on the (few) selected real-world data, it was quite accurate. However, if other test sets are used (with multiplicative noise, more complex real-world data) it turns out that the overall performance of ACID is not very good compared to other methods (around 50-60% accuracy). Therefore, the conclusion of the authors is a little bit misleading and biased by the good results on the synthetic data set.

## Conclusion

The paper provides a heuristic algorithm for discrete regression of two discrete random variables which is theoretically founded on the information-theoretic fact that data that are generated by an ANM with $X \rightarrow Y$ has smaller entropy in that direction than in the other. Its assumptions on the model and data is quite strong and therefore the impact for practical applications is not too high.

Relevance: relevant for discrete data samples (small domain sizes), strong assumptions

Impact: low to medium for practical applications.

Level of difficulty: easy (basic knowledge about statistics and probability theory)