Machine Learning Meets Policy: Reflections on HUML 2016

Last Friday, the University of Ca’ Foscari in Venice organized an IEEE workshop on the Human Use of Machine Learning (HUML 2016). The workshop, held at the European Centre for Living Technology, hosted roughly 30 participants and broadly addressed the social impacts and ethical problems stemming from the wide-spread use of machine learning.

HUML joins a growing number workshops for critical voices in the ML community. These include Fairness, Accountability and Transparency in Machine Learning (FAT-ML), the #Data4Good at ICML 2016, and Human Interpretability of Machine Learning (WHI), held this year at ICML and Interpretable ML for Complex Systems, held this year at NIPS. Among this company, HUML was notable especially notable for diversity of perspectives. While FAT-ML, DS4Good and WHI featured presentations primarily by members of the machine learning community, HUML brought together scholars from philosophy of science, law, predictive policing, and machine learning.

The event consisted of one day of talks with occasional breaks for discussion over coffee, lunch, and dinner. Thanks to an invitation from the organizers facilitated by Professor Fabio Roli, I had the opportunity to attend and speak. In my talk, I presented The Mythos of Model Interpretability, a position piece tackling the epistemological problems that frustrate both research and public discourse on interpretable models.

Since this blog exists precisely to address the intersection of technical and social perspectives on machine learning, I was happy to learn of this summit of like-minded researchers. In light of the tight overlap between the workshop’s objectives and this blog’s mission, I’ve created this post to share my notes on the encounter.

I’ll step through each of the 40-minute invited talks, sharing the high-level points and my personal take on each. Several themes repeat throughout. For example, one recurring issue was the tension between complex ethical issues and simple formalisms we propose to address them. Theoretical definitions of privacy, fairness and discrimination can poorly approximate the real-world meaning given to these words. These shortcomings can be hard (sometimes impossible?) to capture by looking at the mathematics alone but obvious when considering real-world scenarios.

In the opposite direction, several discussions demonstrated that a lack of formal definitions can be equally problematic. This vagueness is most prominent on the matter of interpretability / explainability of machine learning algorithms.

Of course, my reflections are subjective and describe the presentations incompletely. Fortunately, the event was live-streamed on YouTube and (presumably) archived. When the archived versions and slides become available I’ll add them here.

First Talk: Judith Simon – Reflections on trust, trustworthiness & responsibility

The workshop began with a talk by Judith Simon, a Professor of Philosophy of Science at the University of Copenhagen. Her talk addressed issues of privacy, trust, and responsibility.

To begin, Simon led with a motivating anecdote: a father complains to an e-merchant after his daughter is bombarded with advertisements for pregnancy-related products. Subsequently the father calls back and apologizes, having discovered that his daughter actually is pregnant.

Throughout the anecdote, the daughter’s life appears to be adversely impacted by algorithmic decisions in various ways. At first, when we think she is not pregnant, she appears to be the victim of a false rumor. Subsequently, when we discover she is pregnant, it appears that the algorithm inferred and divulged a secret she might have preferred to keep in confidence.

With this context set, Simon asks precisely what constitutes an invasion of privacy? Is it:

The collection of personal data?
The inferences made upon the data?
The divulgence of these inferences, potentially to 3rd parties?

These points warrant serious consideration. Moreover, we might observe that answering each question calls upon a range of expertise spanning traditionally siloed disciplines. What data companies can harvest and who owns that data strikes me as foremost a legal question. What inferences should we draw based on that data, constitutes a philosophical question but one inextricably tied to the domain of machine learning. Finally, we ask how software should behave, given inferences it can access? This appears to call upon both legal, machine learning and HCI perspectives.

Later in her talk, Simons raised an important issue of functional vs epistemic transparency of machine learning. Functional transparency refers to the opacity of systems owing to inaccessibility. For example, Criminal recidivism models might be functionally opaque because the public lacks access to their data, their algorithms, and the learned parameters of their models.

Epistemic transparency, on the other hand, refers to the intrinsic ability (or lack thereof) to understand a model even given full functional transparency. Simon notes we might view functional transparency as a necessary but insufficient step to understanding machine learning.

In the machine learning community, we focus almost exclusively on epistemic opacity. This makes sense. It’s the problem machine learning academics are best equipped to tackle.

We might also note that it’s typically technical people who rise to positions of power within technology companies. Ultimately, the ethical responsibility to provide functional transparency falls on people with mostly technical training. We might ask, how can we count on these stakeholders to do the right thing if these issues are only tackled by legal scholars and philosophers.

Katherine Strandburg – Decision-making, machine learning and the value of explanation

A second talk came from Professor Katherine Strandburg of NYU. In it, she articulated the value of explanation from a lawyer’s perspective. While the talk lacked a technical discussion, I think any researcher interested in interpretable models should check it out.

To begin, she articulated why a right to explanation is a core aspect of due process under the law. Explanations are required because citizens are required only to comply with the letter of the law. In order to subject a citizen to a judgment, one must articulate precisely what someone did and why it is illegal. This explanation must accord with the letter of the law. The requirement that one produce an explanation (and not simply a judgment) is in part intended as a guard against unlawful or discriminatory judgments. [This assumes, of course, that the law is not itself discriminatory].

Moreover, the necessity of providing an explanation is thought to guard against subconscious biases. For example, suppose an adjudicator were predisposed subconsciously to pass judgment against an individual on account of race. In order to actually pass judgment against the individual, the adjudicator would have to produce an explanation that didn’t include race. Ultimately, such an explanation might be difficult to produce absent a real violation of the law. Further, this process of introspection might conceivably help an unconsciously biased adjudicator to uncover the subconscious bias and account for it.

Of course, despite the right to explanation, the legal system is well-known to suffer from systemic biases. Black defendants are more likely to be convicted, and more likely to be sentenced to death. Nevertheless, it seems plausible that absent the right to explanation, the situation could be far worse.

We should ask, do the explanations we generated by today’s efforts at interpretable ML confer these desired properties? Does the task of producing the explanation improve the models? When we ask for interpretable models what are we asking for? Are we sometimes wrongly anthropomorphizing models in the hope that the task of producing an explanation will make them smarter or less biased?

We should consider, when stakeholders ask for explanations what precisely do they want? If what they want is assurance that the decision conforms with a valid chain of legal or causal reasoning, then hardly any of the existing work on model interpretability applies at all. Certainly post-hoc explanations like saliency maps or LIME offer nothing towards accountability.

Fabio Roli – Safety of Machine Learning

On the topic of anthropomorphic ML, Fabio Roli began his discussion with an engaging foray into science fiction. Recalling Fred Hoyle’s Black Cloud, Roli recounted a story of humans who encounter a sentient cloud. They soon realize that it is intelligent but are unable to communicate with it because all attempts at communication stem from anthropomorphic assumptions.

Turning back towards machine learning, Roli asked whether it is good or responsible to anthropomorphize developments in AI. Already many companies build humanoid robots. We already engage in dialogue with anthropomorphic chatbots (Alexa, Siri, Cortana). What are the benefits of this tendency to the anthropomorphic? What are its dangers? Do we present false or misleading expectations of capabilities?

Roli went on to describe problems with machine learning owing to adversarial examples. He addressed the susceptibility of convolutional neural networks to adversarially perturbed images and proposed approaches in which ML algorithms learn closed decision boundaries. In a model with closed decision boundaries, the algorithm might abstain from points sufficiently unlike those previously seen. How precisely to draw closed boundaries on the space of images remains a challenging open question and the method has not to my knowledge been reduced to practice in this domain.

Describing another line of research, Roli introduced work aimed at detecting malware on android devices. This work presented an interesting spin on opacity in ML algorithms. In malware detection, even controlling for accuracy, opacity might be an asset and not a vice. A truly transparent algorithm might be easier for a malware coder to evade. Opaque algorithms might be harder to game.

Viola Schiaffonati – Preliminary steps for experimentally evaluating the impact of AI

Professor Schiaffonati’s talk focused on the safe deployment of machine learning algorithms, calling to memory the NIPS workshop on reliable ML in the wild. In the talk, she focused on the distinction between learning by experimentation (in vitro) and learning by doing (in vivo). Among the ideas presented was a proposal for special testing zones.

Already, many technologies are rolled out systematically with testing phases. Google for example conducts extensive betas internally with employees receiving advanced access to new technology. Nevertheless, for many players, especially smaller startups and research groups, deployment can be considerably more haphazard and these problems are by no means solved.

Mirelle Hildebrandt – The Issue of Bias

Mirelle Hildebrandts’ talk was both one of the most entertaining and the toughest to summarize. The talk alternately addressed the law, bias, and the fundamental task of pattern recognition. It also featured frequent leaps to the classics, complete with interjections from Hume and a reference to the no free lunch theorem. To do the talk justice, I’d suggest to watch the video, but I’ll quickly summarize the most important takeaways.

To me the most profound moment in the talk came early when Hildebrandt addressed why we need law in the first place.

“We need law to create a playing field such that actors can act ethically”

“What we don’t want is an incentive structure such that companies who want to act ethically will be pushed out of the market.”

While it might seem absurd that we should need to justify the existence of law (!!!), in today’s Silicon Valley climate these points hit home.

Consider the many startups now that outmaneuver their highly regulated predecessors precisely by skirting regulation. Uber outcompeted taxis in part because it’s a slick app but also because they operated without commercial insurance, didn’t employ commercially licensed drivers and didn’t have to pay for medallions in cities where this is the prohibitive cost to entering the taxi market. Given recent events, the point rings more profound. How can a politician compete effectively without abusing the truth? For ethical behavior to be prevail it must be encouraged by society.

Later in her talk, Hildebrandt addressed contestability. This point hits to the heart of ML interpretability research again. In many real-life situation, people need/want the ablity to contest decisions. If a decision-maker denies you a loan due to insufficient income, you could contest this by showing evidence of income and demand that this new information be taken into account. Unfortunately few attempts at interpretable ML models possess the ability to handle a protest and revise predictions under new information.

Krishna Gummadi – Discrimination in human vs. machine decision making

In his talk, Krishna Gummadi took on discrimination in algorithmic decisions. He started with the idea of “socially salient group”. In short, this would be something like race or gender. Basically any group that we want to be careful not to discriminate against.

Gummadi then formalized several notions of discrimination. One for example, would be disparate treatment. A predictive model is guilty of disparate treatment of P(y|x,z) != P(y|x) where z is a sensitive feature and x are the insensitive features.

Gummadi explained why omitting sensitive features may not be sufficient to ensure that a model doesn’t discriminate. In particular, if (i) the available labels are biased, and (ii) the insensitive features are correlated with the sensitive features, then the model could learn to reproduce the discriminatory behavior of the labelers. For a long-form discussion of this problem, I addressed it in the previous post The Foundations of Algorithmic Bias.

In his proposed solution, Gummadi proposed encoding fairness as a constraint. Take for example recidivism prediction. In this approach the model might be constrained to predict the same number of recidivism cases among white and black arrestees. Then subject to this fairness constraint, the model would maximize accuracy. In practice, setting exact equality could yield trivial predictions (give the same prediction to everyone), so the authors introduce some slack. In this approach, while the model wouldn’t be guilty of disparate treatment, it would still peek at the sensitive feature during training.

This research strikes me as interesting both for its attempts to formalize notions of discrimination and due to its exploration of a technical solution. However, I’d share a couple caveats about this particular technical solution that one ought to consider before actually using it in practice.

My two reservations are as follows. First, I think if a model peeks at a sensitive feature during training, but not inference time, this only removes disparate treatment in a narrow technical sense. It respects the mathematical definition but not the spirit of disparate treatment concerns. Say, for example, that the sensitive feature were race and that the insensitive features included correlated features like zip code. The model could then be expected to use zip code explicitly as a proxy for race. At this point, the model could not reasonably be said to be ignoring race. So if the goal is to learn an affirmative action strategy, why do it implicitly and not explicitly?

This brings us to my second reservation about the approach: the arbitrariness of the learned weights. Consider a dataset for predicting recidivism. Imagine that the dataset contains a throwaway feature Q that has no plausible connection to the prediction task but is correlated with the sensitive feature Z. If we train any reasonable ML classifier, this feature will get no weight: P(Y|Q) = P(Y). But under Gummadi’s proposed model, this nonsense feature now might shoulder considerable weight (to meet the fairness constraint). So for the sake of paying lip-service to disparate treatment, we now produce a seemingly nonsensical model.

To be clear, I think this work is valuable. The benefits and pitfalls of any algorithms can only be examined once an algorithm is proposed. Gummadi’s approach makes a bold attempt both at a problem formulation and a solution.

Arshak Navruzyan – Avoiding bad machine learning predictions in critical decision domains

The lone non-academic among the invited speakers, Arshak Navruzyan of Startup.ML offered an entrepreneur’s perspective on machine learning. Early on, Navruzyan pointed out the difficulty of doing machine learning owing to its interdisciplinary nature. Doing machine learning well requires mathematical skills, engineering talent, and some amount of problem domain expertise. It also requires the ability to identify an important problem, to conceive of its impact, and to communicate its importance effectively. Moreover, executing on all of the above while keeping an eye towards ethics requires yet another dimension of competence.

Navruzyan suggested that we should expect things to go wrong if the people doing this work are ill-equipped for it. He positioned his program Startup.ML, which seeks to train new data scientists as one solution. Over a several month program, Navruzyan and his colleagues train a class of fellows for careers to solve practical problems using machine learning. During this introduction, Navruzyan claimed that the Startup.ML fellowship boasted a 2% acceptance rate, making the eyebrow-raising claim that the bootcamp is therefore more selective than elite computer science institutions. The strongest aspect of the talk was his case that programs like Startup.ML could help to enable careers in data science beyond the handful of elite ML PhDs benefitting most now.

Navruzyan suggested that one way to address the multidisciplinary nature of machine learning is to build teams composed of individuals with complementary skill sets. He presented a chart which depicted the stereotypical skills of ML PhDs, general computer scientists, mathematicians, statisticians, natural scientists, and project managers. I’m not a fan of hard-coding stereotypes like this. But the point about inter-disciplinary team-building is reasonable.

Later in the talk Navruzyan discussed machine learning predictions in critical domains, ostensibly the purpose of the talk. He cited adversarial examples as one problem, motivating startup.ML’s growing focus on reinforcement learning. This struck me as odd. There is no good reason to suspect that reinforcement learning is safer than supervised learning in critical domains or less susceptible to adversarial examples. In fact, even absent adversarial intervention, reinforcement learning can be subject instability during training. Deep reinforcement learning in particular can diverge or oscillate, even on toy examples.

Bettina Berendt -What does it mean to ask about “the human use of machine learning?”

The final invited talk of the day was delivered by Bettina Berendt. In it she posed six questions about the social impacts of machine learning and the responsibilities of individuals and organizations deploying machine learning in the wild. Since her talk was designed more to ask question and foster conversation than to propose definitive answers, I present them here, excerpted from her abstract.

Which actors are involved in formulating the (e.g. privacy) problem?
How does the researcher conceptualise the problem (e.g. privacy) in terms of the major legal and ethical positions currently being discussed?
Is informing users of (e.g. privacy) dangers always a good thing?
Do we want to influence users’ attitudes and behaviours?
Who is the target audience?
What can we do in our various roles – as academics, teachers, intellectuals, etc.?

In a note-worthy moment, Berendt criticized the tendency of theorists who take on social issues for ignoring prior work in the social sciences. This is a thorny issue and I can see multiple angles.

On one side there is a question of intellectual opportunity. We could ask, are today’s researchers missing out by ignoring previous research? On the other hand, there’s an issue of credit assignment. Are today’s machine learning researchers, armed with fame and funding usurping the helm of long-researched fields? Having just come from NIPS 2016, credit-assignment battles were fresh in my mind. Likely each side has some merit.

How best to engage prior literature on privacy, fairness, and discrimination and to what extent this work is overlooked and under-cited I leave as an exercise for a future post or an ambitious reader.

Final Thoughts

The summit in Venice was a bold effort to bring together various experts at the intersection of machine learning, ethics, and public policy. The talks were diverse and informative, leaving me to wonder, what venue will emerge as the permanent home for this work? At present, this community is scattered across several workshops, none of which publishes a peer-reviewed proceedings.

As the field develops and as technological progress increased the importance of this work, will a full conference (or journal) emerge as a definitive publishing venue? Where can policy-makers and machine learning practitioners turn to for authoritative research on social impacts of machine learning?

Regarding the future of this community I pose the following challenges/ questions:

The necessary research requires both legal, critical, and technical scholarship. It seems unlikely that we can count on building community exclusively of individuals who excel in all disciplines. Likely many papers will emerge with important insights for policy but minimal technical contributions. Other papers might have technical contributions but contribute to policy less directly. What set of standards should we apply to evaluating research?
From an audience perspective, research in this area should be accessible both by machine learning researchers and practitioners, legal scholars and practitioners, policy researchers and politicians. Can this be accomplished in one venue?
Could a multi-track conference overcome some of these potential organizational problems?
Could we look to Machine Learning for Healthcare (MLHC) as an exemplar? This new conference brings together work at the intersection of core machine learning and clinical applications. As a published author and peer reviewer there, I can reflect that while process isn’t yet perfect, it’s off to a promising start.
We need academic incentives for engaging in this research. What changes must we make to our disparate academic communities to encourage this work?
- Philosophers and public policy academics get no credit for research presented at conferences
- Machine learning researchers get little career advancement for writing philosophy papers
- Machine learning venues rarely publish position papers

While these questions may remain open for some time, I’m hopeful after HUML2016 that a small but growing community of researchers is committed to making progress.

Zachary C. Lipton

Author: Zachary C. Lipton

Zachary Chase Lipton is an assistant professor at Carnegie Mellon University. He is interested in both core machine learning methodology and applications to healthcare and dialogue systems. He is also a visiting scientist at Amazon AI, and has worked with Amazon Core Machine Learning, Microsoft Research Redmond, & Microsoft Research Bangalore. View all posts by Zachary C. Lipton