Troubling Trends in Machine Learning Scholarship

By Zachary C. Lipton* & Jacob Steinhardt*
*equal authorship

Originally presented at ICML 2018: Machine Learning Debates
[arXiv link]

1   Introduction

Collectively, machine learning (ML) researchers are engaged in the creation and dissemination of knowledge about data-driven algorithms. In a given paper, researchers might aspire to any subset of the following goals, among others: to theoretically characterize what is learnable, to obtain understanding through empirically rigorous experiments, or to build a working system that has high predictive accuracy. While determining which knowledge warrants inquiry may be subjective, once the topic is fixed, papers are most valuable to the community when they act in service of the reader, creating foundational knowledge and communicating as clearly as possible.

What sort of papers best serve their readers? We can enumerate desirable characteristics: these papers should (i) provide intuition to aid the reader’s understanding, but clearly distinguish it from stronger conclusions supported by evidence; (ii) describe empirical investigations that consider and rule out alternative hypotheses [62]; (iii) make clear the relationship between theoretical analysis and intuitive or empirical claims [64]; and (iv) use language to empower the reader, choosing terminology to avoid misleading or unproven connotations, collisions with other definitions, or conflation with other related but distinct concepts [56].

Recent progress in machine learning comes despite frequent departures from these ideals. In this paper, we focus on the following four patterns that appear to us to be trending in ML scholarship:

  1. Failure to distinguish between explanation and speculation.
  2. Failure to identify the sources of empirical gains, e.g. emphasizing unnecessary modifications to neural architectures when gains actually stem from hyper-parameter tuning.
  3. Mathiness: the use of mathematics that obfuscates or impresses rather than clarifies, e.g. by confusing technical and non-technical concepts.
  4. Misuse of language, e.g. by choosing terms of art with colloquial connotations or by overloading established technical terms.

While the causes behind these patterns are uncertain, possibilities include the rapid expansion of the community, the consequent thinness of the reviewer pool, and the often-misaligned incentives between scholarship and short-term measures of success (e.g. bibliometrics, attention, and entrepreneurial opportunity). While each pattern offers a corresponding remedy (don’t do it), we also discuss some speculative suggestions for how the community might combat these trends.

As the impact of machine learning widens, and the audience for research papers increasingly includes students, journalists, and policy-makers, these considerations apply to this wider audience as well. We hope that by communicating more precise information with greater clarity, we can accelerate the pace of research, reduce the on-boarding time for new researchers, and play a more constructive role in the public discourse.

Flawed scholarship threatens to mislead the public and stymie future research by compromising ML’s intellectual foundations. Indeed, many of these problems have recurred cyclically throughout the history of artificial intelligence and, more broadly, in scientific research. In 1976, Drew McDermott [53] chastised the AI community for abandoning self-discipline, warning prophetically that “if we can’t criticize ourselves, someone else will save us the trouble”. Similar discussions recurred throughout the 80s, 90s, and aughts [13, 38, 2]. In other fields such as psychology, poor experimental standards have eroded trust in the discipline’s authority [14]. The current strength of machine learning owes to a large body of rigorous research to date, both theoretical [22, 7, 19] and empirical [34, 25, 5]. By promoting clear scientific thinking and communication, we can sustain the trust and investment currently enjoyed by our community.

2   Disclaimers

This paper aims to instigate discussion, answering a call for papers from the ICML Machine Learning Debates workshop. While we stand by the points represented here, we do not purport to offer a full or balanced viewpoint or to discuss the overall quality of science in ML. In many aspects, such as reproducibility, the community has advanced standards far beyond what sufficed a decade ago. We note that these arguments are made by us, against us, by insiders offering a critical introspective look, not as sniping outsiders. The ills that we identify are not specific to any individual or institution. We ourselves have fallen into these patterns, and likely will again in the future. Exhibiting one of these patterns doesn’t make a paper bad nor does it indict the paper’s authors, however we believe that all papers could be made stronger by avoiding these patterns. While we provide concrete examples, our guiding principles are to (i) implicate ourselves, and (ii) to preferentially select from the work of better-established researchers and institutions that we admire, to avoid singling out junior students for whom inclusion in this discussion might have consequences and who lack the opportunity to reply symmetrically. We are grateful to belong to a community that provides sufficient intellectual freedom to allow us to express critical perspectives.

3   Troubling Trends

In each subsection below, we (i) describe a trend; (ii) provide several examples (as well as positive examples that resist the trend); and (iii) explain the consequences. Pointing to weaknesses in individual papers can be a sensitive topic. To minimize this, we keep examples short and specific.

3.1   Explanation vs. Speculation

Research into new areas often involves exploration predicated on intuitions that have yet to coalesce into crisp formal representations. We recognize the role of speculation as a means for authors to impart intuitions that may not yet withstand the full weight of scientific scrutiny. However, papers often offer speculation in the guise of explanations, which are then interpreted as authoritative due to the trappings of a scientific paper and the presumed expertise of the authors.

For instance, [33] forms an intuitive theory around a concept called internal covariate shift. The exposition on internal covariate shift, starting from the abstract, appears to state technical facts. However, key terms are not made crisp enough to conclusively assume a truth value. For example, the paper states that batch normalization offers improvements by reducing changes in the distribution of hidden activations over the course of training. By which divergence measure is this change quantified? The paper never clarifies, and some work suggests that this explanation of batch normalization may be off the mark [65]. Nevertheless, the speculative explanation given in [33] has been repeated as fact, e.g. in [60], which states, “It is well-known that a deep neural network is very hard to optimize due to the internal-covariate-shift problem.”

We ourselves have been equally guilty of explanation disguised as speculation. In [72], JS writes that “the high dimensionality and abundance of irrelevant features. . . give the attacker more room to construct attacks”, without conducting any experiments to measure the effect of dimensionality on attackability. And in [71], JS introduces the intuitive notion of coverage without defining it, and uses it as a form of explanation, e.g.: “Recall that one symptom of a lack of coverage is poor estimates of uncertainty and the inability to generate high precision predictions.” Looking back, we desired to communicate insufficiently fleshed out intuitions that were material to the work described in the paper, and we were reticent to label a core part of our argument as speculative.

In contrast to the above examples, [69] separates speculation from fact. While this paper, which introduced dropout regularization, speculates at length on connections between dropout and sexual reproduction, a designated “Motivation” section clearly quarantines this discussion. This practice avoids confusing readers while allowing authors to express informal ideas.

In another positive example, [3] presents practical guidelines for training neural networks. Here, the authors carefully convey uncertainty. Instead of presenting the guidelines as authoritative, the paper states: “Although such recommendations come…from years of experimentation and to some extent mathematical justification, they should be challenged. They constitute a good starting point. . . but very often have not been formally validated, leaving open many questions that can be answered either by theoretical analysis or by solid comparative experimental work”.

3.2   Failure to Identify the Sources of Empirical Gains

The machine learning peer review process places a premium on technical novelty. Perhaps to satisfy reviewers, many papers emphasize both complex models (addressed here) and fancy mathematics (see §3.3). While complex models are sometimes justified, empirical advances often come about in other ways: through clever problem formulations, scientific experiments, optimization heuristics, data preprocessing techniques, extensive hyper-parameter tuning, or by applying existing methods to interesting new tasks. Sometimes a number of proposed techniques together achieve a significant empirical result. In these cases, it serves the reader to elucidate which techniques are necessary to realize the reported gains.

Too frequently, authors propose many tweaks absent proper ablation studies, obscuring the source of empirical gains. Sometimes just one of the changes is actually responsible for the improved results. This can give the false impression that the authors did more work (by proposing several improvements), when in fact they did not do enough (by not performing proper ablations). Moreover, this practice misleads readers to believe that all of the proposed changes are necessary.

Recently, Melis et al. [54] demonstrated that a series of published improvements, originally attributed to complex innovations in network architectures, were actually due to better hyper-parameter tuning. On equal footing, vanilla LSTMs, hardly modified since 1997 [32], topped the leaderboard. The community may have benefited more by learning the details of the hyper-parameter tuning without the distractions. Similar evaluation issues have been observed for deep reinforcement learning [30] and generative adversarial networks [51]. See [68] for more discussion of lapses in empirical rigor and resulting consequences.

In contrast, many papers perform good ablation analyses [41, 45, 77, 82], and even retrospective attempts to isolate the source of gains can lead to new discoveries [10, 65]. Furthermore, ablation is neither necessary nor sufficient for understanding a method, and can even be impractical given computational constraints. Understanding can also come from robustness checks (as in [15], which discovers that existing language models handle inflectional morphology poorly) as well as qualitative error analysis [40].

Empirical study aimed at understanding can be illuminating even absent a new algorithm. For instance, probing the behavior of neural networks led to identifying their susceptibility to adversarial perturbations [74]. Careful study also often reveals limitations of challenge datasets while yielding stronger baselines. [11] studies a task designed for reading comprehension of news passages and finds that 73% of the questions can be answered by looking at a single sentence, while only 2% require looking at multiple sentences (the remaining 25% of examples were either ambiguous or contained coreference errors). In addition, simpler neural networks and linear classifiers outperformed complicated neural architectures that had previously been evaluated on this task. In the same spirit, [80] analyzes and constructs a strong baseline for the Visual Genome Scene Graphs dataset.

3.3   Mathiness

When writing a paper early in PhD, we (ZL) received feedback from an experienced post-doc that the paper needed more equations. The post-doc wasn’t endorsing the system, but rather communicating a sober view of how reviewing works. More equations, even when difficult to decipher, tend to convince reviewers of a paper’s technical depth.

Mathematics is an essential tool for scientific communication, imparting precision and clarity when used correctly. However, not all ideas and claims are amenable to precise mathematical description, and natural language is an equally indispensible tool for communicating, especially about intuitive or empirical claims.

When mathematical and natural language statements are mixed without a clear accounting of their relationship, both the prose and the theory can suffer: problems in the theory can be concealed by vague definitions, while weak arguments in the prose can be bolstered by the appearance of technical depth. We refer to this tangling of formal and informal claims as mathiness, following economist Paul Romer who described the pattern thusly: “Like mathematical theory, mathiness uses a mixture of words and symbols, but instead of making tight links, it leaves ample room for slippage between statements in natural language versus formal language” [64].

Mathiness manifests in several ways: First, some papers abuse mathematics to convey technical depth—to bulldoze rather than to clarify. Spurious theorems are common culprits, inserted into papers to lend authoritativeness to empirical results, even when the theorem’s conclusions do not actually support the main claims of the paper. We (JS) are guilty of this in [70], where a discussion of “staged strong Doeblin chains” has limited relevance to the proposed learning algorithm, but might confer a sense of theoretical depth to readers.

The ubiquity of this issue is evidenced by the paper introducing the Adam optimizer [35]. In the course of introducing an optimizer with strong empirical performance, it also offers a theorem regarding convergence in the convex case, which is perhaps unnecessary in an applied paper focusing on non-convex optimization. The proof was later shown to be incorrect in [63].

A second issue is claims that are neither clearly formal nor clearly informal. For example, [18] argues that the difficulty in optimizing neural networks stems not from local minima but from saddle points. As one piece of evidence, the work cites a statistical physics paper [9] on Gaussian random fields and states that in high dimensions “all local minima [of Gaussian random fields] are likely to have an error very close to that of the global minimum” (a similar statement appears in the related work of [12]). This appears to be a formal claim, but absent a specific theorem it is difficult to verify the claimed result or to determine its precise content. Our understanding is that it is partially a numerical claim that the gap is small for typical settings of the problem parameters, as opposed to a claim that the gap vanishes in high dimensions. A formal statement would help clarify this. We note that the broader interesting point in [18] that minima tend to have lower loss than saddle points is more clearly stated and empirically tested.

Finally, some papers invoke theory in overly broad ways, or make passing references to theorems with dubious pertinence. For instance, the no free lunch theorem is commonly invoked as a justification for using heuristic methods without guarantees, even though the theorem does not formally preclude guaranteed learning procedures.

While the best remedy for mathiness is to avoid it, some papers go further with exemplary exposition. A recent paper [8] on counterfactual reasoning covers a large amount of mathematical ground in a down-to-earth manner, with numerous clear connections to applied empirical problems. This tutorial, written in clear service to the reader, has helped to spur work in the burgeoning community studying counterfactual reasoning for ML.

3.4   Misuse of Language

We identify three common avenues of language misuse in machine learning: suggestive definitions, overloaded terminology, and suitcase words.

3.4.1   Suggestive Definitions

In the first avenue, a new technical term is coined that has a suggestive colloquial meaning, thus sneaking in connotations without the need to argue for them. This often manifests in anthropomorphic characterizations of tasks (reading comprehension [31] and music composition [59]) and techniques (curiosity [66] and fear [48]). A number of papers name components of proposed models in a manner suggestive of human cognition, e.g. “thought vectors” [36] and the “consciousness prior” [4]. Our goal is not to rid the academic literature of all such language; when properly qualified, these connections might communicate a fruitful source of inspiration. However, when a suggestive term is assigned technical meaning, each subsequent paper has no choice but to confuse its readers, either by embracing the term or by replacing it.

Describing empirical results with loose claims of “human-level” performance can also portray a false sense of current capabilities. Take, for example, the “dermatologist-level classification of skin cancer” reported in [21]. The comparison to dermatologists conceals the fact that classifiers and dermatologists perform fundamentally different tasks. Real dermatologists encounter a wide variety of circumstances and must perform their jobs despite unpredictable changes. The machine classifier, however, only achieves low error on i.i.d. test data. In contrast, claims of human-level performance in [29] are better-qualified to refer to the ImageNet classification task (rather than object recognition more broadly). Even in this case, one careful paper (among many less careful [21, 57, 75]) was insufficient to put the public discourse back on track. Popular articles continue to characterize modern image classifiers as “surpassing human abilities and effectively proving that bigger data leads to better decisions” [23], despite demonstrations that these networks rely on spurious correlations, e.g. misclassifying “Asians dressed in red” as ping-pong balls [73].

Deep learning papers are not the sole offenders; misuse of language plagues many subfields of ML. [49] discusses how the recent literature on fairness in ML often overloads terminology borrowed from complex legal doctrine, such as disparate impact, to name simple equations expressing particular notions of statistical parity. This has resulted in a literature where “fairness”, “opportunity”, and “discrimination” denote simple statistics of predictive models, confusing researchers who become oblivious to the difference, and policymakers who become misinformed about the ease of incorporating ethical desiderata into ML.

3.4.2   Overloading Technical Terminology

A second avenue of misuse consists of taking a term that holds precise technical meaning and using it in an imprecise or contradictory way. Consider the case of deconvolution, which formally describes the process of reversing a convolution, but is now used in the deep learning literature to refer to transpose convolutions (also called up-convolutions) as commonly found in auto-encoders and generative adversarial networks. This term first took root in deep learning in [79], which does address deconvolution, but was later over-generalized to refer to any neural architectures using upconvolutions [78, 50]. Such overloading of terminology can create lasting confusion. New machine learning papers referring to deconvolution might be (i) invoking its original meaning, (ii) describing upconvolution, or (iii) attempting to resolve the confusion, as in [28], which awkwardly refers to “upconvolution (deconvolution)”.

As another example, generative models are traditionally models of either the input distribution p(x) or the joint distribution p(x,y). In contrast, discriminative models address the conditional distribution p(y | x) of the label given the inputs. However, in recent works, “generative model” imprecisely refers to any model that produces realistic-looking structured data. On the surface, this may seem consistent with the p(x) definition, but it obscures several shortcomings—for instance, the inability of GANs or VAEs to perform conditional inference (e.g. sampling from p(x2 | x1) where x1 and x2 are two distinct input features). Bending the term further, some discriminative models are now referred to as generative models on account of producing structured outputs [76], a mistake that we (ZL) make in [47]. Seeking to resolve the confusion and provide historical context, [58] distinguishes between prescribed and implicit generative models.

Revisiting batch normalization, [33] describes covariate shift as a change in the distribution of model inputs. In fact, covariate shift refers to a specific type of shift where although the input distribution p(x) might change, the labeling function p(y|x) does not [27]. Moreover, due to the influence of [33], Google Scholar lists batch normalization as the first reference on searches for “covariate shift”.

Among the consequences of mis-using language is that (as with generative models) we might conceal lack of progress by redefining an unsolved task to refer to something easier. This often combines with suggestive definitions via anthropomorphic naming. Language understanding and reading comprehension, once grand challenges of AI, now refer to making accurate predictions on specific datasets [31].

3.4.3   Suitcase Words

Finally, we discuss the overuse of suitcase words in ML papers. Coined by Minsky in the 2007 book The Emotion Machine [56], suitcase words pack together a variety of meanings. Minsky describes mental processes such as consciousness, thinking, attention, emotion, and feeling that may not share “a single cause or origin”. Many terms in ML fall into this category. For example, [46] notes that interpretability holds no universally agreed-upon meaning, and often references disjoint methods and desiderata. As a consequence, even papers that appear to be in dialogue with each other may have different concepts in mind.

As another example, generalization has both a specific technical meaning (generalizing from train to test) and a more colloquial meaning that is closer to the notion of transfer (generalizing from one population to another) or of external validity (generalizing from an experimental setting to the real world) [67]. Conflating these notions leads to overestimating the capabilities of current systems.

Suggestive definitions and overloaded terminology can contribute to the creation of new suitcase words. In the fairness literature, where legal, philosophical, and statistical language are often overloaded, terms like bias become suitcase words that must be subsequently unpacked [17].

In common speech and as aspirational terms, suitcase words can serve a useful purpose. Perhaps the suitcase word reflects an overarching concept that unites the various meanings. For example, artificial intelligence might be well-suited as an aspirational name to organize an academic department. On the other hand, using suitcase words in technical arguments can lead to confusion. For example, [6] writes an equation (Box 4) involving the terms intelligence and optimization power, implicitly assuming that these suitcase words can be quantified with a one-dimensional scalar.

4  Speculation on Causes Behind the Trends

Do the above patterns represent a trend, and if so, what are the underlying causes? We speculate that these patterns are on the rise and suspect several possible causal factors: complacency in the face of progress, the rapid expansion of the community, the consequent thinness of the reviewer pool, and misaligned incentives of scholarship vs. short-term measures of success.

4.1   Complacency in the Face of Progress

The apparent rapid progress in ML has at times engendered an attitude that strong results excuse weak arguments. Authors with strong results may feel licensed to insert arbitrary unsupported stories (see §3.1) regarding the factors driving the results, to omit experiments aimed at disentangling those factors (§3.2), to adopt exaggerated terminology (§3.4), or to take less care to avoid mathiness (§3.3).

At the same time, the single-round nature of the reviewing process may cause reviewers to feel they have no choice but to accept papers with strong quantitative findings. Indeed, even if the paper is rejected, there is no guarantee the flaws will be fixed or even noticed in the next cycle, so reviewers may conclude that accepting a flawed paper is the best option.

4.2   Growing Pains

Since around 2012, the ML community has expanded rapidly due to increased popularity stemming from the success of deep learning methods. While we view the rapid expansion of the community as a positive development, it can also have side effects.

To protect junior authors, we have preferentially referenced our own papers and those of established researchers. However, newer researchers may be more susceptible to these patterns. For instance, authors unaware of previous terminology are more likely to mis-use or re-define language (§3.4). On the other hand, experienced researchers fall into these patterns as well.

Rapid growth can also thin the reviewer pool, in two ways—by increasing the ratio of submitted papers to reviewers, and by decreasing the fraction of experienced reviewers. Less experienced reviewers may be more likely to demand architectural novelty, be fooled by spurious theorems, and let pass serious but subtle issues like misuse of language, thus either incentivizing or enabling several of the trends described above. At the same time, experienced but over-burdened reviewers may revert to a “check-list” mentality, rewarding more formulaic papers at the expense of more creative or intellectually ambitious work that might not fit a preconceived template. Moreover, overworked reviewers may not have enough time to fix—or even to notice—all of the issues in a submitted paper.

4.3   Misaligned Incentives

Reviewers are not alone in providing poor incentives for authors. As ML research garners increased media attention and ML startups become commonplace, to some degree incentives are provided by the press (“What will they write about?”) and by investors (“What will they invest in?”). The media provides incentives for some of these trends. Anthropomorphic descriptions of ML algorithms provide fodder for popular coverage. Take for instance [55], which characterizes an autoencoder as a “simulated brain”. Hints of human-level performance tend to be sensationalized in newspaper headlines, e.g. [52], which describes a deep learning image captioning system as “mimicking human levels of understanding”. Investors too have shown a strong appetite for AI research, funding startups sometimes on the basis of a single paper. In our (ZL) experience working with investors, they are sometimes attracted to startups whose research has received media coverage, a dynamic which attaches financial incentives to media attention. We note that recent interest in chatbot startups co-occurred with anthropomorphic descriptions of dialogue systems and reinforcement learners both in papers and in the media, although it may be difficult to determine whether the lapses in scholarship caused the interest of investors or vice versa.

5   Suggestions

Supposing we are to intervene to counter these trends, then how? Besides merely suggesting that each author abstain from these patterns, what can we do as a community to raise the level of experimental practice, exposition, and theory? And how can we more readily distill the knowledge of the community and disabuse researchers and the wider public of misconceptions? Below we offer a number of preliminary suggestions based on our personal experiences and impressions.

5.1   Suggestions for Authors

We encourage authors to ask “what worked?” and “why?”, rather than just “how well?”. Except in extraordinary cases [39], raw headline numbers provide limited value for scientific progress absent insight into what drives them. Insight does not necessarily mean theory. Three practices that are common in the strongest empirical papers are error analysis, ablation studies, and robustness checks (to e.g. choice of hyper-parameters, as well as ideally to choice of dataset). These practices can be adopted by everyone and we advocate their wide-spread use. For some examplar papers, we refer the reader to the preceding discussion in §3.2. [43] also provides a more detailed survey of empirical best practices.

Sound empirical inquiry need not be confined to tracing the sources of a particular algorithm’s empirical gains; it can yield new insights even when no new algorithm is proposed. Notable examples of this include a demonstration that neural networks trained by stochastic gradient descent can fit randomly-assigned labels [81]. This paper questions the ability of learning-theoretic notions of model complexity to explain why neural networks can generalize to unseen data. In another example, [26] explored the loss surfaces of deep networks, revealing that straight-line paths in parameter space between initialized and learned parameters typically had monotonically decreasing loss.

When writing, we recommend asking the following question: Would I rely on this explanation for making predictions or for getting a system to work? This can be a good test of whether a theorem is being included to please reviewers or to convey actual insight. It also helps check whether concepts and explanations match our own internal mental model. On mathematical writing, we point the reader to Knuth, Larrabee, and Roberts’ excellent guidebook [37].

Finally, being clear about which problems are open and which are solved not only presents a clearer picture to readers, it encourages follow-up work and guards against researchers neglecting questions presumed (falsely) to be resolved.

5.2   Suggestions for Publishers and Reviewers

Reviewers can set better incentives by asking: “Might I have accepted this paper if the authors had done a worse job?” For instance, a paper describing a simple idea that leads to improved performance, together with two negative results, should be judged more favorably than a paper that combines three ideas together (without ablation studies) yielding the same improvement.

Current literature moves fast at the expense of accepting flawed works for conference publication. One remedy could be to emphasize authoritative retrospective surveys that strip out exaggerated claims and extraneous material, change anthropomorphic names to sober alternatives, standardize notation, etc. While venues such as Foundations and Trends in Machine Learning already provide a track for such work, we feel that there are still not enough strong papers in this genre.

Additionally, we believe (noting our conflict of interest) that critical writing ought to have a voice at machine learning conferences. Typical ML conference papers choose an established problem (or propose a new one), demonstrate an algorithm and/or analysis, and report experimental results. While many questions can be addressed in this way, for addressing the validity of the problems or the methods of inquiry themselves, neither algorithms nor experiments are sufficient (or appropriate). We would not be alone in embracing greater critical discourse: in NLP, this year’s COLING conference included a call for position papers “to challenge conventional thinking” [1].

There are many lines of further discussion worth pursuing regarding peer review. Are the problems we described mitigated or exacerbated by open review? How do reviewer point systems align with the values that we advocate? These topics warrant their own papers and have indeed been discussed at length elsewhere [42, 44, 24].

6   Discussion

Folk wisdom might suggest not to intervene just as the field is heating up: You can’t argue with success! We counter these objections with the following arguments: First, many aspects of the current culture are consequences of ML’s recent success, not its causes. In fact, many of the papers leading to the current success of deep learning were careful empirical investigations characterizing principles for training deep networks. This includes the advantage of random over sequential hyper-parameter search [5], the behavior of different activation functions [34, 25], and an understanding of unsupervised pre-training [20].

Second, flawed scholarship already negatively impacts the research community and broader public discourse. We saw in §3 examples of unsupported claims being cited thousands of times, lineages of purported improvements being overturned by simple baselines, datasets that appear to test high-level semantic reasoning but actually test low-level syntactic fluency, and terminology confusion that muddles the academic dialogue. This final issue also affects the public discourse. For instance, the European parliament passed a report considering regulations to apply if “robots become or are made self-aware” [16]. While ML researchers are not responsible for all misrepresentations of our work, it seems likely that anthropomorphic language in authoritative peer-reviewed papers is at least partly to blame.

We believe that greater rigor in both exposition, science, and theory are essential for both scientific progress and fostering a productive discourse with the broader public. Moreover, as practitioners apply ML in critical domains such as health, law, and autonomous driving, a calibrated awareness of the abilities and limits of ML systems will enable us to deploy ML responsibly. We conclude the paper by discussing several counterarguments and by providing historical context.

6.1   Countervailing Considerations

There are a number of countervailing considerations to the suggestions set forth above. Several readers of earlier drafts of this paper noted that stochastic gradient descent tends to converge faster than gradient descent—in other words, perhaps a faster noisier process that ignores our guidelines for producing “cleaner” papers results in a faster pace of research. For example, the breakthrough paper on ImageNet classification [39] proposes multiple techniques without ablation studies, several of which were subsequently determined to be unnecessary. However, at the time the results were so significant and the experiments so computationally expensive to run that waiting for ablations to complete was perhaps not worth the cost to the community.

A related concern is that high standards might impede the publication of original ideas, which are more likely to be unusual and speculative. In other fields, such as economics, high standards result in a publishing process that can take years for a single paper, with lengthy revision cycles consuming resources that could be deployed towards new work.

Finally, perhaps there is value in specialization: the researchers generating new conceptual ideas or building new systems need not be the same ones who carefully collate and distill knowledge.

We recognize the validity of these considerations, and also recognize that these standards are at times exacting. However, in many cases they are straightforward to implement, requiring only a few extra days of experiments and more careful writing. Moreover, we present these as strong heuristics rather than unbreakable rules—if an idea cannot be shared without violating these heuristics, we prefer the idea be shared and the heuristics set aside. Additionally, we have almost always found attempts to adhere to these standards to be well worth the effort. In short, we do not believe that the research community has achieved a Pareto optimal state on the growth-quality frontier.

6.2   Historical Antecedents

The issues discussed here are neither unique to machine learning nor to this moment in time; they instead reflect issues that recur cyclically throughout academia. As far back as 1964, the physicist John R. Platt discussed related concerns in his paper on strong inference [62], where he identified adherence to specific empirical standards as responsible for the rapid progress of molecular biology and high-energy physics relative to other areas of science.

There have also been similar discussions in AI. As noted in §1, Drew McDermott [53] criticized a (mostly pre-ML) AI community in 1976 on a number of issues, including suggestive definitions and a failure to separate out speculation from technical claims. In 1988, Paul Cohen and Adele Howe [13] addressed an AI community that at that point “rarely publish[ed] performance evaluations” of their proposed algorithms and instead only described the systems. They suggested establishing sensible metrics for quantifying progress, and also analyzing “why does it work?”, “under what circumstances won’t it work?” and “have the design decisions been justified?”, questions that continue to resonate today. Finally, in 2009 Armstrong and co-authors [2] discussed the empirical rigor of information retrieval research, noting a tendency of papers to compare against the same weak baselines, producing a long series of improvements that did not accumulate to meaningful gains.

In other fields, an unchecked decline in scholarship has led to crisis. A landmark study in 2015 suggested that a significant portion of findings in the psychology literature may not be reproducible [14]. In a few historical cases, enthusiasm paired with undisciplined scholarship led entire communities down blind alleys. For example, following the discovery of X-rays, a related discipline on N-rays emerged [61] before it was eventually debunked.

6.3   Concluding Remarks

The reader might rightly suggest that these problems are self-correcting. We agree. However, the community self-corrects precisely through recurring debate about what constitutes reasonable standards for scholarship. We hope that this paper contributes constructively to the discussion.

Acknowledgments

We thank the many researchers, colleagues, and friends who generously shared feedback on this draft, including Asya Bergal, Kyunghyun Cho, Moustapha Cisse, Daniel Dewey, Danny Hernandez, Charles Elkan, Ian Goodfellow, Moritz Hardt, Tatsunori Hashimoto, Sergey Ioffe, Sham Kakade, David Kale, Holden Karnofsky, Pang Wei Koh, Lisha Li, Percy Liang, Julian McAuley, Robert Nishihara, Noah Smith, Balakrishnan “Murali” Narayanaswamy, Ali Rahimi, Christopher R ́e, and Byron Wallace. We also thank the ICML Debates organizers for the opportunity to work on this draft and for their patience throughout our revision process.

References

[1] Coling first call for papers, Accessed on July 4th, 2018. URL http://coling2018.org/ first-call-for-papers/.

[2] Timothy G Armstrong, Alistair Moffat, William Webber, and Justin Zobel. Improvements that don’t add up: ad-hoc retrieval results since 1998. In Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 2009.

[3]  Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade, pages 437–478. Springer, 2012.

[4]  Yoshua Bengio. The consciousness prior. arXiv preprint arXiv:1709.08568, 2017.

[5]  James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research (JMLR), 13(Feb), 2012.

[6]  Nick Bostrom. Superintelligence. Dunod, 2017.

[7]  Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural information processing systems (NIPS), 2008.

[8]  Léon Bottou, Jonas Peters, Joaquin Quin ̃onero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207–3260, 2013.

[9]  Alan J Bray and David S Dean. Statistics of critical points of gaussian fields on large-dimensional spaces. Physical review letters, 98(15):150201, 2007.

[10]  Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference (BMVC), 2014.

[11]  Danqi Chen, Jason Bolton, and Christopher D Manning. A thorough examination of the CNN/Daily Mail reading comprehension task. In Association for Computational Linguistics (ACL), 2016.

[12]  Anna Choromanska, Mikael Henaff, Michael Mathieu, G ́erard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics (AISTATS), 2015.

[13]  Paul R Cohen and Adele E Howe. How evaluation guides ai research: The message still counts more than the medium. AI magazine, 9(4):35, 1988.

[14]  Open Science Collaboration et al. Estimating the reproducibility of psychological science. Science, 349(6251):aac4716, 2015.

[15]  Ryan Cotterell, Sebastian J Mielke, Jason Eisner, and Brian Roark. Are all languages equally hard to language-model? In North American Chapter of the Association for Computational Linguistics (NAACL), 2018.

[16]  Council of European Union. Motion for a European parliament resolution with recommendations to the commission on civil law rules on robotics, 2017. http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//NONSGML%2BCOMPARL% 2BPE-582.443%2B01%2BDOC%2BPDF%2BV0//EN.

[17]  David Danks and Alex John London. Algorithmic bias in autonomous systems. In International Joint Conference on Artificial Intelligence (IJCAI). AAAI Press, 2017.

[18]  Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems (NIPS), 2014.

[19]  John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research (JMLR), 12(Jul), 2011.

[20]  Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research (JMLR), 11(Feb):625–660, 2010.

[21]  Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 2017.

[22]  Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.

[23]  David Gershgorn. The data that transformed ai research—and possibly the world, 2017 — Accessed on July 4th, 2018. URL https://qz.com/1034972/ the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/.

[24]  Zoubin Ghahramani. A modest proposal, Accessed on July 4th, 2018. URL http://hunch. net/?page_id=1115.

[25]  Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In International conference on artificial intelligence and statistics (AISTATS), 2010.

[26]  Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. In International Conference on Learning Representations (ICLR), 2015.

[27]  Arthur Gretton, Alexander J Smola, Jiayuan Huang, Marcel Schmittfull, Karsten M Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. 2009.

[28]  Caner Hazirbas, Laura Leal-Taix ́e, and Daniel Cremers. Deep depth from focus. arXiv preprint arXiv:1704.01085, 2017.

[29]  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In International conference on computer vision (ICCV), 2015.

[30]  Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters, 2017.

[31]  Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems (NIPS), 2015.

[32]  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 1997.

[33]  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML), 2015.

[34]  Kevin Jarrett, Koray Kavukcuoglu, Yann LeCun, et al. What is the best multi-stage architecture for object recognition? In International Conference on Computer Vision (ICCV). IEEE, 2009.

[35]  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2015.

[36]  Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing systems (NIPS), 2015.

[37]  Donald E Knuth, Tracy Larrabee, and Paul M Roberts. Mathematical writing, 1987. URL http://www.jmlr.org/reviewing-papers/knuth_mathematical_writing.pdf.

[38]  RE Korf. Does deep blue use artificial intelligence? ICGA Journal, 20(4):243–245, 1997.

[39]  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), 2012.

[40]  Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke Zettlemoyer. Scaling semantic parsers with on-the-fly ontology matching. In Empirical Methods in Natural Language Processing (EMNLP), 2013.

[41]  Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 2015.

[42]  John Langford. Future publication models at NIPS, Accessed on July 4th, 2018. URL http: //hunch.net/?p=1086.

[43]  Pat Langley and Dennis Kibler. The experimental study of machine learning. 1991.

[44]  Yann LeCun. Proposal for a new publishing model in computer science, Accessed on July 4th, 2018. URL http://yann.lecun.com/ex/pamphlets/publishing-models.html.

[45]  Chen Liang, Jonathan Berant, Quoc Le, Kenneth D Forbus, and Ni Lao. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. In Association for Computational Linguistics (ACL), 2017.

[46]  Zachary C Lipton. The mythos of model interpretability. ICML Workshop on Human Interpretability, 2016.

[47]  Zachary C Lipton, Sharad Vikram, and Julian McAuley. Generative concatenative nets jointly learn to write and classify reviews. arXiv preprint arXiv:1511.03683, 2015.

[48]  Zachary C Lipton, Jianfeng Gao, Lihong Li, Jianshu Chen, and Li Deng. Combating reinforcement learning’s Sisyphean curse with intrinsic fear. NIPS Workshop on Reliable ML in the Wild, 2016.

[49]  Zachary C Lipton, Alexandra Chouldechova, and Julian McAuley. Does mitigating ML’s impact disparity require treatment disparity? arXiv preprint arXiv:1711.07076, 2017.

[50]  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2015.

[51]  Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans created equal? a large-scale study. arXiv preprint arXiv:1711.10337, 2017.

[52]  John Markoff. Researchers announce advance in image-recognition software, 2014 — Accessed on July 4th, 2018. URL https://www.nytimes.com/2014/11/18/science/ researchers-announce-breakthrough-in-content-recognition-software.html. [Online; posted 26-September-2014].

[53]  Drew McDermott. Artificial intelligence meets natural stupidity. ACM SIGART Bulletin, (57): 4–9, 1976.

[54]  Gábor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. In International Conference on Learning Representations (ICLR), 2018.

[55]  Cade Metz. You don’t have to be Google to build an artificial brain, 2014 — Accessed on July 4th, 2018. URL https://www.wired.com/2014/09/google-artificial-brain/.

[56]  Marvin Minsky. The emotion machine: Commonsense thinking, artificial intelligence, and the future of the human mind. Simon and Schuster, 2007.

[57]  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

[58]  Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.

[59]  Michael C Mozer. Neural network music composition by prediction: Exploring the benefits of psychoacoustic constraints and multi-scale processing. Connection Science, 6(2-3):247–280, 1994.

[60]  Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In International Conference on Computer Vision (ICCV), 2015.

[61]  Mary Jo Nye. N-rays: An episode in the history and psychology of science. Historical studies in the physical sciences, 11(1):125–156, 1980.

[62]  John R Platt. Strong inference. Science, 1964.

[63]  Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. 2018.

[64]  Paul M Romer. Mathiness in the theory of economic growth. American Economic Review, 105 (5):89–93, 2015.

[65]  Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? (no, it is not about internal covariate shift). arXiv preprint arXiv:1805.11604, 2018.

[66]  Ju ̈rgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, 1991.

[67]  Arthur Schram. Artificiality: The tension between internal and external validity in economic experiments. Journal of Economic Methodology, 12(2):225–237, 2005.

[68]  D Sculley, Jasper Snoek, Alex Wiltschko, and Ali Rahimi. Winner’s curse? on pace, progress, and empirical rigor. In ICLR Workshop, 2018.

[69]  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR), 15(1), 2014.

[70]  Jacob Steinhardt and Percy Liang. Learning fast-mixing models for structured prediction. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1063–1072, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/steinhardtb15. html.

[71]  Jacob Steinhardt and Percy Liang. Reified context models. In International Conference on Machine Learning (ICML), 2015.

[72]  Jacob Steinhardt, Pang Wei Koh, and Percy S. Liang. Certified defenses for data poisoning attacks. In Advances in Neural Information Processing Systems (NIPS), 2017.

[73]  Pierre Stock and Moustapha Cisse. Convnets and imagenet beyond accuracy: Explanations, bias detection, adversarial examples and model criticism. arXiv preprint arXiv:1711.11443, 2017.

[74]  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.

[75]  Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face verification. In Computer vision and pattern recognition (CVPR), 2014.

[76]  Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, and Xiaoming Li. Neural generative question answering. In International Joint Conference on Artificial Intelligence (IJCAI), 2015.

[77]  Sergey Zagoruyko, Adam Lerer, Tsung-Yi Lin, Pedro O Pinheiro, Sam Gross, Soumith Chintala, and Piotr Dollár. A multipath network for object detection. Computer Vision and Pattern Recognition (CVPR), 2016.

[78]  Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV), 2014.

[79]  Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. Deconvolutional networks. In Computer Vision and Pattern Recognition (CVPR), 2010.

[80]  Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global context. In Computer Vision and Pattern Recognition (CVPR), 2018.

[81]  Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), 2017.

[82]  Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. Position-aware attention and supervised data improve slot filling. In Empirical Methods in Natural Language Processing (EMNLP), 2017.

Author: Zachary C. Lipton

Zachary Chase Lipton is an assistant professor at Carnegie Mellon University. He is interested in both core machine learning methodology and applications to healthcare and dialogue systems. He is also a visiting scientist at Amazon AI, and has worked with Amazon Core Machine Learning, Microsoft Research Redmond, & Microsoft Research Bangalore.

8 thoughts on “Troubling Trends in Machine Learning Scholarship”

  1. The whole paper can be summarized as follows IMHO:

    “What sort of papers best serve their readers? We can enumerate desirable characteristics: thesepapers should (i) provide intuition to aid the reader’s understanding, but clearly distinguish it fromstronger conclusions supported by evidence; (ii) describe empirical investigations that consider andrule out alternative hypotheses [62]; (iii) make clear the relationship between theoretical analysis andintuitive or empirical claims [64]; and (iv) use language to empower the reader, choosing terminologyto avoid misleading or unproven connotations, collisions with other definitions, or conflation withother related but distinct concepts [56].Recent progress in machine learning comes despite frequent departures from these ideals. In thispaper, we focus on the following four patterns that appear to us to be trending in ML scholarship:

    1.Failure to distinguish between explanation and speculation.
    2.Failure to identify the sources of empirical gains, e.g. emphasizing unnecessary modifications to neural architectures when gains actually stem from hyper-parameter tuning.
    3.Mathiness: the use of mathematics that obfuscates or impresses rather than clarifies, e.g. by confusing technical and non-technical concepts.
    4.Misuse of language, e.g. by choosing terms of art with colloquial connotations or by overloading established technical terms.

    While the causes behind these patterns are uncertain, possibilities include the rapid expansion of the community, the consequent thinness of the reviewer pool, and the often-misaligned incentives between scholarship and short-term measures of success (e.g. bibliometrics, attention, and entrepreneurial opportunity). While each pattern offers a corresponding remedy (don’t do it), we also discuss some speculative suggestions for how the community might combat these trends.”

    e.g. gutting everything before / after.

    This may be sin #5: unnecessary verbosity for basic ideas.

    I agree with everything you’ve said, but i think people are fairly aware that problems #1-4 exist (although the extent they are new/increasing in severity is probably subject to debate).

    I think the question(s) that this paper sets the stage for an interesting debate about is: “What are the most appropriate solutions/changes to each of these problems (other than perhaps ‘vigilance’)?”

    1. I don’t agree. That merely articulates a position. It’s important to walk through examples, to think critically about why this might be happening and to discuss possible correctives. I also think there’s value in discussing the consequences, i.e., where have these patterns gone wrong historically.

  2. thx for the (meta)analysis. the AI field has gone thru massive growth in short time and billions of dollars are chasing many monetizable applications, others just on the edge of practical/ viable. it is a little like drug testing / sales in the medical field, capitalism + science is not a comfortable mix typically. this fits in with recent criticism of ML as underperforming against the “big challenges” nearer to AGI recently covered here. recall the “gartner hype curve”. its a gold rush, and during one, there is real gold and fools gold.

    https://vzn1.wordpress.com/2018/06/17/top-agi-leads-2018/#c

    https://en.wikipedia.org/wiki/Hype_cycle

  3. This is an excellent read. I agree with most everything written. I do think, however, that it was good for the authors of the Adam paper to present a proof for the convex case. Even though it was wrong in the end, I think it’s useful to have some theoretical guarantees in those cases.

    1. I agree that this paragraph is one of the weaker in our paper and that we should be clearer about what in this paper constitutes mathiness. As written now it can come off as suggesting that we don’t in general believe in analyzing simple cases, which is not actually the position we’re trying to express here.

Leave a Reply

Your email address will not be published. Required fields are marked *