Do I really have to cite an arXiv paper?

With peak submission season for machine learning conferences just behind us, many in our community have peer-review on the mind. One especially hot topic is the arXiv preprint service. Computer scientists often post papers to arXiv in advance of formal publication to share their ideas and hasten their impact.

Despite the arXiv’s popularity, many authors are peeved, pricked, piqued, and provoked by requests from reviewers that they cite papers which are only published on the arXiv preprint.

“Do I really have to cite arXiv papers?”, they whine.

“Come on, they’re not even published!,” they exclaim.

The conversation is especially testy owing to the increased use (read misuse) of the arXiv by naifs. The preprint, like the conferences proper is awash in low-quality papers submitted by band-wagoners. Now that the tooling for deep learning has become so strong, it’s especially easy to clone a repo, run it on a new dataset, molest a few hyper-parameters, and start writing up a draft.

Of particular worry is the practice of flag-planting. That’s when researchers anticipate that an area will get hot. To avoid getting scooped / to be the first scoopers, authors might hastily throw an unfinished work on the arXiv to stake their territory: we were the first to work on X. All that follow must cite us. In a sublimely cantankerous rant on Medium, NLP/ML researcher Yoav Goldberg blasted the rising use of the (mal)practice.

In particular, he excoriated a paper from the prominent MILA research group which purported to have adapted the methods of generative adversarial networks to language. His gripe was that the language generation was laughable and actually far worse than any current technique. The authors, he surmised (and many I’ve spoken with agree), were staking their territory so that regardless of who first succeeds, they would need to be cited as the originators of the idea.

Amid this tumult, some have questioned the very enterprise of citing preprinted articles. So, if the arXiv may be subject to abuse, do I have to cite papers that have only appeared on the arXiv?

Yes, of course. Any time that our work follows, copies, or borrows ideas from other people, and when we can reasonably be expected to be aware of this, we ought to cite the related work.

A large number of seminal works have never been published. The greatest mathematics paper of our lifetimes remains unpublished. Not every paper on the arXiv warrants a bibliographic entry, but many do. The idea that unpublished status would categorically exclude the responsibility of citation is a bit preposterous. It puts far too much faith in the deeply flawed fraternity of conference organizers and the overworked cohort of peer reviewers, roughly 30% of whom typically fail to even comprehend the basic outline of the paper.

If similar work comes to our attention during a proper literature review, we ought to cite it. If we knowingly build on someone else’s work we should cite it. If someone shares a non-obvious idea with us that develops into a paper, we should find some way to credit them. If someone writes a theory down on a napkin shortly before dying and it turns out to open a new subfield of machine learning to scientific inquiry, we should convert the napkin to a pdf, upload it to arXiv, and then cite it.

We should not have to cite nonsense. Many reviewers are abusing the system and asking for ridiculous comparison to recently-posted preprint papers. Bald-faced flag-planting should not be rewarded. And we should not be faulted by reviewers for failing to compare against 2-week old algorithms that may or may not work. But the very idea that arXiv papers would in general not need to be cited puts far too much faith in the fraught process of scientific publication and far too little importance on ideas themselves.

Author: Zachary C. Lipton

Zachary Chase Lipton is an assistant professor at Carnegie Mellon University. He is interested in both core machine learning methodology and applications to healthcare and dialogue systems. He is also a visiting scientist at Amazon AI, and has worked with Amazon Core Machine Learning, Microsoft Research Redmond, & Microsoft Research Bangalore.

11 thoughts on “Do I really have to cite an arXiv paper?”

  1. This is fine assuming that everyone is playing remotely fair; there are very disreputable people (and deluded / stupid ones who think that this is the way that research is actually done) who create papers with a kind of buzz word generator. This then gives them a (ridiculous) platform to argue for priority. In the past this was gate kept by refereeing (to some extent) but this will put the boot on the other foot. You will see conference committees colonized (over time) by these folks and they will ruthlessly use it to block work from other groups.

    In some parts of the world academia consists of a patronage/favour network. This maintains and develops people’s careers and large institutions, but it is scherlotic and bad for science. Using the arxiv as you suggest will promote this kind of structure.

    It will do for science what blogging and social media have done for politics.

    1. These phenomena are real and we should fight against it. And I explicitly call out this nonsense for what it is in the article. But to suggest that we should not (ever) have to cite a paper that hasn’t passed through the gatekeepers is insane. Science doesn’t come from the journals. It’s fundamental. If you solve a fundamental math problem and write it on your bathroom mirror in you mother’s lipstick, snap a selfie and post it to Instagram, an then download the result and post it to arXiv, if the work is really tight, and you are the true discoverer, then people have to cite you.

  2. Also : there is a massive disjunct between citation in the humanities and citation in science. People seem to have forgotten this. Ideas are two a penny, they literally do not matter *at all* they should not be cited. The person who introduces a concept to science deserves no credit whatsoever. What deserves credit is the provision of evidence or proof. Perhaps there should be an inversion, papers that are on arXiv that provide substantial evidence or proof that are then depended on by other papers should be co-opted into proceedings or journals?

    1. Copied from reply above (or below? I do not know): Grisha Perelman’s proof of the Poincaré conjecture has never been (by him, to my knowledge) submitted to or published in any journal. He decided, as is his right, that he could not care less about the professional community of publishing mathematicians or their protocols. Does not invalidate his achievement.
      I should note, since I am not and do not expect to be the level of mathematician that Perelman is, I have not actually read his proof. So I defer to other superior mathematicians for this assessment and come by it as hearsay. 🙂

      1. Perelman wrote three papers that proved the Geometrisation Conjecture (arXiv:math/0211159, arXiv:math/0303109, arXiv:math/0307245), of which the Poincaré Conjecture was a special case. They are on the arXiv and nowhere else. Other (teams of) people have literally written whole books to explain what he did (eg: https://www-fourier.ujf-grenoble.fr/~besson/book.pdf).

        In fact these papers are really post-famous: you almost don’t even need to cite them because everybody knows about them, much like nobody would cite Turing’s paper where he introduced his eponymous machines, just because you are talking about a computer. But the principle stands that if needed, they shouldn’t *not* be cited because they are merely on the arxiv.

  3. As I read the article I was overwhelmed by the fact that I had witnessed all of the described problematic behavior and it is that widespread.

    Reviewer fraternities, blocking other group’s paper, reviewers expected to review 20 papers in 2 weeks, grad students reviewing top venue papers, putting half-assed papers online to flag the field, asking people to cite 1-2 weeks old papers, doing 1/10th of the work and claiming the other 9 ideas in the conclusion… and the list goes on.

    I even came across an asshole of a rival-reviewer who forced us to cite his sub-par rival paper that was NOT EVEN ONLINE yet. He subsequently managed to change other reviewer’s opinions and got the paper rejected in a top venue.

    So needless to say, I put no trust in anonymous peer-reviews. I think all reviews should be open and public. Sure, it would create some unnecessary discussions and conflicts at first but if you are confident enough and have solid reasons to accept or reject a paper in a top venue, you should have no problem writing your name as the reviewer. And eventually, discussions would become more sensible.

    I am not saying Arxiv or traditional peer-review is good or bad. I believe that as long as there are ultra-competitive, dishonest academics and this type of behavior is not publicly shamed, science and academia will continue to suffer. Subsequently leading to privatized science of the companies as it is today in many fields.

    1. Grisha Perelman’s proof of the Poincaré conjecture has never been (by him, to my knowledge) submitted to or published in any journal. He decided, as is his right, that he could not care less about the professional community of publishing mathematicians or their protocols. Does not invalidate his achievement.
      I should note, since I am not and do not expect to be the level of mathematician that Perelman is, I have not actually read his proof. So I defer to other superior mathematicians for this assessment and come by it as hearsay. 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *