Evaluating preprints

I am hugely enthusiastic for communicating research by preprints. So naturally, I am happy to see when the president and strategic advisers of one of the most elite funding institutes embraces preprints:

For centuries, publishing a scientific article was just about sharing the results. More recently, publishing research articles in a journal has served two distinct functions: (i) Public disclosure and (ii) Partial validation by peer-review (Vale & Hyman, 2016). The partial validation is sometimes followed up by strong validation: (iii) Independent reproduction and building upon the published work.

Preprints clearly can serve the first function, public disclosure. It has been less clear to me how to validate and curate the highly heterogeneous research that is published as preprints. I think this question remains open, though I have seen signs that some preprints are strongly validated (independently reproduced & built upon) even before the more conventional partial validation by peer-review.

For example, the methods and ideas underlying Single Cell ProtEomics by Mass Spectrometry (SCoPE-MS) were independently validated by multiple laboratories. Some presented their results at conferences before our preprint was peer-reviewed:

Several groups published their results after our preprint was published in a peer-reviewed journal, crediting the preprint for the ideas:

More (that I know of) are underway. All inspired by a preprint.  I see this as a datapoint that preprints can get strong validation even outside of the boundaries of the peer-review system that has dominated our field for the last few decades.  It’s not a complete solution for evaluating all preprints, but I think it’s very encouraging evidence that preprints can be strongly validated even before the weak validation of peer-review!

Single-cell analysis

single-cell analysis

Imaging is the most widely used method for single-cell analysis

The success of imaging technologies

The molecular and functional differences among the cells making our bodies have been appreciated for many decades. Yet, the tools to study them were very limited. In the last couple of decades, we have began developing increasingly powerful technologies for molecular single-cell measurements. Currently, the most widely used high-throughput methods for molecular single-cell analysis have two things in common: (1) they quantify nucleic acids and 2) they are based on imagining. The imaging can be done in situ (e.g., fluorescent in situ hybridization, FISH) or in vitro (e.g., single-cell RNA-seq based on next gen DNA sequencing). Imaging has been applied to single-cell protein analysis as well, though most applications have been hampered by their dependance on antibodies. A recent break away from this antibody-dependance is the single-molecule Edman degradation developed by the group of Edward Marcotte. If this is developed further, imaging could become a workhorse for single-cell protein analysis as well.

Emerging mass-spec methods

Efforts to apply mass-spectrometry to single-cell analysis started in the 1990s. As comprehensively reviewed by Rubakhin et al., these efforts focused on ionizing biological molecules via Secondary Ion MS (SIMS) or via Matrix Assisted Laser Desorption/Ionization (MALDI). These methods allow to ionize biological molecules with minimal processing and losses but remain rather limited in their quantification accuracy and in identifying the chemical composition of the analyzed ions. In contrast, the methods that afford robust high-throughput identification (based on analyte separation and tandem MS analysis, e.g., LC-MS/MS or CE-MS/MS) have been very challenging to apply to small samples. Still, the typical mammalian cell contains thousands of metabolites and proteins whose abundance is much higher than the sensitivity of mass-spec instruments. Based on this realization, we outlined directions for multiplexed analysis of single cells by LC-MS/MS that can enable quantifying thousands of proteins across many thousands of single cells. We recently published a proof of principle that has been superseded by a higher throughput single-cell proteomics method. These initial steps need much further developments, both experimental and computational, before they reach the transformative potential that single-cell mass-spec could have.

 

Understanding biology

Single-cell analysis is not merely about measurements. It’s about understanding them. Our progress in understanding single-cell data has been limited, even for the data coming from the more mature technologies. Conceptual progress has been much slower than technological progress. So, how do we make sense of the data?

I will reserve my musings on this question for a forthcoming post. For now, I’ll just say that I like an idea articulated by Munsky et al., 2012 and Padovan-Merhar and Raj, 2013: Using the variability between single cells as a natural perturbation for studying gene regulation. I think that this approach can be a very powerful. More thoughts on that coming soon.

 

 

 

Missed citations

Have you seen a paper failing to cite a very relevant source? The chances are you have and that you felt — more than once — that your work was not referenced when it should have been.

Authors may choose not to cite a reference for many reasons, some legitimate, e.g., they find the evidence unconvincing, and other less so, e.g., they believe that a reference will undermine the novelty of their work. If the latter sounds incredible to you, here is a quote:

 

I am sorry for not referencing your paper, but it would have undermined the novelty of our work. You know how Nature editors think.

 

Missed citations are hot potatoes. If you complain that your papers are not cited, when you believe they should have, most of yours colleagues are unlikely to take you seriously. Indeed, authors are likely to be biased toward their work and seek more references in these citation-obsessed times. Why care about missed citations?

 

I think most scientist will agree that we should give credit where credit is due. So, how can we fight the “impact factor scoop”? Here is an idea:

We can start a PubPeer style database in which everybody who has been a corresponding author for a PubMed paper can list missed citations to papers for which they are not authors. The latter is essential to avoid the biases of authors who believe that global conspiracy against them is the only reason why everybody is not citing them. Furthermore, the database should collect missed citations in a machine-readable form so that they can be analyzed more easily.

What do you think about the above idea? Do you have suggestions for other approaches that can improve citation practices?

 

Why I love preprints

I believe that preprints are a great medium for communicating and disclosing new research in the life sciences. Recently, Science magazine asked me why I am so enthusiastic for preprints and published a feature including some of my responses, and some reasons why I love preprints. Below are more reasons why I believe that preprints hold much promise for improving the communication of biomedical research:

What do you get out of it? Have you gotten useful feedback (if so via comments, twitter email etc.)?
Preprints are great for sharing broadly my latest research findings. My first bioRxiv preprint has been viewed over 11,000 times and my preprint on single cell proteomics (SCoPE-MS) of stem cells well over 19,000 times in just a few months. These numbers, along side with comments from leading scientist and prominent news coverage of my preprints, suggest that preprints can be very effective in communicating my results to the community and establishing priority. I believe preprints and journal articles can be equally effective (or ineffective) at establishing priority, depending on the their quality, visibility and percolation through the community.

I have received feedback from senior professors, multiple suggestions for collaborations (some of which materialized), news coverage of my preprints, editorial invitations to submit my preprints to journals, and an invitation to apply for a faculty position at an elite institution.

Have you modified a manuscript as a result? Or is the benefit more getting your work out there/sharing earlier?
The benefits are many. One is modifying and improving a manuscript based on the feedback, which I have done. Another is getting a timestamp on my work that makes me more willing to broadly share and present my results without concerns for being scooped.

Do you worry that posting a preprint could jeopardize publication in a journal?
No. Most prominent journals accept preprints; journals that do not, lose. In my opinion, the benefits of using preprints far outweigh the limited opportunity to publish in some journals. I believe that journals that do not embrace preprints will decline in prominence over time.

Do you have papers you have not posted as preprints? if not why not?
I am committed to posting all papers from my laboratory as preprints. I am coauthor on papers that were not posted as preprints but I was not the lead author for them and did not make the final decision.

Do you think preprints will make journals obsolete, or do we still need peer reviewed journals?
Preprints are not aiming to side-step peer review. We need good peer review, more than ever, and preprints provide more opportunities for peer review, not less. We still need a formal system that can ensure peer review with minimal bias that is as transparent as possible and successful journals will adapt to fulfill this need.

How do colleagues react when you try to get them to share preprints? Are some more receptive than others? Are there differences by age or field? Why do you think some are still reluctant?
Many of my colleagues are receptive, others less so. Importantly, I have not heard a cogent argument why preprints are bad for science. The most frequent argument revolves around fear of losing priority of discovery. My usual question is: Have you ever felt that one of your peer reviewed papers is not cited and given credit when it should have been? Publishing a paper in a peer reviewed journal does not make it immuned to scooping. I think that the quality and the visibility of a timestamped research article is more important for establishing priority over the exact time when it is peer reviewed. Of course peer review and independent replication are essential for establishing the validity of any research article but these can be separated in time from the first disclosure of the results.

I have noticed that in many fields, a lot of the papers are quantitative, modeling, etc. and not wet biology. Will it take longer for those doing lab experiments to embrace preprints? Why would they be more reluctant?
I have observed these differences across disciplines. I think they stem from differences in culture and technical skills. Preprints will percolate slower in some communities, but I am confident that they will continue to spread fast and eventually will be adopted by all.

Single-cell proteomics

Ever since my lab posted the SCoPE-MS preprint, I have been repeatedly asked about the future potential and the cost of quantifying proteins by high-throughput mass-spectrometry in single cells. I will summarize a few thoughts that hopefully will be helpful and will reduce email traffic.

Why quantify proteins and PTMs in single cells?

Single-cell RNA-seq has made great strides and become widely available and preferred method for high-throughput single-cell measurements. That is great! These measurements are very useful and their usefulness will continue to grow as we invent new ways to think about these data and reduce their noise. Yet, measuring transcript levels alone is insufficient for studying and understating many physiological and pathological processes, not least because the changes of protein levels human across tissues and cell differentiation are poorly predicted by the corresponding changes in mRNA levels:

 

The usefulness of mRNA levels as surrogate for signaling activing by post-translational modifications (PTMs, e.g., phosphorylation, ubiquitylation) is even more limited.

 

What is the history of single-cell proteomics by mass-spectrometry?

Quantifying proteins in single cells directly, without relying on antibodies, has been a long standing aim and dream for many scientists. There are over a dozen reports for doing so over the last decade but they all have used cells that have over 1000 fold larger volume than the typical mammalian cell, e.g., muscle cells and oocytes, and quantified only a few proteins in a few cells. To my knowledge, SCoPE-MS is the first method to quantify over a thousand proteins across hundreds of mammalian cells with typical cell sizes, i.e., diameter of 10 – 15μm.

 

How expensive is it to do SCoPE-MS?

This questions comes up frequently. The answer depends a lot on what we factor into the price. If you own a suitable high-resolution MS instrument/system, the current cost is about a dollar per cell but very soon that will drop significantly; stay tuned for our next preprint. If you do not own a suitable high-resolution MS instrument, the price depends on the service charges of your prefered MS facility. The cost for a suitable instrument ranges from ~ 100k (low end refurbished instrument) to ~ 700k (the high end benchtop instruments on the market, new).

 

How easy is it to do SCoPE-MS?

For our lab, quite easy. I am proud of the fact that SCoPE-MS is enabled by a simple idea and not by access to the newest corporate technology with limited accessibility. We used an old, low-end instrument for developing SCoPE-MS. We are writing up more detailed protocols and hoping to release a robust data processing pipeline soon. Anyways, there is nothing particularly tricky in the method, and I expect that any good lab should be able to quantify single cell proteomes by SCoPE-MS.

 

How noisy are the data?

As all methods using tandem mass tags, SCoPE-MS measurements are affected by coisolation interferences, which means that about 5 – 10 % of the reporter ion signal for a typical peptide comes from other peptides. This undesirable contribution can be reduced by using newer instruments with better mass-filters that allow for smaller ion isolation windows. It can also be reduced by simply filtering out peptides with more co-isolation and focusing on those with very limited coisolation or by computationally compensating for it.

There is of course also nonsystematic (random) noise. In our current data (Supplemental Figure 2c), the reliability of the measurements for the proteins with the smallest fold changes is over 50 % and for those with the largest fold changes, about 80 %. The reliability is higher for data acquired on the new instruments that use high-quality quadrupole mass-filters, i.e, Q-exactive Orbitraps.

 

Can you measure post-translational modifications (PTMs)?

Yes, we can. Stay tuned for the preprint.

 

What is the future potential for building up on SCoPE-MS?

That is my favorite question! We have outlined ideas and technologies that can advance single cell proteomics methods by several orders of magnitude. In short:

  • Throughput: The throughput will grow as we increase the number of mass-tags. These should go up to 16 in the fall. As the demand for single cell proteomics increases, thermo or the community will come up with much higher plex. Since MS does the measurements on groups of identical ions (not individual molecules as in the case of next-generation DNA sequencing), higher multiplex will increase the number of quantified samples without affecting the depth of coverage. The higher multiplex will also reduce the need for the career channel, first reduce the number of carrier cells and ultimately eliminate the need for them.
  • Accuracy: SCoPE-MS does minimal processing of the samples and the measurement is based on hundreds, even thousands of ions for the quantification of each peptide in each cell. There are no fundamental limits to achieving very high accuracy. Since proteins are much more abundant than mRNAs (on average over 1000 protein molecules per each mRNA), the counting of low copy number molecules or ions is much less problematic compared to single cell RNA sequencing. As we improve our ability to deliver and capture all ions, we should be able to measure even the least abundant proteins and expand the depth of coverage tremendously. This is not just a distant promise. I think it is an imminent possibility.

CSHL Meeting: Single Cell Analyses 2017

 

Inadvertent Support

Everyday, thousands of colleagues use social media to express surprise, dislike, or even outrage for the impact factor, for articles in luxury journals, against closed access, against Trump and so on and so forth. This voluminous response carries a powerful self-defeating-leadershipmessage about the influence and visibility of what is criticized; this response and its hyperlinks tell internet search engines just how influential and thus highly ranked the criticized pages should be. It is a self-defeating response, a response providing strong and vital support for the nemeses. This support is unintended and inadvertent but powerful.
There is another option. Focus on spreading and sharing what you like and admire, i.e., what is worth sharing. Whether that is a great paper in a luxury journal or a great paper in a less visible journal, share it for its own merits. Emphasize the good since the bad is not worth your time or my time, or the high rank that search engines will give it. And what about the these transgressions that you find outrageous? Ignoring them is a far more powerful and effective message than honoring them with your attention. They do not deserve attention. Consider this:

Ellsworth Toohey: There’s the building that should have been yours. There are buildings going up all over the city which are great chances refused and given to incompetent fools. You’re walking the streets while they’re doing the work that you love but cannot obtain. This city is closed to you. It is I who have done it! Don’t you want to know my motive?
Howard Roark: No!
Ellsworth Toohey: I’m fighting you and shall fight you in every way I can.
Howard Roark: You’re free to do what you please!
Ellsworth Toohey: Mr. Roark, we’re alone here. Why don’t you tell me what you think of me in any words you wish.
Howard Roark: But I don’t think of you!
[Roark walks away and Toohey’s head slumps down]

– The Fountainhead

Magnanimity pays off

Earlier this year, I read an inspiring recollection (by Sydney Brenner) of a grand scientific milestone: the elucidation of the genetic code. How do DNA nucleotides code for the amino-acid sequence of proteins? This fundamental question had captivated numerous scientists, including Francis Crick and Sydney Brenner. The punchline of this wonderful interview/recollection is a magnanimous act by Francis Crick:

crick-13154-portrait-mini-2xIn August 1961, more than 5,000 scientists came to Moscow for five days of research talks at the International Congress of Biochemistry. A couple of days in, Matt Meselson, a friend of Crick’s, told him the news: The first word of the genetic code had been solved, by somebody else. In a small Friday afternoon talk at the Congress, in a mostly empty room, Marshall Nirenberg—an American biochemist and a complete unknown to Crick and Brenner—reported that he had fed a single repeated letter into a system for making proteins, and had produced a protein made of repeating units of just one of the amino acids. The first word of the code was solved. And it was clear that Nirenberg’s approach would soon solve the entire code.

Here’s where I like to imagine what I would have done if I were Crick. For someone driven solely by curiosity, Nirenberg’s result was terrific news: The long-sought answer was arriving. The genetic code would be cracked. But for someone with the human urge to attach one’s name to discoveries, the news could not have been worse. Much of nearly a decade’s worth of Crick and Brenner’s work on the coding problem was about to be made redundant.

I’d like to believe I would have reacted honorably. I wouldn’t have explained away Nirenberg’s finding to myself, concocting reasons why it wasn’t convincing. I wouldn’t have returned to my lab and worked a little faster to publish my own work sooner. I’ve seen scientists react like this to competition. I’d like to believe that I would have conceded defeat and congratulated Nirenberg. Of course, I’ll never know what I would have done.

Crick’s response was, to me, remarkable and exemplary. He implored Nirenberg to give his talk again, this time to announce the result to more than 1,000 people in a large symposium that Crick was chairing. Crick’s Moscow meeting booklet survives as an artifact of his decision, with a hand-written “Nirenberg” in blue ink, and a long arrow inserting into an already-packed schedule the scientist who had just scooped him. And when Nirenberg reached the stage, he reported that his lab had just solved a second word of the code.

by Bob Goldstein 

I admire Crick’s reaction. It is very honorable. In the long run, it helped both science and Crick’s reputation. Nirenberg had a correct result and sooner or later, he was going to receive credit for it. Crick facilitated this process, and in the process Crick only added to his own credit. Our current admiration for Crick’s reaction at the Moscow conference is the only proof I need.

Any interpretation that sees Crick’s magnanimous act as being good only for the science but bad for Crick’s personal reputation is myopic; it misses the long run. It misses mine (and hopefully yours) opinion of Crick’s magnanimous act.

Premature human engineering

The news buzz alive with excitement about human genome editing, even human germline engineering. Successful germline engineering requires (1) a technology for editing DNA safely and (2) knowledge of what to edit and how to edit based on understanding the underlying biology. We are approaching (1), which is the easier part; we do not have (2), and we are far from achieving it for most desired “edits”.

A huge hurdle to germline engineering is that, beyond a few simple cases, our understanding does not allow achieving desired effects while avoiding unintended consequences. Unlike DNA sequencing, silicon chips and DNA editing, our understanding of complex combinatorial multi-gene interactions has made very little progress over the last few decades. Until we made more progress and understand gene interactions and the respective health outcomes better, germline engineering is akin to medieval quack therapies based the technology to bleed patients and feed them various concoctions but with very limited understanding of the medical consequences, and with plenty of unintended consequences. We can fix the unintended consequences later and then fix the unintended consequences from the fixing, and we will keep trying!

Deceptive Numbers

You want to estimate an important quantity. You compute an exact number purporting to estimate it. You compute another exact number purporting to estimate it. The two numbers differ significantly. The only logical conclusion is that these estimates are less exact than they seem.

This clearly seems to be the case with the notion of “impact” as quantified by different  metrics that purport to estimate the same quantity from the same data:

Journal_Impact_Number_vs_Google_Rank_Influence

Perhaps a good antidote to innumeracy can be playing with data interactively. So, you can search these data interactively and find for yourself how different metrics of impact may differ by over 300 % !

 

Increasingly direct evidence

The results in our Cell report are particularly satisfying to me since they bring clarity to a puzzle that I have pursued for almost a decade. The puzzle started with an observation that I made while a graduate student in the Botstein laboratory at Princeton University.

As growth rate increases, RPs are transcriptionally induced to varying degrees; some are even repressed.

I studied the transcriptional responses of yeast cells growing across a wide range of growth rates. These data allowed us to evaluate a suggestion that Ole Maaløe had proposed for bacteria over 30 years earlier: cells growing faster should induce the transcription of ribosomal proteins since they need to make more ribosomes that can meet the increased demands for protein synthesis. While most mRNAs coding for ribosomal proteins (RPs) exhibited this logical trend (their levels increased with the growth rate), others did not. The RP transcript levels that deviated from the expectation were reproducible across biological replicas and even across different nutrient limitations used to control the cell growth rate. Furthermore, the number of the RP transcripts defying the expectations was even larger when I grew the yeast cells on ethanol carbon source. I also observed uncorrelated variability in RP transcripts across human cancers, but this observation was based on public data without biological replicates and with many confounding factors.

My observations of differential RP transcriptional induction puzzled me deeply. According to the decades-old model of the ribosomes, each ribosome has exactly one copy of each core RP. Thus, the simplest mechanism for making more ribosomes is to induce the transcription of each RP by the same amount, not to induce some RPs and repress others. Still, biology often defies simplistic expectations; one can easily imagine that RP levels are controlled mostly post-transcriptionally. Transcript levels for RPs were enough to pick my curiosity but ultimately too indirect to serve as evidence for the protein composition of the ribosomes. Thus, I neglected the large differences in RP transcriptional responses and interpreted our data with the satisfyingly simple framework suggested by Ole Maaløe. Many other research groups have also reported differential transcription of RP genes but these observations have the same limitations as my transcriptional data.  The puzzle remained latent in my mind until years later I quantified the yeast proteome by mass-spectrometry as part of investigating trade-offs of aerobic glycolysis. This time, the clue for altered protein composition of the ribosomes was at the level of the ribosomal proteins, not their transcripts. While still indirect and inconclusive, I found this observation compelling. It motivated me to design experiments specifically aiming to find out if the protein composition of the ribosome can vary within a cell and across growth conditions.

The data from these experiments showed that unperturbed cells build ribosomes with different protein compositions that depend both on the number of ribosomes bound per mRNA and on the growth conditions. I find this an exciting result because it opens the door to conceptual questions such as: What is the extent, scope and specificity of ribosome-mediated translational regulation? What are the advantages of regulating gene expression by modulating the ribosomal composition as compared to the other layers of gene regulation, from histone modifications through RNA processing to protein degradation? Do altered ribosomal compositions offer tradeoffs, such as higher translational accuracy at the expense of lower translation-elongation rate via more kinetic proofreading? Some of these question may (hopefully will) reveal general principles. These questions are fascinating to speculate about but they can also be answered by direct measurements. Designing experiments that can rigorously explore and discriminate among different conceptual models should be a lot of fun!

 

CSHL Translational Control Meeting 2016