Topological acceleration (en | fr) About | Scalar averaging/DE | inhomog@ADS arXiv ADS

blogs archives
en:
Exzuberant
In the dark
Trenches of discovery
Kipac
fr:
Café sciences
Luminet
Libération

Mon, 01 Jul 2019

Science reproducibility: the software evolution problem

What is the point of publishing a scientific paper if an expert reader has to do so much extra work to independently reproduce the results that s/he is effectively discouraged from doing so?

Reproducibility: brief description

In the present practice of cosmology research, such a paper tends to be accepted as "scientific" if the method is described in sufficient detail and clearly enough, and if the observational data are publicly available in the case of an observational paper. However, the modern concepts of free-licensed software and efficient management of software evolution via git repositories over the Internet, as well as Internet communication in general, should make it, in principle, possible to allow an expert reader to reproduce the figures and tables of a research paper with just a small handful of commands in a terminal, to download, compile and run scripts and programs provided by the authors of the research article. This will in practice make it easier for more scientists to verify the method and results, and improve on them, rather than forcing them to rewrite everything from scratch.

This idea has been floating around for several years. A very nice summary and discussion by Mohammad Akhlagi includes Akhlagi's own aim of making the complete research paper reproducible with just a few lines of shell commands, and links to several astronomical reproducible papers from 2012 to 2018, most using complementary methods.

I tend to agree that using Makefiles is most likely to be the optimal overall strategy for reproducible papers. For the moment, I've used a single shell script in 1902.09064.

The software evolution problem

I suspect, unfortunately, that there's a fundamental dilemma in making fully reproducible papers that remain reproducible in the long term, because of software evolution. Akhlagi's approach is to download and compile all the libraries that are needed by the author(s)' software, in specific versions of the software that were used at the time of preparing the research paper. This would appear to solve the software evolution problem.

My approach, at least so far in 1902.09064, is to use the native operating system (Debian GNU/Linux, in my case) recommended versions of all libraries and other software, to the extent that these are available; and to download and compile specific versions of software that are "research-level" software, either not yet available in a standard GNU/Linux family operating system, or evolving too fast to be available in those systems.

Download everything: pro

  • exact reproducibility: Downloading as much software, including various libraries, as possible, with specific commit hashes (frozen versions), should, in principle, enable users at any time later to reproduce exactly what the authors claimed they did to obtain their results. Tracing the origin of differing results should be easier than if these libraries are not downloaded in exactly the same versions.

Download everything: con

  • heaviness: Downloading the source code of libraries (such as the GNU Scientific Library) that are integrated into a well-tested, stable operating system such as Debian GNU/Linux (stable), and recompiling them from scratch, can consume a lot of download bandwidth and a lot of cpu time. If the user wishes to repeat the cycle from scratch many times, this becomes prohibitive in terms of user patience, and in the context of the climate crisis, risks becoming unethical if the benefits are too modest compared to the costs in "carbon load".
  • security risks: Old versions of standard libraries contain science errors, software errors, and security bugs. Reproducing the science errors and software errors of the authors is acceptable, since the aim is to check the authors' claims. But running software with unfixed security flaws is unwise. Running user-space software that is out-of-date in terms of security is less risky than doing so for root-level software, but is still a risk. Once a cracker has obtained user-level remoted access to a computer, escalating to root access is a much smaller challenge than getting initial access to the system.
  • dependency hell: "Science research" software depends on lower level libraries (integrals, derivatives, interpolation, monte carlo methods) that themselves rely on lower level numerical algorithm libraries, that need to be integrated with parallelisation libraries, and that themselves depend on the kernel and other system-level libraries. The FLOSS community is constantly improving and fixing these different software packages, as well as package management systems. Some packages or some of their functions may become unmaintained and obsolete. There is a complex logical graph of dependencies between software packages, the complexity is unlikely to weaken, and the ecosystem of software is not going to stop evolving. A package that can successfully download and compile a few hundred Mb of associated software source codes and correctly run in 2019 might be extremely difficult to run in a standar software environment in 2029 or 2039. The user could be forced to deal with dependency hell in order to check the package from 2019.
  • inexact reproducibility?: Ten or 20 years after a research paper is published, how easy will it really be to provide an operating system and software environment that is really identical to that used by the author? There is such a diversity of the GNU/Linux-like operating systems, that few scientists will be really interested in trying to emulate "antique" operating systems/software environments.

Prefer native libraries: pro

  • efficiency: In by-default binary distributions, such as Debian, the bandwidth and cpu loads for binary versions of libraries are much lighter than for the "download everything" approach.
  • security: Well-established software distribution communities such as Debian, with numerous quality assurance pipelines and methods and bug management, will tend to provide high (though never perfect) standards of software security, as well as correct science and software errors.
  • convenience: There is a complex logical graph of dependencies between software packages, and the complexity is unlikely to weaken. Using native libraries avoids dependency hell.

Prefer native libraries: con

  • faith in modularity: The "prefer native" approach effectively assumes that any bugs or science errors in the research level software lie in the "science level" software and are not the fault of libraries that are stable enough to be "native" in the operating system. But this might not always be the case: the fault might be in the native library, and either have been fixed, or have been introduced, in versions more recent than those used by the research article authors.

Choosing an approach

While the "download everything" approach is, in principle, preferable in terms of hypothetical reproducibility, it risks being heavy, could have security risks, could be difficult due to dependency hell, and might in the long term not lead to exact reproducibility anyway, for practical reasons (leaving aside theoretical Turing machines). The "prefer native libraries" approach provides, in principle, less reproducibility, but it should be more efficient, secure and convenient, and, in practice, may be sufficient to trace bugs and science errors in scientific software.

fr | permanent link | RSS | trackback: ping me (experimental)

Comments: edit name, title and content in this template: NAME: name; TITLE: title; Please publish my comment on https://cosmo.torun.pl/blog/reproducibility ; content; and send the edited template to blog cosmo torun pl; use of email is for antispam filtering only; your email address will not be published.

Sat, 06 Apr 2019

Why non-use of ArXiv refs in a bibliography is unethical

It has become quasi-obligatory since the late 1990s for cosmology research articles to be posted at the ArXiv preprint server, making them publicly available under green open access. Much of other astronomy, physics and mathematics articles needed for cosmology research is also available at ArXiv. In practice, this means that almost all post-mid-late-1990s literature cited in cosmology research articles is available on ArXiv.

Many of these articles are posted before external peer-review by research journals, so they are literally "preprints", while others are posted after acceptance by a journal, but usually before they appear in paper versions of the journals, for those journals that are still printed on paper, or as online "officially published" articles. However, most of these "preprints" are cited before they are formally published — because they're hot-off-the-press, state-of-the-art results, or to put in plain English rather than advertising jargon, they're useful new results that need to be taken into account. Several journals, including MNRAS and A&A, insist on hiding the fact that references are easily obtainable without paywall blocks by requiring all references that have peer-reviewed bibliometry data to have their ArXiv identifiers removed from the list of references (bibliography) of any research paper!

The reason cited by colleagues (there doesn't seem to be a formal public justification by MNRAS/A&A) for excluding ArXiv identifiers from the bibliography for articles that are already formally published is to restrict citations as much as possible to the peer-reviewed literature. But this is nonsense: including both the peer-reviewed identifying information (year, journal name, volume, first page) and the ArXiv identifier informs the reader that the article is peer-reviewed, while also guaranteeing that the article is available to the reader (at least) under green open access. So that reason is unconvincing.

Another reason cited by colleagues is that the journal versions are more valid than the preprints, since the journal versions have usually been updated following peer-review and following language editor and proof-reader requests for corrections. This reason has some validity, but in practice is weak. Article authors quite frequently update their preprint on ArXiv to match the final accepted version of their article (in content, not in the particular details of layout, to reduce the chance of copyright complaints by the journals), because they know that many people will access the green open access version, and they want to reduce the risk that readers will refer to an out-of-date preprint version. Other authors only post their article on ArXiv once it is already accepted, in which case no significant revision is needed to match the content of the accepted version.

If the reasons for hiding ArXiv references are weak, what are the reasons for including ArXiv references?

  • For articles that are not provided in open access mode by the publishers either immediately or after an embargo period (such as many cosmology journals including JCAP, PRD, and CQG, which seem to block all of their articles behind paywalls unless open access charges are paid by the authors at an appropriate step of submitting the article for publication), removing/omitting ArXiv references from a reference list blocks access to the research articles for:
    1. scientists (physicists, mathematicians) in institutes who do not pay for subscriptions to astronomy/cosmology journals;
    2. astronomers in institutes who do not pay for subscriptions to maths/physics journals containing articles with justifications of mathematical techniques or physics that is not published in astronomy journals;
    3. scientists (astronomers, physicists, mathematicians) in institutes/universities who do not pay for global subscriptions to the publishers of the journals referred to;
    4. scientists in poor countries who do not pay for any journal subscriptions at all;
    5. the general public — including former astronomy/cosmology students who retain an interest in cosmology research and have the competence to understand research articles — who do not have access to any research institute or university journal subscriptions.

    Arguments 1, 2, and 3 are practical problems; these researchers will generally know that they can search ArXiv and the ADS and after 30–120 seconds will find out if the article is available on ArXiv, or possibly by open access on the journal website.

    Argument 3 here can be considered as a form of racism. There are several Nobel prizes explicitly related to Bose's contributions to physics, Chandrasekhar actually got a Nobel prize rather than merely having his name cited in the topics of Nobel prize awards, but the reality of today's economic/political/sociological setup is that the budgets of many Indian astronomy research institutes are far lower than that of rich-country institutes, so excellent scientists of high international reputations, and their undergraduate and postgraduate students, have to do research without having access to any paid journal subscriptions.

    Argument 5 could be considered as arrogance, elitism, and/or bad public relations in the Internet epoch.

  • A&A now has a short embargo (12 months?) for paywall blocks on articles, after which all articles become gold open access (with no extra charges to authors); MNRAS has a longer embargo, and other journals are under pressure to shift to open access. So what are arguments for including ArXiv identifiers for peer-reviewed articles that are available under open access by the publishers?

    1. It would require a lot of extra administrative effort by authors to update their .bib files depending on the dates on which articles become open access after an embargo;
    2. It would require a lot of extra administrative effort by authors to modify their .bib files to separate out journals whose articles are never open access from those with an embargo period;
    3. Authors at institutions with some or many journal subscriptions generally don't notice whether or not a cited article is behind a paywall, because the publishers' servers usually have IP filters that automatically recognise authors' computers as having authorisation to access the articles.
    4. Although big journal publishers can probably be relied on, to some degree, to maintain their article archives in the long-term, we know that the group of people running ArXiv have solid experience in long-term archiving and backing up (data storage redundancy) practices, and they have no conflict between commercial motivations and scientific aims.
    5. A typical article has anywhere from 30–100 or so references. Each of those also has from 30–100 or so "second-level" references. And so on. Even if the n-th level references are to a large degree redundant, a complete survey of the third or fourth level of references could easily cover 1000–10,000 articles. Nobody is going to read that many background articles, and not even their abstracts. Obviously, in practice, a reader can only trace back a modest number of references, and a modest number of references in those references. So for those articles that can be, after a little effort, found by the reader despite the ArXiv identifier being omitted, or as publisher-provided online articles, the hiding of the ArXiv identifier (and lack of a clickable ArXiv link) slows down the time for the reader to find the abstract and decide whether or not to read further. Even though the slowdown might only be an extra minute, multiplying that extra minute by the number of references to be potentially checked leads to a big number of minutes. Adding unnecessary "administrative" work for the reader is obstructive.

So that's why you should include ArXiv references in the bibliographies of your research articles. You can set up a LaTeX command so that if the journal asks you to remove them in the official version, you do that at the final stage for your "official" version, because you don't want to waste time trying to convince the journal about the ethical arguments above. But in your ArXiv versions and other versions that you might distribute to colleagues, you should favour the more ethical versions, which include the ArXiv references.

fr | permanent link | RSS | trackback: ping me (experimental)

Comments: edit name, title and content in this template: NAME: name; TITLE: title; Please publish my comment on https://cosmo.torun.pl/blog/arXiv_refs ; content; and send the edited template to blog cosmo torun pl; use of email is for antispam filtering only; your email address will not be published.

Tue, 30 Aug 2016

An extra Gigayear for the Universe?

Popular science descriptions of our present understanding of observational cosmology tend to say that we know the age of the Universe to be 13.80 gigayears, with an uncertainty of just 0.02 gigayears (20 megayears). But some of the oldest microlensed stars in the Galactic Bulge, within the central kiloparsec or so of our Galaxy, have best estimated ages of about 14.7 gigayears!. In the figure at left, our analysis of the probability distribution of the most likely age of the oldest of these stars is shown. The thin curves show probability densities for the ages of individual stars—several of these peak between about 14.5 and 15 gigayears. The thick curve shows the age of the oldest of these stars, supposing that we choose the individual star ages randomly according to their probability distributions. (This includes possible ages much lower than in the figure; we take the full asymmetric distributions into account.) So could the Universe be a gigayear older than is generally thought? The uncertainties are still big, but this is certainly an exciting prospect for shifting towards a more physically motivated cosmological model.

The more careful descriptions of the age of the Universe give a caveat—a warning of how or why the standard estimate might be wrong—the age estimate depends on fitting observations by using the standard ΛCDM model. Which is the standard model of cosmology. Meaning that it makes a non-standard assumption about gravity. Instead of allowing space to curve differently in regions where matter collapses into galaxies versus places where the Universe becomes more empty, which is what Einstein's general relativity says, the standard model is rigid (apart from uniform expansion). It doesn't allow general relativity to apply properly.

Several of us have been working on theoretical tools and observational analysis to see if we can apply general relativity better than in the standard model. At least so far, we generally find that doing our homework tells us that the would-be mysterious "dark energy" is really, until or unless proven otherwise, just a misinterpretation of space recently becoming negatively curved (on average) as voids and galaxies have formed during the most recent several gigayears.

This is where the age of the Universe comes in. In our new paper, arXiv:1608.06004, my colleagues and I summarise some key numbers that we argue are needed by any of the "backreaction" models similar to ours, which allow space to curve as galaxies and voids form, as required by the Einstein equation of general relativity. These simple constraints show that by fitting a no-dark-energy flat model (the Einstein–de Sitter model) at early times, the age of the Universe should be somewhat less than 17.3 gigayears, and quite likely somewhat more than the ΛCDM estimate of 13.8 gigayears. So we looked at published observations of stellar ages, which individually still have big uncertainties, but together favour the oldest stars having ages of around 14.7 gigayears. As expected, this is somewhere in between the two limits of 13.8 and 17.3 gigayears.

So will there be a race between detailed "backreaction" models versus stellar observers to get tight cosmological predictions of the age of the Universe versus accurate spectrosopic measurements of the oldest Galactic stars's ages (which have to be younger than the Universe, of course!)?

Barely had our paper become public on ArXiv, that we were reminded by colleagues studying cosmic microwave background (CMB) observations using the Einstein–de Sitter, no-dark-energy, flat cosmological model at early times that they also found an age of the Universe of something like 14.5 gigayears! Figure 4 bottom-right of arXiv:1012.3460 (PRD) shows our colleagues' estimates of the age of the Universe using the CMB and type Ia supernovae observations. Their most likely age is about 14.5 gigayears, give or take about half a gigayear. This is not so very different from the Galactic Bulge star best estimate! So we have very different, independent methods tending to give similar results. The uncertainties are still big. This story is not closed. But an extra Gigayear for the age of the Universe may be a clue that helps shift from the precise ΛCDM cosmology to the upcoming generation of accurate cosmology...

fr | permanent link | RSS | trackback: ping me (experimental)

Comments: edit name, title and content in this template: NAME: name; TITLE: title; Please publish my comment on https://cosmo.torun.pl/blog/an_extra_gyr ; content; and send the edited template to blog cosmo torun pl; use of email is for antispam filtering only; your email address will not be published.

2019/07
2019/04
2016/08
2016/02
2016/01


content licence: CC-BY | blog tools: GNU/Linux, emacs, perl, blosxom