Topological acceleration

Topological acceleration (en | fr)

About | Scalar averaging/DE | inhomog@ADS

blogs archives

en:

Exzuberant
In the dark

Trenches of discovery

Kipac

fr:

Café sciences

Luminet

Libération

Tue, 07 Jul 2020

The Slack and Zoom gilded cage for astronomers

Why should astronomers not use Zoom or Slack for voice/audio/text/file communication over the Internet?

Practical reasons include:

we should "keep control of the software — so that the software doesn't control us";

we should use software that allows and encourages interoperability like email, so that nobody is forced to use any particular server or software;

we should be able to easily export our communications and stored information and copy these locally or shift them to another server;

we should not be forced to install unverifiable software that may contain Trojans, backdoors or other malware.

Zoom and Slack both violate 1 — their software is non-free. 2. In 2018 Slack stopped allowing connections over two of the most widely used messaging protocols — irc and xmpp: Slack is opposed to the freedom to interconnect between instant messaging networks. Slack's strategy of gradually burning these bridges/gateways" as it increases its market dominance is part of vendor lock-in. 3. You'll have to check if Slack/Zoom make data export this easy, but I suspect that they do not. This is another component of vendor lock-in. (Interoperability and data export are in principle closely related.) 4. Use of Zoom forces us to install dangerous software (binary blobs) on our computers. We and the wider software community cannot verify that the binary blobs needed for the Zoom client are free of backdoors and trojans. Zoom client software is unverifiable software.

Do the ends justify the means? Independently of the practical reasons to not use Zoom or Slack (or Skype, MS Teams/GAFAM, Webex), there are ethical reasons:
Ethical reasons include:

"You are the product, not the customer." When you use Zoom or Slack, you are the product that they sell to corporate clients. They will do whatever they can to keep a big mass of users addicted to their services, and sacrifice your privacy, freedom of interoperability, freedom to backup your own data, or cybersecurity, if it is in their corporate interests to sell these to the highest bidder.

By pressuring your local community to use Slack or Zoom, you are weakening the support for ethically constructed communities — those built on the basis of free-licensed software, transparency, cooperation, intellectual freedom. Developers of Jitsi/BBB/Jami/Matrix need bug reports, wishlist items, open, constructive discussion and encouragement to continue. You are not forbidden from supporting these community software developers with money: free software does not mean zero-payment software.

The most common counterargument to the practical and ethical arguments above is the Tyranny of Convenience [Keye 2009] (and [Wu 2018]): "It works! It works! I just want to communicate efficiently! I'm not an expert in software! Most people in our community use it, so we should too. And Zoom/Slack has feature X, which I couldn't find on Jitsi/BBB/Jami/Matrix in a five-second search." This brings us back to consequentialism, the philosophical stance according to which the ends justify the means. The question here is how bad the means are compared to the ends. Software is at the core of the biggest geopolitical and economic power struggles of the XXIst century. Is it worth it to support authoritarian software and close to totalitarian software corporations given that "it's convenient? How many people in the XXth century felt that convenience justified small actions, in themselves "non-political" but implicitly supporting the totalitarian governments of that century, only to regret it later? And how does Slack actually behave towards its employees? "Slack employees ... cannot speak out about [the propietary Slack software], for fear of retribution (so they're inherently gagged by fear over mortgage etc. or self-restraint that defies logic/ethics)", according to Roy Schestowitz.

Alternatives exist! A complementary answer to the practical arguments above is that if we want text, voice and video communication — after all, we're humans and it's especially important during the pandemic to keep up the video-stream-to-video-stream contact — it feels good — then we should remember that we do already have practical free software packages to run ourselves and servers that already run that software. Checking at https://switching.software we find:

for video-conferencing, try Jitsi meet, BigBlueButton, or Jami;

for text/audio/video/file communication, try the Matrix protocol with a front end such as Riot.im.

Slack and Zoom control us if we use their services. But we control Jitsi/BBB/Jami/Matrix.

Continuing to more robust communication, the big paradox is how it's possible for people with PhDs in astrophysics to claim that they cannot handle irc. Irc is efficient, robust, light-weight and has matured through several decades of debugging and development. You can choose any client of your liking on your own computer — in a standalone gui, in a browser or in a terminal. It's not rocket science. And since we cannot do "rocket science" without typing equations, text, reasoning, specific lines of code — what's wrong with irc? For observational files, databases, software, diagrams, git repositories, all of this in the end has to be handled as text. In any case, those who want audio/video have it with Jitsi/BBB/Jami/Matrix.

So not only are Zoom and Slack impractical and unethical, but there's no need to use them. They don't provide the freedom to communicate; they instead welcome us instead to prison — which, for the moment, seems to be gilded, but is still a prison with all the associated costs.

fr | permanent link | RSS | trackback: ping me (experimental)

Comments: Please publish comments on a community-based Fediverse server of your choice and ping me in the comment with @boud@framapiaf.org.

Mon, 01 Jul 2019

Science reproducibility: the software evolution problem

What is the point of publishing a scientific paper if an expert reader has to do so much extra work to independently reproduce the results that s/he is effectively discouraged from doing so?

Reproducibility: brief description

In the present practice of cosmology research, such a paper tends to be accepted as "scientific" if the method is described in sufficient detail and clearly enough, and if the observational data are publicly available in the case of an observational paper. However, the modern concepts of free-licensed software and efficient management of software evolution via git repositories over the Internet, as well as Internet communication in general, should make it, in principle, possible to allow an expert reader to reproduce the figures and tables of a research paper with just a small handful of commands in a terminal, to download, compile and run scripts and programs provided by the authors of the research article. This will in practice make it easier for more scientists to verify the method and results, and improve on them, rather than forcing them to rewrite everything from scratch.

This idea has been floating around for several years. A very nice summary and discussion by Mohammad Akhlagi includes Akhlagi's own aim of making the complete research paper reproducible with just a few lines of shell commands, and links to several astronomical reproducible papers from 2012 to 2018, most using complementary methods.

I tend to agree that using Makefiles is most likely to be the optimal overall strategy for reproducible papers. For the moment, I've used a single shell script in 1902.09064.

The software evolution problem

I suspect, unfortunately, that there's a fundamental dilemma in making fully reproducible papers that remain reproducible in the long term, because of software evolution. Akhlagi's approach is to download and compile all the libraries that are needed by the author(s)' software, in specific versions of the software that were used at the time of preparing the research paper. This would appear to solve the software evolution problem.

My approach, at least so far in 1902.09064, is to use the native operating system (Debian GNU/Linux, in my case) recommended versions of all libraries and other software, to the extent that these are available; and to download and compile specific versions of software that are "research-level" software, either not yet available in a standard GNU/Linux family operating system, or evolving too fast to be available in those systems.

Download everything: pro

exact reproducibility: Downloading as much software, including various libraries, as possible, with specific commit hashes (frozen versions), should, in principle, enable users at any time later to reproduce exactly what the authors claimed they did to obtain their results. Tracing the origin of differing results should be easier than if these libraries are not downloaded in exactly the same versions.

Download everything: con

heaviness: Downloading the source code of libraries (such as the GNU Scientific Library) that are integrated into a well-tested, stable operating system such as Debian GNU/Linux (stable), and recompiling them from scratch, can consume a lot of download bandwidth and a lot of cpu time. If the user wishes to repeat the cycle from scratch many times, this becomes prohibitive in terms of user patience, and in the context of the climate crisis, risks becoming unethical if the benefits are too modest compared to the costs in "carbon load".

security risks: Old versions of standard libraries contain science errors, software errors, and security bugs. Reproducing the science errors and software errors of the authors is acceptable, since the aim is to check the authors' claims. But running software with unfixed security flaws is unwise. Running user-space software that is out-of-date in terms of security is less risky than doing so for root-level software, but is still a risk. Once a cracker has obtained user-level remoted access to a computer, escalating to root access is a much smaller challenge than getting initial access to the system.

dependency hell: "Science research" software depends on lower level libraries (integrals, derivatives, interpolation, monte carlo methods) that themselves rely on lower level numerical algorithm libraries, that need to be integrated with parallelisation libraries, and that themselves depend on the kernel and other system-level libraries. The FLOSS community is constantly improving and fixing these different software packages, as well as package management systems. Some packages or some of their functions may become unmaintained and obsolete. There is a complex logical graph of dependencies between software packages, the complexity is unlikely to weaken, and the ecosystem of software is not going to stop evolving. A package that can successfully download and compile a few hundred Mb of associated software source codes and correctly run in 2019 might be extremely difficult to run in a standar software environment in 2029 or 2039. The user could be forced to deal with dependency hell in order to check the package from 2019.

inexact reproducibility?: Ten or 20 years after a research paper is published, how easy will it really be to provide an operating system and software environment that is really identical to that used by the author? There is such a diversity of the GNU/Linux-like operating systems, that few scientists will be really interested in trying to emulate "antique" operating systems/software environments.

Prefer native libraries: pro

efficiency: In by-default binary distributions, such as Debian, the bandwidth and cpu loads for binary versions of libraries are much lighter than for the "download everything" approach.

security: Well-established software distribution communities such as Debian, with numerous quality assurance pipelines and methods and bug management, will tend to provide high (though never perfect) standards of software security, as well as correct science and software errors.

convenience: There is a complex logical graph of dependencies between software packages, and the complexity is unlikely to weaken. Using native libraries avoids dependency hell.

Prefer native libraries: con

faith in modularity: The "prefer native" approach effectively assumes that any bugs or science errors in the research level software lie in the "science level" software and are not the fault of libraries that are stable enough to be "native" in the operating system. But this might not always be the case: the fault might be in the native library, and either have been fixed, or have been introduced, in versions more recent than those used by the research article authors.

Choosing an approach

While the "download everything" approach is, in principle, preferable in terms of hypothetical reproducibility, it risks being heavy, could have security risks, could be difficult due to dependency hell, and might in the long term not lead to exact reproducibility anyway, for practical reasons (leaving aside theoretical Turing machines). The "prefer native libraries" approach provides, in principle, less reproducibility, but it should be more efficient, secure and convenient, and, in practice, may be sufficient to trace bugs and science errors in scientific software.

fr | permanent link | RSS | trackback: ping me (experimental)

Comments: Please publish comments on a community-based Fediverse server of your choice and ping me in the comment with @boud@framapiaf.org.

Sat, 06 Apr 2019

Why non-use of ArXiv refs in a bibliography is unethical

It has become quasi-obligatory since the late 1990s for cosmology research articles to be posted at the ArXiv preprint server, making them publicly available under green open access. Much of other astronomy, physics and mathematics articles needed for cosmology research is also available at ArXiv. In practice, this means that almost all post-mid-late-1990s literature cited in cosmology research articles is available on ArXiv.

Many of these articles are posted before external peer-review by research journals, so they are literally "preprints", while others are posted after acceptance by a journal, but usually before they appear in paper versions of the journals, for those journals that are still printed on paper, or as online "officially published" articles. However, most of these "preprints" are cited before they are formally published — because they're hot-off-the-press, state-of-the-art results, or to put in plain English rather than advertising jargon, they're useful new results that need to be taken into account. Several journals, including MNRAS and A&A, insist on hiding the fact that references are easily obtainable without paywall blocks by requiring all references that have peer-reviewed bibliometry data to have their ArXiv identifiers removed from the list of references (bibliography) of any research paper!

The reason cited by colleagues (there doesn't seem to be a formal public justification by MNRAS/A&A) for excluding ArXiv identifiers from the bibliography for articles that are already formally published is to restrict citations as much as possible to the peer-reviewed literature. But this is nonsense: including both the peer-reviewed identifying information (year, journal name, volume, first page) and the ArXiv identifier informs the reader that the article is peer-reviewed, while also guaranteeing that the article is available to the reader (at least) under green open access. So that reason is unconvincing.

Another reason cited by colleagues is that the journal versions are more valid than the preprints, since the journal versions have usually been updated following peer-review and following language editor and proof-reader requests for corrections. This reason has some validity, but in practice is weak. Article authors quite frequently update their preprint on ArXiv to match the final accepted version of their article (in content, not in the particular details of layout, to reduce the chance of copyright complaints by the journals), because they know that many people will access the green open access version, and they want to reduce the risk that readers will refer to an out-of-date preprint version. Other authors only post their article on ArXiv once it is already accepted, in which case no significant revision is needed to match the content of the accepted version.

If the reasons for hiding ArXiv references are weak, what are the reasons for including ArXiv references?

For articles that are not provided in open access mode by the publishers either immediately or after an embargo period (such as many cosmology journals including JCAP, PRD, and CQG, which seem to block all of their articles behind paywalls unless open access charges are paid by the authors at an appropriate step of submitting the article for publication), removing/omitting ArXiv references from a reference list blocks access to the research articles for:

scientists (physicists, mathematicians) in institutes who do not pay for subscriptions to astronomy/cosmology journals;

astronomers in institutes who do not pay for subscriptions to maths/physics journals containing articles with justifications of mathematical techniques or physics that is not published in astronomy journals;

scientists (astronomers, physicists, mathematicians) in institutes/universities who do not pay for global subscriptions to the publishers of the journals referred to;

scientists in poor countries who do not pay for any journal subscriptions at all;

the general public — including former astronomy/cosmology students who retain an interest in cosmology research and have the competence to understand research articles — who do not have access to any research institute or university journal subscriptions.

Arguments 1, 2, and 3 are practical problems; these researchers will generally know that they can search ArXiv and the ADS and after 30–120 seconds will find out if the article is available on ArXiv, or possibly by open access on the journal website.

Argument 3 here can be considered as a form of racism. There are several Nobel prizes explicitly related to Bose's contributions to physics, Chandrasekhar actually got a Nobel prize rather than merely having his name cited in the topics of Nobel prize awards, but the reality of today's economic/political/sociological setup is that the budgets of many Indian astronomy research institutes are far lower than that of rich-country institutes, so excellent scientists of high international reputations, and their undergraduate and postgraduate students, have to do research without having access to any paid journal subscriptions.

Argument 5 could be considered as arrogance, elitism, and/or bad public relations in the Internet epoch.

A&A now has a short embargo (12 months?) for paywall blocks on articles, after which all articles become gold open access (with no extra charges to authors); MNRAS has a longer embargo, and other journals are under pressure to shift to open access. So what are arguments for including ArXiv identifiers for peer-reviewed articles that are available under open access by the publishers?

It would require a lot of extra administrative effort by authors to update their .bib files depending on the dates on which articles become open access after an embargo;

It would require a lot of extra administrative effort by authors to modify their .bib files to separate out journals whose articles are never open access from those with an embargo period;

Authors at institutions with some or many journal subscriptions generally don't notice whether or not a cited article is behind a paywall, because the publishers' servers usually have IP filters that automatically recognise authors' computers as having authorisation to access the articles.

Although big journal publishers can probably be relied on, to some degree, to maintain their article archives in the long-term, we know that the group of people running ArXiv have solid experience in long-term archiving and backing up (data storage redundancy) practices, and they have no conflict between commercial motivations and scientific aims.

A typical article has anywhere from 30–100 or so references. Each of those also has from 30–100 or so "second-level" references. And so on. Even if the n-th level references are to a large degree redundant, a complete survey of the third or fourth level of references could easily cover 1000–10,000 articles. Nobody is going to read that many background articles, and not even their abstracts. Obviously, in practice, a reader can only trace back a modest number of references, and a modest number of references in those references. So for those articles that can be, after a little effort, found by the reader despite the ArXiv identifier being omitted, or as publisher-provided online articles, the hiding of the ArXiv identifier (and lack of a clickable ArXiv link) slows down the time for the reader to find the abstract and decide whether or not to read further. Even though the slowdown might only be an extra minute, multiplying that extra minute by the number of references to be potentially checked leads to a big number of minutes. Adding unnecessary "administrative" work for the reader is obstructive.

So that's why you should include ArXiv references in the bibliographies of your research articles. You can set up a LaTeX command so that if the journal asks you to remove them in the official version, you do that at the final stage for your "official" version, because you don't want to waste time trying to convince the journal about the ethical arguments above. But in your ArXiv versions and other versions that you might distribute to colleagues, you should favour the more ethical versions, which include the ArXiv references.

fr | permanent link | RSS | trackback: ping me (experimental)

Comments: Please publish comments on a community-based Fediverse server of your choice and ping me in the comment with @boud@framapiaf.org.

2020/07

2019/07

2019/04

2016/08

2016/02

2016/01

content licence: CC-BY | blog tools: GNU/Linux, emacs, perl, blosxom