Abstract
Here, I document the overall longevity and continued maintenance of the totality of R packages published in the journal Methods in Ecology and Evolution, a mainstay for methodological work and R packages in ecology. I conclude that published packages are well maintained, and remain viable long after the average PhD or postdoc appointments. However, (survival) bias and privilege of the authors should be further explored with an in depth survey on career paths.
cite as: Hufkens K. The life cycle of an R package. https::/doi.org/10.5281/zenodo.7689570
Introduction
Methods in Ecology and Evolution (MEE) for over a decade provides an outlet to describe significant software contributions to the ecological research community. These descriptions of free software, and in particular R packages, provide many ecologists with the required tools to do their research. At the same time, it is well known that open source software development is limited by a number of factors including the available maintenance time, the number of developers on a project and ultimately funding to sustain software and documentation in the long term (Merow et al. 2023). Here, I explore the totality of all published R packages in Methods in Ecology and Evolution, a mainstay for analysis in ecology, for their overall longevity and maintenance.
Results & Discussion
Over the past decade MEE has published one to two R package descriptions per month for a total of 244 R packages (since 2012, see Methods below). Despite the significant burden of additional requirements to publish software in the public Comprehensive R Archive Network (CRAN), ~70% of the packages are currently listed on CRAN (with only ~1% listed on the alternative repository Bioconductor). Although publication on CRAN is not a requirement for publication in MEE the data suggest that the formal structure and requirements of CRAN will impact reviews positively. General attrition of CRAN packages is low with ~4% being delisted from CRAN over the total time frame of data considered. Removal occurred on average 3.9 ± 2.5 years after the creation of the package. Both numbers suggest that most packages remain viable long after the average lifespan of PhD or postdoc appointments (Woolston 2020). It must be noted that publishing packages in a journal or on CRAN has a high barrier of entry, and sampling of these packages post-hoc is invariably affected by survival bias.
The overall majority of package development (~79%) happens on Github (https://github.com), a free version control hosting and development service. The availability of well developed R community infrastructure around writing and publishing packages and code on Github, such as {usethis} (Wickham, Bryan, and Barrett 2022) and other r-lib infrastructure (https://github.com/r-lib), might be reflected in these numbers. Publicly available version control data allowed me to assess the engagement with the community and speed and frequency of code maintenance. The total number of open issues on Github, as an indication of development response, is generally low with 8.4 ± 2.5 open issues across all projects. Most projects have seen their last commit within the last year (345 ± 513 days), further confirming the relative active development as supported by few open unresolved issues raised by users. Despite active maintenance and low issue counts a third (31%) of all packages on Github are maintained by a single author. Across all packages I note an average of 3.2 ± 2.9 developers.As such research focused software development confirms the open source development trend that most small projects have a limited number of active core developers.
The impact of the software varies form package to package, with downloads during 2022 varying from as few as 88 to as many 160 000 downloads, with a mean of 8000 downloads per package. Although it could be argued that downloads don’t necessarily reflect true user uptake, more popular packages seem more frequently cited.
Conclusion
Overall, published packages in MEE are well maintained, and remain viable long after the average lifespan of PhD or postdoc appointments (Woolston 2020). Although no survey was executed, development seems sustained, as assessed via Github commit messages and outstanding issues. It is more difficult to estimate if packages developed internally, or not formally published in journals, progress through a similar life-cycle. Conversely it could be argued that active maintenance of a published R package might have a positive influence on networking, community building and the continued academic employment of developers. The (survival) bias and privilege of the authors involved should therefore be further explored with an in depth survey on career paths.
Methods
For my meta-analysis I searched the archives of Methods in Ecology and Evolution and the sub-categories of “applications” and “tools” using the search word “package” (consulting the MEE archive on 19 December 2022). A total of 368 queries were found. All publications were subsequently manually screened for referring to a novel R package, excluding all other software contributions and mere collections of scripts not adhering to the standard package layout.
For each R package publication I gathered key meta-data. I screened all publications for the date of publication, the number of citations and the number of authors. I further verified if in the main MEE text the availability of CRAN or Bioconductor was mentioned, and if an install routine was provided (e.g. install.packages). I checked if packages were currently listed on CRAN, or were ever listed on CRAN but are currently archived. Delisting occurs when a package does not adhere to current R standards anymore, and was not updated to address these mistakes (i.e. the author is not responsive to maintenance requests).
In order to establish the collaborative nature of software development I further gathered information on centralised public repositories linked to the software described in the MEE papers. I first checked for any references to a Github repository (as cursory screening suggested this to be the main location of software development). In absence of a Github repository other version control websites were manually back filled sourcing from the original manuscript. If any information was missing regarding a development website in the main manuscript a simple web search (using DuckDuckGo) was executed to provide a link to such a location. For all Github based projects (79% of packages) I determined the current (as of 19 December 2022) number of collaborators, open issues and the date of the latest commit. Finally I used the {cranlogs} package (Csárdi 2022) to determine the number of downloads of all packages during the past year. I provide summary statistics on all these metrics using R 4.2.2 (R Core Team 2023).
Acknowledgements
The R logo is © 2016 The R Foundation. Under the terms of the Creative Commons Attribution-ShareAlike 4.0 International license (CC-BY-SA 4.0)