4 GCRG Lab Data Management Plan
The Global Change Research Group strives for an open science policy that makes our science accessible and reproducible and that allows us to collaborate with our future selves and with our colleagues. To assist in achieving that aim, we use the following practices.
4.1 Key resources
- Lab notebook
- Shared Google Drive (Global Change Research Group)
- GitHub repository
- Working location (laptop, scientific workstation, cluster, etc.)
- Backup location (Bishop, a PbSci storage server that is backed up offsite. This may change to Rufus, a different storage server, in the future.)
4.2 Project organization
- Lab notebooks are maintained each day; tracking progress, daily learnings, what was accomplished, and/or what goal was worked towards. These can be physical notebooks, text files, Evernote, Jupyter notebooks, etc.
- Lab notebooks are digitally backed up, regardless of initial format
- Each new project gets a GitHub repository in the pinskylab organization
- This practice facilitates collaboration, sharing, data traceability, and maintenance of institutional knowledge.
- Repos can be public or private
- A CITATION.cff clarifies how to cite the repo. We apply authorship criteria similarly as for papers (e.g., first author is the lead).
- Data processing is accomplished through the use of scripts, not manual manipulation
- This helps to verify the reproducibility of our methods
- A README.md in each directory (including the top directory)
- explains the purpose of each file and sub-directory
- includes links to relevant papers and preprints
- has contact information for authors and data creators, as necessary
- defines data columns, including units, of any data files and basic methods used for data collection
- has version numbers of any software or packages needed for running codes or scripts
- If the repo is included in a publication, we also archive it with a DOI on, for example, Zenodo (see here for instructions).
- We add a license to each git repo, e.g., MIT or CC BY-NC 4.0, to clarify how the material can be reused. See guidance from Creative Commons, Github, and R Packages book.
- We write collaborative manuscripts in the GCRG Shared Google Drive
- Presentations are typically made in Google Slides in the GCRG Google Drive (Presentations/) allowing lab members easy access to useful graphics
- Name the file in the format YYYY-MM-DD_presentername_occasion, eg, 2023-01-07_Pinsky_AmNat
- Before matriculation, we ensure all projects, code, data, papers, etc. are available and fully documented
4.3 Data
Data used in support of a project is:
- Saved in an appropriate, non-proprietary format with accompanying metadata (e.g., csv rather than Excel)
- Raw data is stored in or linked from the GitHub repository associated with the project
- Raw data files under the Github file size limit (<100MB) are stored in a data/ directory
- Larger files are stored in a data_largefiles/ directory (not tracked by Git by using the .gitignore file) in at least two places
- where the analysis is occurring (e.g., laptop or scientific workstation), and
- either one of (we document this arrangement clearly in the relevant data_largefiles/README.md file of the git repo):
- on FigShare, linked from the relevant GitHub readme.
- on NCBI for raw sequencing reads. Upload them soon after receiving, possibly with an embargo, with the accession numbers documented in the relevant README.md.
- another public data repository, with a link pointing to the data from the git readme.
- if FigShare, NCBI, or other public data repositories are not appropriate, we can use the UCSC PbSci Bishop data server. Data is stored in bio-globalchange/data_largefiles/ in a subdirectory whose name matches the Git repo name.
- If data from an external or public source is being used, it is stored in a data_dl/ directory (not tracked by Git). We clearly and unambiguously describe the data source in the data_dl/README.md file by providing links, version numbers, descriptions for access/download, or other details to ensure reproducibility. If you’re worried the data won’t be available in the future, follow one of the two previous bullet point approaches.
- Metadata is stored in the same directory as the raw data, typically in a README.md file that describes the data in each column, units, coordinate reference system (CRS, for GIS data), and other details needed to understand the file
- Processed or cleaned data is stored in a separate directory, e.g., output/ or similar, to differentiate from raw data
4.4 Code
All code used or developed in support of the project is:
- Well commented and complete
- Versioned in the project’s git repository
- Described in the README.md file to explain what each script does, what language was used, what software and package versions were used, etc.
- Tested! Can at least one other person (more is better) complete your analysis on a different computer?
4.5 Backups
We store our raw data and scripts in at least two locations.
- Most of our work is on Github or Google Drive. Both are backed up monthly to the UCSC PbSci Bishop file server.
- Raw data files too large for Github and Drive are stored in two locations (see Data section above)
4.6 Hummingbird
4.6.1 Accessing Humminbird
- Your Hummingbird information is your CruzID and your Gold password
ssh CRUZID@hb.ucsc.edu
- other stuff