Project and Data Organization
Objectivesπ
- folder structure
- file naming
- variable naming
- backups
Project Organization
To organize your data, you should use a clear folder structure to ensure that you can find your files. For this there are already multiple existing templates.
If you don't find any template that suits your needs (which I doubt...), make sure you follow these general suggestions on organization of folders:
- Make sure you have enough (sub)folders so that files can be stored in the right folder and are not scattered in folders where they do not belong, or stored in large quantities in a single folder.
- Use a clear folder structure. You can structure folders based on the person that has generated the data/folder, chronologically (month, year, sessions), per project (as done in the example below), or based on analysis method/equipment or data type.
- Avoid overlapping or vague folder names, and do not use personal data in folder/file names.
- All good structures contain at least the following elements:
- A unique main folder for the project
- Some form of code
- Some form of data
- A readme document with any important information about the project for yourself or collaborators
Project Organization: One Example 1
.
βββ LICENSE.md
βββ README.md
βββ code
β βββ experiment
β βββ analysis
βββ data
β βββ source <- The original, immutable data dump.
β βββ raw <- The final, canonical data sets for modeling.
β βββ derivatives <- Data after being processed by analysis pipelines.
β βββ temp <- Intermediate data that has been transformed.
βββ docs <- Documentation notebook for users.
β βββ manuscript <- Manuscript source, e.g., LaTeX, Markdown, etc.
β βββ study-protocol <- or `preregistration`
βΒ Β βββ reports <- Other project reports and notebooks (e.g. Jupyter, .Rmd)
βββ results
βββ figures <- Figures for the manuscript or reports
Β Β βββ output <- Other output for the manuscript or reports
It doesn't matter too much if you name the folder with processed data processed
and have it under the upper folder data
or if you have it one level up and name it instead edited data
or what have you. The important thing is that you choose a structure and stick to it and best explain it to others by putting an explanation of the project folder structure in a README. What you should do though is checking out if there are community standards for folder structure in your research field. For example, in psychology and neuroscience we have a widely accepted and comprehensively documented standard called BIDS. Some data repositories now expect you to have the data in this format, otherwise they won't accept your data. So, just make sure that you follow recommendations and principles in your field of research (if there are any).
How you organize your data
folder depends on your kind of research data. For some it is better to have separate folders for each subject, e.g., if you collect multiple modality data from subjects. Then you can organize your data
folder as such (inspired by BIDS for psychology and neurosciences):
βββ data
β βββ source
β βββ raw
β β βββ sub-01
β β β βββ beh
β β β β βββ sub-01_beh.csv
β β β βββ eeg
β β β β βββ sub-01_eeg.edf
β β βββ sub-02
β β β βββ beh
β β β β βββ sub-02_beh.csv
β β β βββ eeg
β β β β βββ sub-02_eeg.edf
β βββ derivatives
β βββ temp
If you only have one data type, you can skip the sub-0X
and modality
folders, like so:
βββ data
β βββ source
β βββ raw
β β βββ sub-01_beh.csv
β β βββ sub-02_beh.csv
β βββ derivatives
β βββ temp
Task
Go through your project folder and re-organize it using the above mentioned recommendations.
File Naming Conventions
Structure your file names and set up a template for this. For example, it may be advantageous to start naming your files with the date each file was generated. This will sort your files chronologically and create a unique identifier for each file. The utility of this process is apparent when you generate multiple files on the same day that may need to be versioned to avoid overwriting.
Some other tips for file naming include:
- Use the date or date range of the experiment:
YYYY-MM-DD
- Use the file type
- Use the researcher's name/initials
- Use the version number of file (v001, v002) or language used in the document (ENG)
- Do not make file names too long (this can complicate file transfers)
- Avoid special characters (?!@*%{[<>) and spaces
- Avoid personal data in file names
You can explain the file naming convention in a README file so that it will also become apparent to others what the file names mean.
Jenny Bryanβs βnaming thingsβ presentation (also available as a 5 minute summary video on youtube) gives very concrete and intuitive recommendations and examples. Here's the main content of her talk for you:
Don't do this | Do this instead |
---|---|
myabstract.docx | 2022-09-24_abstract-for-normconf.docx |
Janeβs Filenames Use βSpacesβ & Punctuation ;).xlsx | janes-filenames-are-getting-better.xlsx |
figure 1.png | fig01_scatterplot-talk-length-vs-interest.png |
JW7d^(2sl@deletethisandyourcareerisoverWx2*.txt | 1986-01-28_raw-data-from-challenger-o-rings.txt |
Good file names are:
- machine readable
- human readable
- sorted in a useful way
Machine readable
Wikipedia: globbing = "glob patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *.txt textfiles/
moves all files with names ending in .txt from the current directory to the directory textfiles. Here, * is a wildcard and *.txt is a glob pattern. The wildcard * stands for "any string of any length including empty, but excluding the path separator characters (/ in unix and in windows)". "
This means: use _
underscore to delimit fields, i.e. when you have multiple .csv
files that contain data of one type of observations and need to be parsed to one dataframe
in the end, use the observation name in the file name and delimit this name from the rest of the filename using an underscore. This way you can easily find it with the ls
command and easily code the read-in for data analysis. At the same time, use -
hyphen to delimit words within fields.
Example: If you have a list of files named like this 2022-09-24_Plasmid-Cellline-100-1MutantFraction_A01.csv
, 2022-06-26_Plasmid-Cellline-100-1MutantFraction_H02.csv
, 2022-06-26_Plasmid-Cellline-100-1MutantFraction_H03.csv
you can do multiple machine operations with it:
- you can find those files in your folder under all the other files by simply typing
ls *Plasmid*
in the terminal - you can read in the different parts of the filename as headers for your dataframe by coding a delimiter rule, e.g., in
R
a code like:
separate_wider_delim(
filenames,
delim = regex("[_\\.]"),
names = c("date", "assay", "well", NA)
)
leads to an output of:
date | assay | well | |
---|---|---|---|
1 | 2022-09-24 | Plasmid-Cellline-100-1MutantFraction | A01 |
2 | 2022-06-26 | Plasmid-Cellline-100-1MutantFraction | H02 |
3 | 2022-06-26 | Plasmid-Cellline-100-1MutantFraction | H03 |
Human readable
Make sure that at least you yourself are able to decode from the filename what is in it. Try also to make it easy for others to guess what something is.
Don't do this | Do this instead |
---|---|
01.md | 01_marshal-data.md |
01.R | 01_marshal-data.R |
02.md | 02_pre-dea-filtering.md |
02.R | 02_pre-dea-filtering.R |
03.md | 03_dea-with-limma-voom.md |
03.R | 03_dea-with-limma-voom.R |
04.md | 04_explore-dea-results.md |
04.R | 04_explore-dea-results.R |
9.md | 90_limma-model-term-name-fiasco.md |
90.R | 90_limma-model-term-name-fiasco.R |
Dates
To be able to sort file in a chronological order it is always a good idea to include a date in the filename. For this you should respect ISO 8601
which states that dates should be written in the YYYY-MM-DD
format. Don't let the US convince you to use MM-DD-YYYY...they're really the only ones using this format.
Sorted in a useful way
- plan for alphanumeric sorting
- put something numeric-ish first-ish
- use the ISO 8601 standard for dates
- left pad numbers with zeros. Otherwise the file starting with a
10
will be shown above the file starting with a1
in the folder, which is confusing.
Variable naming
Basically, all of the above about file naming conventions applies equally to variable naming. Here are some conventions you should know about:
- CamelCase
- lowerCamelCase
- Underscore_Methods
- Mixed_Case_with_Underscores
- lowercase
Task
Go through your project files and rename files and variables using the above mentioned recommendations.
Task
Go to your OSF project and upload your newly organized folder. You can either upload the whole folder in your OSF project main page or use the components feature in OSF.
Again, checking out if there are community standards in your research field for file and variable naming is a good idea. The above mentioned BIDS standard, for example, gives instructions on that, too.
Open file formats
This one is short: use open file formats! Open file formats are the ones that can be opened by anyone on any machine without having to install additional software than the one a computer already comes with. Formats like .txt/.md for text or .csv/.tsv for tables can be opened by simple text editors. Formats from proprietary software (like Microsoft Office) cannot.
optional/reading/further materials
Project Organisation: Other Examples
- This folder structure by Nikola Vukovic
- You can pull/download folder structures using GitHub: This template by Barbara Vreede, based on cookiecutter, follows recommended practices for scientific computing by Wilson et al. (2017).
- See this template by Chris Hartgerink for file organisation on the Open Science Framework.
- How to Organize Your Digital Files by Melanie Pinola.
- Project structure videos by Danielle Navarro (with slides).
More Information on Project Organisation
- How to organise your data and code by Rene Bekkers.
More Information on File naming
- MIT's recommendations on File naming and folder hierarchy
- 8 step guide on how to set up your file naming convention
File renaming tools
If you want to change your file names you have the option to use bulk renaming tools. Be careful with these tools, because changes made with bulk renaming tools may be too rigorous if not carefully checked!
Some bulk file renaming tools include: - Bulk Rename Utility, WildRename, and Ant Renamer (for Windows) - Renamer (for MacOS) - PSRenamer (for MacOS, Windows, Unix, Linux)
Info
Most of the content was copied from The Turing Way Handbook under a CC-BY 4.0 licence.
-
adapted from this template by Barbara Vreede. ↩