Microbiome data analysis tools for R

The goal of micro4R was to create an R package for microbiome data processing with a low barrier to entry. I started my career in microbiome research at the bench and had to ELI5 to myself how to process and analyze “big data”. I’ve spent a ton of time poring over and experimenting with others’ code, and want to pass it on.

Likely, the ideal person to benefit from micro4R would be a bench scientist without much formal statistics or bioinformatics training. Fair warning, if you already have a strong stats/informatics background, this may not be of much use for you!

This package does not create any brand new functionality and is essentially a wrapper of and/or inspired by existing tools others have already created. Much of what it does can be accomplished with other packages, such as phyloseq, QIIME 2, and MicrobiomeAnalyst. One of these may be better for your purposes, and I’d encourage anyone new to the field to explore multiple tools!

Installation

All you really need is R, but I’d recommend also downloading and working in RStudio. If you’re a true newbie to R, there’s tons of free content to help you learn the basics.

Once you’re set in R/RStudio, you can install the development version of micro4R like so:

# install.packages("pak")
pak::pak("mshilts1/micro4R")

My Usual Workflow

I’ve listed the steps that I usually take with 16S data:

Do the lab work to extract DNA, make libraries, submit for sequencing, etc. Go ↓
Get the data from the sequencing core. Go ↓
Process the data through dada2 to generate an amplicon sequence variant (ASV) table and its associated taxonomy table. Go ↓
Create a “metadata” file with pertinent information on the samples and controls in your run. Go ↓
Use decontam to bioinformatically remove suspected contaminants. Go ↓.
Do some basic sanity checking on the metadata, ASV, and taxonomy objects. Go ↓
Check the quality of the sequencing data by examining the positive and negative controls. Go ↓
Alpha diversity (vegan). Go ↓
Beta diversity (vegan). Go ↓
Additional statistical analysis, including differential abundance testing (maaslin3). Go ↓

Example

I’ll run through a VERY small and simple possible use case below. For more detailed help and documentation, please explore the vignettes (TBA).

Step 1: the lab work

I’m not going to get into much detail here, but everything you or your colleagues do in the lab matters and can influence your results. You’ll need to be thoughtful about everything from how exactly samples will be collected, storage conditions, DNA extraction method, library prep method, and more. It’s not my place to tell you what to do here, but carefully consider and document every step along the say.

However, I will strongly tell you that including both negative and positive controls is very important! Why, and what are those? Read more here: ^1,^2,^3,^4,⁵

Step 2: Sequencing

You’ll need to follow your own desired protocol to perform the sequencing, and get the data back. Most likely, you’ll use the services of a sequencing core, so follow their instructions.

Included with the package is an extremely and unnaturally tiny toy example to demonstrate its major functionality, using subsampled publicly available nasal swab 16S microbiome sequencing data that I generated along with many colleagues.

The example files were generated by amplifying the V4 hypervariable region of the 16S gene and were sequenced on an Illumina MiSeq machine with 2x250 bp reads.

Step 3: Data processing

The files received from the sequencer will most likely be FASTQ files. These are files with DNA sequences generated by the sequencer along with quality score information (putting the “Q” in FASTQ). In the age of next-generation sequencing, you could easily get millions of these sequences per each sequencing run. Our poor human brains can’t really comprehend or do anything with this data as it is. So what we’re going to do is process this data to eventually turn it into two tables that we humans can make sense of and analyze.

We’re going to use the R package dada2 to turn the FASTQ files into matching amplicon sequence variant (ASV) count and taxonomy tables. If you don’t know what an ASV is, please go here first.

The first thing we’ll do on these files is run dada2_asvtable(), which is essentially a wrapper to generate an ASV count table by following a workflow similar to the dada2 tutorial.

This function can take a number of arguments, but the most important one is ‘where’, which is the path to the folder where your FASTQ files are located.

For demonstration purposes, it’s been set to the relative path of the example FASTQ files that are included with the package:

library(micro4R)
#> This is version 0.0.0.9000 of micro4R. CAUTION: This is package is under active development and its functions may change at any time, without warning! Please visit https://github.com/mshilts1/micro4R to see recent changes.

asvtable <- dada2_asvtable(where = "inst/extdata/f", chatty = FALSE, logfile = FALSE)
#> Creating output directory: /var/folders/pp/15rq6p297j18gk2xt39kdmm40000gp/T//RtmpllHMAy/dada2_out/filtered
#> 59520 total bases in 248 reads from 7 samples will be used for learning the error rates.
#> 49600 total bases in 248 reads from 7 samples will be used for learning the error rates.

If you’re running this with your own data, set ‘where’ to the path of the folder where your FASTQ files are stored. If you leave it empty (e.g., run dada2_asvtable()), it will default to searching in your current working directory. (‘chatty’ was set to FALSE because tons of information gets printed to the console otherwise; I’d recommend setting it to TRUE (the default) when you’re processing data for real, as the information is useful, but just too much here.)

Let’s take a quick look at what this asvtable looks like (using the tibble::as_tibble() function so it prints more nicely):

tibble::as_tibble(asvtable, rownames = "SampleID")
#> # A tibble: 7 × 7
#>   SampleID  TACGTAGGTGGCAAGCGTTA…¹ TACGGAGGGTGCAAGCGTTA…² TACGTAGGGTGCGAGCGTTG…³
#>   <chr>                      <int>                  <int>                  <int>
#> 1 SAMPLED_…                      0                      0                      0
#> 2 SAMPLED_…                      0                      0                      0
#> 3 SAMPLED_…                     44                      0                      0
#> 4 SAMPLED_…                     24                      0                      0
#> 5 SAMPLED_…                      0                      0                     12
#> 6 SAMPLED_…                      0                     35                      6
#> 7 SAMPLED_…                      0                      0                      0
#> # ℹ abbreviated names:
#> #   ¹TACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCAAACAGG,
#> #   ²TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG,
#> #   ³TACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGGGCTCGTAGGTGGTTTGTCGCGTCGTCTGTGAAATTCTGGGGCTTAACTCCGGGCGTGCAGGCGATACGGGCATAACTTGAGTGCTGTAGGGGTAACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTTACTGGGCAGTTACTGACGCTGAGGAGCGAAAGCATGGGTAGCGAACAGG
#> # ℹ 3 more variables:
#> #   TACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCACGTCGTCTGTGAAATTCCACAGCTTAACTGTGGGCGTGCAGGCGATACGGGCTGACTTGAGTACTGTAGGGGTAACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTTACTGGGCAGTTACTGACGCTGAGGAGCGAAAGCATGGGTAGCAAACAGG <int>,
#> #   TACGTAGGTGACAAGCGTTGTCCGGATTTATTGGGCGTAAAGGGAGCGCAGGCGGTCTGTTTAGTCTAATGTGAAAGCCCACGGCTTAACCGTGGAACGGCATTGGAAACTGACAGACTTGAATGTAGAAGAGGAAAATGGAATTCCAAGTGTAGCGGTGGAATGCGTAGATATTTGGAGGAACACCAGTGGCGAAGGCGATTTTCTGGTCTAACATTGACGCTGAGGCTCGAAAGCGTGGGGAGCGAACAGG <int>, …

This is basically just a count table of the number of ASVs detected in each sample.

Let’s look what the column names (AKA the names of the ASVs) look like:

colnames(asvtable)
#> [1] "TACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCAAACAGG" 
#> [2] "TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG" 
#> [3] "TACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGGGCTCGTAGGTGGTTTGTCGCGTCGTCTGTGAAATTCTGGGGCTTAACTCCGGGCGTGCAGGCGATACGGGCATAACTTGAGTGCTGTAGGGGTAACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTTACTGGGCAGTTACTGACGCTGAGGAGCGAAAGCATGGGTAGCGAACAGG"
#> [4] "TACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCACGTCGTCTGTGAAATTCCACAGCTTAACTGTGGGCGTGCAGGCGATACGGGCTGACTTGAGTACTGTAGGGGTAACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTTACTGGGCAGTTACTGACGCTGAGGAGCGAAAGCATGGGTAGCAAACAGG" 
#> [5] "TACGTAGGTGACAAGCGTTGTCCGGATTTATTGGGCGTAAAGGGAGCGCAGGCGGTCTGTTTAGTCTAATGTGAAAGCCCACGGCTTAACCGTGGAACGGCATTGGAAACTGACAGACTTGAATGTAGAAGAGGAAAATGGAATTCCAAGTGTAGCGGTGGAATGCGTAGATATTTGGAGGAACACCAGTGGCGAAGGCGATTTTCTGGTCTAACATTGACGCTGAGGCTCGAAAGCGTGGGGAGCGAACAGG" 
#> [6] "TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGCTTTGGAAACTGTTTAACTTGAGTGCAAGAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGCTTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGG"

Our ASVs are by default just named after their literal DNA sequences. Since you’re (probably?) not a computer, a string of hundreds of nucletotides is likely not something you can make much sense of by yourself. The next step will take those nucleotide sequences and compare them against a database (or two) of sequences with known taxonomy:

train <- "inst/extdata/db/EXAMPLE_silva_nr99_v138.2_toGenus_trainset.fa.gz" # set training database
species <- "inst/extdata/db/EXAMPLE_silva_v138.2_assignSpecies.fa.gz" # set species database

taxa <- dada2_taxa(asvtable = asvtable, train = train, species = species, chatty = FALSE)

There are two databases that we’re using for taxonomic assignment here:
1. ‘train’ needs to be the path to whatever database you’d like to use as the “training set of reference sequences with known taxonomy”.
2. ‘species’ is OPTIONAL. If you’d like to use this option, provide the path to a specifically formatted species assignment database. (Read more here.)

CAUTION: The two databases used in the example here are comically small and artificial subsamples of the real SILVA databases, and should only ever be used for testing and demonstration purposes! You’ll definitely need to download and use the real databases for your actual data! If you try to use these with real data, you’ll get very weird results that will make no sense.

There are many options for taxonomic databases you can use; the major players are SILVA, RDP, GreenGenes, and UNITE. Please go here for details and download links. I usually prefer the SILVA databases, but you don’t have to!

Let’s take a look at the taxonomy assignment table:

#> # A tibble: 6 × 7
#>   ASV                                    Kingdom Phylum Class Order Family Genus
#>   <chr>                                  <chr>   <chr>  <chr> <chr> <chr>  <chr>
#> 1 TACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCG… Bacter… Bacil… Baci… Stap… Staph… Stap…
#> 2 TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCG… Bacter… Pseud… Gamm… Ente… Enter… Kleb…
#> 3 TACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCG… Bacter… Actin… Acti… Myco… Coryn… Cory…
#> 4 TACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGCG… Bacter… Actin… Acti… Myco… Coryn… Cory…
#> 5 TACGTAGGTGACAAGCGTTGTCCGGATTTATTGGGCG… Bacter… Bacil… Baci… Lact… Carno… Dolo…
#> 6 TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCG… Bacter… Bacil… Baci… Lact… Strep… Stre…

It’s a bit squished, but you can see this information is more human-friendly here. Each ASV has been given a taxonomic assignment the lowest taxonomic level the taxonomy assigner was confident of.

Step 4: Metadata

Next, we need to load in some metadata about our samples.

metadata <- example_metadata()

metadata
#> # A tibble: 7 × 7
#>   SampleID                 LabID SampleType host_age host_sex Host_disease neg  
#>   <chr>                    <chr> <chr>         <int> <chr>    <chr>        <lgl>
#> 1 SAMPLED_5080-MS-1_328-G… part… Homo sapi…       33 female   healthy      FALSE
#> 2 SAMPLED_5080-MS-1_339-A… part… Homo sapi…       25 male     healthy      FALSE
#> 3 SAMPLED_5348-MS-1_162-A… part… Homo sapi…       27 male     COVID-19     FALSE
#> 4 SAMPLED_5348-MS-1_297-G… part… Homo sapi…       26 female   COVID-19     FALSE
#> 5 SAMPLED_5080-MS-1_307-A… CTRL… negative …       NA <NA>     <NA>         TRUE 
#> 6 SAMPLED_5080-MS-1_313-G… CTRL… negative …       NA <NA>     <NA>         TRUE 
#> 7 SAMPLED_5348-MS-1_381-T… CTRL… positive …       NA <NA>     <NA>         FALSE

Let’s look at the ‘SampleID’ field, which is what is sound like, and uniquely identifies each sample:

metadata$SampleID
#> [1] "SAMPLED_5080-MS-1_328-GATCTACG-TCGACGAG_S328_L001"
#> [2] "SAMPLED_5080-MS-1_339-ACTCACTG-GATCGTGT_S339_L001"
#> [3] "SAMPLED_5348-MS-1_162-ACGTGCGC-GGATATCT_S162_L001"
#> [4] "SAMPLED_5348-MS-1_297-GTCTGCTA-ACGTCTCG_S297_L001"
#> [5] "SAMPLED_5080-MS-1_307-ATAGTACC-ACGTCTCG_S307_L001"
#> [6] "SAMPLED_5080-MS-1_313-GACATAGT-TCGACGAG_S313_L001"
#> [7] "SAMPLED_5348-MS-1_381-TGCTCGTA-GTCAGATA_S381_L001"

The first thing you may notice is the ‘SampleIDs’ here are the kinds of IDs that only a computer could love. For my standard workflow, I like to keep the SampleIDs as the FASTQ file names because they will by default automatically match the SampleIDs generated with dada2_asvtable().

In this example, there is also a ‘LabID’ field, which is an ID that could have been used all through specimen processing, as it is much more human-friendly:

metadata$LabID
#> [1] "participant01" "participant02" "participant03" "participant04"
#> [5] "CTRL_neg_ext"  "CTRL_neg_pcr"  "CTRL_pos_pcr"

To seamlessly use this package, you MUST have a column called ‘SampleID’, and those IDs must exactly match between your metadata and ASV count table objects. But otherwise, you’re free to name your samples whatever you want.

Step 5: Bioinformatic Decontamination

What kind of information you’ll need to have in your metadata object is HIGHLY dependent on your study, but there’s some information that we must have for the optional (but highly recommended!) processing of your ASV table through decontam.

What does decontam need to know?
1. To use the “prevalence” method: which samples are the negative controls.
2. To use the “frequency” method: the DNA concentration in each sample prior to sequencing.
3. To use both of the “prevalence” and “frequency” methods, you need all the above.

Don’t have any negative controls? You won’t be able to run decontam, and I strongly recommend you include some next time! Both negative and positive controls are very important! Read more here: ^1,^2,^3,^4,⁵

In our example data, we only are able to use the “prevalence” method, because we know which samples were negative controls, but we don’t have DNA concentration data. That information is in our column called “neg”, where TRUE means it was a negative control:

metadata$neg
#> [1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE

Next, let’s run the decontam_wrapper() on the example data we’ve generated so far:

decontam_wrapper(asvtable = asvtable, taxa = taxa, metadata = metadata, logfile = FALSE)
#> [1] "CLASS OF METADATA IS data.frame"
#> Warning in .is_contaminant(seqtab, conc = conc, neg = neg, method = method, :
#> Removed 3 samples with zero total counts (or frequency).
#> Warning in .is_contaminant(seqtab, conc = conc, neg = neg, method = method, :
#> Some batches have very few (<=4) samples.
#> Warning in .is_contaminant(seqtab, conc = conc, neg = neg, method = method, :
#> Removed 3 samples with zero total counts (or frequency).
#> Warning in .is_contaminant(seqtab, conc = conc, neg = neg, method = method, :
#> Some batches have very few (<=4) samples.
#> [1] "No contaminants were detected. Exiting function and returning your original ASV and taxa tables."

You’ll see several messages, including one at the bottom that tells us “No contaminants were detected. Exiting function and returning your original ASV and taxa tables.” With real data with proper controls, you should not expect to ever see this message. Our example data set was just too stupidly small for decontam to work properly!

So that I can demonstrate decontam actually doing something, we’ll deliberately “contaminate” the ASV table with the contaminate() command. All we’re doing is artificially adding counts to one of the ASVs, and throwing in a few more negative controls for good measure. This data is now even more fake and made up than it was previously, so don’t use it for anything other than learning how this R package works! The command contaminate() will also simultaneously create a matched metadata object containing information on the additional made up negative controls.

contaminated_asvtable <- converter(contaminate()$asvtable)
contaminated_taxa <- dada2_taxa(asvtable = contaminated_asvtable, train = train, species = species) # we don't actually have to re-run this command with this specific example, but it's good practice to always ensure your asvtable and taxa tables match.
#> [1] "CAUTION: You're using the provided micro4R EXAMPLE reference databases. These are extremely tiny and unrealistic and meant only for testing and demonstration purposes. DO NOT use them with your real data."
#> Finished processing reference fasta.
contaminated_metadata <- contaminate()$metadata
decontaminated <- decontam_wrapper(asvtable = contaminated_asvtable, taxa = contaminated_taxa, metadata = contaminated_metadata, logfile = FALSE)
#> [1] "CLASS OF METADATA IS data.frame"
#> Warning in .is_contaminant(seqtab, conc = conc, neg = neg, method = method, :
#> Removed 1 samples with zero total counts (or frequency).
#> Warning in .is_contaminant(seqtab, conc = conc, neg = neg, method = method, :
#> Removed 1 samples with zero total counts (or frequency).

The object “decontaminated” contains a list of the… decontaminated… ASV table, taxa table, and for our convenience also includes the metadata table.

nrow(contaminated_taxa) # number of rows corresponds to number of ASVs
#> [1] 6
nrow(decontaminated$taxa)
#> [1] 5

ncol(contaminated_asvtable) # number of columns corresponds to number of ASVs
#> [1] 6
ncol(decontaminated$asvtable)
#> [1] 5

The decontaminated version of both the ASV and taxa tables have one fewer ASV than the originals, as I artifically added a bunch of counts to one of the ASVs to the contaminated version of the ASV table. decontam rightfully thinks that ASV is likely a background contaminant due to its prevalence in the negative controls, and has now removed it, to make our data cleaner and more reliable. Why is this so important?^[Negative controls are especially important with 16S data due the nature of the process: 1) bacteria and their DNA are ubiquitous and can live even in environments hostile to most other life, 2) the PCR protocol deliberately enriches for all bacteria in a semi-universal way. This means the data can be extremely susceptible to contamination. The details of this are out of the scope of this document, but your FIRST step should be improving your lab methods to reduce contamination potential as much as possible.

However, no matter how amazing you (or your colleagues) are in the lab, you’ll probably still have at least some contamination. That’s where the negative controls come in. I would recommend having at least one negative extraction control (e.g., “extract” DNA from some ultraclean water or sample buffer) per every extraction batch, and a PCR negative control for every PCR master mix batch.)

Step 6: Sanity Check

checkAll(asvtable = asvtable, taxa = taxa, metadata = metadata)
#> [1] "As least 1 NA or empty cell was detected in 3 sample(s) in your metadata object. This is not necessarily bad or wrong, but if you were not expecting this, check your metadata object again. Sample(s) SAMPLED_5080-MS-1_307-ATAGTACC-ACGTCTCG_S307_L001, SAMPLED_5080-MS-1_313-GACATAGT-TCGACGAG_S313_L001, SAMPLED_5348-MS-1_381-TGCTCGTA-GTCAGATA_S381_L001 were detected to have NAs or empty cells."
#> [1] "No errors or warnings identified."

Step 7: Quality Check

Since we’re done with the basic sanity checks, now we can do an actual more interesting quality check of our data.

As mentioned numerous times, lab positive and negative controls are VERY important and should be included at minimum in every single sequencing run. If you haven’t included either of these, you won’t be able to do the full quality checks here.

Step 8: Alpha diversity

Step 9: Beta diversity

Step 10: Additional analysis

Move information to the bottom for anyone who wants more details

subsampled FASTQ files from a manuscript I co-authored with my colleagues, for which the raw data is publicly available under bioproject ID PRJNA726992. From seven samples from this study, using seqtk, I randomly sampled only 50 reads from each FASTQ file so that the files would take up minimal space and the example would run quickly.

If you’d like to run through this the full fastq files can be downloaded from SRA or as a zipped bolus here

Here is some body text that needs a footnote.¹

logo made by me using artwork from Canva (©iconbunny11) followed by hexSticker to get it into the typical hex logo format.

Not going to keep this on the readme, but want to hold onto the logo code until I put it somewhere else.
s <- sticker(image_path, package=“micro4R”, p_size=15, p_family = “Comfortaa”, p_fontface = “bold”, p_y = 1.5, s_x=1, s_y=.75, s_width=.5, s_height = .5, p_color = “black”, h_fill = “#6ed5f5”, h_color= “#16bc93”, h_size = 2, filename=“inst/figures/imgfile.png”).

built R 4.5.1
RStudio Version 2025.05.1+513 (2025.05.1+513) macOS Sequoia Version 15.6.1

Acknowledgements

As mentioned above, I could not have done any of this without the benefit of what the work of others. These tools below were especially important, as they were either extremely influential in helping me learn how to do these analyses, and/or are directly used here:

mothur
qiime
mgsat
maaslin2/3
vegan
tidyverse, especially dplyr and ggplot2. In addition to these packages, Hadley Wickham and team have written several extremely helpful guides and tutorials on data science.
dada2 and decontam
Suite from Dr. Frank Harrell, especially rms and Hmisc.
Colleague and physician-scientist Dr. Christian Rosas-Salazar is talented at many things, but has an especial knack for creating figures that are both beautiful and informative.

More acknowledgements and more details to be added later