Skip to contents

Microbiome data analysis tools for R

The goal of micro4R was to create an R package with a low barrier to entry. I started my career in microbiome research at the bench and had to ELI5 to myself how to process and analyze “big data”. I’ve spent a ton of time poring over and experimenting with [others’ code]UPDATE URL, and want to pass it on. # micro4R micro4R website

Likely, the ideal candidate to benefit from micro4R would be a bench scientist without much formal statistics or bioinformatics training. Fair warning, if you already have a strong stats/informatics background, this may not be of much use for you!

This package does not create any brand new functionality and was built on the great work of [others]UPDATE URL. Much of what it does can be accomplished with other packages, such as phyloseq, QIIME 2, and MicrobiomeAnalyst. One of these may be better for your purposes, and I’d encourage anyone new to the field to explore multiple tools.

Installation

All you really need is R, but I’d recommend also downloading and working in RStudio. If you’re a true newbie to R, there’s tons of free content to help you learn the basics.

Once you’re set in R/RStudio, you can install the development version of micro4R like so:

# install.packages("pak")
pak::pak("mshilts1/micro4R")

Example

I’ll run through the smallest and simplest possible use case below. For more detailed help and documentation, please explore the vignettes (TBA).

Included with the package is an extremely tiny toy example to demonstrate its major functionality, using subsampled publicly available data.

Making the ASV count and taxonomy tables

The first thing we’ll do on these files is run dada2_asvtable(), which is essentially a wrapper to generate an ASV count table by following a workflow similar to the dada2 tutorial.

This function can take a number of arguments, but the most important one is ‘where’, which is the path to the folder where your FASTQ files are located.

For demonstration purposes, it’s been set to the relative path of the the example FASTQ files that are included with the package:

library(micro4R)
#> This is version 0.0.0.9000 of micro4R. CAUTION: This is package is under active development and its functions may change at any time, without warning! Please visit https://github.com/mshilts1/micro4R to see recent changes.

asvtable <- dada2_asvtable(where = "inst/extdata/f", chatty = FALSE, logfile = FALSE)
#> Creating output directory: /var/folders/pp/15rq6p297j18gk2xt39kdmm40000gp/T//RtmpZSgkSJ/dada2_out/filtered
#> 59520 total bases in 248 reads from 7 samples will be used for learning the error rates.
#> 49600 total bases in 248 reads from 7 samples will be used for learning the error rates.

If you’re running this with your own data, set ‘where’ to the path of the folder where your FASTQ files are stored. If you leave it empty (e.g., run dada2_asvtable()), it will default to your current working directory. (‘chatty’ was set to FALSE because tons of information gets printed to the console otherwise; I’d recommend setting it to TRUE (the default) when you’re processing data for real, as the information is useful, but just too much here.)

Let’s take a quick look at what this asvtable looks like (using the tibble::as_tibble() function so it prints more nicely):

tibble::as_tibble(asvtable, rownames = "SampleID")
#> # A tibble: 7 × 7
#>   SampleID  TACGTAGGTGGCAAGCGTTA…¹ TACGGAGGGTGCAAGCGTTA…² TACGTAGGGTGCGAGCGTTG…³
#>   <chr>                      <int>                  <int>                  <int>
#> 1 SAMPLED_…                      0                      0                      0
#> 2 SAMPLED_…                      0                      0                      0
#> 3 SAMPLED_…                     44                      0                      0
#> 4 SAMPLED_…                     24                      0                      0
#> 5 SAMPLED_…                      0                      0                     12
#> 6 SAMPLED_…                      0                     35                      6
#> 7 SAMPLED_…                      0                      0                      0
#> # ℹ abbreviated names:
#> #   ¹​TACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCAAACAGG,
#> #   ²​TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG,
#> #   ³​TACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGGGCTCGTAGGTGGTTTGTCGCGTCGTCTGTGAAATTCTGGGGCTTAACTCCGGGCGTGCAGGCGATACGGGCATAACTTGAGTGCTGTAGGGGTAACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTTACTGGGCAGTTACTGACGCTGAGGAGCGAAAGCATGGGTAGCGAACAGG
#> # ℹ 3 more variables:
#> #   TACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCACGTCGTCTGTGAAATTCCACAGCTTAACTGTGGGCGTGCAGGCGATACGGGCTGACTTGAGTACTGTAGGGGTAACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTTACTGGGCAGTTACTGACGCTGAGGAGCGAAAGCATGGGTAGCAAACAGG <int>,
#> #   TACGTAGGTGACAAGCGTTGTCCGGATTTATTGGGCGTAAAGGGAGCGCAGGCGGTCTGTTTAGTCTAATGTGAAAGCCCACGGCTTAACCGTGGAACGGCATTGGAAACTGACAGACTTGAATGTAGAAGAGGAAAATGGAATTCCAAGTGTAGCGGTGGAATGCGTAGATATTTGGAGGAACACCAGTGGCGAAGGCGATTTTCTGGTCTAACATTGACGCTGAGGCTCGAAAGCGTGGGGAGCGAACAGG <int>, …

We now have an ASV table, which is a bunch of literal DNA sequences and the count of each that was detected for each sample. Since you’re (probably) not a computer, a string of hundreds of nucletotides is likely not something you can make much sense of by yourself. The next step will take those nucleotide sequences and compare them against a database (or two) of sequences with known taxonomy:

train <- "inst/extdata/db/EXAMPLE_silva_nr99_v138.2_toGenus_trainset.fa.gz" # set training database
species <- "inst/extdata/db/EXAMPLE_silva_v138.2_assignSpecies.fa.gz" # set species database

taxa <- dada2_taxa(asvtable = asvtable, train = train, species = species, chatty = FALSE)

There are two databases that we’re using for taxonomic assignment here:
1. ‘train’ needs to be the path to whatever database you’d like to use as the “training set of reference sequences with known taxonomy”.
2. ‘species’ is OPTIONAL. If you’d like to use this option, provide the path to a specifically formatted species assignment database. (Read more here.)

The two databases used in the example here are comically small and artificial subsamples of the real SILVA databases, and should only ever be used for testing and demonstration purposes! You’ll definitely need to download and use the real databases for your actual data!

There are many options for taxonomic databases you can use; the major players are SILVA, RDP, GreenGenes, and UNITE. Please go here for details and links. I tend to usually use the SILVA databases, but you don’t have to.

Let’s take a look at the taxonomy assignment table:

#> # A tibble: 6 × 8
#>   ASV                            Kingdom Phylum Class Order Family Genus Species
#>   <chr>                          <chr>   <chr>  <chr> <chr> <chr>  <chr> <chr>  
#> 1 TACGTAGGTGGCAAGCGTTATCCGGAATT… Bacter… Bacil… Baci… Stap… Staph… Stap… <NA>   
#> 2 TACGGAGGGTGCAAGCGTTAATCGGAATT… Bacter… Pseud… Gamm… Ente… Enter… Kleb… <NA>   
#> 3 TACGTAGGGTGCGAGCGTTGTCCGGAATT… Bacter… Actin… Acti… Myco… Coryn… Cory… <NA>   
#> 4 TACGTAGGGTGCAAGCGTTGTCCGGAATT… Bacter… Actin… Acti… Myco… Coryn… Cory… <NA>   
#> 5 TACGTAGGTGACAAGCGTTGTCCGGATTT… Bacter… Bacil… Baci… Lact… Carno… Dolo… pigrum 
#> 6 TACGTAGGTCCCGAGCGTTGTCCGGATTT… Bacter… Bacil… Baci… Lact… Strep… Stre… <NA>

Sample metadata

Next, we need to load in some metadata about our samples.

metadata <- example_metadata()

metadata
#> # A tibble: 7 × 7
#>   SampleID                 LabID SampleType host_age host_sex Host_disease neg  
#>   <chr>                    <chr> <chr>         <int> <chr>    <chr>        <lgl>
#> 1 SAMPLED_5080-MS-1_328-G… part… Homo sapi…       33 female   healthy      FALSE
#> 2 SAMPLED_5080-MS-1_339-A… part… Homo sapi…       25 male     healthy      FALSE
#> 3 SAMPLED_5348-MS-1_162-A… part… Homo sapi…       27 male     COVID-19     FALSE
#> 4 SAMPLED_5348-MS-1_297-G… part… Homo sapi…       26 female   COVID-19     FALSE
#> 5 SAMPLED_5080-MS-1_307-A… CTRL… negative …       NA <NA>     <NA>         TRUE 
#> 6 SAMPLED_5080-MS-1_313-G… CTRL… negative …       NA <NA>     <NA>         TRUE 
#> 7 SAMPLED_5348-MS-1_381-T… CTRL… positive …       NA <NA>     <NA>         FALSE

The first thing you may notice is the ‘SampleIDs’ are the kinds of IDs that only a computer could love. For my standard workflow, I like to keep the SampleIDs as the FASTQ file names for full transparency and so I can always easy go back and track down the original FASTQ ID.

What kind of information you’ll need here is highly dependent on your study, but there’s some data that we need to have for the optional (but highly recommended!) processing of your ASV table through decontam. Most importantly, decontam needs to know which samples are your negative controls. Don’t have any negative controls? You won’t be able to run decontam, and I highly recommend you include some next time!


Move information to the bottom for anyone who wants more details

subsampled FASTQ files from a manuscript I co-authored with my colleagues, for which the raw data is publicly available under bioproject ID PRJNA726992. From seven samples from this study, using seqtk, I randomly sampled only 50 reads from each FASTQ file so that the files would take up minimal space and the example would run quickly.

If you’d like to run through this the full fastq files can be downloaded from SRA or as a zipped bolus here

Here is some body text that needs a footnote.1

Not going to keep this on the readme, but want to hold onto the logo code until I put it somewhere else.
s <- sticker(image_path, package=“micro4R”, p_size=15, p_family = “Comfortaa”, p_fontface = “bold”, p_y = 1.5, s_x=1, s_y=.75, s_width=.5, s_height = .5, p_color = “black”, h_fill = “#6ed5f5”, h_color= “#16bc93”, h_size = 2, filename=“inst/figures/imgfile.png”).

built R 4.5.1
RStudio Version 2025.05.1+513 (2025.05.1+513) macOS Sequoia Version 15.6.1