Introduction to the tmzML file type

Warning:

This vignette deals with a feature still undergoing active development. While I can’t promise to update it immediately with changes, I will do my best to keep you in the loop as this functionality evolves. Please also see the speed & size comparison vignette for more stable alternatives.

Table of contents:

What is tmzML?

A tmzML document is a new type of mass-spectrometry file that’s optimized for EIC extraction and visualization. These were introduced in version 1.2.0 of RaMS and aren’t used anywhere else. Internally, they’re just XML documents with the same MS data in the mzML file they’re created from, but the organization means that accessing data by m/z value is much faster and requires far less memory.

The name stands for “transposed mzML” and is inspired by the idea that mzML documents are organized by scan number. This makes sense given that the original MS data was produced one scan at a time, but this format also makes EIC extraction from the original files slower because there’s no indexing by m/z. What RaMS does here is “transposes” the original document so that data are grouped by m/z value instead of scan. This means that to extract a chromatogram, only a small portion of the file needs to be read instead of the whole thing.

Why tmzML?

RaMS is great because it supplies rapid, intuitive access to mzML documents. However, this access comes at a cost - the entire file needs to be loaded into R’s memory simultaneously. Given that MS files can easily be tens or hundreds of megabytes in size, this limits the number of files that can be open at once to the computer memory available.

With tmzMLs, much of the “hard” work is done in advance by reorganizing the structure of the file on disk. This allows us to read only the essential data into memory, vastly improving initial read time and enormously reducing the amount of memory required. This makes it possible to open hundreds or even thousands of tmzML documents simultaneously and extract a chromatogram from each one of them, all on a basic laptop. Within a minute. Below are some benchmarking tests comparing the new tmzML file type to the traditional RaMS code and MSnbase’s MSnExp and OnDiskMSnExp.

Alt text: Speed comparison between various load methods, showing the new tmzML type to be faster than other methods for most functions and requiring several orders of magnitude less memory
Alt text: Speed comparison between various load methods, showing the new tmzML type to be faster than other methods for most functions and requiring several orders of magnitude less memory

The tmzML documents take essentially no time to load and require barely kilobytes of memory. This is because we aren’t doing any data handling yet - we’re just creating an R object that points to our files and manages the provided parameters elegantly. When we actually request the data, the tmzML file is opened, scanned, and read into memory. This is a little slower than requesting the data the traditional way because data from mzML files read into RaMS is already stored in memory, but it still takes fractions of a second and is an order of magnitude faster than the MSnbase functions.

To the user, the end result is almost identical: a list of data.tables named by what the user requested (MS1, MS2, BPC, EIC, etc.) in tidy format. This means all your favorite tidyverse tricks and the code you’ve written to handle the old RaMS output will work identically. However, the full data file is never read into memory so there are some quirks that I discuss below.

Getting started with tmzML documents

First, mzML/mzXML documents must be converted to tmzML. Again, these files might feel similar but they’re very different internally. We’ll use a couple of the files packaged with RaMS and convert them in a temporary directory.

files_to_convert <- list.files(
  system.file("extdata", package = "RaMS"), full.names = TRUE, pattern = "mzML"
)[2:4]

The function tmzmlMaker is the workhorse here. It’s not vectorized, so you’ll need to loop over multiple files if you’ve got them. And while it has some basic renaming logic internally, it’s safest to pass a list of output filenames (just remember to end them with file type .tmzML!). My usual project setup has a folder for the mzML files of the project, so creating a similar tmzML folder has worked well for me.

library(RaMS)
# Create a folder to hold the new files
dir.create("tmzMLs")

# Convert a single file
file_to_create <- "tmzMLs/LB12HL_AB.tmzML"
tmzmlMaker(input_filename = files_to_convert[1], output_filename = file_to_create)
#> [1] "tmzMLs/LB12HL_AB.tmzML"

# Convert multiple files
files_to_create <- paste0("tmzMLs/", basename(files_to_convert))
# Make sure they end in .tmzML!
files_to_create <- gsub(x = files_to_create, "\\.mzML.*", ".tmzML")
# Loop over each file input/output pair
created_files <- mapply(tmzmlMaker, files_to_convert, files_to_create)

These files should now exist in the folder and can be opened with Notepad or a similar word processor if you’re curious, but they’re mostly encoded so it may not be incredibly informative.

Extracting a chromatogram from a tmzML document should feel very familiar to anyone already used to using RaMS. grabMSdata is smart enough to realize that it’s dealing with .tmzML files instead of .mzML files from the filename, so you can reuse the exact same function as before. However, unlike when loading mzMLs this step should be instantaneous.

msdata <- grabMSdata(created_files, verbosity=0)

If you’re planning to extract a chromatogram, the syntax is identical to traditional RaMS:

ms_data_table <- msdata$MS1[mz%between%pmppm(152.05723, 5)]

Which is a perfectly normal data.table object ready to be passed to more familiar functions:

library(ggplot2)
ggplot(ms_data_table) + geom_line(aes(x=rt, y=int, color=filename)) + xlim(8, 9.5)

library(dplyr)
ms_data_table %>%
  filter(rt%between%c(8.4, 8.9)) %>%
  group_by(filename) %>%
  summarise(area=sum(int))
#> # A tibble: 3 × 2
#>   filename             area
#>   <chr>               <dbl>
#> 1 LB12HL_AB.tmzML 13057831.
#> 2 LB12HL_CD.tmzML 24925878.
#> 3 LB12HL_EF.tmzML 21913457.

Why not tmzML?

However, there are some quirks to the tmzML object itself. If you’re used to inspecting the initial RaMS output by printing it to the console, you’ll get a new message instead of the expected list of data tables:

print(msdata)
#> Hey, I'm not actually an object, sorry!
#> But you can pretend I'm a list containing data.tables:
#> MS1; MS2
#> from the following files:
#> tmzMLs/LB12HL_AB.tmzML
#> tmzMLs/LB12HL_CD.tmzML
#> tmzMLs/LB12HL_EF.tmzML
#> and access the data inside with $ and [ subsetting

This is because the msdata object in this case isn’t a list anymore - it’s a custom object that really only contains the names of the files and the settings you initially provided to grabMSdata. You can see this with str():

str(msdata)
#> List of 3
#>  $ MS1       : NULL
#>  $ MS2       : NULL
#>  $ connection:List of 3
#>   ..$ files    : Named chr [1:3] "tmzMLs/LB12HL_AB.tmzML" "tmzMLs/LB12HL_CD.tmzML" "tmzMLs/LB12HL_EF.tmzML"
#>   .. ..- attr(*, "names")= chr [1:3] "/tmp/RtmpswqVrA/Rinst19212da235c1/RaMS/extdata/LB12HL_AB.mzML.gz" "/tmp/RtmpswqVrA/Rinst19212da235c1/RaMS/extdata/LB12HL_CD.mzML.gz" "/tmp/RtmpswqVrA/Rinst19212da235c1/RaMS/extdata/LB12HL_EF.mzML.gz"
#>   ..$ grab_what: chr [1:2] "MS1" "MS2"
#>   ..$ verbosity: num 0
#>  - attr(*, "class")= chr "msdata_connection"

Eagle-eyed browsers may have already noticed that the msdata object is a little different than before because there’s this new connection option that shows up when tab-completing the extraction code above. The MS1 and MS2 objects here will actually always be NULL - I just added them so that tab-completion will remind you what things you can ask msdata for via $ notation. If you see the message above, it means that you’ve asked a tmzML for something that you should ask with traditional RaMS instead because it’s outside the scope of the tmzML optimization.

The $ and [ syntax provided above is really the only thing that will work with tmzMLs. Any request that’s meant to operate on the data as a whole will fail and subsetting with head() or tail() won’t work either. Also, subsetting based on retention time needs to be done separately because these poor files have been indexed within an inch of their lives for a single purpose.

# Cannot order data by mass
msdata$MS1[-mz]
#> Error in isub[[3]]: subscript out of bounds
# Cannot request the top few rows
msdata[1:10,]
#> Error in `[.msdata_connection`(msdata, 1:10, ): unused argument (alist())
# Cannot subset by retention time
msdata$MS1[rt %between% c(8, 9.5)]
#> Error in `[.msdata_connection`(msdata$MS1, rt %between% c(8, 9.5)): tmzML documents currently only support subsetting by mz!

In general, if you get an error mentioning the msdata_connection object you should probably be using traditional RaMS on mzMLs instead of the new tmzMLs.

tmzML internals

If you haven’t yet played with tmzMLs already, please consider doing so now. The remainder of this is really geared towards useRs who have already seen what tmzMLs do and are interested in some of the magic behind it.

RaMS works on tmzMLs by defining a new S3 object class, the msdata_connection. This allowed me to define new behavior for $ and [ so that the syntax is identical to the traditional RaMS arrangement but requires some complicated internal mapping. The $ remapping that occurs can be viewed with but is really just a basic way to pass a single MS level at a time to the [ function. The [ function, on the other hand, is really where the magic happens. You can view the source code again with but what basically happens is that we manually parse the subset function using substitute and eval.parent, finally open up the file, and access only those XML nodes that have m/z values of interest. This returns a data table containing all the m/z values in a large bin (defaulting to 3 Da) that’s then subset further down to just the requested values.

I’m still deciding whether XML documents are the way to go. Having it be XML means that I can expand it effortlessly and essentially future-proofs it for additional functionality that I’d like to add later, but does require that the whole file is read in at a high level. Something simpler like a pure-text document could have a header that’s read in separately from the rest of the document that then contains pointers to the data. Preliminary tests suggest this plain-text method is 5-10 times faster than the XML style but tmzMLs are already plenty fast so we may be approaching diminishing returns.

unlink("tmzMLs", recursive = TRUE)

Vignette last built on 2024-11-08