The talusR package provides a workflow for Talus mass-spectrum data—import, QC, differential analysis, and visualization. But it assumes that your data have already been processed by DIA-NN (e.g. *.tsv or *.csv outputs containing normalized, protein-level intensity values). It further expects one sample per mass-spectrometer run (with DIA-NN normalization applied across all runs), and that your sample metadata table includes a run column whose entries exactly match the sample names in the intensity signal file.
Key features include:
Data import & structure
Import DIA-NN–processed protein intensities into S4 objects:
TalusDataSet for a whole datasetTalusDataSetList (a SimpleList of TalusDataSet) for per-fraction splitsEach object carries sample metadata and feature annotations.
Quality control: Run PCA or other multivariate checks to confirm that technical replicates and fractions cluster as expected.
Differential analysis: use a simple wrapper around matrixTests to t-test and limma to fit linear models, define contrasts, apply independent filtering by adjusted p-value or log-fold change thresholds, and return results in TalusResult or TalusResultList objects.
Visualization: Create PCA scatterplots, volcano plots, per-protein intensity plots.
Bioconductor integration: Built on SummarizedExperiment and S4Vectors, talusR classes inherit the generics and methods of the Bioconductor ecosystem, for convenient plug into annotation, pathway analysis, or reporting tools.
# if you've installed the Bioconductor dependencies, then
devtools::install_github('chaochaowong/talusR',
build_vignettes = TRUE)
# if haven't installed the Bioconductor dependencies, then
BiocManager::install("chaochaowong/talusR",
dependencies = TRUE
build_vignettes = TRUE
)
library(talusR)
library(ggplot2)
library(dplyr)
The core function read_talus() imports protein-level intensity signals for whole dataset or, if split_by_fraction = TRUE, splits by fractions.
When split_by_fraction = FALSE, read_talus() returns a single TalusDataSet instance. This mode is ideal when:
# get paths
protein_file <- system.file('extdata', 'test_data.tsv',
package = 'talusR')
meta_file <- system.file('extdata', 'test_meta.csv',
package = 'talusR')
# without split by fraction -> return TalusDataSet
tds <- talusR::read_talus(file = protein_file,
meta_file = meta_file,
which_proteinid = "Protein.Ids",
which_run = "Run",
remove_few_measurements = TRUE,
split_by_fraction = FALSE)
tds
#> class: TalusDataSet
#> dim: 1500 36
#> metadata(3): intensity_group metric log_transform
#> assays(1): abundance
#> rownames(1500): P82675;P82675-2 Q6ZSR9 ... Q5JTZ9
#> Q9H078;Q9H078-2;Q9H078-3
#> rowData names(5): Protein.Group Protein.Ids Protein.Names Genes
#> First.Protein.Description
#> colnames(36): 240805_Sarthy_DMSO_chrom_03.d 240805_Sarthy_DMSO_nuc_05.d
#> ... 240805_Sarthy_DMSO_cyto_06.d 240805_Sarthy_Etopo_cyto_02.d
#> colData names(3): Run Tx Frx
#> intensity_group: protein
#> metric: DIA-NN
#> log_transform: log2
When split_by_fraction = TRUE, read_talus() returns a TalusDataSetList instance—a SimpleList of TalusDataSet instances, one per fraction, each holding its protein-level signals.
tdsl <- read_talus(file = protein_file,
meta_file = meta_file,
which_proteinid = "Protein.Ids",
which_fraction = "Frx",
which_sequence = NA,
which_run = "Run",
remove_few_measurements = TRUE,
split_by_fraction = TRUE,
log_transform = 'log2',
intensity_group = "protein",
metric = "DIA-NN")
tdsl
#> TalusDataSetList of length 3
#> names(3): chrom cyto nuc
#>
#> [[1]] — name: chrom
#> class: TalusDataSet
#> dim: 1269 12
#> metadata(3): intensity_group metric log_transform
#> assays(1): abundance
#> rownames(1269): P82675;P82675-2 Q6ZSR9 ... Q5JTZ9
#> Q9H078;Q9H078-2;Q9H078-3
#> rowData names(5): Protein.Group Protein.Ids Protein.Names Genes
#> First.Protein.Description
#> colnames(12): 240805_Sarthy_DMSO_chrom_03.d
#> 240805_Sarthy_DMSO_chrom_01.d ... 240805_Sarthy_Etopo_chrom_02.d
#> 240805_Sarthy_DMSO_chrom_04.d
#> colData names(3): Run Tx Frx
#> intensity_group: protein
#> metric: DIA-NN
#> log_transform: log2
#>
#> [[2]] — name: cyto
#> class: TalusDataSet
#> dim: 1435 12
#> metadata(3): intensity_group metric log_transform
#> assays(1): abundance
#> rownames(1435): P82675;P82675-2 Q6ZSR9 ... Q5JTZ9
#> Q9H078;Q9H078-2;Q9H078-3
#> rowData names(5): Protein.Group Protein.Ids Protein.Names Genes
#> First.Protein.Description
#> colnames(12): 240805_Sarthy_DMSO_cyto_01.d 240805_Sarthy_DMSO_cyto_02.d
#> ... 240805_Sarthy_DMSO_cyto_06.d 240805_Sarthy_Etopo_cyto_02.d
#> colData names(3): Run Tx Frx
#> intensity_group: protein
#> metric: DIA-NN
#> log_transform: log2
#>
#> [[3]] — name: nuc
#> class: TalusDataSet
#> dim: 1226 12
#> metadata(3): intensity_group metric log_transform
#> assays(1): abundance
#> rownames(1226): P82675;P82675-2 Q6ZSR9 ... Q5JTZ9
#> Q9H078;Q9H078-2;Q9H078-3
#> rowData names(5): Protein.Group Protein.Ids Protein.Names Genes
#> First.Protein.Description
#> colnames(12): 240805_Sarthy_DMSO_nuc_05.d 240805_Sarthy_DMSO_nuc_01.d
#> ... 240805_Sarthy_Etopo_nuc_01.d 240805_Sarthy_Etopo_nuc_03.d
#> colData names(3): Run Tx Frx
#> intensity_group: protein
#> metric: DIA-NN
#> log_transform: log2
Use plot_pca() to assess sample clustering and variance structure.
TalusDataSetRun PCA on the full dataset to verify that protein-level signals from each fraction cluster as expected.
talusR::plot_pca(tds, color_by='Frx')
Figure 1: PCA on TalusDataSet
TalusDataSetListWhen using per-fraction splits, perform PCA on each fraction independently:
plot_pca(tdsl, color_by='Tx')
Figure 2: PCA performed separately on each fraction in the TalusDataSetList instance
Core statistical methods:
talus_limma: a wrapper function for limma::lmFit() to fit linear models and test contrasts.talus_row_t_welch: a wrapper function for matrixTests::row_t_welch() to perform per-protein Welch’s t-tests.Summary & visualization:
summary_talus: summarize differential results, flag significant proteins, and return a data.frame of significant proteins.plot_volcano: create a volcano plot of log-fold changes vs adjusted p-values.plot_per_protein: display protein intensity profiles across fractions or conditions.limmaUse talus_limma() to fit linear models with empirical Bayesian and moderate t-statistics across conditions (e.g. drug treatment) to perform differential analysis.
Here the condition group is Tx: DMSO, etopo, and doxo. Set the factor levels so that DMSO is the reference:
tx_levels <- c('DMSO', 'Etopo', 'Doxo')
tdsl <- endoapply(tdsl, function(x) {
colData(x)$Tx <- factor(colData(x)$Tx, levels = tx_levels)
x
})
Next, run talus_limma() with a ~ 0 + Tx design formula to compare each treatment against the DMSO control:
res_limma <- talus_limma(tdsl, design = ~ 0 + Tx)
Finally, summarize and filter for significant proteins with with adjusted p-value < 0.05 and |logâ‚‚ fold change| > 0.5.
limma_filt <- summary_talus(res_limma[['chrom']],
alpha = 0.05,
lfc_threshold = 0.5)
#>
#> TxEtopo_vs_TxDMSO :
#>
#> Threshold: alpha < 0.05; |lfc_threshod| > 0.5
#> Total features: 1269
#> Up-regulated: 1
#> Down-regulated: 1
#>
#>
#> TxDoxo_vs_TxDMSO :
#>
#> Threshold: alpha < 0.05; |lfc_threshod| > 0.5
#> Total features: 1269
#> Up-regulated: 2
#> Down-regulated: 0
limma_filt
#> $TxEtopo_vs_TxDMSO
#> id logFC AveExpr t P.Value
#> 1 P11388;P11388-2;P11388-3;P11388-4 -0.8769478 18.53434 -13.717557 5.507541e-08
#> 2 P27694 0.8630436 16.38714 8.934821 3.400613e-06
#> adj.P.Val B Protein.Group
#> 1 6.614556e-05 8.195588 P11388;P11388-2;P11388-3;P11388-4
#> 2 2.042068e-03 4.853289 P27694
#> Protein.Ids Protein.Names Genes
#> 1 P11388;P11388-2;P11388-3;P11388-4 TOP2A_HUMAN TOP2A
#> 2 P27694 RFA1_HUMAN RPA1
#> First.Protein.Description
#> 1 DNA topoisomerase 2-alpha
#> 2 Replication protein A 70 kDa DNA-binding subunit
#>
#> $TxDoxo_vs_TxDMSO
#> id logFC AveExpr t P.Value adj.P.Val B
#> 1 Q9H8H0 1.298674 14.40518 12.532423 1.339523e-07 0.0001615464 7.586870
#> 2 Q9UMY1 1.006518 14.53278 8.657043 4.554389e-06 0.0027462964 4.613002
#> Protein.Group Protein.Ids Protein.Names Genes First.Protein.Description
#> 1 Q9H8H0 Q9H8H0 NOL11_HUMAN NOL11 Nucleolar protein 11
#> 2 Q9UMY1 Q9UMY1 NOL7_HUMAN NOL7 Nucleolar protein 7
Visualize the statistiics by plot_volcano():
plot_volcano(res_limma[['chrom']],
alpha = 0.05,
lfc_threshold = 0.5,
use_adjP = TRUE,
label_top_n=10)
Apply talus_row_t_welch() to perform per-protein Welch’s t-tests on the TalusDataSetList:
res_t <- talus_row_t_welch(tdsl, design = ~ 0 + Tx)
names(res_t)
#> [1] "chrom" "cyto" "nuc"
Filter significant proteins using summary_talus() (adjusted p-value < 0.05, |logâ‚‚FC| > 0.5):
# show contrast level
cat('\nContrast levels:\n')
#>
#> Contrast levels:
names(res_t[['chrom']])
#> [1] "TxEtopo-vs-TxDMSO" "TxDoxo-vs-TxDMSO"
# summarize the result per fraction
ttest_filt <- summary_talus(res_t[['chrom']],
alpha = 0.05,
lfc_threshold = 0.5)
#>
#> TxEtopo-vs-TxDMSO :
#>
#> Threshold: alpha < 0.05; |lfc_threshod| > 0.5
#> Total features: 1269
#> Up-regulated: 1
#> Down-regulated: 1
#>
#>
#> TxDoxo-vs-TxDMSO :
#>
#> Threshold: alpha < 0.05; |lfc_threshod| > 0.5
#> Total features: 1269
#> Up-regulated: 1
#> Down-regulated: 0
Visualize the statistics by plot_volcano().
# try to re-arrange the contrast_levels
plot_volcano(res_t[['chrom']],
alpha = 0.05,
lfc_threshold = 0.5,
use_adjP = TRUE,
contrast_levels = names(res_t[['chrom']])[c(2, 1)],
label_top_n=10)
Extract a protein ID (e.g. TOP2A) and plot the intensities across conditions (e.g., fractions and treatments):
protein_id <- as.data.frame(rowData(tdsl[[1]])) %>%
dplyr::filter(Genes == 'TOP2A') %>% pull(Protein.Group)
plot_per_protein(tdsl,
protein_id = 'P11388;P11388-2;P11388-3;P11388-4',
group_by = 'Tx') +
labs(title = 'TOP2A')
Or just for chrom fraction only:
plot_per_protein(tdsl[['chrom']],
protein_id = 'P11388;P11388-2;P11388-3;P11388-4',
group_by = 'Tx') +
labs(title = 'TOP2A')
sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Ventura 13.6.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/Los_Angeles
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] readr_2.1.5 dplyr_1.1.4
#> [3] ggplot2_3.5.2 talusR_1.0.3
#> [5] SummarizedExperiment_1.34.0 Biobase_2.64.0
#> [7] GenomicRanges_1.56.2 GenomeInfoDb_1.40.1
#> [9] IRanges_2.38.1 S4Vectors_0.42.1
#> [11] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
#> [13] matrixStats_1.5.0 knitr_1.50
#> [15] BiocStyle_2.32.1
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 xfun_0.52 bslib_0.9.0
#> [4] ggrepel_0.9.6.9999 lattice_0.22-6 tzdb_0.5.0
#> [7] vctrs_0.6.5 tools_4.4.0 generics_0.1.4
#> [10] parallel_4.4.0 tibble_3.2.1 pkgconfig_2.0.3
#> [13] Matrix_1.7-0 RColorBrewer_1.1-3 lifecycle_1.0.4
#> [16] GenomeInfoDbData_1.2.12 compiler_4.4.0 farver_2.1.2
#> [19] stringr_1.5.1 tinytex_0.57 statmod_1.5.0
#> [22] htmltools_0.5.8.1 sass_0.4.10 yaml_2.3.10
#> [25] tidyr_1.3.1 pillar_1.10.2 crayon_1.5.3
#> [28] jquerylib_0.1.4 DelayedArray_0.30.1 cachem_1.1.0
#> [31] limma_3.60.6 abind_1.4-8 tidyselect_1.2.1
#> [34] digest_0.6.37 stringi_1.8.7 purrr_1.0.4
#> [37] bookdown_0.43 labeling_0.4.3 fastmap_1.2.0
#> [40] grid_4.4.0 cli_3.6.5 SparseArray_1.4.8
#> [43] magrittr_2.0.3 S4Arrays_1.4.1 withr_3.0.2
#> [46] UCSC.utils_1.0.0 scales_1.4.0 bit64_4.6.0-1
#> [49] rmarkdown_2.29 XVector_0.44.0 httr_1.4.7
#> [52] bit_4.6.0 matrixTests_0.2.3 hms_1.1.3
#> [55] evaluate_1.0.3 rlang_1.1.6 Rcpp_1.0.14
#> [58] glue_1.8.0 BiocManager_1.30.25 vroom_1.6.5
#> [61] rstudioapi_0.17.1 jsonlite_2.0.0 R6_2.6.1
#> [64] zlibbioc_1.50.0