This notebook provides an initial exploration of the synthetic TCR repertoire dataset. The data simulates a cell therapy manufacturing workflow where peripheral blood mononuclear cells (PBMCs) from the apheresis collection are processed into a final drug product. We examine 6 patients with paired Apheresis/Product samples.
# Load dependencies
library(tidyverse)
library(immunarch)
# Read synthetic data
tcr_data <- read_csv("data/raw/synthetic_tcr_repertoires.csv")
patient_meta <- read_csv("data/raw/patient_metadata.csv")
glimpse(tcr_data)
The cohort includes 6 AML patients with varying clinical responses to cell therapy: 2 complete responders (CR), 2 partial responders (PR), and 2 with progressive disease (PD).
| Patient | Response | Age | Sex | Disease | Prior Lines |
|---|---|---|---|---|---|
| PT-001 | CR | 60 | F | AML | 2 |
| PT-002 | CR | 46 | M | AML | 2 |
| PT-003 | PR | 43 | M | AML | 3 |
| PT-004 | PR | 59 | M | AML | 4 |
| PT-005 | PD | 47 | F | AML | 2 |
| PT-006 | PD | 58 | F | AML | 2 |
For each patient, we have paired Apheresis (starting material) and Product (manufactured drug product) TCR repertoire samples. Key diversity metrics are computed below.
# Compute diversity metrics per sample
diversity_summary <- tcr_data %>%
group_by(patient_id, sample_type, clinical_response) %>%
summarise(
n_clonotypes = n_distinct(clonotype_id),
total_reads = sum(clone_count),
shannon_entropy = vegan::diversity(clone_count, index = "shannon"),
clonality = 1 - (shannon_entropy / log(n_clonotypes)),
.groups = "drop"
)
| Patient | Sample | Response | Clonotypes | Reads | Shannon H | Clonality | D50 |
|---|---|---|---|---|---|---|---|
| PT-001 | Apheresis | CR | 13,238 | 94,857 | 7.9996 | 0.1571 | 1816 |
| PT-001 | Product | CR | 2,029 | 72,517 | 4.4388 | 0.4171 | 3 |
| PT-002 | Apheresis | CR | 14,132 | 106,246 | 8.1066 | 0.1517 | 2046 |
| PT-002 | Product | CR | 940 | 79,693 | 3.0637 | 0.5525 | 1 |
| PT-003 | Apheresis | PR | 11,658 | 120,934 | 7.7771 | 0.1694 | 1379 |
| PT-003 | Product | PR | 1,123 | 99,121 | 3.4408 | 0.5101 | 2 |
| PT-004 | Apheresis | PR | 12,786 | 126,170 | 7.9188 | 0.1626 | 1639 |
| PT-004 | Product | PR | 2,474 | 51,822 | 4.8667 | 0.3771 | 5 |
| PT-005 | Apheresis | PD | 9,771 | 144,336 | 7.4659 | 0.1874 | 858 |
| PT-005 | Product | PD | 1,913 | 60,767 | 4.6019 | 0.3910 | 5 |
| PT-006 | Apheresis | PD | 8,676 | 113,231 | 7.2834 | 0.1968 | 614 |
| PT-006 | Product | PD | 2,853 | 55,484 | 5.2527 | 0.3398 | 11 |
Apheresis samples show high diversity (Shannon H > 7.0, low clonality) with thousands of unique clonotypes, while Product samples are dramatically more oligoclonal (Shannon H 3.0–5.3, clonality 0.34–0.55). This is consistent with the expected biology: manufacturing selects and expands a subset of T cell clones from the starting material.
We examine TRBV gene segment usage across samples to identify any preferential gene usage patterns between Apheresis and Product.
# V gene usage summary
vgene_summary <- tcr_data %>%
group_by(sample_type, v_gene) %>%
summarise(total_count = sum(clone_count), .groups = "drop") %>%
group_by(sample_type) %>%
mutate(fraction = total_count / sum(total_count)) %>%
arrange(sample_type, desc(fraction))
This dataset is entirely synthetic, generated to demonstrate analytical methods and pipeline architecture. CDR3 sequences are randomly generated and do not correspond to real TCR rearrangements. Clone size distributions follow power-law models calibrated to approximate real repertoire characteristics.
Data quality checks performed: