NOTEBOOK 01 Next →

Data Overview & Exploration

TCR Repertoire Analysis — Notebook 01: Dataset characterization, sample summary, and quality checks
Joshua Luthy R + immunarch + edgeR Synthetic Data 2026
Contents
  1. Dataset Overview
  2. Patient Cohort
  3. Per-Sample Summary
  4. V Gene Usage
  5. Data Quality Notes

01 Dataset Overview

This notebook provides an initial exploration of the synthetic TCR repertoire dataset. The data simulates a cell therapy manufacturing workflow where peripheral blood mononuclear cells (PBMCs) from the apheresis collection are processed into a final drug product. We examine 6 patients with paired Apheresis/Product samples.

6
Patients
12
Samples
81,593
Total Clonotypes
1,125,178
Total Reads
# Load dependencies library(tidyverse) library(immunarch) # Read synthetic data tcr_data <- read_csv("data/raw/synthetic_tcr_repertoires.csv") patient_meta <- read_csv("data/raw/patient_metadata.csv") glimpse(tcr_data)
Rows: 81,593 Columns: 9 $ patient_id <chr> "PT-001", "PT-001", "PT-001", "PT-001", … $ clinical_response <chr> "CR", "CR", "CR", "CR", … $ sample_type <chr> "Apheresis", "Apheresis", "Apheresis", … $ clonotype_id <chr> "PT-001_CLN_00001", "PT-001_CLN_00002", … $ cdr3_aa <chr> "CDWVSYQFTKRRF", "CMGDVHRMPPGLMF", … $ v_gene <chr> "TRBV7-9", "TRBV29-1", "TRBV29-1", … $ j_gene <chr> "TRBJ1-1", "TRBJ2-1", "TRBJ2-1", … $ clone_count <dbl> 8242, 3590, 2207, 1568, … $ clone_fraction <dbl> 0.08689, 0.03785, 0.02327, 0.01653, …

02 Patient Cohort

The cohort includes 6 AML patients with varying clinical responses to cell therapy: 2 complete responders (CR), 2 partial responders (PR), and 2 with progressive disease (PD).

PatientResponseAgeSexDiseasePrior Lines
PT-001 CR 60FAML2
PT-002 CR 46MAML2
PT-003 PR 43MAML3
PT-004 PR 59MAML4
PT-005 PD 47FAML2
PT-006 PD 58FAML2

03 Per-Sample Summary

For each patient, we have paired Apheresis (starting material) and Product (manufactured drug product) TCR repertoire samples. Key diversity metrics are computed below.

# Compute diversity metrics per sample diversity_summary <- tcr_data %>% group_by(patient_id, sample_type, clinical_response) %>% summarise( n_clonotypes = n_distinct(clonotype_id), total_reads = sum(clone_count), shannon_entropy = vegan::diversity(clone_count, index = "shannon"), clonality = 1 - (shannon_entropy / log(n_clonotypes)), .groups = "drop" )
PatientSampleResponseClonotypesReads Shannon HClonalityD50
PT-001 Apheresis CR 13,238 94,857 7.9996 0.1571 1816
PT-001 Product CR 2,029 72,517 4.4388 0.4171 3
PT-002 Apheresis CR 14,132 106,246 8.1066 0.1517 2046
PT-002 Product CR 940 79,693 3.0637 0.5525 1
PT-003 Apheresis PR 11,658 120,934 7.7771 0.1694 1379
PT-003 Product PR 1,123 99,121 3.4408 0.5101 2
PT-004 Apheresis PR 12,786 126,170 7.9188 0.1626 1639
PT-004 Product PR 2,474 51,822 4.8667 0.3771 5
PT-005 Apheresis PD 9,771 144,336 7.4659 0.1874 858
PT-005 Product PD 1,913 60,767 4.6019 0.3910 5
PT-006 Apheresis PD 8,676 113,231 7.2834 0.1968 614
PT-006 Product PD 2,853 55,484 5.2527 0.3398 11
Key Observation

Apheresis samples show high diversity (Shannon H > 7.0, low clonality) with thousands of unique clonotypes, while Product samples are dramatically more oligoclonal (Shannon H 3.0–5.3, clonality 0.34–0.55). This is consistent with the expected biology: manufacturing selects and expands a subset of T cell clones from the starting material.

04 V Gene Usage

We examine TRBV gene segment usage across samples to identify any preferential gene usage patterns between Apheresis and Product.

# V gene usage summary vgene_summary <- tcr_data %>% group_by(sample_type, v_gene) %>% summarise(total_count = sum(clone_count), .groups = "drop") %>% group_by(sample_type) %>% mutate(fraction = total_count / sum(total_count)) %>% arrange(sample_type, desc(fraction))
Figure 1. Top 10 TRBV gene segments by read fraction, shown for Apheresis (cyan) and Product (purple) samples aggregated across all patients.

05 Data Quality Notes

Synthetic Data Disclaimer

This dataset is entirely synthetic, generated to demonstrate analytical methods and pipeline architecture. CDR3 sequences are randomly generated and do not correspond to real TCR rearrangements. Clone size distributions follow power-law models calibrated to approximate real repertoire characteristics.

Data quality checks performed: