diff --git a/recognition/Flan_T5_s45893623/.gitignore b/recognition/Flan_T5_s45893623/.gitignore
new file mode 100644
index 000000000..1d14e5b25
--- /dev/null
+++ b/recognition/Flan_T5_s45893623/.gitignore
@@ -0,0 +1,10 @@
+# Byte-compiled / cached files
+__pycache__/
+*.py[cod]
+*.pyo
+*.pyd
+*.py.class
+# Generated folders, checkpoints and logs
+runs/
+eval/
+archive/
\ No newline at end of file
diff --git a/recognition/Flan_T5_s45893623/README.md b/recognition/Flan_T5_s45893623/README.md
new file mode 100644
index 000000000..63e8f980a
--- /dev/null
+++ b/recognition/Flan_T5_s45893623/README.md
@@ -0,0 +1,288 @@
+# FLAN-T5 + LoRA for Layperson Radiology Summarisation
+
+This project fine-tunes **FLAN-T5-base** with **LoRA** adapters on the **BioLaySumm 2025 LaymanRRG** dataset.
+Given a radiology report, the model generates a 1–3 sentence summary in plain English that replaces medical jargon with basic concepts.
+
+
+## 1. Problem Description
+
+Radiology reports are written for clinicians and are difficult for patients to understand.
+The goal of this project is to:
+
+- take an **expert radiology report** as input,
+- produce a **short layperson-friendly summary**,
+- evaluate performance using **ROUGE**,
+- and analyse where the model succeeds/fails.
+
+We use a pretrained FLAN-T5-base model with LoRA adapters to efficiently fine-tune on an RTX 5070 TI (16GB)
+
+## 2. Background
+
+
+

+
+
+
+FLAN-T5 is a variant of Google's T5 that has been instruction-tuned on a large number of diverse tasks. Whilst the original T5 was already a strong encoder-decoder transformer, FLAN-T5 is trained specifically to follow natural language instructions, making it more suitable at tasks phrased as "Rewrite this", "Summarise", or "Explain this".
+
+### Encoder/Decoder Architecture
+
+FLAN-T5 uses the classic seq2seq structure:
+- The encoder reads the input radiology report (plus our instruction prompt) and converts it into hidden representations.
+- The decoder takes those representations and generates the summary token by token, handling both:
+ - the encoder's information (what the report says)
+ - previous generated tokens (what has already been written by the model)
+
+FLAN-T5 is also relatively accessible compared to more complex LLMs, which require more VRAM to fine-tune.
+
+### LoRA (Low-Rank Adaptation)
+
+FLAN-T5-base has ~**250M** parameters. Fully fine tuning all of them on a single consumer GPU is both slow and unneccesary.
+
+**LoRA** adds a tiny number of trainable matrices inside the model's attention layers. Only these low-rank matrices are updated during training, which was around **1.7M parameters** in our configuration.
+
+Effectively, **FLAN-T5** provides general reasoning ability and **LoRA** teaches it the domain-specific phrasing of radiology summaries.
+
+### ROUGE Scores
+
+To evaluate the quality of generated summaries, we use the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric suite. ROUGE measures the degree of overlap between our model's output and the gold layman summary. The four relevant variants are:
+
+- **ROUGE-1** — unigram (word-level) overlap
+ Reflects whether the model captures the key medical terms and concepts.
+
+- **ROUGE-2** — bigram (two-word sequence) overlap
+ Measures phrasing quality and short-range coherence.
+
+- **ROUGE-L** — longest common subsequence
+ Rewards structurally similar summaries (e.g. similar sentence ordering).
+
+- **ROUGE-Lsum** — a variant of ROUGE-L suited to multi-sentence summaries.
+ This is typically the most representative single metric for summarisation and is used as our primary score for hyperparameter tuning.
+
+
+
+## 3. Code Structure
+
+`dataset.py` – Dataset handler with an optional 90:10 test split
+`eval.py` – Produces CSVs & plots using metrics logged from train.py
+`modules.py` – Model & tokenizer definition
+`predict.py` – Uses best checkpoint to summarise arbitrary reports
+`train.py` – Full training algorithm with logged metrics
+`runs/` – Stores model checkpoints + loss & validation history
+`eval/` – Stores CSVs & plots generated from history in `runs/`
+
+### Dependencies
+- **torch** – `2.9.0`
+- **transformers** – `4.57.1`
+- **datasets** – `4.4.1`
+- **evaluate** – `0.4.6`
+- **peft** – `0.17.1`
+- **sentencepiece** – `0.2.1`
+- **numpy** – `2.3.3`
+- **matplotlib** – `3.10.7`
+- **tqdm** – `4.67.1`
+
+
+
+## 4. Usage
+It is quite easy to install and train the model using the given requirements.txt file.
+
+**NOTE:** This was only tested in WSL, which requires a very specific setup for CUDA & torch.
+```
+# Create a conda environment & install
+conda create -n COMP3710 python=3.11
+conda activate COMP3710
+pip install -r requirements.txt
+
+# Run training using default parameters
+python train.py --out_dir runs/flan_t5_lora
+
+# Generate plots & CSVs
+python eval.py runs/flan_t5_lora
+
+# Test the model, optionally specifying the report index in the dataset
+python predict.py --ckpt runs/flan_t5_lora --idx [report_index]
+```
+## 5. Dataset
+### 5.1 Source
+**Dataset:** `BioLaySumm/BioLaySumm2025-LaymanRRG-opensource-track` (Hugging Face)
+
+Each row contains:
+
+- x: `radiology_report` – the original expert report (input).
+- y: `layman_report` – the corresponding layperson summary (target).
+
+### 5.2 Splits
+The dataset provides the following splits by default:
+- `train`,
+- `validation`
+- `test`*
+
+However, the `test` split is not useful for evaluation as it omits layman summaries. Hence, we optionally compute a `90:10` `train:test` split for hyperparameter tuning. A 10% hold-out is sufficient given the large dataset size, and avoids unneccesary computation during evaluation.
+
+The final evaluation uses the `validation` set by default, which remains untouched by our splits. This fixed validation set avoids the noise from random partitioning and ensures that final results are comparable and reproducable.
+
+### 5.3 Prompts & Preprocessing
+
+Minimal preprocessing was applied. We justify this as directly modifying the text may reduce our model's real-world performance, as we want our model to be robust to real clinical noise. We let FLAN-T5 handle and model any inconcistencies in language.
+
+Each input is embedded into a fixed instruction prompt:
+```
+You are a helpful medical assistant. Rewrite the radiology report for a layperson
+in 1–3 sentences, avoid jargon, use plain language.
+
+Report:
+{radiology_report}
+
+Layperson summary:
+```
+- Inputs are tokenised and truncated to `max_input_len` (default 1024)
+- Targets are tokenised and truncated to `max_target_len` (default 256)
+
+No additional cleaning or filtering is performed.
+
+## 6. Model & Training
+
+At a high level, training follows a standard seq2seq fine-tuning loop implemented in PyTorch. Each radiology report is first wrapped in our instruction prompt and tokenised, along with its lay summary as the target. We then run a forward pass through FLAN-T5 with LoRA adapters, compute the cross-entropy loss over the summary tokens, and use gradient accumulation so that several small batches simulate a larger effective batch size. Every `grad_accum` steps we update only the LoRA parameters with AdamW.
+
+
+### 6.1 Base Model & LoRA Config
+
+- Base model: **google/flan-t5-base** (~249M parameters)
+- LoRA applied to: `["q", "k", "v", "o"]`
+- LoRA configuration:
+ - `r = 8`
+ - `alpha = 16`
+ - `dropout = 0.05`
+
+
+
+Total trainable parameters under LoRA: **~1.7M**.
+
+
+### 6.2 Default Arguments
+The training script contains many command line arguments to control model setup, optimisation, and LoRA configuration.
+Defaults were chosen to contain VRAM usage to 16GB.
+
+| Argument | Default | Description |
+|---------|---------|-------------|
+| `--model_name` | `"google/flan-t5-base"` | Base HuggingFace model to load adapters. |
+| `--out_dir` | `"runs/flan_t5_lora"` | Directory where checkpoints, logs, and metrics are saved. |
+| `--epochs` | `5` | Total number of epochs. |
+| `--lr` | `2e-4` | Learning rate for AdamW |
+| `--wd` | `0.01` | Weight decay regularisation|
+| `--warmup_steps` | `500` | Gradual warmup for the learning rate scheduler |
+| `--batch_size` | `2` | Batch size. Small due to VRAM constraints. |
+| `--grad_accum` | `8` | Number of gradient accumulation steps. Effective batch size = `batch_size × grad_accum`. |
+| `--max_input_len` | `1024` | Maximum token length for the radiology report + prompt. Inputs are truncated beyond this length. |
+| `--max_target_len` | `256` | Maximum token length for model output. |
+| `--val_beams` | `4` | Beam search width during validation generation. Improves ROUGE at the cost of speed. |
+| `--val_max_new_tokens` | `128` | Maximum generation length for validation summaries. |
+| `--lora_r` | `8` | LoRA rank. Controls the size of the low-rank adaptation matrices. |
+| `--lora_alpha` | `16` | LoRA scaling factor, effectively adjusting update magnitude. |
+| `--lora_dropout` | `0.05` | Dropout applied to LoRA layers to improve generalisation. |
+| `--seed` | `1337` | Global seed for reproducibility. |
+| `--fp16` | *off by default* | Enables mixed precision training. |
+
+### 6.3 Hardware
+
+- `GPU`: RTX 5070 TI
+- `VRAM`: 16GB
+
+VRAM usage generally hovered around 15.4GB during training.
+
+## 7. Results
+
+We run training over all 150k rows across 5 epochs, using the default parameters as defined above. As said above, evaluation was executed against the default validation set.
+
+### 7.1 Training Loss
+
+
+The training loss curve shows a steep drop during the first ~2,000 steps, indicating that the model rapidly learns the relationship between reports and their summaries. This is not a suprise, considering the richness of each epoch. After this convergence, however, we begin to see diminishing returns, with loss slowly decreasing. This reflects smaller refinements to phrasing and style.
+
+Note that the periodic spikes at the start of each epoch are an artefact of gradient accumulation rather than training instability. Because the model accumulates gradients for several batches before performing the first optimiser step, it computes loss using weights that have not yet been updated. As a result, the logged loss appears higher during these early batches and then drops quickly once the first few updates occur.
+
+### 7.2 Validation
+
+#### Full-Val: 5 Epochs
+
+
+| Epoch | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-Lsum |
+|-------|---------|---------|---------|------------|
+| 1 | 0.722 | 0.538 | 0.670 | 0.670 |
+| 2 | 0.728 | 0.546 | 0.677 | 0.677 |
+| 3 | 0.734 | 0.556 | 0.684 | 0.684 |
+| 4 | 0.735 | 0.558 | 0.686 | 0.686 |
+| 5 | **0.740** | **0.566** | **0.691** | **0.692** |
+
+ROUGE scores increase steadily across the five epochs, but the improvements are modest and nearly flat. As suggested by our loss curve, the majority of learning actually happens within the first epoch, and by the time full-epoch validation begins, the model is already close to its optimal performance. The remaining epochs provide incremental refinements rather than substantive gains, indicating that the model has largely converged by the end of epoch 1.
+
+Although all ROUGE metrics follow the same trajectory, they improve at slightly different rates. ROUGE-1 and ROUGE-L/Lsum rise quickly and stabilise early, indicating that that the model rapidly learns to identify key clinical concepts and sentence structure. ROUGE-2, however improves slightly slower, reflecting that the model continues refining phrasing and short n-gram coherence even after the semantic mapping is learned.
+
+The best checkpoint achieves:
+- `ROUGE-Lsum` ≈ 0.6915
+
+
+#### Multi-Val: 2 Epochs
+
+ 
+
+To make this behaviour clearer, we also trained the model for two epochs while performing four validation checks per epoch. This finer-grained view reveals a sharp jump in ROUGE by the first epoch evaluation (around 25% through epoch 1), confirming that most of the performance gain occurs very early in training. After this point, the curves flatten noticeably, showing that later updates provide only small refinements.
+
+### 7.3 Training Report
+
+
+This section summarises the key training settings, LoRA configuration, model size, and hardware used for fine-tuning FLAN-T5-base. Training took an average of 2-2.5 hours per epoch.
+
+### Training Configuration
+
+| Setting | Value |
+|--------|-------|
+| **Model** | google/flan-t5-base |
+| **Total Params** | 251,116,800 |
+| **Trainable (LoRA)** | 1,769,472 |
+| **LoRA** | r = 8, α = 16, dropout = 0.05 |
+| **GPU** | RTX 5070 Ti (15.9 GB VRAM) |
+| **Epochs** | 5 |
+| **Batch Size** | 2 |
+| **Grad Accum** | 8 |
+| **Warmup Steps** | 1000 |
+| **Learning Rate** | 2e-4 |
+| **Weight Decay** | 0.01 |
+| **Total Training Time** | ~39,982 s (~11.1 h) |
+
+
+
+## 8. Error Analysis
+
+
+### Error Analysis Samples
+
+| Analysis | Radiology Report | Gold Summary | Model Prediction |
+|---------|------------------|--------------|------------------|
+| **Success: Strong match** | The chest shows significant air trapping. Bilateral apical chronic changes are present. Dorsal kyphosis is noted. No evidence of pneumothorax. | The chest shows a large amount of trapped air. There are long-term changes at the top of both lungs. The upper back is curved outward. There is no sign of air in the space around the lungs. | The chest x-ray shows a lot of trapped air in the lungs. There are long-term changes at the top of both lungs. The upper back is curved more than it should be. There's no sign of air in the chest cavity. |
+| **Success: Fully correct** | No parenchymal consolidation or peribronchial thickening is observed, with no significant changes in the lower lobe. | There is no sign of lung tissue thickening or thickening around the airways in the lower part of the lung, and there are no major changes in this area. | No signs of lung tissue consolidation or thickening around the airways are seen, and there are no significant changes in the lower part of the lung. |
+| **Failure: subtle semantic error (misinterprets “pneumothorax”)** | Comparison with the previous examination dated 20 03 2010 shows slightly greater expansion of the right lung compared to the prior control. Persistent right apical pneumothorax. A pleural drainage tube is present in the right mid-lung field. Subcutaneous emphysema is noted. | Looking at the images from the chest compared to the ones taken on March 20, 2010, we can see that the right lung has expanded a bit more than before. There's still a small amount of air leaking around the lung on the right side. There's a tube in the middle of the right lung to help drain fluid. There's also a bit of air trapped under the skin. | Compared to the previous exam from March 20, 2010, the right lung is a bit larger than it was before. There's still a collapsed lung at the top of the right lung. A tube is in place to drain fluid from the lungs in the middle of the right lung. There's also air under the skin. |
+| **Failure: technically correct but too clincial** | Midline sternotomy with cerclage. Hiatal hernia. Chronic parenchymal lung changes. Bilateral chronic pleural thickening. No pulmonary infiltrate suggestive of pneumonia identified. | There is a surgical cut down the middle of the chest with a wire loop used to close it. There is a hernia at the diaphragm opening. The lungs show long-term damage. Both sides of the lungs have chronic thickening of the lining. No signs of pneumonia are seen in the lungs. | A midline sternotomy with cerclage is present. Hiatal hernia is present. There are chronic changes in the lung tissue. There is ongoing thickening of the pleura on both sides. No signs of pneumonia are found. |
+
+Across the selected examples, two broad failure modes emerge:
+1. `Overly Clinical Language`
+
+In the 4th sample, the model produces a summary that is technically accurate but fails to fully simplify it. Instead of paraphrasing jargon such as “sternotomy”, “cerclage”, and “pleura”, it repeats these clinical terms almost unchanged. This behaviour appears when the input consists of a dense cluster of specialised terms with no surrounding context. Each sentence in the report is essentially a list of named medical findings rather than descriptions of processes or effects. This is a concept-mapping failure caused by the structure of the input. In general, the model performs well when it can see *why* a clinical entity is mentioned (e.g. "pneumothorax in the upper right lobe"). Sequence models rely heavily on context windows, relational cues and patterns, but when the report is simply a stack of nouns, the model struggles under a lack of signal / context needed to rewrite the medical terms.
+
+2. `Semantic Drift on Rare Clinical Patterns`
+
+We can observe a hallucination issue on the 3rd sample. The model incorrectly describes a 'persistent apical pneumothorax' as a 'collapsed lung', which is somewhat related but not medically equivalent. This suggests that the model occasionally substitutes a more familiar medical concept when faced with highly specific terms. These errors tend to be small, but they directly affect factual correctness.
+
+
+Despite these issues, the model shows strong reliability across most cases, especially when the input phrasing is common in the training corpus. The model seldom invents findings, avoids major hallucinations, and generally preserves clinical meaning. However, failures do occur when the input consists largely of isolated, highly specific medical terms with no explanatation. For instance, phrases such as "persistent right apical pneumothorax." or "hiatal hernia" provide little contextual structure for the model to interpret. In these cases, the model either repeats the terms verbatim rather than simplifying, or substitutes to a basic concept that may not be medically equivalent.
+
+
+
+
+
+
+
+
+
+
diff --git a/recognition/Flan_T5_s45893623/dataset.py b/recognition/Flan_T5_s45893623/dataset.py
new file mode 100644
index 000000000..940f9b5f8
--- /dev/null
+++ b/recognition/Flan_T5_s45893623/dataset.py
@@ -0,0 +1,35 @@
+# Simple dataset handler
+from datasets import load_dataset
+from torch.utils.data import Dataset
+
+class BioSummDataset(Dataset):
+ def __init__(self, split="train", do_train_split=False):
+ ds = load_dataset("BioLaySumm/BioLaySumm2025-LaymanRRG-opensource-track")
+ # We optionally split the training data to get a held-out test set.
+ if do_train_split == True and split in ["train", "test"]: # NOTE: You must construct both train and test using do_train_split=True, otherwise the splits won't be executed for both
+ full_train = ds["train"]
+ split_ds = full_train.train_test_split(test_size=0.1, seed=42) # keep seed set at 42 to keep splits consistent.
+
+ self.ds = split_ds["train"] if split == "train" else split_ds["test"]
+ # Otherwise use the default train,validation,test split in BioSumm (NOTE: default test does not contain layman summary)
+ else:
+ self.ds = ds[split]
+
+ def __len__(self):
+ return len(self.ds)
+
+ def __getitem__(self, i):
+ # we do not care about the image or source, only text
+ x = self.ds[i]["radiology_report"]
+ y = self.ds[i]["layman_report"]
+ return x, y
+
+# Test main: prints out the first 10 in the dataset
+if __name__ == "__main__":
+ train_ds = BioSummDataset(split="train")
+ val_ds = BioSummDataset(split="validation")
+ test_ds = BioSummDataset(split="test")
+ for i in range(0, 10):
+ print(f"Train [{i}]: {train_ds[i][0]}")
+ print(f"Val [{i}]: {val_ds[i][0]}")
+ print(f"Test [{i}]: {test_ds[i][0]}")
\ No newline at end of file
diff --git a/recognition/Flan_T5_s45893623/eval.py b/recognition/Flan_T5_s45893623/eval.py
new file mode 100644
index 000000000..72fba9dcb
--- /dev/null
+++ b/recognition/Flan_T5_s45893623/eval.py
@@ -0,0 +1,142 @@
+# eval.py
+# Computes plots and csvs using logged data from the training run.
+import os, sys, shutil
+
+def save_curves_and_plots_from_run(run_dir: str):
+ """
+ Rebuild CSVs and plots using the jsonl logs in a run-like directory (e.g., eval/).
+ Handles repeated 'step' values by constructing a monotonic 'gstep' and an inferred 'epoch'.
+ """
+ import csv, json
+ import matplotlib.pyplot as plt
+ import os
+
+ tl_jsonl = os.path.join(run_dir, "train_loss.jsonl")
+ vr_jsonl = os.path.join(run_dir, "val_rouge.jsonl")
+
+ # Load train loss
+ raw_loss = []
+ if os.path.isfile(tl_jsonl):
+ with open(tl_jsonl, "r", encoding="utf-8") as f:
+ for line in f:
+ line = line.strip()
+ if not line:
+ continue
+ try:
+ obj = json.loads(line)
+ raw_loss.append({"step": int(obj.get("step", 0)),
+ "loss": float(obj.get("loss", 0.0))})
+ except Exception:
+ pass
+
+ # Rebuild epoch + global step
+ loss_hist = []
+ if raw_loss:
+ epoch = 1
+ prev_step = -1
+ carry = 0
+ last_epoch_max = 0
+
+ for r in raw_loss:
+ s = r["step"]
+ # detect wrap (new epoch) when step doesn't increase
+ if s <= prev_step:
+ carry += max(last_epoch_max, prev_step)
+ last_epoch_max = 0
+ epoch += 1
+ last_epoch_max = max(last_epoch_max, s)
+ gstep = carry + s
+ loss_hist.append({"epoch": epoch, "step": s, "gstep": gstep, "loss": r["loss"]})
+ prev_step = s
+
+ # Load validation rouge
+ val_hist = []
+ if os.path.isfile(vr_jsonl):
+ with open(vr_jsonl, "r", encoding="utf-8") as f:
+ for line in f:
+ line = line.strip()
+ if not line:
+ continue
+ try:
+ obj = json.loads(line)
+ row = {"epoch": int(obj.get("epoch", 0))}
+ for k in ("rouge1", "rouge2", "rougeL", "rougeLsum"):
+ if k in obj:
+ row[k] = float(obj[k])
+ val_hist.append(row)
+ except Exception:
+ pass
+
+ # Write CSVs
+ if loss_hist:
+ loss_csv = os.path.join(run_dir, "train_loss.csv")
+ with open(loss_csv, "w", newline="", encoding="utf-8") as f:
+ w = csv.DictWriter(f, fieldnames=["epoch", "step", "gstep", "loss"])
+ w.writeheader()
+ # already chronological, ensure by gstep
+ w.writerows(sorted(loss_hist, key=lambda d: d["gstep"]))
+
+ if val_hist:
+ fields = sorted({k for d in val_hist for k in d.keys()})
+ val_csv = os.path.join(run_dir, "val_rouge.csv")
+ with open(val_csv, "w", newline="", encoding="utf-8") as f:
+ w = csv.DictWriter(f, fieldnames=fields)
+ w.writeheader()
+ w.writerows(sorted(val_hist, key=lambda d: d.get("epoch", 0)))
+
+ # Create Plots
+ if loss_hist:
+ xs = [d["gstep"] for d in loss_hist]
+ ys = [d["loss"] for d in loss_hist]
+ plt.figure()
+ plt.plot(xs, ys)
+ plt.xlabel("global step"); plt.ylabel("loss"); plt.title("train loss")
+ plt.tight_layout()
+ plt.savefig(os.path.join(run_dir, "train_loss.png"))
+ plt.close()
+
+ if val_hist:
+ # single plot, multiple lines (rouge1, rouge2, rougeL, rougeLsum) vs epoch
+ epochs = [d.get("epoch", 0) for d in val_hist]
+ metrics = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
+
+ plt.figure()
+ for metric_key in metrics:
+ ys = [d.get(metric_key, 0.0) for d in val_hist]
+ plt.plot(epochs, ys, marker="o", label=metric_key)
+ plt.xlabel("epoch")
+ plt.ylabel("ROUGE score")
+ plt.title("validation ROUGE over epochs")
+ plt.legend()
+ plt.tight_layout()
+ plt.savefig(os.path.join(run_dir, "val_rouge.png"))
+ plt.close()
+
+def main():
+ if len(sys.argv) != 2:
+ print("Usage: python eval.py ")
+ sys.exit(1)
+
+ run_dir = sys.argv[1]
+ if not os.path.isdir(run_dir):
+ print(f"Error: '{run_dir}' is not a valid directory")
+ sys.exit(1)
+
+ run_name = os.path.basename(os.path.normpath(run_dir))
+ eval_dir = os.path.join("eval", run_name)
+
+ # create eval/run_name folder, copy jsonls so function can read them
+ os.makedirs(eval_dir, exist_ok=True)
+ for fname in ("train_loss.jsonl", "val_rouge.jsonl"):
+ src = os.path.join(run_dir, fname)
+ dst = os.path.join(eval_dir, fname)
+ if os.path.isfile(src):
+ shutil.copy2(src, dst)
+
+ print(f"Rebuilding plots into {eval_dir} ...")
+ save_curves_and_plots_from_run(eval_dir)
+ print(f"Done. Outputs saved under {eval_dir}")
+
+# Usage: python eval.py runs/flan-t5-lora
+if __name__ == "__main__":
+ main()
diff --git a/recognition/Flan_T5_s45893623/images/flan_t5_architecture.png b/recognition/Flan_T5_s45893623/images/flan_t5_architecture.png
new file mode 100644
index 000000000..2b0498de1
Binary files /dev/null and b/recognition/Flan_T5_s45893623/images/flan_t5_architecture.png differ
diff --git a/recognition/Flan_T5_s45893623/images/multi_val.png b/recognition/Flan_T5_s45893623/images/multi_val.png
new file mode 100644
index 000000000..7ef35ad5d
Binary files /dev/null and b/recognition/Flan_T5_s45893623/images/multi_val.png differ
diff --git a/recognition/Flan_T5_s45893623/images/train_loss.png b/recognition/Flan_T5_s45893623/images/train_loss.png
new file mode 100644
index 000000000..d02d0b9b4
Binary files /dev/null and b/recognition/Flan_T5_s45893623/images/train_loss.png differ
diff --git a/recognition/Flan_T5_s45893623/images/val_rouge.png b/recognition/Flan_T5_s45893623/images/val_rouge.png
new file mode 100644
index 000000000..be9d5a061
Binary files /dev/null and b/recognition/Flan_T5_s45893623/images/val_rouge.png differ
diff --git a/recognition/Flan_T5_s45893623/modules.py b/recognition/Flan_T5_s45893623/modules.py
new file mode 100644
index 000000000..6b7cc9d37
--- /dev/null
+++ b/recognition/Flan_T5_s45893623/modules.py
@@ -0,0 +1,22 @@
+# modules.py
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+from peft import LoraConfig, get_peft_model, TaskType
+
+# returns fast tokenizer for seq2seq models
+def load_tokenizer(model_name: str):
+ return AutoTokenizer.from_pretrained(model_name, use_fast=True)
+
+# builds and returns the base-flan-t5 with attached LoRA adapters.
+def build_flan_t5_with_lora(model_name="google/flan-t5-base", r=8, alpha=16, dropout=0.05):
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
+ cfg = LoraConfig(
+ task_type=TaskType.SEQ_2_SEQ_LM,
+ r=r,
+ lora_alpha=alpha,
+ lora_dropout=dropout,
+ # NOTE: these match t5's projection layers
+ target_modules=["q", "k", "v", "o"],
+ bias="none",
+ )
+
+ return get_peft_model(model, cfg) # we convert the model to cuda outside this function, to prevent device mismatch issues
\ No newline at end of file
diff --git a/recognition/Flan_T5_s45893623/predict.py b/recognition/Flan_T5_s45893623/predict.py
new file mode 100644
index 000000000..8615e72bb
--- /dev/null
+++ b/recognition/Flan_T5_s45893623/predict.py
@@ -0,0 +1,65 @@
+# predict.py
+import argparse
+import torch
+import sys
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+from datasets import load_dataset
+
+DEFAULT_PROMPT = (
+ "You are a helpful medical assistant. Rewrite the radiology report for a layperson "
+ "in 1–3 sentences, avoid jargon, use plain language.\n\n"
+ "Report:\n{rad_report}\n\nLayperson summary:"
+)
+
+# Loads the model checkpoint and does prediction
+@torch.no_grad()
+def predict(report_text, ckpt_dir="runs/flan_t5_lora", prompt=None, beams=4, max_new=128):
+ dev = "cuda" if torch.cuda.is_available() else "cpu"
+ tok = AutoTokenizer.from_pretrained(ckpt_dir, use_fast=True)
+ model = AutoModelForSeq2SeqLM.from_pretrained(ckpt_dir).to(dev).eval()
+
+ p = (prompt or DEFAULT_PROMPT).format(rad_report=report_text)
+ enc = tok([p], return_tensors="pt", truncation=True, max_length=1024).to(dev)
+ out = model.generate(
+ **enc,
+ max_new_tokens=max_new,
+ num_beams=beams,
+ early_stopping=True
+ )
+ return tok.batch_decode(out, skip_special_tokens=True)[0]
+
+# Loads up an interactive check. Uses same default run dir as train.py. If idx is set, it computes the summary for that val_report[idx] and exits.
+def main():
+ p = argparse.ArgumentParser()
+ p.add_argument("--ckpt", type=str, default="runs/flan_t5_lora",
+ help="Checkpoint directory")
+ p.add_argument("--idx", type=int, default=None,
+ help="Index of val-set report to evaluate")
+ args = p.parse_args()
+
+ # If idx is provided: run prediction on that test report
+ if args.idx is not None:
+ ds = load_dataset(
+ "BioLaySumm/BioLaySumm2025-LaymanRRG-opensource-track"
+ )["validation"]
+
+ report = ds[args.idx]["radiology_report"]
+ gold = ds[args.idx]["layman_report"]
+ print(f"\n--- Test Sample {args.idx} ---")
+ print("Radiology Report:\n", report, "\n")
+ print("Gold Summary:\n", gold, "\n")
+ pred = predict(report, ckpt_dir=args.ckpt)
+ print("Model Prediction:\n", pred, "\n")
+ return
+
+ # Otherwise: interactive chat
+ print(f"Chat with your FLAN-T5 model ({args.ckpt}). Please enter only the report you want summarised. Type 'exit' to quit.\n")
+ while True:
+ msg = input("You: ").strip()
+ if msg.lower() in {"exit", "quit"}:
+ break
+ reply = predict(msg, ckpt_dir=args.ckpt)
+ print("Model:", reply, "\n")
+
+if __name__ == "__main__":
+ main()
diff --git a/recognition/Flan_T5_s45893623/requirements.txt b/recognition/Flan_T5_s45893623/requirements.txt
new file mode 100644
index 000000000..e4d3c29a1
--- /dev/null
+++ b/recognition/Flan_T5_s45893623/requirements.txt
@@ -0,0 +1,12 @@
+torch==2.9.0
+transformers==4.57.1
+datasets==4.4.1
+evaluate==0.4.6
+peft==0.17.1
+sentencepiece==0.2.1
+numpy==2.3.3
+matplotlib==3.10.7
+tqdm==4.67.1
+absl-py==2.3.1
+nltk==3.9.2
+rouge_score==0.1.2
diff --git a/recognition/Flan_T5_s45893623/train.py b/recognition/Flan_T5_s45893623/train.py
new file mode 100644
index 000000000..cb889e749
--- /dev/null
+++ b/recognition/Flan_T5_s45893623/train.py
@@ -0,0 +1,307 @@
+# train.py
+# flan t5 base + LORA training.
+import os, math, time, argparse, random, json
+import numpy as np
+import torch
+from torch.utils.data import DataLoader
+from torch.optim import AdamW
+from torch.cuda.amp import GradScaler, autocast
+from transformers import get_linear_schedule_with_warmup
+import evaluate # for rouge scoring
+
+# Our codebase
+from dataset import BioSummDataset
+from modules import load_tokenizer, build_flan_t5_with_lora
+
+PROMPT = (
+ "You are a helpful medical assistant. Rewrite the radiology report for a layperson "
+ "in 1–3 sentences, avoid jargon, use plain language.\n\n"
+ "Report:\n{rad_report}\n\nLayperson summary:"
+)
+
+# All params have default values. To re-produce our results just run 'python train.py'
+def args_parse():
+ p = argparse.ArgumentParser()
+ p.add_argument("--model_name", default="google/flan-t5-base")
+ p.add_argument("--out_dir", default="runs/flan_t5_lora")
+ p.add_argument("--epochs", type=int, default=5)
+ p.add_argument("--lr", type=float, default=2e-4)
+ p.add_argument("--wd", type=float, default=0.01)
+ p.add_argument("--warmup_steps", type=int, default=500)
+ p.add_argument("--batch_size", type=int, default=2)
+ p.add_argument("--grad_accum", type=int, default=8)
+ p.add_argument("--max_input_len", type=int, default=1024)
+ p.add_argument("--max_target_len", type=int, default=256)
+ p.add_argument("--val_beams", type=int, default=4)
+ p.add_argument("--val_max_new_tokens", type=int, default=128)
+ p.add_argument("--lora_r", type=int, default=8)
+ p.add_argument("--lora_alpha", type=int, default=16)
+ p.add_argument("--lora_dropout", type=float, default=0.05)
+ p.add_argument("--seed", type=int, default=1337)
+ p.add_argument("--fp16", action="store_true")
+ p.add_argument("--split", default=False)
+ return p.parse_args()
+
+def set_seed(s):
+ random.seed(s); np.random.seed(s)
+ torch.manual_seed(s); torch.cuda.manual_seed_all(s)
+
+# Collates report,summary pairs into tokenised encoder/decoder tensors.
+class Batchify:
+
+ def __init__(self, tok, max_in, max_out):
+ self.tok = tok
+ self.max_in = max_in
+ self.max_out = max_out
+ self.pad_id = tok.pad_token_id
+
+ def __call__(self, batch):
+ src = [PROMPT.format(rad_report=x) for x, _ in batch] # include radiology report in prompt
+ tgt = [y for _, y in batch]
+ # encode src and tgt seperately so we can mask
+ enc = self.tok(src, padding=True, truncation=True, max_length=self.max_in, return_tensors="pt")
+ dec = self.tok(text_target=tgt, padding=True, truncation=True, max_length=self.max_out, return_tensors="pt")
+ labels = dec["input_ids"]
+ # pad tokens get -100 so loss ignores them.
+ labels[labels == self.pad_id] = -100
+ return {"input_ids": enc["input_ids"], "attention_mask": enc["attention_mask"], "labels": labels}
+
+# Computes model rouge
+@torch.no_grad()
+def score_rouge(model, tok, loader, dev, max_new_tokens, beams):
+
+ if loader is None: return None
+ model.eval()
+ preds, refs = [], []
+
+ for b in loader:
+ b = {k: v.to(dev) for k, v in b.items()}
+ gen = model.generate(
+ input_ids=b["input_ids"],
+ attention_mask=b["attention_mask"],
+ max_new_tokens=max_new_tokens,
+ num_beams=beams,
+ early_stopping=True # only 1-3 max sentences
+ )
+
+ # put the pads back so decode handles -100s
+ tgt = b["labels"].clone()
+ tgt[tgt == -100] = tok.pad_token_id
+ preds.extend(tok.batch_decode(gen, skip_special_tokens=True))
+ refs.extend(tok.batch_decode(tgt, skip_special_tokens=True))
+
+ r = evaluate.load("rouge")
+ out = r.compute(predictions=preds, references=refs, use_stemmer=True)
+
+ return {k: float(out[k]) for k in ("rouge1","rouge2","rougeL","rougeLsum") if k in out}
+
+# small helper for logging model params
+def param_counts(m):
+ total = sum(p.numel() for p in m.parameters())
+ trainable = sum(p.numel() for p in m.parameters() if p.requires_grad)
+ return total, trainable
+
+# Epoch logic: 150k rows
+def run_one_epoch(model, loader, optim, sched, scaler, dev, accum, use_amp, log_every=50, loss_hist=None, loss_json_path=None, step_hook=None):
+
+ model.train()
+ total, shown, t0 = 0.0, 0.0, time.time()
+ steps = 0
+
+ for i, batch in enumerate(loader, 1):
+
+ batch = {k: v.to(dev) for k, v in batch.items()}
+
+ with autocast(enabled=use_amp):
+ out = model(**batch)
+ # gradient accumulation
+ loss = out.loss / accum
+
+ scaler.scale(loss).backward()
+ total += loss.item()
+ shown += loss.item()
+
+ if i % accum == 0:
+ scaler.unscale_(optim)
+ torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # prevents spikes
+ scaler.step(optim); scaler.update()
+ optim.zero_grad(set_to_none=True)
+
+ if sched is not None: sched.step()
+ steps += 1
+
+ # we call model peak here
+ if step_hook is not None and (steps % 500 == 0):
+ step_hook(steps)
+
+ # logging loss, default is to log every 50 steps
+ if steps % log_every == 0:
+ dt = time.time() - t0
+ mean_loss = shown / log_every
+ print({"step": steps, "loss": round(mean_loss, 6), "sec": round(dt, 2)})
+ loss_hist.append({"step": steps, "loss": float(mean_loss)})
+
+ # append loss hist to json, so we can recreate plots
+ if loss_json_path is not None:
+ with open(loss_json_path, "a", encoding="utf-8") as jf:
+ jf.write(json.dumps({"step": int(steps), "loss": float(mean_loss)}) + "\n")
+
+
+ shown = 0.0; t0 = time.time()
+
+
+ return total / max(1, steps)
+
+# sanity check - we peak at the model every 500 steps & ask it to generate a summary of the first report in the dataset.
+@torch.no_grad()
+def model_peak(model, tok, dev, dataset, beams=4, max_new=128):
+
+ model.eval()
+ x, y = dataset[0] # we pick the same report each time so we can see how it improves
+ enc = tok([PROMPT.format(rad_report=x)], return_tensors="pt", truncation=True, max_length=1024).to(dev)
+
+ out = model.generate(
+ input_ids=enc["input_ids"],
+ attention_mask=enc["attention_mask"],
+ max_new_tokens=max_new,
+ num_beams=beams,
+ early_stopping=True
+ )
+ pred = tok.batch_decode(out, skip_special_tokens=True)[0]
+
+ print("---------MODEL_PEAK----------")
+ print(f"Radiology Report: \n{x}")
+ print(f"True: \n {y}")
+ print(f"LLM: \n {pred}")
+ print("-----------------------------")
+
+# Main training loop
+def main():
+ # Setup
+ a = args_parse()
+ os.makedirs(a.out_dir, exist_ok=True)
+ set_seed(a.seed)
+ dev = "cuda" if torch.cuda.is_available() else "cpu"
+
+ t_start = time.time() # for logging train time
+
+ tok = load_tokenizer(a.model_name)
+ model = build_flan_t5_with_lora(
+ model_name=a.model_name, r=a.lora_r, alpha=a.lora_alpha, dropout=a.lora_dropout
+ )
+ model.config.use_cache = False
+ model.to(dev)
+
+ # Datasets
+ train_ds = BioSummDataset(split="train", do_train_split=a.split)
+ val_ds = BioSummDataset(split="validation")
+ test_ds = BioSummDataset(split="test", do_train_split=a.split)
+ collate = Batchify(tok, a.max_input_len, a.max_target_len)
+ train_loader = DataLoader(train_ds, batch_size=a.batch_size, shuffle=True, collate_fn=collate)
+ val_loader = DataLoader(val_ds, batch_size=8, shuffle=False, collate_fn=collate)
+ test_loader = DataLoader(test_ds, batch_size=8, shuffle=False, collate_fn=collate)
+
+ optim = AdamW(model.parameters(), lr=a.lr, weight_decay=a.wd) # adam@ optimiser
+ total_updates = math.ceil(len(train_loader) / max(1, a.grad_accum)) * a.epochs # updates = round_up(batches per epoch /accum) * epochs
+ warm = min(a.warmup_steps, max(1, total_updates // 20))
+ sched = get_linear_schedule_with_warmup(optim, num_warmup_steps=warm, num_training_steps=total_updates)
+ scaler = GradScaler(enabled=(a.fp16 and dev == "cuda"))
+
+ # log loss and val
+ loss_hist = []
+ val_hist = []
+ # we also save the histories so we can recreate the plots if needed.
+ loss_json_path = os.path.join(a.out_dir, "train_loss.jsonl")
+ val_json_path = os.path.join(a.out_dir, "val_rouge.jsonl")
+
+ # model peak prior to training
+ model_peak(model, tok, dev, train_ds, beams=a.val_beams, max_new=a.val_max_new_tokens)
+
+ # we pass this _probe as a function reference to our epoch trainer, runs every 500 steps.
+ def _probe(_step):
+ model_peak(model, tok, dev, train_ds, beams=a.val_beams, max_new=a.val_max_new_tokens)
+
+ total_params, trainable_params = param_counts(model)
+
+ # For final report
+ gpu_name, vram_gb = None, None
+ if dev == "cuda":
+ try:
+ gpu_name = torch.cuda.get_device_name(0)
+ except Exception:
+ gpu_name = "unknown"
+ try:
+ _, total = torch.cuda.mem_get_info()
+ vram_gb = round(total / (1024**3), 2)
+ except Exception:
+ vram_gb = None
+
+ # TRAINING LOOP
+ best = -1.0
+ for ep in range(1, a.epochs + 1):
+ print(f"\nepoch {ep}/{a.epochs}")
+ tr_loss = run_one_epoch(
+ model, train_loader, optim, sched, scaler, dev, a.grad_accum, a.fp16, # training params
+ log_every=50, loss_hist=loss_hist, loss_json_path=loss_json_path, # logging params
+ step_hook=_probe # < - model peak here
+ )
+ print({"train_loss": round(float(tr_loss), 6)})
+
+ # one epoch done, time for validation
+ scores = score_rouge(model, tok, val_loader, dev, a.val_max_new_tokens, a.val_beams)
+ if scores:
+ msg = {k: round(v, 4) for k, v in scores.items()}
+ print({"val": msg})
+ row = {"epoch": ep, **{k: float(v) for k, v in scores.items()}}
+ val_hist.append(row)
+
+ # write json
+ with open(val_json_path, "a", encoding="utf-8") as jf:
+ jf.write(json.dumps(row) + "\n")
+
+ cur = scores.get("rougeLsum", scores.get("rougeL", -1.0))
+ if cur > best:
+ best = cur
+ model.save_pretrained(a.out_dir) # model epoch checkpoint
+ tok.save_pretrained(a.out_dir)
+ # logging + print to show output dir
+ with open(os.path.join(a.out_dir, "best.json"), "w", encoding="utf-8") as f:
+ f.write(str({"epoch": ep, "metric": cur}))
+ print({"save": a.out_dir, "metric": round(cur, 4)})
+ # ALL EPOCHS DONE!
+
+ # Eval
+ final_val = score_rouge(model, tok, val_loader, dev, a.val_max_new_tokens, a.val_beams)
+ print({"final_val": {k: round(v, 4) for k, v in final_val.items()}})
+ final_test = score_rouge(model, tok, test_loader, dev, a.val_max_new_tokens, a.val_beams)
+ print({"final_test": {k: round(v, 4) for k, v in final_test.items()}}) # Will output 0 if there's no split.
+
+ # NOTE: We make plots/csvs after training using eval.py and the saved jsonl's
+
+ t_total = round(time.time() - t_start, 2)
+
+ # dump a report after all epochs. Only static model + system details here.
+ report = {
+ "model_name": a.model_name,
+ "total_params": int(total_params),
+ "trainable_params": int(trainable_params),
+ "lora_r": a.lora_r,
+ "lora_alpha": a.lora_alpha,
+ "lora_dropout": a.lora_dropout,
+ "gpu_name": gpu_name,
+ "gpu_vram_gb": vram_gb,
+ "epochs": a.epochs,
+ "batch_size": a.batch_size,
+ "grad_accum": a.grad_accum,
+ "warmup_steps": a.warmup_steps,
+ "lr": a.lr,
+ "weight_decay": a.wd,
+ "total_training_seconds": t_total,
+ }
+ with open(os.path.join(a.out_dir, "train_report.json"), "w", encoding="utf-8") as f:
+ json.dump(report, f, indent=2)
+ print({"train_report": report})
+
+if __name__ == "__main__":
+ main()
+