A modular Coffea-based analysis framework for CMS Run 3 bbMET analysis.
DarkBottomLine is designed to process NanoAOD datasets using Coffea, producing flat output (ROOT or Parquet) containing analysis-level variables. The framework is generic, configurable for each Run 3 year (2022–2024) via metaconditions, and operates on NanoAOD datasets as input.
- Modular Design: Separate modules for objects, selections, corrections, weights, and histograms
- Config-Driven: Year-specific parameters in YAML configuration files
- Coffea Integration: Uses Coffea NanoEvents for efficient event processing
- Correction Support: Integration with correctionlib for scale factors
- Multiple Executors: Support for iterative, futures, and Dask execution backends
- Flexible Output: Support for ROOT, Parquet, and pickle output formats
- Validation Tools: Jupyter notebook for framework validation and plotting
- Python 3.9+
- Conda or pip package manager
- Clone the repository:
git clone <repository-url>
cd DarkBottomLine- Create a conda environment:
conda create -n darkbottomline python=3.9
conda activate darkbottomline- Install dependencies:
pip install -r requirements.txt- Install the package in development mode:
pip install -e .- Source the following file
source /cvmfs/sft.cern.ch/lcg/views/LCG_105/x86_64-el9-gcc11-opt/setup.sh- If above method does not work then try to install and load the CMSSW release:
cmsrel CMSSW_15_0_17
cd CMSSW_15_0_17/src
cmsenv- Clone the repository:
git clone https://github.com/tiwariPC/DarkBottomLine.git
cd DarkBottomLine- Check the pre-installed packages that come with CMSSW release:
python3 check_requirements.py
# To install the missing packages
python3 check_requirements.py --install --local-dir ./.local
# Please add the suggested PYTHONPATH in the output of the above installation to your Python path- Run final installation script:
chmod +x install_lxplus.sh
./install_lxplus.shAfter installation, you need to set up your environment before using DarkBottomLine. The install_lxplus.sh script automatically sets up the environment for the current session, but for future logins you'll need to use the start.sh script.
When you run install_lxplus.sh, it automatically:
- Sets up
PYTHONPATHto include the installed packages - Adds the
darkbottomlinecommand to yourPATH - Exports these paths for the current session
After running install_lxplus.sh, you can immediately use DarkBottomLine in that session (after sourcing LCG environment).
Every time you start a new shell session on lxplus, you need to:
- Source LCG environment first (critical):
source-lcg
# Or if you don't have the function:
source /cvmfs/sft.cern.ch/lcg/views/LCG_105/x86_64-el9-gcc11-opt/setup.sh- Source the start.sh script to set up DarkBottomLine environment:
cd /path/to/DarkBottomLine
source start.sh
# Or:
. start.sh- Verify the setup:
darkbottomline --help
# Or:
python3 -c "from darkbottomline import DarkBottomLineProcessor; print('✓ Import successful')"cd condorJobs
# Edit submit.sub file, change the user letter and username in line 3
# Change the <full_path> in runanalysis.sh and relevant command as needed
voms-proxy-init --voms cms --valid 192:00 && cp /tmp/x509up_u$(id -u) /afs/cern.ch/user/u/username/private/
condor_submit submit.subThe DarkBottomLine framework supports a complete analysis workflow from NanoAOD processing to plot generation. Here's how to run the entire analysis:
Run the analysis on your input files (data, MC backgrounds, signal). You can provide input files one by one, as multiple arguments, or listed in a .txt file.
# Activate virtual environment
source venv/bin/activate
# Run analysis on a single data file
darkbottomline analyze \
--config configs/2024.yaml \
--regions-config configs/regions.yaml \
--input /path/to/data/nano_data.root \
--output outputs/hists/regions_data.pkl \
--max-events 10000
# Run analysis on MC backgrounds from a list of files in a .txt file
darkbottomline analyze \
--config configs/2024.yaml \
--regions-config configs/regions.yaml \
--input my_background_files.txt \
--output outputs/hists/regions_dy.pkl
# Run analysis on multiple signal files directly
darkbottomline analyze \
--config configs/2024.yaml \
--regions-config configs/regions.yaml \
--input /path/to/signal/nano_signal_1.root /path/to/signal/nano_signal_2.root \
--output outputs/hists/regions_signal.pklWhen using a .txt file for input, list one file path per line. Empty lines and lines starting with # will be ignored.
Analysis Options:
--config: Base configuration file (e.g.,configs/2024.yaml)--regions-config: Regions configuration file (e.g.,configs/regions.yaml)--input: Input NanoAOD ROOT file(s). Can be a single file, multiple files, or a.txtfile containing a list of file paths.--output: Output pickle file path--executor: Execution backend (iterative, futures, dask) - default: iterative--workers: Number of parallel workers (for futures/dask) - default: 4--chunk-size: Number of events per chunk for futures/dask executors (default: 50000 for futures, 200000 for dask). Useful for managing memory with large files.--max-events: Maximum number of events to process (optional, for testing). For futures/dask executors, this is converted to maxchunks based on chunk-size.
Generate data/MC plots from the analysis results:
# Generate plots from analysis results
darkbottomline make-plots \
--input outputs/hists/regions_data.pkl \
--save-dir outputs \
--show-data
# With custom plotting configuration
darkbottomline make-plots \
--input outputs/hists/regions_data.pkl \
--save-dir outputs \
--show-data \
--plot-config configs/plotting.yaml \
--version 20251105_1100Plotting Options:
--input: Input results pickle file--save-dir: Base output directory (default:outputs)--show-data: Include data points on plots--plot-config: Plotting configuration file (default:configs/plotting.yaml)--version: Version string for output directory (default: auto-generate timestamp)--regions: Specific regions to plot (optional, default: all regions)
# 1. Setup
source venv/bin/activate
cd /path/to/DarkBottomLine
# 2. Run analysis on all samples (using a .txt file for inputs)
# Create a file, e.g. dy_inputs.txt, with your list of ROOT files.
darkbottomline analyze \
--config configs/2024.yaml \
--regions-config configs/regions.yaml \
--input dy_inputs.txt \
--output outputs/hists/regions_dy.pkl
# 3. Generate plots
darkbottomline make-plots \
--input outputs/hists/regions_data.pkl \
--save-dir outputs \
--show-data
# 4. Plots are saved in: outputs/plots/{version}/
# - PNG: outputs/plots/{version}/png/{category}/{region}/
# - PDF: outputs/plots/{version}/pdf/{category}/{region}/
# - ROOT: outputs/plots/{version}/root/
# - Text: outputs/plots/{version}/text/{category}/{region}/
# - Summary: outputs/plots/{version}/region_summary.{png,pdf}For a simple single-region analysis without the multi-region framework:
darkbottomline run \
--config configs/2024.yaml \
--input /path/to/nanoaod_or_file_list.txt \
--output results.pkl \
--event-selection-output output/event_selected.pkl # optional: save events for event-level selection
--executor iterativeAnalysis Commands:
analyze: Multi-region analysis with full region definitionsrun: Simple single-region analysis--config: Path to YAML configuration file--regions-config: Path to regions configuration file (foranalyzecommand)--input: Path to input NanoAOD file(s). Can be a single file, multiple files, or a.txtfile containing a list of file paths.--output: Path to output file (supports .parquet, .root, .pkl)--executor: Execution backend (iterative, futures, dask)--workers: Number of parallel workers (for futures/dask)--chunk-size: Number of events per chunk for futures/dask executors (default: 50000 for futures, 200000 for dask). Helps manage memory with large files.--max-events: Maximum number of events to process. For futures/dask executors, converted to maxchunks based on chunk-size.--event-selection-output: Optional path to save events that pass event-level selection (supports.pkland.root).- If you provide a
.pklpath, a plain-Python-serializable pickle will be saved and a raw awkward backup*.awk_raw.pklwill also be created. - If you provide a
.rootpath, a small ROOT TTreeEventswill be written containing scalar branches (event identifiers, MET scalars, and object multiplicities).
- If you provide a
Plotting Commands:
make-plots: Generate individual variable plots and grouped plotsmake-stacked-plots: Generate stacked Data/MC plots with ratio--show-data: Show data points on plots--plot-config: Plotting configuration file--version: Version string for output directory
The input flexibility works with all executors. For example:
# Iterative execution (single-threaded, good for debugging)
python run_analysis.py --config configs/2023.yaml --input my_files.txt --output results.pkl --executor iterative
# Futures execution (multi-threaded, good for local parallelization)
python run_analysis.py --config configs/2023.yaml --input file1.root file2.root --output results.parquet --executor futures --workers 4
# Futures execution with custom chunk size (for large files)
python run_analysis.py --config configs/2023.yaml --input large_file.root --output results.pkl --executor futures --workers 8 --chunk-size 100000
# Dask execution (distributed, good for production)
python run_analysis.py --config configs/2023.yaml --input nanoaod.root --output results.root --executor dask --workers 8
# Dask execution with custom chunk size (default is 200000 for dask)
python run_analysis.py --config configs/2023.yaml --input large_file.root --output results.root --executor dask --workers 8 --chunk-size 500000Chunk Size Notes:
- Chunk size controls how many events are processed per chunk, helping manage memory usage
- Smaller chunks (e.g., 50000) use less memory but may have more overhead
- Larger chunks (e.g., 200000+) are more efficient but require more memory
- Default: 50000 for futures executor, 200000 for dask executor
- Only applies to
futuresanddaskexecutors (usesrun_uproot_jobinternally) - The
iterativeexecutor loads all events at once and doesn't use chunking
The framework uses YAML configuration files for year-specific parameters. Configuration files are located in the configs/ directory:
configs/2022.yaml: 2022 data-taking parametersconfigs/2023.yaml: 2023 data-taking parametersconfigs/2024.yaml: 2024 data-taking parametersconfigs/regions.yaml: Region definitions with categories and channelsconfigs/plotting.yaml: Plotting configuration and exclusions
Regions are defined in configs/regions.yaml with the format: {category}:{region_type}_{channel}
Categories:
1b: 1 b-tag category (≤2 jets, 1 b-jet)2b: 2 b-tag category (3 jets, 2 b-jets)
Region Types:
SR: Signal regionCR_Wlnu: W+jets control regionCR_Top: Top control regionCR_Zll: Z+jets control region
Channels:
mu: Muon channelel: Electron channel
Example Regions:
1b:SR- Signal region, 1 b-tag2b:SR- Signal region, 2 b-tags1b:CR_Wlnu_mu- W+jets CR, 1b, muon channel2b:CR_Top_el- Top CR, 2b, electron channel1b:CR_Zll_mu- Z+jets CR, 1b, muon channel
year: 2023
lumi: 35.9 # fb^-1
# Correction file paths
corrections:
pileup: data/corrections/pileup_2023.json.gz
btagSF: data/corrections/btagging_2023.json.gz
muonSF: data/corrections/muonSF_2023.json.gz
electronSF: data/corrections/electronSF_2023.json.gz
# Trigger paths
triggers:
MET: ["HLT_PFMET120_PFMHT120_IDTight"]
SingleMuon: ["HLT_IsoMu24", "HLT_IsoMu27"]
# Object selection cuts
objects:
muons:
pt_min: 20.0
eta_max: 2.4
id: "tight"
iso: "tight"
# ... more object configurations
# Event selection
event_selection:
min_muons: 0
max_muons: 2
min_jets: 2
min_bjets: 1
met_min: 50.0The analysis uses a category-based region structure with channel separation:
-
1b Category: 1 b-tag, ≤2 jets
- SR:
1b:SR - W CR:
1b:CR_Wlnu_mu,1b:CR_Wlnu_el - Z CR:
1b:CR_Zll_mu,1b:CR_Zll_el - No Top CR (removed as per requirements)
- SR:
-
2b Category: 2 b-tags, 3 jets (Top CR may have >3 jets)
- SR:
2b:SR - Top CR:
2b:CR_Top_mu,2b:CR_Top_el - Z CR:
2b:CR_Zll_mu,2b:CR_Zll_el - No W CR (removed as per requirements)
- SR:
Z CR Separation:
- Z_1b:
(njet <= 2) and (jet1Pt > 100.) - Z_2b:
(njet <= 3 and njet > 1) and (jet1Pt > 100.)
Channel Separation:
- All CRs (Top, W, Z) have separate muon and electron channels
- Taus are vetoed for the full analysis
Physics object selection and cleaning functions:
select_muons(): Muon selection with ID and isolation cutsselect_electrons(): Electron selection with ID and isolation cutsselect_taus(): Tau selection with ID and decay mode cutsselect_jets(): AK4 jet selection with jet ID cutsselect_fatjets(): AK8 fat jet selectionclean_jets_from_leptons(): Delta-R based overlap removalget_bjet_mask(): B-tagging working point selection
Multi-region analysis with category and channel separation:
RegionManager: Manages multiple analysis regionsRegion: Single region with cuts and propertiesapply_regions(): Apply region cuts to events- Supports category-based regions (1b, 2b) with channel separation
Multi-region analysis processor:
DarkBottomLineAnalyzer: Extends base processor for multi-region analysisprocess(): Process events through all defined regions_fill_region_histograms(): Fill histograms for each region_calculate_region_cutflow(): Calculate cutflow per regionsave_results(): Save results with full region names preserved
Data/MC plotting with CMS styling:
PlotManager: Manages plot creation and stylingcreate_all_plots(): Generate all plots for all regions_get_excluded_variables_for_region(): Region-specific plot exclusions- Supports multiple formats: PNG, PDF, ROOT, TXT
- CMS plotting style with
mplhep - Configurable exclusions via
configs/plotting.yaml
Histogram definitions and filling:
HistogramManager: Manages histogram creation and filling- Histogram types: MET, jet kinematics, lepton kinematics, b-tagging, derived variables
- Support for both hist library and fallback implementation
- ~40+ histogram definitions matching StackPlotter variables
Weight calculation and combination:
WeightCalculator: Combines all weights using Coffea's Weights classadd_generator_weight(): Generator weight handlingadd_corrections(): Correction weight applicationget_weight(): Final weight calculation with systematic variations
# Skimmed events with selected objects
skimmed_events = {
"event": events.event,
"run": events.run,
"luminosityBlock": events.luminosityBlock,
"MET": {"pt": events.MET.pt, "phi": events.MET.phi},
"weights": event_weights,
"muons": selected_muons,
"electrons": selected_electrons,
"jets": selected_jets,
"bjets": selected_bjets,
}Histograms saved as ROOT histograms with metadata.
Complete analysis results including histograms, cutflow, and metadata.
Analysis results are saved as pickle files with the following structure:
outputs/hists/
├── regions_data.pkl
├── regions_dy.pkl
├── regions_signal.pkl
└── ...
Each pickle file contains:
region_histograms: Dictionary of histograms per regionregions: Region processing resultsregion_cutflow: Cutflow statistics per regionregion_validation: Region validation resultsmetadata: Analysis metadata
Plots are organized in a versioned directory structure:
outputs/plots/{version}/
├── png/
│ ├── 1b/
│ │ ├── SR/
│ │ │ ├── 1b_SR_met.png
│ │ │ ├── 1b_SR_met_log.png
│ │ │ └── ...
│ │ ├── Wlnu_mu/
│ │ │ ├── 1b_Wlnu_mu_lep1_pt.png
│ │ │ └── ...
│ │ ├── Wlnu_el/
│ │ ├── Zll_mu/
│ │ └── Zll_el/
│ └── 2b/
│ ├── SR/
│ ├── Top_mu/
│ ├── Top_el/
│ ├── Zll_mu/
│ └── Zll_el/
├── pdf/ (same structure as png/)
├── text/ (same structure as png/)
├── root/
│ ├── met.root (one file per variable)
│ └── ...
└── region_summary.{png,pdf}
File Naming Convention:
- Format:
{category}_{region_dir}_{variable_name}.{format} - Examples:
1b_SR_met.png2b_Top_mu_lep1_pt.png1b_Zll_mu_z_mass.png
Plot Exclusions:
- 1b SR: Excludes jet3 plots and all lepton plots
- 2b SR: Excludes lepton plots (includes jet3)
- Top/W CRs: Exclude
z_massandz_ptplots - Z CRs: Include
z_massandz_ptplots
See configs/plotting.yaml for configurable exclusions.
Use the validation notebooks to test and verify the framework:
jupyter notebook notebooks/Available Validation Notebooks:
01_plot_exclusions_validation.ipynb- Test plot exclusions02_region_definitions_validation.ipynb- Validate region definitions03_histogram_structure_validation.ipynb- Check histogram structure04_plot_output_structure_validation.ipynb- Verify plot directory structure05_configuration_validation.ipynb- Validate configuration files06_data_mc_comparison_validation.ipynb- Compare data/MC yields
See notebooks/README.md for detailed documentation.
- Create new YAML configuration file in
configs/ - Update luminosity values and trigger paths
- Adjust object selection cuts if needed
- Add correction file path to configuration
- Implement correction method in
CorrectionManager - Add correction to weight calculation
- Define histogram in
HistogramManager.define_histograms() - Add filling logic in
fill_histograms() - Update validation notebook if needed
- Core: coffea, awkward, uproot, correctionlib
- Execution: dask, distributed
- Output: pyarrow, pandas
- Visualization: matplotlib, jupyter, mplhep
- Histogramming: hist
- ROOT: pyroot (optional, for ROOT file output)
Install all dependencies:
pip install -r requirements.txtFor ROOT support (optional):
# Install ROOT via conda or system package manager
conda install -c conda-forge root-
Missing correction files: Ensure correction files are in the correct paths specified in configuration. Warnings are acceptable if corrections are not needed for testing.
-
Import errors: Check that all dependencies are installed correctly:
pip install -r requirements.txt pip install -e . -
Memory issues:
- Use
--max-eventsto limit events for testing:darkbottomline analyze ... --max-events 10000
- For futures/dask executors, reduce
--chunk-sizeto process smaller chunks:darkbottomline analyze ... --executor futures --chunk-size 25000
- Use
-
Executor issues: Try different executors (iterative, futures, dask)
-
ROOT not available: ROOT files won't be generated if ROOT library is not installed. Other formats (PNG, PDF, TXT) will still be created.
-
Old region format: If you see regions like
CR_Zllinstead of1b:CR_Zll_mu, the data file was created with an old version. Re-run the analysis with the updatedregions.yaml. -
Plot exclusions not working: Check that
configs/plotting.yamlis loaded correctly and exclusion patterns match variable names.
Run with debug logging to see detailed information:
# Set log level
export PYTHONPATH=.
python -c "import logging; logging.basicConfig(level=logging.DEBUG)"
# Run analysis
darkbottomline analyze ... --log-level DEBUGRun validation notebooks to check framework setup:
jupyter notebook notebooks/01_plot_exclusions_validation.ipynbHere's a complete example running the full analysis workflow:
#!/bin/bash
# Complete analysis workflow
# Setup
source venv/bin/activate
cd /path/to/DarkBottomLine
# Configuration
CONFIG="configs/2024.yaml"
REGIONS_CONFIG="configs/regions.yaml"
INPUT_DIR="/path/to/nanoaod"
OUTPUT_DIR="outputs/hists"
# 1. Run analysis on all samples
echo "Running analysis on data..."
darkbottomline analyze \
--config $CONFIG \
--regions-config $REGIONS_CONFIG \
--input ${INPUT_DIR}/nano_data.root \
--output ${OUTPUT_DIR}/regions_data.pkl
echo "Running analysis on DY..."
darkbottomline analyze \
--config $CONFIG \
--regions-config $REGIONS_CONFIG \
--input ${INPUT_DIR}/nano_dy.root \
--output ${OUTPUT_DIR}/regions_dy.pkl
echo "Running analysis on signal..."
darkbottomline analyze \
--config $CONFIG \
--regions-config $REGIONS_CONFIG \
--input ${INPUT_DIR}/nano_signal.root \
--output ${OUTPUT_DIR}/regions_signal.pkl
# 2. Generate plots
echo "Generating plots..."
darkbottomline make-plots \
--input ${OUTPUT_DIR}/regions_data.pkl \
--save-dir outputs \
--show-data \
--plot-config configs/plotting.yaml
echo "Analysis complete! Plots saved to outputs/plots/{version}/"- Analysis Structure: See
docs/analysis_structure.mdfor region naming conventions and structure flow - Plotting Configuration: See
docs/plotting_configuration.mdfor plot exclusion configuration - Validation Notebooks: See
notebooks/README.mdfor validation notebook documentation - Developer Guide: See
DEVELOPER_GUIDE.mdfor a comprehensive guide on where to make changes (plotting, variables, histograms, regions, etc.)
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built on top of the Coffea framework
- Uses correctionlib for scale factor corrections
- Inspired by CMS analysis workflows
- Plotting style follows CMS figure guidelines using
mplhep