Skip to content

Issue of evaluating T5 model #17

Description

@zzhendong

When I tried to evaluate T5 t5_4gpus.sh in osdi24ae branch, this error occured :

Traceback (most recent call last):
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/examples/t5/train.py", line 221, in <module>
    train()
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/examples/t5/train.py", line 174, in train
    def train_iter(model, dataloader):
  File "/local/home/zzhendong2/projects/nnscaler/cube/compiler.py", line 204, in decorator
    graph = PAS(graph, resource)
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/policy.py", line 303, in policy
    sched = OrderSolver().solve(graph, nmicros, config.order_plan)
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/solver/order.py", line 51, in solve
    sched = self.sched_tessel(graph, nmicros, sched_file)
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/solver/order.py", line 114, in sched_tessel
    tsched = TSched.load(load_sched_file)
  File "/local/home/zzhendong2/projects/nnscaler/Tessel/tessel/schedule/schedplan.py", line 574, in load
    with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'mllm.4stages.sched.json'

It seems this file "mllm.4stages.sched.json" is missing. If I comment out the line "--order-plan mllm.4stages.sched.json" in t5_4gpus.sh, it will run into another error:

Traceback (most recent call last):
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/examples/t5/train.py", line 221, in <module>
    train()
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/examples/t5/train.py", line 174, in train
    def train_iter(model, dataloader):
  File "/local/home/zzhendong2/projects/nnscaler/cube/compiler.py", line 204, in decorator
    graph = PAS(graph, resource)
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/policy.py", line 303, in policy
    sched = OrderSolver().solve(graph, nmicros, config.order_plan)
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/solver/order.py", line 49, in solve
    sched = self.sched_1f1b(graph, nmicros)
  File "/local/home/zzhendong2/projects/nnscaler/cupilot/cupilot/solver/order.py", line 80, in sched_1f1b
    sched.add_segment(stage, mb_idx, step)
  File "/local/home/zzhendong2/projects/nnscaler/cube/graph/schedule/schedplan.py", line 177, in add_segment
    self.add_block(block, step)
  File "/local/home/zzhendong2/projects/nnscaler/cube/graph/schedule/schedplan.py", line 151, in add_block
    raise RuntimeError(
RuntimeError: inserting confict at device 1 of time step 2: cannot execute multiple blocks at a same time step

How to solve this issue?
Thank you!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions