Skip to content

TinyLLM训练到50000次+的时候出现,loss为nan #31

@qingyanbaby

Description

@qingyanbaby

使用的是直接clone的代码,没有做任何修改,运行环境是python3.9,设备是2080ti。试过两次了,每次都是都是训练过半后,日志开始出现loss为nan。请问这个怎么解决?

image
step 100000: train loss nan, val loss nan
100000 | loss nan | lr 0.000000e+00 | 2006.12ms | mfu 1.68%

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions