Retriever does not train properly on a custom dataset with empty negative_ctxs and hard_negative_ctxs

Hi! I've been trying to train DPR using the in-batch negatives schema on a custom dataset with no `negative_ctxs` and `hard_negative_ctxs` with default configs. It appears that the network does not train properly on such datasets. In particular, loss on every training step is 0:
![image](https://user-images.githubusercontent.com/24842658/217569496-cdc7d0da-45b1-49ad-85ac-e99b763d9801.png)

It seems that the issue is indeed due to the abscence of negative examples in the dataset: when I add random positive paragraphs from other questions as negatives, the retriever seems to train properly:
![image](https://user-images.githubusercontent.com/24842658/217570896-7e285cbb-3b54-4daa-a9da-898f4799d3cf.png)
 
However, I don't want any fixed random paragraphs as negatives in my dataset. It seems that either the in-batch negatives schema does not apply when there are no negative_ctxs, or it does not apply in the default settings at all. I was not able to find the reason in `_calc_loss` (`train_dense_encoder.py`). 

Is it possible to train the retriever on such datasets? Or do I need at least one `negative_ctxs` for each data point? Thank you!

P.S.
The dataset I am using looks like this (two exalmples):
```
[{'question': 'x y : ℝ,\nh : x ≤ y\n⊢ real.sqrt x ≤ real.sqrt y',
  'positive_ctxs': [{'title': 'real.sqrt',
    'text': 'def sqrt (x : ℝ) : ℝ :=\tnnreal.sqrt (real.to_nnreal x)'},
   {'title': 'nnreal.sqrt_le_sqrt_iff',
    'text': 'lemma sqrt_le_sqrt_iff : sqrt x ≤ sqrt y ↔ x ≤ y'}],
  'negative_ctxs': [],
  'hard_negative_ctxs': []},
 {'question': 'X : Compactum,\nA B : set ↥X\n⊢ basic (A ∩ B) = basic A ∩ basic B',
  'positive_ctxs': [{'title': '<None>', 'text': '<None>'}],
  'negative_ctxs': [],
  'hard_negative_ctxs': []}]
```
It is designed to search relevant lemmas for automated theorem proving.

The only thing I changed in the repo is the `encoder_train_default.yaml` config where I added my custom dataset:
![image](https://user-images.githubusercontent.com/24842658/217574229-e83999e2-6c11-49c6-9818-e6ab2e6b5044.png)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retriever does not train properly on a custom dataset with empty negative_ctxs and hard_negative_ctxs #243

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Retriever does not train properly on a custom dataset with empty negative_ctxs and hard_negative_ctxs #243

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions