File Upload API: problem with mime type detection

[this issue is still work in progress; I may need to investigate some more/add more info; but going to create an issue so that I don't forget, again]

**Short version:**

When files are uploaded via `/api/datasets/{id}/add` it appears that the mime type identification step is skipped if the file stream is passed to the API in a certain way; with the file always ending up classified as `text/plain`. 
This is not a fatal problem when using the API on the command line via curl (it works properly when used exactly as specified in our guide). But it becomes a problem when trying to use it from some software clients. **Specifically, it appears to be impossible to upload a file via pyDataverse as anything but `text/plain`**. 

**Excruciating details:**

1\. Uploading an image file following the example in the API guide: 

`curl -H X-Dataverse-key:XXX -X POST -F "file=@test.jpg" "http://localhost:8080/api/datasets/NNN/add"`

this works, the file is uploaded and identified as `image/jpeg`.

2\. But try to pipe the same input to the API instead: 

`cat test.jpg | curl -H X-Dataverse-key:XXX -X POST -F "file=@-" -F 'jsonData={"label":"test_stream.jpg"}' http://localhost:8080/api/datasets/NNN/add`

the file still uploads, saved as "test_stream.jpg", but identified as "text/plain".

Note that in the first example the mime type is not necessarily derived from the filename extension. You can rename a jpeg as test.xxx, and it will still be typed properly. Meaning, our detection code reads the file and identifies it as a jpeg; but for whatever reason this isn't done when the same file is piped in. I couldn't immediately tell why from looking at the API code. 

It appears that when the API is executed from pyDataverse (via `api.upload_datafile()`), the POST request is also formatted (using Python requests library) in a way that makes our code skip the type detection. 

**More info/potential explanation:** 

OK, looking at the POST requests formatted by curl (via `curl ... --trace-ascii /dev/stdout`), it looks like the difference is straightforward enough: 

case 1.: 

```
0000: --------------------------fe5bea7b618b9c79
002c: Content-Disposition: form-data; name="file"; filename="test.xxx"
006e: Content-Type: application/octet-stream
0096: 
0098: ...
```

vs. case 2.:

```
0000: --------------------------c56c65bb0215ed20
002c: Content-Disposition: form-data; name="file"; filename="-"
0067: 
0069: ...
```

i.e., when the standard input is used, curl encodes the multipart-form without any `Content-Type:` header; which somehow causes the mime type to default to `text/plain`, which we accept as a good enough type (?) and either skip the type check, or disregard its result. With the filename supplied, the `Content-Type:` is set, at least to `application/octet-stream` - which we recognize on the application side as a nice way to say "type unknown", so we replace it if the file can be typed as something more specific. 

The same thing must be happening in pyDataverse - no `Content-Type:` in the multiform file entry. While it's not possible to explicitly specify the mime type in `pyDataverse/upload_datafile()`, it appears to be possible to do so w/ the standard `requests` library used by `pyDataverse`. So it should be possible to make a PR into https://github.com/gdcc/pyDataverse that would fix this on their end (?).  

We may still want to change something in our (Dataverse) code and see if we can easily prevent it from defaulting to `text/plain` when the type is not supplied in the multiform POST explicitly. (the defaulting may be happening outside of our code; but we can still make our code smarter, about picking the best/most specific type possible). 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

File Upload API: problem with mime type detection #8344

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File Upload API: problem with mime type detection #8344

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions