Skip to content

File Upload API: problem with mime type detection #8344

@landreev

Description

@landreev

[this issue is still work in progress; I may need to investigate some more/add more info; but going to create an issue so that I don't forget, again]

Short version:

When files are uploaded via /api/datasets/{id}/add it appears that the mime type identification step is skipped if the file stream is passed to the API in a certain way; with the file always ending up classified as text/plain.
This is not a fatal problem when using the API on the command line via curl (it works properly when used exactly as specified in our guide). But it becomes a problem when trying to use it from some software clients. Specifically, it appears to be impossible to upload a file via pyDataverse as anything but text/plain.

Excruciating details:

1. Uploading an image file following the example in the API guide:

curl -H X-Dataverse-key:XXX -X POST -F "[email protected]" "http://localhost:8080/api/datasets/NNN/add"

this works, the file is uploaded and identified as image/jpeg.

2. But try to pipe the same input to the API instead:

cat test.jpg | curl -H X-Dataverse-key:XXX -X POST -F "file=@-" -F 'jsonData={"label":"test_stream.jpg"}' http://localhost:8080/api/datasets/NNN/add

the file still uploads, saved as "test_stream.jpg", but identified as "text/plain".

Note that in the first example the mime type is not necessarily derived from the filename extension. You can rename a jpeg as test.xxx, and it will still be typed properly. Meaning, our detection code reads the file and identifies it as a jpeg; but for whatever reason this isn't done when the same file is piped in. I couldn't immediately tell why from looking at the API code.

It appears that when the API is executed from pyDataverse (via api.upload_datafile()), the POST request is also formatted (using Python requests library) in a way that makes our code skip the type detection.

More info/potential explanation:

OK, looking at the POST requests formatted by curl (via curl ... --trace-ascii /dev/stdout), it looks like the difference is straightforward enough:

case 1.:

0000: --------------------------fe5bea7b618b9c79
002c: Content-Disposition: form-data; name="file"; filename="test.xxx"
006e: Content-Type: application/octet-stream
0096: 
0098: ...

vs. case 2.:

0000: --------------------------c56c65bb0215ed20
002c: Content-Disposition: form-data; name="file"; filename="-"
0067: 
0069: ...

i.e., when the standard input is used, curl encodes the multipart-form without any Content-Type: header; which somehow causes the mime type to default to text/plain, which we accept as a good enough type (?) and either skip the type check, or disregard its result. With the filename supplied, the Content-Type: is set, at least to application/octet-stream - which we recognize on the application side as a nice way to say "type unknown", so we replace it if the file can be typed as something more specific.

The same thing must be happening in pyDataverse - no Content-Type: in the multiform file entry. While it's not possible to explicitly specify the mime type in pyDataverse/upload_datafile(), it appears to be possible to do so w/ the standard requests library used by pyDataverse. So it should be possible to make a PR into https://github.com/gdcc/pyDataverse that would fix this on their end (?).

We may still want to change something in our (Dataverse) code and see if we can easily prevent it from defaulting to text/plain when the type is not supplied in the multiform POST explicitly. (the defaulting may be happening outside of our code; but we can still make our code smarter, about picking the best/most specific type possible).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions