Skip to content

Conversation

@chaoran-chen
Copy link
Member

@chaoran-chen chaoran-chen commented Dec 30, 2025

resolves #3999

This PR adds backend validation to prevent filenames to contain characters that are not allowed in some of the major operating systems, ASCII control characters and be longer than 255 characters. For the reason mentioned in #3999 (comment), this is not taking the restrictions of Content-Disposition's filename= into account.

Screenshot

image

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)
    • The upload of a file with a * in the name.

🚀 Preview: https://filename-restriction.loculus.org

@chaoran-chen chaoran-chen added the preview Triggers a deployment to argocd label Dec 30, 2025
@chaoran-chen chaoran-chen marked this pull request as ready for review December 31, 2025 00:02
@claude
Copy link
Contributor

claude bot commented Dec 31, 2025

Code Review: feat(backend): restrict allowed filenames

Summary

This PR adds robust filename validation to prevent issues with cross-platform compatibility. The implementation is well-structured with comprehensive test coverage. Overall, this is a solid implementation with just a few minor suggestions for improvement.

✅ Strengths

  1. Excellent Test Coverage: The test suite is very comprehensive with 17 test cases covering:

    • Valid filenames (ASCII, Unicode, hidden files)
    • All forbidden characters individually
    • Edge cases (empty, too long, null/empty mappings)
    • Multiple files with mixed validity
  2. Clear Documentation: The validateFilename() function has excellent documentation explaining the rationale and providing references to authoritative sources.

  3. Proper Error Messages: Error messages are clear, specific, and include the problematic filename and category.

  4. Consistent Integration: The validation is properly integrated at all three entry points (submit, revise, edit) following the existing fluent validation pattern.

  5. Good User Documentation: The documentation in submit-extra-files.md clearly explains the restrictions to users.

🔍 Issues & Suggestions

1. Potential Character Encoding Issue (Minor)

Location: FileMappingPreconditionValidator.kt:143

The validation checks filename.length > 255, but the documentation mentions "ext4 and NTFS only allow 255 bytes". Kotlin's String.length returns the number of characters, not bytes. For UTF-8 filenames, a character can be 1-4 bytes, so a 200-character filename with multibyte Unicode characters could exceed 255 bytes.

Suggestion: Consider whether you want to validate byte length instead:

if (filename.toByteArray(Charsets.UTF_8).size > 255) {
    throw UnprocessableEntityException(
        "Invalid filename '$filename' in category '$category': Filenames may not exceed 255 bytes when encoded as UTF-8",
    )
}

However, this might be overly restrictive for the use case. If the files are primarily stored in S3 and downloaded via HTTP (which your PR description suggests), character length may be sufficient. This deserves consideration/discussion.

2. Special Windows Reserved Names Not Checked (Low Priority)

Location: FileMappingPreconditionValidator.kt:137

Windows has reserved filenames (CON, PRN, AUX, NUL, COM1-9, LPT1-9) that cannot be used regardless of extension. These would cause issues on Windows systems.

Example problematic filenames: CON.txt, NUL.fasta, AUX, COM1.csv

Reference: Your own documentation link mentions this: https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file

Suggestion: Add validation like:

private val WINDOWS_RESERVED_NAMES = setOf(
    "CON", "PRN", "AUX", "NUL",
    "COM1", "COM2", "COM3", "COM4", "COM5", "COM6", "COM7", "COM8", "COM9",
    "LPT1", "LPT2", "LPT3", "LPT4", "LPT5", "LPT6", "LPT7", "LPT8", "LPT9"
)

// In validateFilename():
val nameWithoutExtension = filename.substringBeforeLast('.', filename).uppercase()
if (nameWithoutExtension in WINDOWS_RESERVED_NAMES) {
    throw UnprocessableEntityException(
        "Invalid filename '$filename' in category '$category': This is a reserved name on Windows systems.",
    )
}

That said, this may be lower priority if files are typically accessed via web download rather than Windows filesystems.

3. Trailing Dots and Spaces (Low Priority)

Location: FileMappingPreconditionValidator.kt:137

Windows automatically strips trailing dots and spaces from filenames, which can cause confusion. For example, file.txt. becomes file.txt on Windows.

Suggestion:

if (filename.endsWith('.') || filename.endsWith(' ')) {
    throw UnprocessableEntityException(
        "Invalid filename '$filename' in category '$category': Filenames may not end with dots or spaces.",
    )
}

4. Test Assertions Could Be More Specific

Location: ValidateFileNameTest.kt (multiple tests)

The tests use assertThrows<UnprocessableEntityException> but don't verify the error message content. This makes tests less valuable for catching regressions in error messages.

Suggestion:

@Test
fun `filename with asterisk should fail validation`() {
    val fileMapping = createFileMapping("sequences", listOf("file*.txt"))
    val exception = assertThrows<UnprocessableEntityException> {
        validator.validateFilenameCharacters(fileMapping)
    }
    assertTrue(exception.message!!.contains("forbidden characters"))
}

5. Edge Case: Filename with Only Dots (Minor)

Location: FileMappingPreconditionValidator.kt:137

Filenames like ., .., or ... might pass validation but could cause issues. While .. is particularly dangerous (directory traversal), single or multiple dots alone are problematic.

Suggestion: Add validation:

if (filename.all { it == '.' }) {
    throw UnprocessableEntityException(
        "Invalid filename '$filename' in category '$category': Filenames may not consist only of dots.",
    )
}

📊 Performance Considerations

No concerns. The validation is O(n) where n is the filename length, which is acceptable given the 255-character limit.

🔒 Security Considerations

Good: The PR properly prevents path traversal characters (/, \), which is critical for security.

Consider: If these filenames are ever used in shell commands, URL paths, or Content-Disposition headers, additional validation might be needed, but that should be handled at those specific points, not here.

📝 Documentation & Conventions

✅ Follows conventional commit format: feat(backend):
✅ Detailed PR description with screenshot
✅ Documentation updated for users
✅ References the issue #3999

🧪 Testing Recommendations

Consider adding integration tests that actually attempt to:

  1. Submit a file with a forbidden character via the API
  2. Verify the error response format and HTTP status code
  3. Test that valid filenames successfully pass through the entire submission flow

The unit tests are excellent, but integration tests would provide additional confidence.

🎯 Verdict

Recommendation: Approve with optional follow-ups

The core functionality is solid and ready to merge. The suggestions above are mostly edge cases that could be addressed in follow-up PRs if deemed necessary. The most important suggestion to consider is #1 (byte length vs character length) as it directly relates to the documented restriction.

Great work on the comprehensive test coverage! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview Triggers a deployment to argocd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Restrict allowed characters in file name (backend)

2 participants