Skip to content

Purview/RMS Document Protection & Sensitivity Label Integration #797

@paullizer

Description

@paullizer

TL;DR

Supporting Purview/RMS-encrypted and sensitivity-labeled documents in SimpleChat involves three capabilities with different technical approaches and complexity levels:

  1. Reading sensitivity labels from documents sourced from OneDrive/SharePoint (via Microsoft Graph API) — most feasible, moderate effort
  2. Decrypting RMS/MIP-protected documents server-side so they can be processed through the existing pipeline — feasible but complex, significant prerequisites
  3. Storing and displaying protection metadata as document tags/metadata in the existing Cosmos DB + AI Search schema — straightforward once Adding Microsoft SECURITY.MD #1 or This repo is missing important files #2 provides the data

Key Findings from Research

Current state:

  • No handling of encrypted/protected documents exists. If a password-protected or RMS-protected file is uploaded, Azure Document Intelligence will fail and the processing errors out with no special handling.
  • No OneDrive/SharePoint file browsing or import integration exists.
  • All MS Graph API calls use delegated (user session) auth only — no app-only client_credentials flow.
  • The app already requests User.Read, User.ReadBasic.All, People.Read.All, Group.Read.All scopes via MSAL.
  • Documents have an existing tag system (tags array in Cosmos DB, propagated to search index chunks) and metadata fields (document_classification, title, authors, etc.).

SDK/API landscape:

Approach SDK/API Python Support Notes
MIP SDK (File SDK) C++, .NET, Java — no Python No native Python SDK Can decrypt, read labels, remove protection. Requires C++ wrapper or .NET subprocess.
MS Graph API — driveItem/extractSensitivityLabels REST API Yes (via msgraph-sdk-python or raw HTTP) Reads sensitivity labels from files already in OneDrive/SharePoint. Cannot decrypt files.
MS Graph API — driveItem/assignSensitivityLabel REST API Yes Assigns labels to files in OneDrive/SharePoint. Metered API (charges apply).
MS Graph API — driveItem/content (download) REST API Yes Downloads file content. If the file is RMS-encrypted, the downloaded bytes are still encrypted.
Azure RMS Super User PowerShell cmdlets PowerShell only (subprocess) Can bulk-decrypt files. Requires Enable-AipServiceSuperUserFeature + admin config.
Set-FileLabel PowerShell cmdlet PowerShell PowerShell only Part of PurviewInformationProtection module — can apply/remove labels and encryption.

Critical constraint:

The MIP SDK has no Python bindings. The three available language bindings are C++, .NET, and Java. For a Flask/Python backend, this means decryption requires either:

  • A .NET/Java microservice sidecar that the Python app calls
  • A subprocess shelling out to PowerShell or a .NET CLI tool
  • Using the MS Graph API approach (which only works for files already in OneDrive/SharePoint, not arbitrary uploads)

Steps

Phase 1: Detect Protected Documents on Upload (Low effort, high value)

  1. Add a protection detection step in process_document_upload_background() in functions_documents.py — before dispatching to the format-specific handler, inspect the uploaded file for RMS/MIP encryption signatures:

    • For Office files (DOCX/XLSX/PPTX): check for the EncryptedPackage stream in the OLE compound file, or check for LabelInfo/MSOEncryptionInfo XML parts in the OOXML package
    • For PDF files: check for /Encrypt dictionary entries and Microsoft IRM markers
    • Use the olefile Python library to detect compound files with encrypted streams
    • Use python-docx / zipfile to check for [Content_Types].xml containing customXml with MIP label metadata
  2. Add new fields to the Cosmos DB document schema: protection_status (enum: none, rms_encrypted, sensitivity_labeled, password_protected), sensitivity_labels (array of label objects with id, name, assignment_method), protection_source (e.g., purview, rms, onedrive)

  3. When a protected document is detected but cannot be decrypted, set status to a new value like "Protected - requires decryption" instead of erroring out, and store the protection metadata. Return this status to the UI so the user sees a clear explanation.

Phase 2: Read Sensitivity Labels via Graph API (Moderate effort)

  1. Add a new OneDrive/SharePoint file import route — rather than uploading a file from the user's machine, allow importing a file directly from OneDrive/SharePoint via Graph API:

    • GET /me/drive/root/children to browse OneDrive files
    • POST /drives/{drive-id}/items/{item-id}/extractSensitivityLabels to read labels
    • GET /drives/{drive-id}/items/{item-id}/content to download the file content
    • This requires adding Files.Read.All scope to the MSAL configuration in config.py and the app registration
  2. After extracting sensitivity labels via Graph, resolve label IDs to human-readable names using the Graph Information Protection API or a configured label-name mapping in admin settings.

  3. Store the extracted label metadata on the document record in Cosmos DB (using the new sensitivity_labels field from Step 2) and propagate to search index chunks (similar to how document_tags are propagated today).

Phase 3: Decrypt RMS-Protected Documents (High effort, significant prerequisites)

  1. Option A — .NET MIP SDK Sidecar Microservice: Build a small ASP.NET Web API or Azure Function that:

    • Accepts an encrypted file via HTTP POST
    • Uses the MIP File SDK (Microsoft.InformationProtection.File NuGet package) to decrypt it
    • Returns the decrypted file content and extracted label metadata
    • Requires: Entra ID app registration with Azure Rights Management Services > user_impersonation + Microsoft Information Protection Sync Service > UnifiedPolicy.User.Read API permissions
    • Requires: The service account/app to be added as an RMS Super User via Add-AipServiceSuperUser PowerShell cmdlet + Enable-AipServiceSuperUserFeature
    • Requires: An Information Protection Integration Agreement (IPIA) with Microsoft if the app is released publicly
  2. Option B — PowerShell Subprocess: Use subprocess from Python to call the Set-FileLabel PowerShell cmdlet (from the PurviewInformationProtection module) on the server to strip protection before processing. Simpler for internal-only deployments but less scalable and requires PowerShell to be installed on the app server.

  3. Option C — Graph API "Download via SharePoint" (limited): For files sourced from OneDrive/SharePoint, SharePoint can sometimes serve unprotected content when the requesting user/app has appropriate permissions. This only works for files that use labels backed by Azure RMS (not DKE — double key encryption), and the app must have Sites.ReadWrite.All or equivalent permissions.

  4. Integrate the chosen decryption mechanism into the existing process_document_upload_background() flow: detect protection → call decryption service → receive plaintext file → continue with normal processing pipeline → store protection metadata.

Phase 4: UI & Metadata Display (Low-moderate effort)

  1. Add UI indicators in the document list views (personal, group, public workspaces) showing protection status: a badge/icon for sensitivity_labeled, rms_encrypted, etc. Modify the templates and JavaScript in workspace-documents.js and similar files.

  2. Add sensitivity label names to the existing tag system or as a separate metadata section in the document detail view. Consider auto-tagging documents with their sensitivity label name (e.g., auto-applying a tag purview:confidential).

  3. Add admin settings for configuring the MIP integration: enable/disable, sidecar service URL (for Option A), label-to-tag mapping rules, and accepted protection levels.

Phase 5: Configuration & Admin (Moderate effort)

  1. Add new admin configuration settings in functions_settings.py / settings UI:

    • enable_mip_integration (boolean)
    • mip_sidecar_url (string, URL of the .NET decryption microservice)
    • mip_client_id, mip_tenant_id (for the separate MIP app registration)
    • allowed_sensitivity_levels (list — which label classifications are allowed for upload)
    • auto_tag_sensitivity_labels (boolean — auto-create tags from labels)
  2. Register a new Entra ID app (or extend the existing one) with the additional API permissions: Files.Read.All, InformationProtection.Read.All (for label resolution), Azure Rights Management Services > Content.SuperUser (for the decryption service)

Verification

  • Phase 1: Upload a known RMS-protected DOCX and verify the app detects it and shows "Protected - requires decryption" status instead of an error
  • Phase 2: Browse OneDrive from the app, select a labeled file, verify label metadata appears in the document record
  • Phase 3: Upload an RMS-encrypted file, verify it gets decrypted and processed with full text extraction, and the original labels are stored
  • Phase 4: Verify sensitivity badges appear in document lists and search results include label-based filtering

Decisions

  • Decision: Decryption approach — Option A (.NET MIP SDK sidecar) is recommended over Option B (PowerShell subprocess) for production reliability and scalability, but Option B is faster to prototype internally
  • Decision: No native Python MIP SDK exists — this is the core constraint; any decryption capability requires a non-Python component
  • Decision: Phase 1 (detection) and Phase 4 (UI) are independent of the decryption capability and can ship first to give users visibility into why some documents fail processing
  • Decision: Graph API approach (Phase 2) works only for OneDrive/SharePoint-sourced files, not arbitrary file uploads from disk — both paths should be supported
  • Decision: RMS Super User feature is a tenant-level admin decision with security implications (audit logging, restricted access) — this must be documented and configured by the customer's IT admin, not automated

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Pending Evaluation

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions