-
Notifications
You must be signed in to change notification settings - Fork 105
Description
TL;DR
Supporting Purview/RMS-encrypted and sensitivity-labeled documents in SimpleChat involves three capabilities with different technical approaches and complexity levels:
- Reading sensitivity labels from documents sourced from OneDrive/SharePoint (via Microsoft Graph API) — most feasible, moderate effort
- Decrypting RMS/MIP-protected documents server-side so they can be processed through the existing pipeline — feasible but complex, significant prerequisites
- Storing and displaying protection metadata as document tags/metadata in the existing Cosmos DB + AI Search schema — straightforward once Adding Microsoft SECURITY.MD #1 or This repo is missing important files #2 provides the data
Key Findings from Research
Current state:
- No handling of encrypted/protected documents exists. If a password-protected or RMS-protected file is uploaded, Azure Document Intelligence will fail and the processing errors out with no special handling.
- No OneDrive/SharePoint file browsing or import integration exists.
- All MS Graph API calls use delegated (user session) auth only — no app-only
client_credentialsflow. - The app already requests
User.Read,User.ReadBasic.All,People.Read.All,Group.Read.Allscopes via MSAL. - Documents have an existing tag system (
tagsarray in Cosmos DB, propagated to search index chunks) and metadata fields (document_classification,title,authors, etc.).
SDK/API landscape:
| Approach | SDK/API | Python Support | Notes |
|---|---|---|---|
| MIP SDK (File SDK) | C++, .NET, Java — no Python | No native Python SDK | Can decrypt, read labels, remove protection. Requires C++ wrapper or .NET subprocess. |
MS Graph API — driveItem/extractSensitivityLabels |
REST API | Yes (via msgraph-sdk-python or raw HTTP) |
Reads sensitivity labels from files already in OneDrive/SharePoint. Cannot decrypt files. |
MS Graph API — driveItem/assignSensitivityLabel |
REST API | Yes | Assigns labels to files in OneDrive/SharePoint. Metered API (charges apply). |
MS Graph API — driveItem/content (download) |
REST API | Yes | Downloads file content. If the file is RMS-encrypted, the downloaded bytes are still encrypted. |
| Azure RMS Super User | PowerShell cmdlets | PowerShell only (subprocess) | Can bulk-decrypt files. Requires Enable-AipServiceSuperUserFeature + admin config. |
Set-FileLabel PowerShell cmdlet |
PowerShell | PowerShell only | Part of PurviewInformationProtection module — can apply/remove labels and encryption. |
Critical constraint:
The MIP SDK has no Python bindings. The three available language bindings are C++, .NET, and Java. For a Flask/Python backend, this means decryption requires either:
- A .NET/Java microservice sidecar that the Python app calls
- A subprocess shelling out to PowerShell or a .NET CLI tool
- Using the MS Graph API approach (which only works for files already in OneDrive/SharePoint, not arbitrary uploads)
Steps
Phase 1: Detect Protected Documents on Upload (Low effort, high value)
-
Add a protection detection step in
process_document_upload_background()infunctions_documents.py— before dispatching to the format-specific handler, inspect the uploaded file for RMS/MIP encryption signatures:- For Office files (DOCX/XLSX/PPTX): check for the
EncryptedPackagestream in the OLE compound file, or check forLabelInfo/MSOEncryptionInfoXML parts in the OOXML package - For PDF files: check for
/Encryptdictionary entries and Microsoft IRM markers - Use the
olefilePython library to detect compound files with encrypted streams - Use
python-docx/zipfileto check for[Content_Types].xmlcontainingcustomXmlwith MIP label metadata
- For Office files (DOCX/XLSX/PPTX): check for the
-
Add new fields to the Cosmos DB document schema:
protection_status(enum:none,rms_encrypted,sensitivity_labeled,password_protected),sensitivity_labels(array of label objects withid,name,assignment_method),protection_source(e.g.,purview,rms,onedrive) -
When a protected document is detected but cannot be decrypted, set
statusto a new value like"Protected - requires decryption"instead of erroring out, and store the protection metadata. Return this status to the UI so the user sees a clear explanation.
Phase 2: Read Sensitivity Labels via Graph API (Moderate effort)
-
Add a new OneDrive/SharePoint file import route — rather than uploading a file from the user's machine, allow importing a file directly from OneDrive/SharePoint via Graph API:
GET /me/drive/root/childrento browse OneDrive filesPOST /drives/{drive-id}/items/{item-id}/extractSensitivityLabelsto read labelsGET /drives/{drive-id}/items/{item-id}/contentto download the file content- This requires adding
Files.Read.Allscope to the MSAL configuration inconfig.pyand the app registration
-
After extracting sensitivity labels via Graph, resolve label IDs to human-readable names using the Graph Information Protection API or a configured label-name mapping in admin settings.
-
Store the extracted label metadata on the document record in Cosmos DB (using the new
sensitivity_labelsfield from Step 2) and propagate to search index chunks (similar to howdocument_tagsare propagated today).
Phase 3: Decrypt RMS-Protected Documents (High effort, significant prerequisites)
-
Option A — .NET MIP SDK Sidecar Microservice: Build a small ASP.NET Web API or Azure Function that:
- Accepts an encrypted file via HTTP POST
- Uses the MIP File SDK (
Microsoft.InformationProtection.FileNuGet package) to decrypt it - Returns the decrypted file content and extracted label metadata
- Requires: Entra ID app registration with
Azure Rights Management Services>user_impersonation+Microsoft Information Protection Sync Service>UnifiedPolicy.User.ReadAPI permissions - Requires: The service account/app to be added as an RMS Super User via
Add-AipServiceSuperUserPowerShell cmdlet +Enable-AipServiceSuperUserFeature - Requires: An Information Protection Integration Agreement (IPIA) with Microsoft if the app is released publicly
-
Option B — PowerShell Subprocess: Use
subprocessfrom Python to call theSet-FileLabelPowerShell cmdlet (from thePurviewInformationProtectionmodule) on the server to strip protection before processing. Simpler for internal-only deployments but less scalable and requires PowerShell to be installed on the app server. -
Option C — Graph API "Download via SharePoint" (limited): For files sourced from OneDrive/SharePoint, SharePoint can sometimes serve unprotected content when the requesting user/app has appropriate permissions. This only works for files that use labels backed by Azure RMS (not DKE — double key encryption), and the app must have
Sites.ReadWrite.Allor equivalent permissions. -
Integrate the chosen decryption mechanism into the existing
process_document_upload_background()flow: detect protection → call decryption service → receive plaintext file → continue with normal processing pipeline → store protection metadata.
Phase 4: UI & Metadata Display (Low-moderate effort)
-
Add UI indicators in the document list views (personal, group, public workspaces) showing protection status: a badge/icon for
sensitivity_labeled,rms_encrypted, etc. Modify the templates and JavaScript inworkspace-documents.jsand similar files. -
Add sensitivity label names to the existing tag system or as a separate metadata section in the document detail view. Consider auto-tagging documents with their sensitivity label name (e.g., auto-applying a tag
purview:confidential). -
Add admin settings for configuring the MIP integration: enable/disable, sidecar service URL (for Option A), label-to-tag mapping rules, and accepted protection levels.
Phase 5: Configuration & Admin (Moderate effort)
-
Add new admin configuration settings in
functions_settings.py/ settings UI:enable_mip_integration(boolean)mip_sidecar_url(string, URL of the .NET decryption microservice)mip_client_id,mip_tenant_id(for the separate MIP app registration)allowed_sensitivity_levels(list — which label classifications are allowed for upload)auto_tag_sensitivity_labels(boolean — auto-create tags from labels)
-
Register a new Entra ID app (or extend the existing one) with the additional API permissions:
Files.Read.All,InformationProtection.Read.All(for label resolution),Azure Rights Management Services>Content.SuperUser(for the decryption service)
Verification
- Phase 1: Upload a known RMS-protected DOCX and verify the app detects it and shows
"Protected - requires decryption"status instead of an error - Phase 2: Browse OneDrive from the app, select a labeled file, verify label metadata appears in the document record
- Phase 3: Upload an RMS-encrypted file, verify it gets decrypted and processed with full text extraction, and the original labels are stored
- Phase 4: Verify sensitivity badges appear in document lists and search results include label-based filtering
Decisions
- Decision: Decryption approach — Option A (.NET MIP SDK sidecar) is recommended over Option B (PowerShell subprocess) for production reliability and scalability, but Option B is faster to prototype internally
- Decision: No native Python MIP SDK exists — this is the core constraint; any decryption capability requires a non-Python component
- Decision: Phase 1 (detection) and Phase 4 (UI) are independent of the decryption capability and can ship first to give users visibility into why some documents fail processing
- Decision: Graph API approach (Phase 2) works only for OneDrive/SharePoint-sourced files, not arbitrary file uploads from disk — both paths should be supported
- Decision: RMS Super User feature is a tenant-level admin decision with security implications (audit logging, restricted access) — this must be documented and configured by the customer's IT admin, not automated
Metadata
Metadata
Assignees
Labels
Type
Projects
Status