File Analysis Pipeline

Currently under development. A secure, scalable pipeline for your existing data analysis engine.

This is a system aimed at importing files from desktop computers, delivering them to a scalable network of processors for analysis and sending back results.

Features

⚡️ Optimized Transfers - Files are optimized for maximum throughput
🚀 Distributed Scaling - Jobs are handled by a network of processors
🔒 Private Results - Results are encrypted so that only the uploader can decrypt them
🛡️ Attack Mitigation - Processors are shielded from attacks by having no open inbound ports
🔄 Controlled Updates - Users choose when to update, while devs can still introduce breaking changes

Architecture

💻 Client: Watch directory, Stream files, Sign up / in, Show system state, Manage organisations and users
🌐 Server: Create Presigned Upload URLs, Database management, Role-based authorization, SSO authentication
⚙️ Processor: Receive file stream, Run as node in scalable network, Communicate with sandboxed Engine
🤖 Engine: Perform file analysis, my example just counts file size - Replace this with your AI

Installation

# Ubuntu 24.04 LTS with Docker Desktop

# Bun 1.3
curl -fsSL https://bun.com/install | bash

# Chromium-compatible browser is required for Electron
# Any external browser is required for SSO
wget -P /tmp https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb && \
sudo apt install -y /tmp/google-chrome-stable_current_amd64.deb && \
rm /tmp/google-chrome-stable_current_amd64.deb

# Node is required to run electron-builder
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.3/install.sh | bash
nvm install 22

# Within repo root
bun install
cd client
bun install
bun run setup # Configure protocol for Electron

# Within repo root
docker compose up
bun run client

Reasoning

When autoscaling the server, use the NodeJS event loop lag as a metric, in combination with request rate and latency. (TODO)
Secrets are handled by AWS Secret Manager, but for local development I inject my secrets into that manager from json files in the root.
Compress large files using multiple CPUs. The speed increase from having less to send over the network will outweigh the speed decrease of heavy compression.
Chunk up large files while compressing them to avoid intermediate disk I/O and to allow for the processor to download the chunks already uploaded.
Upload directly to S3. Reducing the file transfer time to S3 involves cutting out systems from the pipeline, direct client to S3 uploads are ideal.
Use pre-signed upload URLs. Presigning gives the server the ability to authorize clients on a per-upload basis. Clients can be restricted when the limit of their payment plan is reached.
Use SSO for authentication. Businesses already have on- and off-boarding procedures in-place. By using Microsoft and Google SSO, this system tries to minimise operational requirements.
Use S3 (binary data) and a custom NestJS server (JSON data). This enables the processor with attached engine to pull data instead of requiring open inbound ports.
Leverage the increased throughput of file transfers within AWS networks, which could outperform a direct client to processor upload stream depending on the network connections and usage.
Use Docker Compose with a bridge network between the engine and processor. This ensures the engine can still be open to requests from the processor, without being accessible from anywhere else.
Emulate AWS services using LocalStack, enabling improved testing and development with easily configurable IAM policies, DynamoDB tables, S3 buckets and secrets. This approach also allows for developer-specific mock data.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.vscode		.vscode
aws		aws
client		client
engine		engine
monitor		monitor
processor		processor
server		server
shared		shared
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
compose.yml		compose.yml
package.json		package.json
secrets.processor.example.json		secrets.processor.example.json
secrets.server.example.json		secrets.server.example.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

File Analysis Pipeline

Features

Architecture

Installation

Reasoning

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

File Analysis Pipeline

Features

Architecture

Installation

Reasoning

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages