GitHub - daniyall/pyPiper: Pipelining framework for python. Designed for data analysis but can be useful to other applications

A parallel pipelining framework for Python. Developers can create nodes and chain them together to create pipelines.

Classes that extend Node must implement run method that will be called whenever new data is available.

Installation

pip install pypiper

Example Usage

from pyPiper import Node, Pipeline

class Generate(Node):
    def setup(self, size):
        self.size = size
        self.pos = 0

    def run(self, data):
        if self.pos < self.size:
            self.emit(self.pos)
            self.pos = self.pos + 1
        else:
            self.close()

class Square(Node):
    def run(self, data):
        self.emit(data**2)


pipeline = Pipeline(Generate("gen", size=10) | Square("square"))
print(pipeline)
pipeline.run()

Nodes can also specify a batch size that dictates how much data should be pushed to the node. For example, building on the previous example. In this case batch_size is specified in the nodes setup method. Alternatively, it can be set when creating the node (ex. Printer("print", batch_size=5))

class Printer(Node):
    def setup(self):
        self.batch_size = Node.BATCH_SIZE_ALL

    def run(self, data):
        print(data)

pipeline = Pipeline(Generate("gen", size=10) | Square("square") | Printer("print"))
print(pipeline)
pipeline.run()

Parallel Execution

To process pipelines in parallel, pass n_threads > 1 when creating the pipeline. Parallel execution is done using multiprocessing and is well suited to CPU intensive tasks such as audio processing and feature extraction. For example:

class Generate(Node):
    def setup(self, size):
        self.pos = 0

    def run(self, data):
        if self.pos < self.size:
            self.emit(self.pos)
            self.pos = self.pos + 1
        else:
            self.close()

pipeline = Pipeline(Generate("gen", size=10) | Square("square") | Printer("print"), n_threads=2)
print(pipeline)
pipeline.run()

Stream Names

You can also name input and output streams. For example:

gen = EvenOddGenerate("gen", size=20, out_streams=["even", "odd"])
double = Double("double", out_streams="num", in_streams="even")
square = Square("square", out_streams="num", in_streams="odd")

printer1 = Printer("p1", in_streams="num", batch_size=Node.BATCH_SIZE_ALL)
printer2 = Printer("p2", in_streams="num", batch_size=Node.BATCH_SIZE_ALL)

p = Pipeline(gen | [double | printer1, square | printer2], quiet=False)

p.run()

EvenOddGenerate generates a pair of numbers. using the out_streams parameter, we name the first number even and second number odd. When initializing the double and square nodes, we tell double to take the even number and square to take the odd number.

If multiple output streams are passed into a node, by default, they will be come into the node as a list. For example,

gen = EvenOddGenerate("gen", size=10, out_streams=["even", "odd"])
printer = Printer("p1", batch_size=1)

p = Pipeline(gen | printer, quiet=False)
p.run()

Will output

[0,1],
[2,3],
...

However, you can split the streams by specifying their names in the input streams in_streams parameter. So,

gen = EvenOddGenerate("gen", size=20, out_streams=["even", "odd"])

printer = Printer("p1", in_streams=["even", "odd"], batch_size=1)

p = Pipeline(gen | printer, quiet=False)
p.run()

Will generate:

0,
1,
2,
3,
...

Progress Updates

When calling pipeline.run(), you can provide a callback function for progress updates. Whenever the pipelines makes progress, it calls this function with the number of items that have been processed so far and the total number of items that need to be processed. For example, if you were using a tqdm progress bar, you could use the following code:

from tqdm import tqdm

class TqdmUpdate(tqdm):
    def update(self, done, total_size=None):
        if total_size is not None:
            self.total = total_size
        self.n = done
        super().refresh()

if __name__ == '__main__':
    gen = Generate("gen", size=10)
    double = Double("double")
    sleeper = Sleep("sleep")

    p = Pipeline(gen | [double, sleeper], n_threads=4, quiet=True)
    with TqdmUpdate(desc="Progress") as pbar:
        p.run(update_callback=pbar.update)

Projects Using PyPiper

COVFEFE: A feature extraction tool focusing on lexical, syntactic and pragmatic features from text and audio features from audio.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
pyPiper		pyPiper
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table of Contents

Installation

Example Usage

Parallel Execution

Stream Names

Progress Updates

Projects Using PyPiper

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

daniyall/pyPiper

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Installation

Example Usage

Parallel Execution

Stream Names

Progress Updates

Projects Using PyPiper

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages