Skip to content

ProvMonitor Wiki

Vanessa Braganholo edited this page Dec 1, 2016 · 3 revisions

Scientific experiments modeled as scientific workflows may create, change, or access data products not explicitly referenced in the workflow specification, leading to implicit data flows. The lack of knowledge about implicit data flows makes the experiments hard to understand and reproduce. In this article, we present ProvMonitor, an approach that identifies the creation, change, or access to data products even within implicit data flows. ProvMonitor links this information with the workflow activity that generated it, allowing for scientists to compare data products within and throughout trials of the same workflow, identifying side effects on data evolution caused by implicit data flows. We evaluated ProvMonitor and observed that it could answer queries for scenarios that demand specific knowledge related to implicit provenance.

Instrumentation and Provenance Gathering

The use of ProvMonitor comprises two complementary phases: instrumentation and execution. The instrumentation phase, which can be done manually or automatically (through ProvManager [Marinho et al. 2012]), occurs once, before the execution of the workflow. During this step, the prospective provenance is collected and the original workflow activities are replaced by new activities that wrap the content of the original activities together with Provenance Gathering Activities (PGA), which are responsible for collecting retrospective provenance. In fact, the instrumentation injects two PGA in the original activity, one before and another after its content. Moreover, two additional PGA are injected in the beginning and in the end of the workflow. These PGA manage the files required by the activity, creating the workspace and committing/pushing the generated/modified files, according to the selected isolation strategy – see Section 3.2. As PGAs are wrapped together with the original activity (using composed activities), the aspect of the original workflow is maintained. After the instrumentation, the number of activities, the dependencies among them, and their parameters are intact, but the workflow is enhanced with provenance-gathering capabilities.

During the workflow execution, the PGAs call specific methods of the ProvMonitor API to gather the retrospective provenance and associate it with the previously gathered prospective provenance. Before the first activity execution, a PGA calls the initializeExperimentExecution() API method to notify ProvMonitor that the experiment is about to start. This method is responsible for setting up the Git repository (according to the isolation strategy chosen by the user – see Section 3.2). Then, before each activity execution (including the first activity), a PGA calls the notifyActivityExecutionStartup() API method. Likewise, after each activity execution, another PGA calls the notifyActivityExecutionEnding() method. This method is responsible for performing commits in Git (and pushes, depending on the isolation strategy that is being used – see Section 3.2). Each commit contains an activity identification in the message field for linking the prospective provenance with the retrospective provenance. Finally, after the last workflow activity, another PGA calls the finalizeExperimentExecution() API method. This method pushes changes to the provenance repository. All these methods are located in the RetrospectiveProvenanceBusinessServices class of our implementation. PGAs are wrapped together with the original activity (using composed activities) to maintain the aspect of the original workflow. After the instrumentation, the number of activities, the dependencies among them and their parameters are maintained intact, but the workflow is enhanced with provenance-gathering capabilities. This allows ProvMonitor to work during the workflow execution, gathering the retrospective provenance and associating it with the prospective provenance.

Provenance Model

To be able to query provenance, and also to connect retrospective provenance (the commits in this specific case) with prospective provenance, we also use a Relational Provenance BD. To avoid the proliferation of provenance databases, in cases where the SWfMS uses a relational database to store provenance, ProvMonitor is able to use the same database to create its Provenance DB. In other cases, ProvMonitor requires a relational database (which can be as simple as SQL Lite). Thus, provenance queries can be executed directly over the relational provenance database. Scientists only need to access the Git repository when they need to analyze file contents.

ProvMonitor extends the provenance model adopted by ProvManager since it uses ProvManager to capture provenance at the activity level. Activities are instrumented using ProvManager to include version control commands that capture provenance related to files that are created, deleted or modified during the workflow execution.

ProvMonitor's Provenance Model

Our extension adds two new entities: File Access and Commit, which are shown in light gray in the model above. Additionally, ProvMonitor uses the Workflow Element Execution entity from the ProvManager provenance model. The Workflow Element Execution entity stores information about a workflow element (i.e., activity) execution on a given trial of the experiment. The Commit entity stores all commit hashes triggered by each activity. Finally, the File Access entity stores the access type (create, change, delete, or read) and the accessed files of each commit, which is related to each activity execution in the workflow.

ProvManager

Here is a list of links related to ProvManager:

Clone this wiki locally