-
Notifications
You must be signed in to change notification settings - Fork 12
User Guide
Users currently can interact with Orpheus in three ways. Users can execute Orpheus commands in a terminal or in a user interface. The user interface, in addition, includes a interactive version graph explorer that allows users to visualize the derivation relationship between versions. Moreover, users can select a set of versions via point-and-click, and apply various operations to these versions listed below the version graph. Last but not least, users can directly call Orpheus methods via gRPC interface. Here we introduce the basic Orpheus commands.
To start working on versioned datasets, users need to run orpheus config to set up OrpheusDB for the given user. To start off, use the same username that was used during the PostgreSQL configuration -- this will initialize a OrpheusDB user with the same username. Following that, users can create new OrpheusDB usernames via the create_user command. Upon finishing, this new username will be pushed to the underlying data storage with a SUPERUSER privilege. Command config can also be used to login through a created username and whoami is used to list the current username that is currently logged in.
Please note here that OrpheusDB provides the most basic implementation for user information. For now, we simply use SHA-1 to hash the password and store the encrypted password in OrpheusDB. However, this feature is subject to change in future versions.
orpheus config
orpheus create_user
orpheus whoami
However, in order to connect the user interface to the Postgres, users currently need to specify the Postgres authentification information in the setting.py manually. We aim to fix this issue in the near future.
The init command provides a mechanism to load a csv file into OrpheusDB as a CVD, with all the records as its first version (i.e., vid = 1). To let OrpheusDB know what is the schema for this dataset, user can provide a sample schema file through option -s. Each line in the schema file has the format <attribute name>, <type of the attribute>. In the following example, protein.csv file contains attributes such as protein1, protein2 and neighborhood. The command below loads the protein.csv file into OrpheusDB as a CVD named protein_interaction, whose schema is indicated in the file ``protein_schema.csv`.
orpheus init test/protein.csv protein_interaction -s test/protein_schema.csv
User can checkout one or more desired versions through the checkout command, to either a csv file or a structured table in RDBMS. In the following example, version 1 of CVD protein_interaction is checked out as a csv file named checkout.csv.
orpheus checkout protein_interaction -v 1 -f protein.csv
After changes are made to the previous checkout versions, OrpheusDB can commit these changes to its corresponding CVD assuming that the schema is unchanged.
In the following example, we commit the modified checkout.csv back to CVD protein_interaction. Note here that since OrpheusDB internally logged the CVD name that checkout.csv file was checked out from, there is no need to specify the CVD name in the commit command.
Any changed or new records from commit file will be appended to the corresponding CVD, labeled with a new version id. A special case is the committing of a subset of a previously checked-out version. In such a setting, OrpheusDB will perform the commit as expected; the new version is added with the subset of the records.
orpheus commit -f protein.csv -m 'first commit'
OrpheusDB also supports direct execution of queries on CVDs without materialization. This is done via the run command. The run command will prompt the user to provide the SQL command to be executed directly. If -f is specified, it will execute the SQL file specified.
orpheus run
OrpheusDB supports a rich syntax of SQL statements on versions and CVDs. During the execution of these steatements, OrpheusDB will detect keywords like CVD so it knows the query is against one or more CVDs. There are mainly the following two types of queries supported:
- Query against known version(s) of a particular dataset
- Query against unknown version(s) of a particular dataset
To query against known version(s), the version number needs to be specified. In the following example, OrpheusDB will select the neighborhood and fusion columns from CVD protein_interaction whose version id is equal to either 1 or 2.
SELECT neighborhood, fusion FROM VERSION 1,2 OF CVD protein_interaction;
If the version number is unknown, OrpheusDB supports queries where the desired version numbers are also identified. In the following examples, OrpheusDB will select all the version ids that have one or more records whose cooccurrence score is larger than 200. It is worth noting that the GROUP BY clause is required to aggregate on version numbers.
SELECT vid FROM CVD protein_interaction WHERE cooccurrence > 200 GROUP BY vid;
Here are a couple other examples of SQL on versions:
(1). Find all versions in CVD protein_interaction that have more than 5 records where coexpression score is larger than 300.
SELECT vid FROM CVD protein_interaction WHERE coexpression > 300 GROUP BY vid HAVING COUNT(*) > 5;
(2). Find all versions in CVD protein_interaction whose commit time is later than December 1st, 2016.
SELECT vid FROM CVD protein_interaction WHERE commit_time > '2016-12-01' GROUP BY vid;