OrChem architecture

Back to main page

The picture below shows the main components of OrChem in yellow. Shown left in gray is the user's table with chemical compounds. This table (or multiple tables unified in a view) is presumed to have some unique identifier column (character or numeric) and then also a VARCHAR2 or CLOB column with molecular data (MolFiles and SMILES are accepted). The identifier and molecular data column are mandatory for OrChem to work. During OrChem installation the table ORCHEM_PARAMETERS is populated; this one-row table stores the name of user's compound table and its primary key column.

Before OrChem can be used to search the compound table, a Java stored procedure (package "orchem_fingerprinting") needs to run to populate tables that support similarity and substructure searching.
The fingerprinting procedure reads the entire content of user's compound table, and for each compound creates a CDK molecule and fingerprints it. The time it takes for this procedure to complete depends on the size of the compound table and the complexity of the compounds. The procedure can/should be parallelized, for instance by using DBMS_JOB.

The two shown tables ORCHEM_FINGPRINT_SIMSEARCH and ORCHEM_FINGPRINT_SUBSEARCH are populated by the fingerprinting procedure:

ORCHEM_FINGPRINT_SIMSEARCH is a cached table that facilitates similarity searching. It has a raw column storing the fingerprint for each compound. It also has a column that indicates how many of the fingerprint bits are set to one. OrChem uses a similarity search algorithm that makes relatively large row scans on this table. The similarity table is therefore kept "lean" and cached for better performance.
ORCHEM_FINGPRINT_SUBSEARCH is the central table for substructure searching. It has the same chemical fingerprint shredded into separate columns, each column representing one bit of the fingerprint (null=false, 1=true). There are also columns for storing atom and bond information for the compound the row is linked to. If the compound is big, the atom and bond data may occasionally overflow into Clobs, table ORCHEM_BIG_MOLECULES (not shown)
The substructure search requires a query's fingerprint to be a subset of any candidate's fingerprint. This is implemented by running a pre-filter query on the indexed columns bit1 .. bitxxx.
Atom and bond counts are used to skip compounds that can never be candidates. For example, a query structure with three oxygen atoms can never be a substructure of a compound with two oxygens or less. Filtered candidates are finally put through a CDK VF2 graph matching implementation to check if the query structure really is a substructure of the pre-filtered candidate compound.

OrChem substructure searching is generally more expensive and slower than similarity searching because it includes the costly graph matching step. To boost performance, Oracle's parallel piped function feature has been implemented for the substructure search. The user can actually choose between parallelized and non-parallelized substructure searches (see the two 'subsearch' packages in the picture). The parallel search requires an extra step making the query a two-step process. Although this is extra overhead, the benefit may be serious performance gain, depending on the query and the amount of cores/processors on your database server.

You can opt out of OrChem's search implementations; instead, you could only install the Java libraries in a database schema and so have the CDK at your disposal inside Oracle. A number of chemical format conversion functions have been set up for your convenience, and developers can build additional Java stored procedure wrappers around CDK class methods of your particular interest. This is a relatively straightforward process, and once in place the rich functionality of the CDK is available to all SQL and PL/SQL in your database applications.

Overview / architecture