Back to main page
The picture below shows the main components of OrChem in yellow.
Shown left in gray is the user's table with chemical compounds. This table (or
multiple tables unified in a view) is presumed to have some unique identifier column (character or numeric) and
then also a VARCHAR2 or CLOB column with molecular data (MolFiles and SMILES are accepted).
The identifier and molecular data column are mandatory for OrChem to work.
During OrChem
installation
the table ORCHEM_PARAMETERS is populated; this one-row table
stores the name of user's compound table and its primary key column.
Before OrChem can be used to search the compound table, a Java stored
procedure (package "orchem_fingerprinting") needs to run to populate tables that support
similarity and substructure searching.
The fingerprinting procedure reads the entire content of user's compound table, and for each
compound creates a CDK molecule and fingerprints it. The time it takes for this procedure to complete
depends on the size of the compound table and the complexity of the compounds. The procedure can/should be
parallelized, for instance by using DBMS_JOB.
The two shown tables ORCHEM_FINGPRINT_SIMSEARCH and ORCHEM_FINGPRINT_SUBSEARCH are populated by the
fingerprinting procedure:
-
ORCHEM_FINGPRINT_SIMSEARCH
is a cached table that facilitates similarity searching. It has a raw
column storing the fingerprint for each compound. It also has a column
that indicates how many of the fingerprint bits are set to one.
OrChem uses a
similarity search algorithm that makes relatively large
row scans on this table. The similarity table is therefore kept "lean"
and cached for better performance.
-
ORCHEM_FINGPRINT_SUBSEARCH
is the central table for substructure searching. It has the same chemical
fingerprint shredded into separate columns, each column representing
one bit of the fingerprint (null=false, 1=true).
There are also columns for storing atom and bond information for the compound the row is linked to. If the
compound is big, the atom and bond data may occasionally overflow into Clobs, table ORCHEM_BIG_MOLECULES (not shown)
The substructure search requires a query's fingerprint to be a subset of
any candidate's fingerprint. This is implemented by running a pre-filter query on
the indexed columns bit1 .. bitxxx.
Atom and bond counts are used to
skip compounds that can never be candidates. For example, a query structure with three oxygen atoms can
never be a substructure of a compound with two oxygens or less.
Filtered candidates are finally put through a CDK
VF2 graph matching implementation to check if the query structure really is a substructure of the pre-filtered candidate compound.
OrChem substructure searching is generally more expensive and slower than similarity searching because it includes the costly graph matching step.
To boost performance, Oracle's
parallel piped function feature has been implemented for the substructure search.
The user can actually choose between parallelized and non-parallelized substructure searches (see the two 'subsearch' packages in the picture).
The parallel search requires an extra step making the query a two-step process. Although this is extra overhead,
the benefit may be serious performance gain, depending on the query and the amount of cores/processors on your database server.
You can opt out of OrChem's search implementations; instead, you could only install the Java libraries in a database schema and so have
the CDK at your disposal inside Oracle. A number of chemical format conversion functions
have been set up for your convenience, and developers can build additional Java stored procedure wrappers around CDK class methods of your particular interest.
This is a relatively straightforward process, and once in place the rich functionality of the CDK is available to all SQL and PL/SQL in your database applications.