Creating fingerprints
This section assumes you have installed OrChem.
It explains the fingerprinting process. Once your compounds have been fingerprinted
you can use OrChem for similarity and substructure searching. If you're not interested
in OrChem's search features you need to fingerprint.
Simple approach
If you have a fairly small compound table, say in the order of tens of thousands of
compounds, you could opt to create the fingerprints in a single process.
If you have a larger compound set, please continue reading the parallel approach.
To create fingerprints you need to run procedure "orchem_fingerprinting.load_cdk_fingerprints".
This procedure takes two arguments, but in the simple approach they can be set to NULL.
So, after logging on to the ORCHEM schema with for example SQL*Plus, invoke the procedure like this:
begin
orchem_fingerprinting.LOAD_CDK_FINGERPRINTS(null,null);
end;
The procedure will populate ORCHEM_FINGPRINT_SIMSEARCH and ORCHEM_FINGPRINT_SUBSEARCH.
You can track progress of the procedure by counting the rows in these tables.
When the procedure completes, you can
see some logged information in a log table called ORCHEM_LOG.
After completion, create the indexes.
Parallel approach
The parallel approach splits the creation of fingerprints into several threads.
To do this, you should decide how many parallel threads you want to run. Mind you, too many parallel
threads may be counter productive, clogging your database.
Here is an example. Say we have a table COMPOUNDS that has 1 million compounds in it.
In our table, the compounds are neatly distributed with IDs between 1 and
1000,000.
We decide to run five parallel threads, each taking care of 200,000 compounds.
This is done through DBMS_JOB. Make sure DBMS_JOB is available to you and that
it allows the concurrency you are looking for (your DBA can tell you).
In the example, DBMS_JOB is used as follows:
declare
i binary_integer;
begin
dbms_job.submit (JOB=>i,WHAT=>'ORCHEM_FINGERPRINTING.slice_load( 1 , 200000 );' ,INTERVAL=>null );
dbms_job.submit (JOB=>i,WHAT=>'ORCHEM_FINGERPRINTING.slice_load(200001 , 400000 );' ,INTERVAL=>null );
dbms_job.submit (JOB=>i,WHAT=>'ORCHEM_FINGERPRINTING.slice_load(400001 , 600000 );' ,INTERVAL=>null );
dbms_job.submit (JOB=>i,WHAT=>'ORCHEM_FINGERPRINTING.slice_load(600001 , 800000 );' ,INTERVAL=>null );
dbms_job.submit (JOB=>i,WHAT=>'ORCHEM_FINGERPRINTING.slice_load(800001 ,1000000 );' ,INTERVAL=>null );
commit;
end;
In the example above, procedure slice_load is invoked for each group of compounds we
want to fingerprint. So the first job will take care of fingerprinting compounds
with a primary key between 1 and 200,000. The next job will do primary keys 200,001 to 400,000. And so on.
For your particular data, find the right primary key ranges and issue jobs using more or
fewer jobs according to your preferences.
For a rough idea on timings, at the EBI we indexed a PubChem snapshot of 3.5 million compounds
in about 7 hours using 5 parallel threads.
You can check progress of DBMS_JOB by running the query below. See more on DBMS_JOB in the
Oracle documentation.
set lines 200
col what format a80
select job, what, failures, this_sec, next_sec from user_jobs
/
You should not spot anything in column "failures". If you do,
remove the DBMS_JOBs and check the error in ORCHEM_LOG. Notify the OrChem developers
about the error by sending an email.
When all jobs finish succesfully they will disappear from USER_JOBS. If you wish you can check
the log in table ORCHEM_LOG. Next, create the indexes.
|