Fingerprinting

Back to main page

Creating fingerprints

This section assumes you have installed OrChem. It explains the fingerprinting process. Once your compounds have been fingerprinted you can use OrChem for similarity and substructure searching. If you're not interested in OrChem's search features you need to fingerprint.

Simple approach

If you have a fairly small compound table, say in the order of tens of thousands of compounds, you could opt to create the fingerprints in a single process. If you have a larger compound set, please continue reading the parallel approach.

To create fingerprints you need to run procedure "orchem_fingerprinting.load_cdk_fingerprints". This procedure takes two arguments, but in the simple approach they can be set to NULL. So, after logging on to the ORCHEM schema with for example SQL*Plus, invoke the procedure like this:
begin
   orchem_fingerprinting.LOAD_CDK_FINGERPRINTS(null,null);
end;
The procedure will populate ORCHEM_FINGPRINT_SIMSEARCH and ORCHEM_FINGPRINT_SUBSEARCH. You can track progress of the procedure by counting the rows in these tables. When the procedure completes, you can see some logged information in a log table called ORCHEM_LOG.
After completion, create the indexes.


Parallel approach

The parallel approach splits the creation of fingerprints into several threads. To do this, you should decide how many parallel threads you want to run. Mind you, too many parallel threads may be counter productive, clogging your database.

Here is an example. Say we have a table COMPOUNDS that has 1 million compounds in it. In our table, the compounds are neatly distributed with IDs between 1 and 1000,000. We decide to run five parallel threads, each taking care of 200,000 compounds. This is done through DBMS_JOB. Make sure DBMS_JOB is available to you and that it allows the concurrency you are looking for (your DBA can tell you). In the example, DBMS_JOB is used as follows:
declare
    i binary_integer;
begin
    dbms_job.submit (JOB=>i,WHAT=>'ORCHEM_FINGERPRINTING.slice_load(     1 , 200000  );' ,INTERVAL=>null );
    dbms_job.submit (JOB=>i,WHAT=>'ORCHEM_FINGERPRINTING.slice_load(200001 , 400000  );' ,INTERVAL=>null );
    dbms_job.submit (JOB=>i,WHAT=>'ORCHEM_FINGERPRINTING.slice_load(400001 , 600000  );' ,INTERVAL=>null );
    dbms_job.submit (JOB=>i,WHAT=>'ORCHEM_FINGERPRINTING.slice_load(600001 , 800000  );' ,INTERVAL=>null );
    dbms_job.submit (JOB=>i,WHAT=>'ORCHEM_FINGERPRINTING.slice_load(800001 ,1000000  );' ,INTERVAL=>null );
    commit;
end;
In the example above, procedure slice_load is invoked for each group of compounds we want to fingerprint. So the first job will take care of fingerprinting compounds with a primary key between 1 and 200,000. The next job will do primary keys 200,001 to 400,000. And so on. For your particular data, find the right primary key ranges and issue jobs using more or fewer jobs according to your preferences.
For a rough idea on timings, at the EBI we indexed a PubChem snapshot of 3.5 million compounds in about 7 hours using 5 parallel threads.

You can check progress of DBMS_JOB by running the query below. See more on DBMS_JOB in the
Oracle documentation.
set lines 200
col what format a80
select job, what, failures, this_sec, next_sec from user_jobs
/
You should not spot anything in column "failures". If you do, remove the DBMS_JOBs and check the error in ORCHEM_LOG. Notify the OrChem developers about the error by sending an email.

When all jobs finish succesfully they will disappear from USER_JOBS. If you wish you can check the log in table ORCHEM_LOG. Next, create the indexes.

Creating indexes

  • Connect to the database using SQL*Plus with your ORCHEM account
  • In SQL*Plus, execute the script @indices.sql, located in orchem/sql. This script creates indices on the tables you populated in the previous steps, ORCHEM_FINGPRINT_SIMSEARCH and ORCHEM_FINGPRINT_SUBSEARCH.
                    
    > cd orchem/sql/
    
    > sqlplus orchem/orchem@marx
    
        SQL*Plus: Release 11.1.0.6.0 - Production on Tue May 19 17:48:27 2009
        
        SQL> @indices
    
        Creating bitmap index on similarity search table
        Index created.
        Creating b*tree indexes on substructure search table
        Index created.
        Index created.
        Index created.
        ..........
    

Gathering schema statistics

  • Again using SQL*Plus with your ORCHEM account, run a script - this time @statistics.sql, again located in orchem/sql. This script will gather statistics on the freshly populated tables and indices.
                    
    SQL> @statistics
    
    PL/SQL procedure successfully completed.
    PL/SQL procedure successfully completed.
    PL/SQL procedure successfully completed.
    
    
This concludes the initial creation of OrChem fingerprints; you're now ready to use OrChem for substructure and similarity searching. However, you may want to proceed to the
synchronization page in case you want OrChem to keep the fingerprint in sync with your compound data. This is relevant when your compound data is not a static dataset and affected by insertion, updates and deletions.

Back to main page

Author: mark_rynbeek@users.sourceforge.net

You're almost done..

This concludes the initial creation of OrChem fingerprints; you're now ready to use OrChem for substructure and similarity searching. However, you may want to proceed to the
synchronization page in case you want OrChem to keep the fingerprint in sync with your compound data. This is relevant if your compound data is not static but affected by insertion, updates and deletions.



Back to main page