Database Scanner

Introduction

The Database scanner can access SQL compliant databases and extract information via user specified SQL SELECT statements.

Note that the Database Scanner can extract content from a database or use the content files which are specified by the user during the transformation process via the mc_content_location system attribute.

Exporting objects from a database

The Database scanner can access SQL compliant databases and extract information via user specified SQL SELECT statements. Since many content management systems rely on some type of SQL compliant database to manage their data, the Database scanner can also be used as a generic interface for extracting information from unsupported/obsolete/custom built content management systems. The types of database management systems supported are not limited to any particular vendor or brand, as access happens via JDBC, hence any SQL compliant database having a (compatible) JDBC adapter available can be accessed and queried.

Some common content management system features supported by migration-center Database scanner are metadata, including system metadata such as permission information, owner, creator, content path, etc, as well as version information, as long as these types of information can be obtained from the respective system’s SQL database. The information extracted by the database scanner is stored in the migration-center database and can be processed, transformed, validated and imported just like any other type of scanned information.

Note that the Database Scanner can extract content from a database stored in BLOB/CLOB fields. Alternatively, the content files corresponding to the objects can be specified by the user during the transformation process via the mc_content_location system attribute.

Depending on the way the content is stored, it may be necessary to extract the content to the filesystem first by other means before migration-center can process it. For the mc_content_location system attribute any of the available transformation functions can be used, so it is easy to generate a value resembling a path pointing to the location of the object’s content file, which a migration-center importer can use to import the content. A good practice would be to export content files to the filesystem using the object’s unique identifier as the filename, and then build the path information based on the known path and the objects unique identifier. This location would need to be accessible to the Job Server running the import which will migrate the content to the new target system.

Database Scanner Configuration

Configuring the JDBC driver

The types of database management systems supported are not limited to any particular vendors or brands, as access happens via JDBC, hence any SQL compliant database having a (compatible) JDBC adapter available can be accessed and queried.

Before using the database scanner, the appropriate JDBC driver needs to be installed and configured for the Job Server. To configure a JDBC driver all required Java jar files and Java library paths need to be added to the "... \migration-center Server Components \jdbc.conf" file. The file can be edited with any text editor and contains all information needed for configuration as each setting is described in the file itself.

The JDBC driver for Oracle 11g is delivered with the Database scanner already preconfigured and ready to use.

Scanner configuration

To create a new database Scanner job, specify the respective adapter type in the Scanner Properties window – from the list of available connectors, “Database” must be selected. Once the adapter type has been selected, the Parameters list will be populated with the parameters specific to the selected adapter type, in this case the Database Scanner.

The Properties window of a scanner can be accessed by double-clicking a scanner in the list or selecting the Properties button or entry from the toolbar or context menu.

Common scanner parameters

Database scanner parameters

Configurations of queries in the “queryFile”

The “queryFile” is an xml file containing a series of SQL queries that will be run sequentially by the scanner for extracting information from the database.

There are two ways to scan objects with metadata:

  1. Using one query that contains both the unique identifier of the rows/objects which need to be exported and the names of the columns that represent the object's metadata.

  2. Using at least two of the queries: one is for selecting the unique identifier of the rows/objects which need to be exported as, and the second query to extract the metadata for each row/object returned by the previous query. Note: If the main query has metadata, those won't be taken into consideration.

If objects have multiple versions and the related information that need to be extracted (i.e. a row/object is a version of another row/object), some additional queries are required.

The number of queries for extracting metadata for objects or versions is not limited. Multiple queries can be specified to gather and extract various information related to objects.

Example 1.1: a simple file with one query

<?xml version="1.0" encoding="UTF-8"?>
<!--
This is a simple example the extract the some objects directly from a database that is behind a documentum 
repository using exactly one query.
-->
<queries>
    <!--
    The “type” attribute is mandatory and must have one of the following values: main, versions, main-metadata, version-metadata.
    The “name” attribute is optional and it used only for logging purposes.
    The “key” attribute contains the column name from query that will be stored in MC column: id_in_source_system.
    This query returns all objects ids and aditional metadata that are stored in the folder 'MC32/test'.
    Every row returned by this query will be saved as a new object in migration center-database.
    -->
    <query type="main" name="main query" key="r_object_id">
        select r_object_id, object_name, title, subject from rep1.dm_document_sp where folder('/MC32/test')
    </query>
</queries>

Example 1.2: a simple file with two queries

<?xml version="1.0" encoding="UTF-8"?>
<!--
This is a simple example the extract the some objects directly from a database that is behind a documentum 
repository using two separate queries.
-->
<queries>
    <!--
    The “type” attribute is mandatory and must have one of the following values: main, versions, main-metadata, version-metadata.
    The “name” attribute is optional and it used only for logging purposes.
    The “key” attribute contains the column name from query that will be passed
    to “main-metadata” query and that will be stored in MC column: id_in_source_system.
    This query returns all objects id that are stored in the folder 'MC32/test'.
    This query is run once at the beginning of scanning process.
    Every row returned by this query will be saved as a new object in migration center-database.
    -->
    <query type="main" name="main query" key="r_object_id">
        select r_object_id from rep1.dm_document_sp where folder('/MC32/test')
    </query>
    <!--
    This is the query that extracts metadata for every object returned by the "main" query.
    Any number of "main-metadata" query is allowed.
    ? is a placeholder that will be replaced by scanner at runtime with the key values returned by the main query.
    -->
    <query type="main-metadata" name="metadata query">
        select object_name, title, subject from rep1.dm_document_sp where r_object_id = ?
    </query>
</queries>

The xml file is composed for <query> elements that contain the queries that will be run by scanner for extracting objects and metadata. The <query> element may have the following attributes (some of them mandatory):

Example 2: a more complex query file

<?xml version="1.0" encoding="UTF-8"?>
<!-- 
Contains the queries for selecting objects that will be added as "Database(document)" in migration center
-->
<queries>
    <!--
    “r_object_id” is the unique identifier of the current version and “i_chronicle_id” is the attribute the identify all versions of this document.
    -->
    <query type="main" name="fme documents" key= "r_object_id" versionid="i_chronicle_id">
        select r_object_id, i_chronicle_id from rep1.dm_document_sp where subject like ‘migration-center’
    </query>
    <!-- 
    This query will be run once for every row returns by “main” query. The placeholder “?” will be replaces with the value of column “i_chronicle_id” from “main” query. Every row returned by this query will correspond as a new object in MC database.
    -->
    <query type="versions" name = "fme document versions" key="r_object_id" parentid = "i_antecedent_id">
        select r_object_id, decode(i_antecedent_id, '0000000000000000', null,i_antecedent_id) as i_antecedent_id
        from rep1.dm_document_sp
        WHERE i_chronicle_id=?
    </query>
    <!-- 
    This query will be run once for every row returns by “versions” query. The placeholder “?” will be replaces with the value of column “r_object_id” (specified in “key” attribute) from “versions” query. All values returned by this query will be added as source attributes in MC database.
    -->
    <query type="version-metadata">
        select object_name, title, i_antecedent_id from rep1.dm_document_sp where r_object_id = ?
    </query>
    <query type="version-metadata">
        select keywords, r_version_label from rep1.dm_document_rp where r_object_id = ?
    </query>
    <query type="version-metadata" contentpath="set_file">
        select doc.R_OBJECT_ID, content.set_file
        from rep1.dm_document_sp doc, rep1.dmr_content_sp content
        where doc.I_CONTENTS_ID = content.r_object_id
        and doc.R_OBJECT_ID =?
    </query>
</queries>

The scanner will validate the queries configuration file against the following xml schema:

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" 
    xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="queries">
        <xs:complexType>
            <xs:sequence>
                <xs:element maxOccurs="unbounded" name="query">
                    <xs:complexType>
                        <xs:simpleContent>
                            <xs:extension base="xs:string">
                                <xs:attribute name="type" type="QueryType" use="required" />
                                <xs:attribute name="name" type="xs:string" use="optional" />
                                <xs:attribute name="key" type="xs:string" use="optional" />
                                <xs:attribute name="versionid" type="xs:string" use="optional" />
                                <xs:attribute name="parentid" type="xs:string" use="optional" />
                                <xs:attribute name="contentpath" type="xs:string" use="optional" />
                            </xs:extension>
                        </xs:simpleContent>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
    <xs:simpleType name="QueryType">
        <xs:restriction base="xs:string">
            <xs:enumeration value="main"/>
            <xs:enumeration value="versions"/>
            <xs:enumeration value="main-metadata"/>
            <xs:enumeration value="version-metadata"/>
        </xs:restriction>
    </xs:simpleType>
</xs:schema>

Delta migration

If the parameter “scanUpdates” is not checked, the scanner will only scan new objects or new versions of existing objects and it will ignore all objects that were previously scanned, i.e. that already exist in the MC database (based on their ID in source system).

If the parameter “scanUpdates” is checked, the scanner will scan new objects or new versions of existing objects and it will detect if it should scan existing objects as updates. If an object will be scanned as an update depends on several factors:

  • If there are no attributes specified in the “deltaFields” parameter, the scanner will scan every object that already exists in the MC database as an update.

  • If one or multiple attributes are specified in the “deltaFields” parameter, the scanner will scan an object that was previously scanned as an update only if a value of a delta field in the source database is different than the corresponding value in the MC database. If all values (of all fields defined in “deltaFields”) in the source database match the values in the MC database, then the object will not be scanned as an update, it will just be ignored.

The field names in the “deltaFields” are case sensitive so you should define them as they are scanned by the scanner.

Scanning BLOB/CLOB content

Since version 3.2.8 Update 2 of migration-center the database connector supports extracting BLOB or CLOB content from databases. In order for this feature to work the main-content or version-content query type must be specified in the query file.

The structure of the queries is as follows:

<query type="main-content">
    SELECT [nameOfBlobColumn] AS BLOB_CONTENT, [nameOfFile] AS FILE_NAME, [nameOfExtenstion] AS FILE_EXTENSION FROM MYTABEL where ID = ?
</query>
<query type="main-content">
    SELECT [nameOfClobColumn] AS CLOB_CONTENT, [nameOfFile] AS FILE_NAME, [nameOfExtenstion] AS FILE_EXTENSION FROM MYTABEL where ID = ?
</query>

The column for BLOB/CLOB is mandatory and it must have the alias BLOB_CONTENT or CLOB_CONTENT so the scanner is able to process the value correctly and extract the content.

The columns [nameOfFile] and [nameOfExtenstion] are optional. In case they are set, the scanner will use the values to build the name of the file where the content will be exported. The scanner avoids overwriting exiting files by adapting the file name to be unique in the export folder. Also, the characters that area not allowed in the file name will be replaced. Filename and extensions are also saved as source attributes so they can be used in the transformation engine. For avoiding name conflicts with the other attributes, the user can set aliases for these columns. If the columns [nameOfFile] and [nameOfExtenstion] are missing the scanner will use the value of the id in source system as a filename.

Multiple contents for a single object are allowed. If the query returns multiple rows the content returned by every row is exported. All the paths will be stored in the source attribute “mc_content_location” and the path from the last returned row is set as primary content in the column “content_location”.

Example 3: scanning BLOB content

<?xml version="1.0" encoding="UTF-8"?>
<queries>
    <query type="main" name="Blob" key="ID" versionid="Version_ID">
        select distinct ID, Version_ID from BLOB_TABEL
    </query>
    <query type="versions" name="Blob " key="ID">
        select ID, version_id
        from BLOB_TABEL
        WHERE Version_id=?
    </query>
    <query type="version-metadata">
        select Filename, Extension, Version_label from BLOB_TABEL where ID = ?
    </query>
    <query type="version-content">
        select Blob_content as BLOB_CONTENT, filename as FILE_NAME, extension as FILE_EXTENSION FROM BLOB_TABEL where ID = ?
    </query>
</queries>

Scanning Excel files

Although it is still possible to use the database scanner to scan Excel files, we recommend to use our CSV & Excel scanner instead. The CSV & Excel scanner is faster and there are no tweaks necessary to get the JDBC-ODBC bridge working in Java later 1.7.

Prerequisites and Configuration

To scan excel files MS Office or at least Excel ODBC drivers need to be installed on the Job Server machine. To install the Excel ODBC driver one of the easiest ways is to download and install the 32bit version of the Microsoft Access Database Engine 2010 Redistributable that includes the necessary drivers and which can be downloaded from Microsoft’s website.

The connectionURL for should have the following format:

jdbc:odbc:DRIVER={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};DBQ=PATH-TO-EXCEL-FILE;ReadOnly=1

Note that all extensions (*.xls, *.xlsx, *.xlsm, *.xlsb) are required.

Example 4: XML Queries for scanning Excel files

<?xml version="1.0" encoding="UTF-8"?>
<queries>
    <query type="main" name="main query" key="ID">
        select "Column1" as ID from [Sheet1$]
    </query>
    <query type="main-metadata" name="metadata query">
        select * from [Sheet1$] where "Column1" = ?
    </query>
</queries>

Solving Null values for some fields

When a column in the sheet contains mixed data (numbers and strings) the ODBC driver uses the first few rows to detect the column type. If the first values are numbers, the ODBC driver tries to handle all the subsequent values as numbers as well and therefore the values that are strings (not numbers) will be ignored, resulting in a “null” value being scanned by migration-center.

To avoid this problem, the parameter “IMEX=1” must be set in the metadata query:

<query type="main-metadata" name="metadata query">
    select * from [Excel8.0;HDR=YES;IMEX=1;Database=PATH-TO-EXCEL-FILE].[sheet1$] where "id" = ?
</query>

Last updated