Database Scanner

Introduction

The Database scanner can access SQL compliant databases and extract information via user specified SQL SELECT statements.

Note that the Database Scanner can extract content from a database or use the content files which are specified by the user during the transformation process via the mc_content_location system attribute.

Exporting objects from a database

The Database scanner can access SQL compliant databases and extract information via user specified SQL SELECT statements. Since many content management systems rely on some type of SQL compliant database to manage their data, the Database scanner can also be used as a generic interface for extracting information from unsupported/obsolete/custom built content management systems. The types of database management systems supported are not limited to any particular vendor or brand, as access happens via JDBC, hence any SQL compliant database having a (compatible) JDBC adapter available can be accessed and queried.

Some common content management system features supported by migration-center Database scanner are metadata, including system metadata such as permission information, owner, creator, content path, etc, as well as version information, as long as these types of information can be obtained from the respective system’s SQL database. The information extracted by the database scanner is stored in the migration-center database and can be processed, transformed, validated and imported just like any other type of scanned information.

Note that the Database Scanner can extract content from a database stored in BLOB/CLOB fields. Alternatively, the content files corresponding to the objects can be specified by the user during the transformation process via the mc_content_location system attribute.

Depending on the way the content is stored, it may be necessary to extract the content to the filesystem first by other means before migration-center can process it. For the mc_content_location system attribute any of the available transformation functions can be used, so it is easy to generate a value resembling a path pointing to the location of the object’s content file, which a migration-center importer can use to import the content. A good practice would be to export content files to the filesystem using the object’s unique identifier as the filename, and then build the path information based on the known path and the objects unique identifier. This location would need to be accessible to the Job Server running the import which will migrate the content to the new target system.

Database Scanner Configuration

Configuring the JDBC driver

The types of database management systems supported are not limited to any particular vendors or brands, as access happens via JDBC, hence any SQL compliant database having a (compatible) JDBC adapter available can be accessed and queried.

Before using the database scanner, the appropriate JDBC driver needs to be installed and configured for the Job Server. To configure a JDBC driver all required Java jar files and Java library paths need to be added to the "... \migration-center Server Components \jdbc.conf" file. The file can be edited with any text editor and contains all information needed for configuration as each setting is described in the file itself.

The JDBC driver for Oracle 11g is delivered with the Database scanner already preconfigured and ready to use.

Scanner configuration

To create a new database Scanner job, specify the respective adapter type in the Scanner Properties window – from the list of available connectors, “Database” must be selected. Once the adapter type has been selected, the Parameters list will be populated with the parameters specific to the selected adapter type, in this case the Database Scanner.

The Properties window of a scanner can be accessed by double-clicking a scanner in the list or selecting the Properties button or entry from the toolbar or context menu.

Common scanner parameters

Configuration parameters

Values

Name

Enter a unique name for this scanner

Mandatory

Adapter type

Select the “Database” connector from the list of available connectors

Mandatory

Location

Select the Job Server location where this job should be run. Job Servers are defined in the Jobserver window. If no Job Server migration-center will prompt the user to define a Job Server Location when saving the Importer.

Mandatory

Description

Enter a description for this job (optional)

Database scanner parameters

Configuration parameters

Values

connectionURL*

The database connection URL that is a string that your DBMS JDBC driver uses to connect to a database. It can contain information such as where to search for the database, the name of the database to connect to, and configuration properties. The exact syntax of a database connection URL is specified by your DBMS.

Example connection strings for some common databases:

  • jdbc:oracle:thin:@[host][:port]:SID

  • jdbc:sqlserver://[serverName[\instanceName][:portNumber]][;property=value[;property=value]]

  • jdbc:mysql://host_name:port/dbname

Example (Excel):

  • jdbc:odbc:DRIVER={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};DBQ=PATH-TO-EXCEL-FILE;ReadOnly=1

Mandatory

driverClass*

The JDBC driver entry class that is the class the implement the interface java.sql.Driver.

Examples:

oracle.jdbc.OracleDriver com.microsoft.sqlserver.jdbc.SQLServerDriver

sun.jdbc.odbc.JdbcOdbcDriver

Mandatory

username*

Database username used for jdbc connection.

Mandatory

password*

Password used for jdbc connection.

Mandatory

queryFile*

The xml file path that contains the SQL queries that will be used by scanner to extract objects and metadata from database.

See the chapter 3.2.3 for more details about configuring queries.

Mandatory

scanUpdates*

Enables the scanner to update previously scanned objects in the mc database. If unticked, previous scanned objects will be skipped.

Mandatory

deltaFields

Contains the fields that will be used for detecting if an object needs to be scanned as an update. If a value of one attribute specified here was changed in the source database after the previous scan was executed, the scanner will scan the object as an update, otherwise it will just ignore the object. Delta fields is taken in consideration only when “scanUpdates” is checked. See 3.3 Delta migration section for more details

computeChecksum

When it's checked the checksum of scanned files will be computed. Useful for determining whether files with different names and from different locations have in fact the same content, as can frequently happen with common documents copied and stored by several users in a file share environment.

Do not enable this option unless necessary, since the performance impact is significant due to the scanner having to read the full content for each and compute the checksum for it.

hashAlgorithm

Specifies the algorithm that will be used to compute the checksum of the scanned objects.

Possible values are "MD2", "MD5", "SHA-1", "SHA-224", "SHA-256", "SHA-384" and "SHA-512". Default value is MD5.

hashEncoding

Specifies the encoding that will be used to compute the Checksum of the scanned objects.

Possible values are "HEX", "Base32" and "Base64". Default value is HEX.

exportLocation

The location where the exported object content should be saved. It can be a job server local folder or a shared folder and it should exist and it should be writable.

loggingLevel*

Sets the verbosity of the log file.

Values:

1 - logs only errors during scan

2 - is the default value reporting all warnings and errors

3 - logs all successfully performed operations in addition to any warnings or errors

4 - logs all events (for debugging only, use only if instructed by fme product support since it generates a very large amount of output. Do not use in production)

Mandatory

Configurations of queries in the “queryFile”

The “queryFile” is an xml file containing a series of SQL queries that will be run sequentially by the scanner for extracting information from the database.

There are two ways to scan objects with metadata:

  1. Using one query that contains both the unique identifier of the rows/objects which need to be exported and the names of the columns that represent the object's metadata.

  2. Using at least two of the queries: one is for selecting the unique identifier of the rows/objects which need to be exported as, and the second query to extract the metadata for each row/object returned by the previous query. Note: If the main query has metadata, those won't be taken into consideration.

If objects have multiple versions and the related information that need to be extracted (i.e. a row/object is a version of another row/object), some additional queries are required.

The number of queries for extracting metadata for objects or versions is not limited. Multiple queries can be specified to gather and extract various information related to objects.

Example 1.1: a simple file with one query

<?xml version="1.0" encoding="UTF-8"?>
<!--
This is a simple example the extract the some objects directly from a database that is behind a documentum 
repository using exactly one query.
-->
<queries>
    <!--
    The “type” attribute is mandatory and must have one of the following values: main, versions, main-metadata, version-metadata.
    The “name” attribute is optional and it used only for logging purposes.
    The “key” attribute contains the column name from query that will be stored in MC column: id_in_source_system.
    This query returns all objects ids and aditional metadata that are stored in the folder 'MC32/test'.
    Every row returned by this query will be saved as a new object in migration center-database.
    -->
    <query type="main" name="main query" key="r_object_id">
        select r_object_id, object_name, title, subject from rep1.dm_document_sp where folder('/MC32/test')
    </query>
</queries>

Example 1.2: a simple file with two queries

<?xml version="1.0" encoding="UTF-8"?>
<!--
This is a simple example the extract the some objects directly from a database that is behind a documentum 
repository using two separate queries.
-->
<queries>
    <!--
    The “type” attribute is mandatory and must have one of the following values: main, versions, main-metadata, version-metadata.
    The “name” attribute is optional and it used only for logging purposes.
    The “key” attribute contains the column name from query that will be passed
    to “main-metadata” query and that will be stored in MC column: id_in_source_system.
    This query returns all objects id that are stored in the folder 'MC32/test'.
    This query is run once at the beginning of scanning process.
    Every row returned by this query will be saved as a new object in migration center-database.
    -->
    <query type="main" name="main query" key="r_object_id">
        select r_object_id from rep1.dm_document_sp where folder('/MC32/test')
    </query>
    <!--
    This is the query that extracts metadata for every object returned by the "main" query.
    Any number of "main-metadata" query is allowed.
    ? is a placeholder that will be replaced by scanner at runtime with the key values returned by the main query.
    -->
    <query type="main-metadata" name="metadata query">
        select object_name, title, subject from rep1.dm_document_sp where r_object_id = ?
    </query>
</queries>

The xml file is composed for <query> elements that contain the queries that will be run by scanner for extracting objects and metadata. The <query> element may have the following attributes (some of them mandatory):

Attribute name

Description

type

Defines the type of the query. The following values are possible: main, versions, main-metadata, version-metadata.

main – is the query the returns the unique identifier of every object the will be scanned. In case there is no main-metadata query, the main query may contains any number of columns which values will be added as metadata. In this case the query element may contain the contentpath attribute that will indicate the column where the object content location is stored.

If a main-metadata query is specified, only the key column in the main query will be returned, all the other columns being ignored. A valid configuration must contain a single main query definition. If you need to deal with versions, this should also return the unique identifier of every version that will be passed to versions query.

versions – is the query the returns the unique identifier and optionally metadata of every version for the objects returned by the main query. It will take as parameter the value of column specified in attribute versionid defined in the main query. This query will be run once for every row returns by the main query. This also may contain the contentpath attribute that will indicate the column where the version content location is stored.

main-metadata - are the queries that extracts the metadata for main objects. They take as parameter the value return by the main query in the column specified by attribute key. These queries will be run once for every row returns by the main query. You can define an unlimited number of such queries.

version-metadata - are the queries that extracts the metadata for versions. They take as parameter the value return by the versions query in the column specified by attribute key. These queries will be run once for every row returns by the versions query. You can define an unlimited number of such queries.

main-content – is the query that extracts the content for main objects. It takes as parameter the value return by the main query in the column specified by attribute key. The query will be run once for every row returned by the main query.

version-content - is the query that extracts the content for version objects. It takes as parameter the value return by the versions query in the column specified by attribute key. The query will be run once for every row returned by the versions query.

Note: If versions query is present, only the objects returned by this query will be extracted. The main query will be used only for identifying the version trees that will be scanned. In this case the main-metadata queries will be ignored.

Mandatory

name

Defines the name of the query and it will be used for logging purpose.

Optional

key

Defines the column name which value will be stored in MC column id_in_source_system. The value of the column defined in this attribute will be passed as parameter in main-metadata or version-metadata

Mandatory for main and versions queries.

versionid

Defines the column name which value will be passed as parameter to the versions query. It can be defined only for main query.

Optional

parentid

Defines the column name that contains the id of the parent version. This might be used only in case of branches. When not used, the versions will be scanned in the order they are returned by the query.

Optional for versions query.

contentpath

Defines the column name that contains the path where object content is stored. It will be used to populate the MC column "content_location".

Note: The content will not be exported by the scanner in the location specified by this attribute. The value should point to an existing file.

Optional for main-metadata and version-metadata queries.

Example 2: a more complex query file

<?xml version="1.0" encoding="UTF-8"?>
<!-- 
Contains the queries for selecting objects that will be added as "Database(document)" in migration center
-->
<queries>
    <!--
    “r_object_id” is the unique identifier of the current version and “i_chronicle_id” is the attribute the identify all versions of this document.
    -->
    <query type="main" name="fme documents" key= "r_object_id" versionid="i_chronicle_id">
        select r_object_id, i_chronicle_id from rep1.dm_document_sp where subject like ‘migration-center’
    </query>
    <!-- 
    This query will be run once for every row returns by “main” query. The placeholder “?” will be replaces with the value of column “i_chronicle_id” from “main” query. Every row returned by this query will correspond as a new object in MC database.
    -->
    <query type="versions" name = "fme document versions" key="r_object_id" parentid = "i_antecedent_id">
        select r_object_id, decode(i_antecedent_id, '0000000000000000', null,i_antecedent_id) as i_antecedent_id
        from rep1.dm_document_sp
        WHERE i_chronicle_id=?
    </query>
    <!-- 
    This query will be run once for every row returns by “versions” query. The placeholder “?” will be replaces with the value of column “r_object_id” (specified in “key” attribute) from “versions” query. All values returned by this query will be added as source attributes in MC database.
    -->
    <query type="version-metadata">
        select object_name, title, i_antecedent_id from rep1.dm_document_sp where r_object_id = ?
    </query>
    <query type="version-metadata">
        select keywords, r_version_label from rep1.dm_document_rp where r_object_id = ?
    </query>
    <query type="version-metadata" contentpath="set_file">
        select doc.R_OBJECT_ID, content.set_file
        from rep1.dm_document_sp doc, rep1.dmr_content_sp content
        where doc.I_CONTENTS_ID = content.r_object_id
        and doc.R_OBJECT_ID =?
    </query>
</queries>

The scanner will validate the queries configuration file against the following xml schema:

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" 
    xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="queries">
        <xs:complexType>
            <xs:sequence>
                <xs:element maxOccurs="unbounded" name="query">
                    <xs:complexType>
                        <xs:simpleContent>
                            <xs:extension base="xs:string">
                                <xs:attribute name="type" type="QueryType" use="required" />
                                <xs:attribute name="name" type="xs:string" use="optional" />
                                <xs:attribute name="key" type="xs:string" use="optional" />
                                <xs:attribute name="versionid" type="xs:string" use="optional" />
                                <xs:attribute name="parentid" type="xs:string" use="optional" />
                                <xs:attribute name="contentpath" type="xs:string" use="optional" />
                            </xs:extension>
                        </xs:simpleContent>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
    <xs:simpleType name="QueryType">
        <xs:restriction base="xs:string">
            <xs:enumeration value="main"/>
            <xs:enumeration value="versions"/>
            <xs:enumeration value="main-metadata"/>
            <xs:enumeration value="version-metadata"/>
        </xs:restriction>
    </xs:simpleType>
</xs:schema>

Delta migration

If the parameter “scanUpdates” is not checked, the scanner will only scan new objects or new versions of existing objects and it will ignore all objects that were previously scanned, i.e. that already exist in the MC database (based on their ID in source system).

If the parameter “scanUpdates” is checked, the scanner will scan new objects or new versions of existing objects and it will detect if it should scan existing objects as updates. If an object will be scanned as an update depends on several factors:

  • If there are no attributes specified in the “deltaFields” parameter, the scanner will scan every object that already exists in the MC database as an update.

  • If one or multiple attributes are specified in the “deltaFields” parameter, the scanner will scan an object that was previously scanned as an update only if a value of a delta field in the source database is different than the corresponding value in the MC database. If all values (of all fields defined in “deltaFields”) in the source database match the values in the MC database, then the object will not be scanned as an update, it will just be ignored.

The field names in the “deltaFields” are case sensitive so you should define them as they are scanned by the scanner.

Scanning BLOB/CLOB content

Since version 3.2.8 Update 2 of migration-center the database connector supports extracting BLOB or CLOB content from databases. In order for this feature to work the main-content or version-content query type must be specified in the query file.

The structure of the queries is as follows:

<query type="main-content">
    SELECT [nameOfBlobColumn] AS BLOB_CONTENT, [nameOfFile] AS FILE_NAME, [nameOfExtenstion] AS FILE_EXTENSION FROM MYTABEL where ID = ?
</query>
<query type="main-content">
    SELECT [nameOfClobColumn] AS CLOB_CONTENT, [nameOfFile] AS FILE_NAME, [nameOfExtenstion] AS FILE_EXTENSION FROM MYTABEL where ID = ?
</query>

The column for BLOB/CLOB is mandatory and it must have the alias BLOB_CONTENT or CLOB_CONTENT so the scanner is able to process the value correctly and extract the content.

The columns [nameOfFile] and [nameOfExtenstion] are optional. In case they are set, the scanner will use the values to build the name of the file where the content will be exported. The scanner avoids overwriting exiting files by adapting the file name to be unique in the export folder. Also, the characters that area not allowed in the file name will be replaced. Filename and extensions are also saved as source attributes so they can be used in the transformation engine. For avoiding name conflicts with the other attributes, the user can set aliases for these columns. If the columns [nameOfFile] and [nameOfExtenstion] are missing the scanner will use the value of the id in source system as a filename.

Multiple contents for a single object are allowed. If the query returns multiple rows the content returned by every row is exported. All the paths will be stored in the source attribute “mc_content_location” and the path from the last returned row is set as primary content in the column “content_location”.

Example 3: scanning BLOB content

<?xml version="1.0" encoding="UTF-8"?>
<queries>
    <query type="main" name="Blob" key="ID" versionid="Version_ID">
        select distinct ID, Version_ID from BLOB_TABEL
    </query>
    <query type="versions" name="Blob " key="ID">
        select ID, version_id
        from BLOB_TABEL
        WHERE Version_id=?
    </query>
    <query type="version-metadata">
        select Filename, Extension, Version_label from BLOB_TABEL where ID = ?
    </query>
    <query type="version-content">
        select Blob_content as BLOB_CONTENT, filename as FILE_NAME, extension as FILE_EXTENSION FROM BLOB_TABEL where ID = ?
    </query>
</queries>

Scanning Excel files

Although it is still possible to use the database scanner to scan Excel files, we recommend to use our CSV & Excel scanner instead. The CSV & Excel scanner is faster and there are no tweaks necessary to get the JDBC-ODBC bridge working in Java later 1.7.

Prerequisites and Configuration

To scan excel files MS Office or at least Excel ODBC drivers need to be installed on the Job Server machine. To install the Excel ODBC driver one of the easiest ways is to download and install the 32bit version of the Microsoft Access Database Engine 2010 Redistributable that includes the necessary drivers and which can be downloaded from Microsoft’s website.

The connectionURL for should have the following format:

jdbc:odbc:DRIVER={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};DBQ=PATH-TO-EXCEL-FILE;ReadOnly=1

Note that all extensions (*.xls, *.xlsx, *.xlsm, *.xlsb) are required.

Example 4: XML Queries for scanning Excel files

<?xml version="1.0" encoding="UTF-8"?>
<queries>
    <query type="main" name="main query" key="ID">
        select "Column1" as ID from [Sheet1$]
    </query>
    <query type="main-metadata" name="metadata query">
        select * from [Sheet1$] where "Column1" = ?
    </query>
</queries>

Solving Null values for some fields

When a column in the sheet contains mixed data (numbers and strings) the ODBC driver uses the first few rows to detect the column type. If the first values are numbers, the ODBC driver tries to handle all the subsequent values as numbers as well and therefore the values that are strings (not numbers) will be ignored, resulting in a “null” value being scanned by migration-center.

To avoid this problem, the parameter “IMEX=1” must be set in the metadata query:

<query type="main-metadata" name="metadata query">
    select * from [Excel8.0;HDR=YES;IMEX=1;Database=PATH-TO-EXCEL-FILE].[sheet1$] where "id" = ?
</query>