Database Scanner
Introduction
The Database scanner can access SQL compliant databases and extract information via user specified SQL SELECT statements.
Note that the Database Scanner can extract content from a database or use the content files which are specified by the user during the transformation process via the mc_content_location system attribute.
Exporting objects from a database
The Database scanner can access SQL compliant databases and extract information via user specified SQL SELECT statements. Since many content management systems rely on some type of SQL compliant database to manage their data, the Database scanner can also be used as a generic interface for extracting information from unsupported/obsolete/custom built content management systems. The types of database management systems supported are not limited to any particular vendor or brand, as access happens via JDBC, hence any SQL compliant database having a (compatible) JDBC adapter available can be accessed and queried.
Some common content management system features supported by migration-center Database scanner are metadata, including system metadata such as permission information, owner, creator, content path, etc, as well as version information, as long as these types of information can be obtained from the respective system’s SQL database. The information extracted by the database scanner is stored in the migration-center database and can be processed, transformed, validated and imported just like any other type of scanned information.
Note that the Database Scanner can extract content from a database stored in BLOB/CLOB fields. Alternatively, the content files corresponding to the objects can be specified by the user during the transformation process via the mc_content_location system attribute.
Depending on the way the content is stored, it may be necessary to extract the content to the filesystem first by other means before migration-center can process it. For the mc_content_location system attribute any of the available transformation functions can be used, so it is easy to generate a value resembling a path pointing to the location of the object’s content file, which a migration-center importer can use to import the content. A good practice would be to export content files to the filesystem using the object’s unique identifier as the filename, and then build the path information based on the known path and the objects unique identifier. This location would need to be accessible to the Job Server running the import which will migrate the content to the new target system.
Database Scanner Configuration
Configuring the JDBC driver
The types of database management systems supported are not limited to any particular vendors or brands, as access happens via JDBC, hence any SQL compliant database having a (compatible) JDBC adapter available can be accessed and queried.
Before using the database scanner, the appropriate JDBC driver needs to be installed and configured for the Job Server. To configure a JDBC driver all required Java jar files and Java library paths need to be added to the "... \migration-center Server Components \jdbc.conf" file. The file can be edited with any text editor and contains all information needed for configuration as each setting is described in the file itself.
The JDBC driver for Oracle 11g is delivered with the Database scanner already preconfigured and ready to use.
Scanner configuration
To create a new database Scanner job, specify the respective adapter type in the Scanner Properties window – from the list of available connectors, “Database” must be selected. Once the adapter type has been selected, the Parameters list will be populated with the parameters specific to the selected adapter type, in this case the Database Scanner.
The Properties window of a scanner can be accessed by double-clicking a scanner in the list or selecting the Properties button or entry from the toolbar or context menu.
Common scanner parameters
Configuration parameters
Values
Name
Enter a unique name for this scanner
Mandatory
Adapter type
Select the “Database” connector from the list of available connectors
Mandatory
Location
Select the Job Server location where this job should be run. Job Servers are defined in the Jobserver window. If no Job Server migration-center will prompt the user to define a Job Server Location when saving the Importer.
Mandatory
Description
Enter a description for this job (optional)
Database scanner parameters
Configuration parameters
Values
connectionURL*
The database connection URL that is a string that your DBMS JDBC driver uses to connect to a database. It can contain information such as where to search for the database, the name of the database to connect to, and configuration properties. The exact syntax of a database connection URL is specified by your DBMS.
Example connection strings for some common databases:
jdbc:oracle:thin:@[host][:port]:SID
jdbc:sqlserver://[serverName[\instanceName][:portNumber]][;property=value[;property=value]]
jdbc:mysql://host_name:port/dbname
Example (Excel):
jdbc:odbc:DRIVER={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};DBQ=PATH-TO-EXCEL-FILE;ReadOnly=1
Mandatory
driverClass*
The JDBC driver entry class that is the class the implement the interface java.sql.Driver.
Examples:
oracle.jdbc.OracleDriver com.microsoft.sqlserver.jdbc.SQLServerDriver
sun.jdbc.odbc.JdbcOdbcDriver
Mandatory
username*
Database username used for jdbc connection.
Mandatory
password*
Password used for jdbc connection.
Mandatory
queryFile*
The xml file path that contains the SQL queries that will be used by scanner to extract objects and metadata from database.
See the chapter 3.2.3 for more details about configuring queries.
Mandatory
scanUpdates*
Enables the scanner to update previously scanned objects in the mc database. If unticked, previous scanned objects will be skipped.
Mandatory
deltaFields
Contains the fields that will be used for detecting if an object needs to be scanned as an update. If a value of one attribute specified here was changed in the source database after the previous scan was executed, the scanner will scan the object as an update, otherwise it will just ignore the object. Delta fields is taken in consideration only when “scanUpdates” is checked. See 3.3 Delta migration section for more details
computeChecksum
When it's checked the checksum of scanned files will be computed. Useful for determining whether files with different names and from different locations have in fact the same content, as can frequently happen with common documents copied and stored by several users in a file share environment.
Do not enable this option unless necessary, since the performance impact is significant due to the scanner having to read the full content for each and compute the checksum for it.
hashAlgorithm
Specifies the algorithm that will be used to compute the checksum of the scanned objects.
Possible values are "MD2", "MD5", "SHA-1", "SHA-224", "SHA-256", "SHA-384" and "SHA-512". Default value is MD5.
hashEncoding
Specifies the encoding that will be used to compute the Checksum of the scanned objects.
Possible values are "HEX", "Base32" and "Base64". Default value is HEX.
exportLocation
The location where the exported object content should be saved. It can be a job server local folder or a shared folder and it should exist and it should be writable.
loggingLevel*
Sets the verbosity of the log file.
Values:
1 - logs only errors during scan
2 - is the default value reporting all warnings and errors
3 - logs all successfully performed operations in addition to any warnings or errors
4 - logs all events (for debugging only, use only if instructed by fme product support since it generates a very large amount of output. Do not use in production)
Mandatory
Configurations of queries in the “queryFile”
The “queryFile” is an xml file containing a series of SQL queries that will be run sequentially by the scanner for extracting information from the database.
There are two ways to scan objects with metadata:
Using one query that contains both the unique identifier of the rows/objects which need to be exported and the names of the columns that represent the object's metadata.
Using at least two of the queries: one is for selecting the unique identifier of the rows/objects which need to be exported as, and the second query to extract the metadata for each row/object returned by the previous query. Note: If the main query has metadata, those won't be taken into consideration.
If objects have multiple versions and the related information that need to be extracted (i.e. a row/object is a version of another row/object), some additional queries are required.
The number of queries for extracting metadata for objects or versions is not limited. Multiple queries can be specified to gather and extract various information related to objects.
Example 1.1: a simple file with one query
Example 1.2: a simple file with two queries
The xml file is composed for <query> elements that contain the queries that will be run by scanner for extracting objects and metadata. The <query> element may have the following attributes (some of them mandatory):
Attribute name
Description
type
Defines the type of the query. The following values are possible: main, versions, main-metadata, version-metadata.
main – is the query the returns the unique identifier of every object the will be scanned. In case there is no main-metadata query, the main query may contains any number of columns which values will be added as metadata. In this case the query element may contain the contentpath attribute that will indicate the column where the object content location is stored.
If a main-metadata query is specified, only the key column in the main query will be returned, all the other columns being ignored. A valid configuration must contain a single main query definition. If you need to deal with versions, this should also return the unique identifier of every version that will be passed to versions query.
versions – is the query the returns the unique identifier and optionally metadata of every version for the objects returned by the main query. It will take as parameter the value of column specified in attribute versionid defined in the main query. This query will be run once for every row returns by the main query. This also may contain the contentpath attribute that will indicate the column where the version content location is stored.
main-metadata - are the queries that extracts the metadata for main objects. They take as parameter the value return by the main query in the column specified by attribute key. These queries will be run once for every row returns by the main query. You can define an unlimited number of such queries.
version-metadata - are the queries that extracts the metadata for versions. They take as parameter the value return by the versions query in the column specified by attribute key. These queries will be run once for every row returns by the versions query. You can define an unlimited number of such queries.
main-content – is the query that extracts the content for main objects. It takes as parameter the value return by the main query in the column specified by attribute key. The query will be run once for every row returned by the main query.
version-content - is the query that extracts the content for version objects. It takes as parameter the value return by the versions query in the column specified by attribute key. The query will be run once for every row returned by the versions query.
Note: If versions query is present, only the objects returned by this query will be extracted. The main query will be used only for identifying the version trees that will be scanned. In this case the main-metadata queries will be ignored.
Mandatory
name
Defines the name of the query and it will be used for logging purpose.
Optional
key
Defines the column name which value will be stored in MC column id_in_source_system. The value of the column defined in this attribute will be passed as parameter in main-metadata or version-metadata
Mandatory for main and versions queries.
versionid
Defines the column name which value will be passed as parameter to the versions query. It can be defined only for main query.
Optional
parentid
Defines the column name that contains the id of the parent version. This might be used only in case of branches. When not used, the versions will be scanned in the order they are returned by the query.
Optional for versions query.
contentpath
Defines the column name that contains the path where object content is stored. It will be used to populate the MC column "content_location".
Note: The content will not be exported by the scanner in the location specified by this attribute. The value should point to an existing file.
Optional for main-metadata and version-metadata queries.
Example 2: a more complex query file
The scanner will validate the queries configuration file against the following xml schema:
Delta migration
If the parameter “scanUpdates” is not checked, the scanner will only scan new objects or new versions of existing objects and it will ignore all objects that were previously scanned, i.e. that already exist in the MC database (based on their ID in source system).
If the parameter “scanUpdates” is checked, the scanner will scan new objects or new versions of existing objects and it will detect if it should scan existing objects as updates. If an object will be scanned as an update depends on several factors:
If there are no attributes specified in the “deltaFields” parameter, the scanner will scan every object that already exists in the MC database as an update.
If one or multiple attributes are specified in the “deltaFields” parameter, the scanner will scan an object that was previously scanned as an update only if a value of a delta field in the source database is different than the corresponding value in the MC database. If all values (of all fields defined in “deltaFields”) in the source database match the values in the MC database, then the object will not be scanned as an update, it will just be ignored.
The field names in the “deltaFields” are case sensitive so you should define them as they are scanned by the scanner.
Scanning BLOB/CLOB content
Since version 3.2.8 Update 2 of migration-center the database connector supports extracting BLOB or CLOB content from databases. In order for this feature to work the main-content or version-content query type must be specified in the query file.
The structure of the queries is as follows:
The column for BLOB/CLOB is mandatory and it must have the alias BLOB_CONTENT or CLOB_CONTENT so the scanner is able to process the value correctly and extract the content.
The columns [nameOfFile] and [nameOfExtenstion] are optional. In case they are set, the scanner will use the values to build the name of the file where the content will be exported. The scanner avoids overwriting exiting files by adapting the file name to be unique in the export folder. Also, the characters that area not allowed in the file name will be replaced. Filename and extensions are also saved as source attributes so they can be used in the transformation engine. For avoiding name conflicts with the other attributes, the user can set aliases for these columns. If the columns [nameOfFile] and [nameOfExtenstion] are missing the scanner will use the value of the id in source system as a filename.
Multiple contents for a single object are allowed. If the query returns multiple rows the content returned by every row is exported. All the paths will be stored in the source attribute “mc_content_location” and the path from the last returned row is set as primary content in the column “content_location”.
Example 3: scanning BLOB content
Scanning Excel files
Although it is still possible to use the database scanner to scan Excel files, we recommend to use our CSV & Excel scanner instead. The CSV & Excel scanner is faster and there are no tweaks necessary to get the JDBC-ODBC bridge working in Java later 1.7.
Prerequisites and Configuration
To scan excel files MS Office or at least Excel ODBC drivers need to be installed on the Job Server machine. To install the Excel ODBC driver one of the easiest ways is to download and install the 32bit version of the Microsoft Access Database Engine 2010 Redistributable that includes the necessary drivers and which can be downloaded from Microsoft’s website.
The connectionURL for should have the following format:
jdbc:odbc:DRIVER={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};DBQ=
PATH-TO-EXCEL-FILE
;ReadOnly=1
Note that all extensions (*.xls, *.xlsx, *.xlsm, *.xlsb) are required.
Example 4: XML Queries for scanning Excel files
Solving Null values for some fields
When a column in the sheet contains mixed data (numbers and strings) the ODBC driver uses the first few rows to detect the column type. If the first values are numbers, the ODBC driver tries to handle all the subsequent values as numbers as well and therefore the values that are strings (not numbers) will be ignored, resulting in a “null” value being scanned by migration-center.
To avoid this problem, the parameter “IMEX=1” must be set in the metadata query:
Last updated