Filesystem Scanner

Introduction

You can use our Filesystem Scanner in several use cases, e.g. to scan files from file repositories or to scan files exported into a filesystem from a DMS or other third-party system. Scanner is the term used in migration-center for an input adapter. Using a scanner module to read the data that needs processing into migration-center is the first step in a migration project, thus scan also refers to the process used to input data to migration-center.

The scanner module works as a job that can be run at any time and can even be executed repeatedly. For every run a detailed history and log file are created.

A scanner is defined by a unique name, a set of configuration parameters and an optional description.

Filesystem Scanners can be created, configured, started and monitored through migration-center Client, but the corresponding processes are executed by migration-center Job Server.

Quick guide to basic migration tasks

Basic configuration

A basic scanner configuration supposes setting a list of folders to be scanned. Local paths and UNC paths are supported. The scanner will scan all files located inside given folders and their subfolders. The common windows attributes like filename, file path, creation date, modify date, content size etc. are extracted and saved in MC database as metadata.

To scan folders as distinct objects in migration-center the flag scanFolders needs to be checked. In this case all subfolders of the given folder list will be saved in migration-center database together with their metadata.

Enriching the content using metadata from external XML files

Additional metadata stored in external files can be used to enrich the files and folders originating from the file system. This file needs to contain the XML schema used by migration-center format and adhere to the naming convention expected by migration-center. The format for such a file is described below.

Although the files’ contents are XML, the extension of such metadata files can be arbitrary and is NOT recommended to be set to XML in order to prevent potential conflicts with actual files using the XML extension. The file extension migration-center should consider as metadata files can be specified in the mcMetadataFileExtension parameter of the Filesystem scanner. If the option has been set and the metadata file for a file or folder cannot be found, an appropriate warning will be logged.

If metadata files and/or folders are not required in the first place, clear the mcMetadataFileExtension parameter to disable additional metadata processing entirely. If some files require additional metadata and others don’t, configure the mcMetadataFileExtension parameters as if all files had metadata. In this case it is safe to ignore the warnings related to missing metadata files for the documents where metadata files are not required or available.

Metadata file naming

One metadata file should be available in the source system for each file and folder which is supposed to have additional metadata provided by such means. The naming for the metadata file has to follow a simple rule:

for files: filename.extension.metadataextension

for folders: .foldername. metadataextension

Filename and extension are self-explaining and refer to the filename and extension of the actual document, while metadaextension should be the custom extension chosen to identify metadata files and must be specified as the value for the mcMetadataFileExtension parameter, as described in the paragraph above.

E.g.: If the document is named report.pdf, and the extension for the metadata files is determined to be fme, then the metadata file for this document needs to be called report.pdf.fme and fme has to be entered as the value for the mcMetadataFileExtension parameter.

If the folder is Migration the metadata file for it must be .Migration.fme.

Metadata file contents

A sample metadata file’s XML structure is illustrated below. The sample content could belong to the report.pdf.fme file mentioned above. In this case the report.pdf file has 4 attributes, each attribute being defined as a name-value pair. There are five lines because one of the attributes is a multi-value attribute. Multi-value attributes are represented by repeating the attribute element with the same name, but different value attribute (i.e. the keywords attribute is listed twice, but with different values)

<?xml version="1.0" encoding="UTF-8" ?>
<contentattributes>
<attribute name="keywords" value="Benchmark" />
<attribute name="keywords" value="Technical" />
<attribute name="reference_period" value="26.11.2001" />
<attribute name="reference_period_from" value="26.11.2001" />
<attribute name="reference_period_to" value="01.01.2100" />
</contentattributes>

The number, name and values of attributes defined in such a file are not subject to any restrictions and can be chosen freely. The value of the name attribute will appear accordingly as a source attribute in migration-center.

If the metadata file has not the expected XML structure, the scanner will use the XSLT file that should be provided before processing in metadataXsltFile, to process the enriching metadata from the file.

Multi-value attributes can be defined by repeating the attribute element with the same name, but different value attribute.

Once the document and any additional metadata have been scanned, migration-center no longer differentiates between attributes originating from different sources. Attributes resulting from metadata files will appear alongside the regular attributes extracted from the file system properties but they are prefixed with “xml_“. The full transformation functionality is available for these attributes.

In case date/time type values are included in the metadata file, the date/time formats used must comply with the date/time pattern defined for migration-center during installation. For more information see the Installation Guide.

Extracting extended metadata from the content of files

In addition to metadata obtained from the standard file system properties and metadata added via external metadata files, the Filesystem Scanner can also extract metadata from supported document formats. This type of metadata is called external metadata. The corresponding functionality can be toggled in the Filesystem Scanner via the “scanExtendedMetadata” parameter.

The scanner can parse the following file formats for metadata:

  • HyperText Markup Language

  • XML and derived formats

  • Microsoft Office document formats

  • OpenDocument Format

  • Portable Document Format

  • Electronic Publication Format

  • Rich Text Format

  • Compression and packaging formats

  • Text formats

  • Audio formats

  • Image formats

  • Video formats

  • Java class files and archives

  • The mbox format

Metadata extracted from these files will be added to the respective documents metadata. Extended metadata source attributes will be prefixed with “ext_“ to indicate their source. Apart from their naming, these attributes are not handled differently by migration-center. Just as with attributes provided via metadata files, extended attributes will appear and work alongside the standard file system attributes. The full transformation functionality is available for these attributes.

Working with Versions

Implicit versioning

Although there aren’t any versioning features available in a standard file system, The Filesystem scanner can detect when source objects in the file system have changed and can version these upon import to Documentum. This can be configured through the "scanChangedFilesBehaviour" parameter configured in the Scanner. This parameter can take the following values:

  • 1 (default) - the changed file will be added as update object, meaning that the existing object in migration-center will be updated (i.e. overwritten) with the new attributes of the modified object.

  • 2 the changed file will be added as a new version of the existing object. This means that a new version of the document will be created, its parent will be set to the previous version and the level in version tree will be incremented by 1.

  • 3 the changed file will be added as a new object, and is not related in any way to the previous existing object in migration-center. If the user does not change the object’s name in migration-center, the document is imported in the target repository with the same name and linked under the same folder as the original object.

A file is detected as changed if either its content or its metadata file has been modified since the previous scan.

A folder is detected as changed if its metadata file has been modified since the previous scan. In this case it is saved as an update.

Explicit versioning

Filesystem Scanner can also generate versioning information from attribute values provided through additional metadata (e.g. via the fme metadata files).

The Filesystem Scanner offers two new parameters which can be set to the names of the source attributes containing the explicit versioning information the scanner should use.

The two parameters are versionIdentifierAttribute and versionLevelAttribute. They should be used together, and work as follows:

  • versionIdentifierAttribute specifies the name of the source attribute which identifies a version group/tree. Setting this parameter will activate the versioning based on metadata. Must be used together with versionLevelAttribute The specified source attribute’s value must be the same for all objects that are part of a version group/tree. Any string value is permitted as long as it fulfills the previous requirement. Note: The attribute name must be prefixed with xml_, i.e. xml_vid if the attribute containing the value in the external metadata file is called vid

  • versionLevelAttribute specifies the name of the source attribute which identifies the order of objects within a group of versions. Must be used together with versionIdentifierAttribute. The specified source attribute’s values must be distinct for all objects within the same version group/tree, i.e. with the same versionIdentifierAttribute value. The specified source attribute’s values must be positive numbers. A decimal point followed by one or more digits is also permitted, as long as the value makes sense as a number.

Setting these parameters to attributes containing valid information will allow the Filesystem Scanner to build internal information describing how the scanned objects should be linked together to form versions, similar to how the scanners targeting dedicated DMS can extract the native version information from there. This information can then be understood and processed by migration-center importers which support versioning for their respective target systems.

Limitations

  • If invalid attribute names are provided for versionIdentifierAttribute and versionLevelAttribute the explicit versioning is not applied.

  • If the value of the attribute specified in versionLevelAttribute is not a number for one or more scanned documents the explicit versioning is not applied to all of the objects in the scanner run

  • Before 3.16, the explicit versioning was only applied to the objects in the current scanner run. Since 3.16, if a new version are be added to the versions trees that were created by previous scanner runs. Nevertheless, this only apply if all scanner runs belong to the same scanner.

Filesystem Scanner Properties

To create a new Filesystem Scanner job, specify the respective adapter type in the Scanner Properties window – from the list of available adapters “Filesystem” must be selected. Once the adapter type has been selected, the Parameters list will be populated with the parameters specific to the selected adapter type, in this case the Filesystem adapter’s.

The Properties window of a scanner can be accessed by double-clicking a scanner in the list, or selecting the Properties button/menu item from the toolbar/context menu.

A detailed description is always displayed at the bottom of the window for the currently selected parameter.

The maximum length of a path for a file system object is now 512 bytes, up from 255 bytes used in previous versions of migration-center all max supported string lengths are specified in bytes. This equals characters as long as the characters are single-byte characters (i.e. Latin characters). For multi-byte characters (as used by most languages and scripts having other than the basic Latin characters) it might result in less than the equivalent number of characters, depending on the number and byte length of multi-byte characters within the string (as used in UTF-8 encoding).

Common scanner parameters

Configuration parameters

Values

Name

Enter a unique name for this scanner

Mandatory

Adapter type

Select the “Filesystem” adapter from the list of available adapters

Mandatory

Location

Select the Job Server location where this job should be run. Job Servers are defined in the Jobserver window. If no Job Server migration-center will prompt the user to define a Job Server Location when saving the Scanner.

Mandatory

Description

Enter a description for this job (optional)

Filesystem scanner parameters

Configuration parameters

Values

scanFolderPaths*

The folder paths to be scanned.

Can be local paths or network file shares (SMB/Samba)

Multiple values can be entered by separating them with the “|” character. They also can be provided as a list of folder paths, one on each row, stored in a text file. The text file path must start with "@".

Those two methods of providing folder paths are mutually exclusive.

Examples:

scanning a network share and a local path:

\\share1\testfolder|c:\documents

scanning folders provided in a text file:

@C:\MC\folders-to-scan.txt

Note: To scan network file shares the Job Server running the respective Scanner must be configured to run using a domain account that has full read permission for the given network share.

For information about configuring the Job Server to run using a specific account, see Windows Help for configuring services to run using a different user account, since the Job Server runs as a regular Windows service.

excludeFolderPaths

The folders that need to be excluded from the scan. The folder paths to be excluded from the scan must be subpaths of the “scanFolderPaths” parameter and must be specified either as an absolute or relative path. The relative path will start with "*".

Examples:

absolute path:

c:\documents\invoices\do-not-migrate

relative path:

*\excludedFolder

The scanner will automatically add each relative path the each of the folders specified in "scanFolderPaths" and add the result to the list of folders to exclude, e.g. if "c:\users\Frank|c:\users\Michael" and "*\do-not-migrate" were specified in "scanFolderPaths" and "excludeFolderPaths" respectively, the scanner would skip the folders "c:\users\Frank\do-not-migrate" and "c:\users\Michael\do-not-migrate".

Multiple values can be entered by separating them with the “|” character.

Note: If the list of excluded folders contains folders that are subfolders of other folders in the same list, these are removed from the list since they are redundant.

excludeFiles

Filename pattern used to exclude certain types of files from scanning. This parameter uses regular expressions.

For example to exclude all documents that have the extension “txt”, use this regular expression: (.)+\.txt

Use “|” as delimiter if you want to enter multiple exclusion patterns.

Note:

The regular expressions use syntax similar to Perl. For more technical details please read the specific javadocs page at:

http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html

For more information about regular expressions, please visit http://www.regular-expressions.info!

ignoreHiddenFiles

Specifies whether files marked as “Hidden” in the file system should be scanned or not.

scanChangedFilesBehaviour

Specifies the behavior of the scanner when a file update is detected. Accepted values are:

1 – (default) – the changed file will be added as update object

2 – the changed file will be added as a new version

3 – the changed file will be added as a new object

Format: String

For more details please consult chapter Working with Versions

moveFilesToFolder

Set a valid local file system path or UNC path folder where to move scanned files. All files which have been scanned successfully will be moved to a cloned folder structure under the configured path.

Example:

scanFolderPath = c:\source\documents

moveFilesToFolder = c:\moved\documents

The source file c:\source\documents\folderA\document.doc will be moved to c:\moved\documents\<scanRunId>\folderA\ document.doc

If this parameter is set and a file is moved by the scanner its contentPath attribute will reference the new location. The importer will use the moved location instead of the source while processing the files.

scanFolders

Boolean. If flag is checked folders will be scanned as fully editable and transformable objects. This way custom attributes, object types, owner and permissions can be defined for the folders. Otherwise folders will be retained only as path references from the documents and will be created using default folder settings in Documentum.

mcMetadataFileExtension

The file-extension of the XML files which contains extra metadata for the scanned files and folders.

Note: The file extension must be specified without a dot, i.e. just "fme" and not ".fme".

For more details please consult chapter Enriching the content using metadata from external XML files

metadataXsltFile

The path to XSLT file that should be applied to metadata XML files before processing. Leave it empty if metadata XML files are already in expected format.

scanExtendedMetadata

Flag indicating if extended metadata will be scanned for common documents like: MS Office documents, pdf, etc. Extended metadata is extracted using apache tika library. For more information about all supported formats please refer the apache-tika documentation: http://tika.apache.org/0.9/formats.html

extendedMetadataDateFormats

Can be used for setting one or more Java date formats the scanner will be used to detect the date attribute in the document content. If empty, the default list of patterns will be used

ignoredAttributesList

Contains list of attributes (comma delimited) that will be ignored. All this attributes will be ignored during scanning saving performance and database storage.

computeChecksum

When it's checked the checksum of scanned files will be computed. Useful for determining whether files with different names and from different locations have in fact the same content, as can frequently happen with common documents copied and stored by several users in a file share environment.

Do not enable this option unless necessary, since the performance impact is significant due to the scanner having to read the full content for each and compute the checksum for it.

hashAlgorithm

Specifies the algorithm that will be used to compute the Checksum of the scanned objects.

Possible values are "MD2", "MD5", "SHA-1", "SHA-224", "SHA-256", "SHA-384" and "SHA-512". Default value is MD5.

hashEncoding

Specifies the encoding that will be used to compute the Checksum of the scanned objects.

Possible values are "HEX", "Base32" and "Base64". Default value is HEX.

ignoreWarnings

When it's checked the following warnings are ignored so the affected objects will be scanned:

  • Warning when an xml-metadata file is missing or cannot be read;

  • Warning when "owner name" or "creation date" cannot be extracted;

  • Warning when check sum cannot be computed;

  • Warning when extended metadata cannot be extracted;

versionIdentifierAttribute

Name of the source attribute which identifies a version tree. Setting this parameter will activate the versioning based on metadata. Must be used together with versionLevelAttribute

The specified source attribute’s value must be the same for all objects that are part of a version group/tree.

Note: The attribute name must be prefixed with xml_, i.e. xml_vid if the attribute containing the value in the external metadata file is called vid

versionLevelAttribute

Name of the source attribute which identifies the order of objects within a group of versions. Must be used together with versionIdentifierAttribute.

The specified source attribute’s values must be distinct for all objects within the same version group/tree, i.e. with the same versionIdentifierAttribute value.

The specified source attribute’s values must be positive numbers. A decimal point followed by one or more digits is also permitted, as long as the value makes sense as a number

Note: The attribute name must be prefixed with xml_, i.e. xml_version if the attribute containing the version in the external metadata file is called version

loggingLevel*

Sets the verbosity of the log file.

Values:

1 - logs only errors during scan

2 - is the default value reporting all warnings and errors

3 - logs all successfully performed operations in addition to any warnings or errors

4 - logs all events (for debugging only, use only if instructed by fme product support since it generates a very large amount of output. Do not use in production)

Log files

A complete history is available for any Filesystem Scanner job from the respective items’ History window. It is accessible through the History button/menu entry on the toolbar/context menu. The History window displays a list of all runs for the selected job together with additional information, such as the number of processed objects, the start and ending time and the status.

Double clicking an entry or clicking the Open button on the toolbar opens the log file created by that run. The log file contains more information about the run of the selected job:

  • Version information of the migration-center Server Components the job was run with

  • The parameters the job was run with

  • Execution Summary that contains the total number of objects processed, the number of documents and folders scanned or imported, the count of warnings and errors that occurred during runtime.

Log files generated by the Filesystem Adapter can be found in the Server Components installation folder of the machine where the job was run, e.g. …\fme AG\migration-center Server Components <Version>\logs

The amount of information written to the log files depends on the setting specified in the ‘loggingLevel’ start parameter for the respective job.