Filesystem Scanner

Introduction

You can use our Filesystem Scanner in several use cases, e.g. to scan files from file repositories or to scan files exported into a filesystem from a DMS or other third-party system.

Quick guide to basic migration tasks

Basic configuration

A basic scanner configuration supposes setting a list of folders to be scanned. Local paths and UNC paths are supported. The scanner will scan all files located inside given folders and their subfolders. The common windows attributes like filename, file path, creation date, modify date, content size etc. are extracted and saved in MC database as metadata.

To scan folders as distinct objects in migration-center the flag scanFolders needs to be checked. In this case all subfolders of the given folder list will be saved in migration-center database together with their metadata.

Enriching the content using metadata from external XML files

Additional metadata stored in external files can be used to enrich the files and folders originating from the file system. This file needs to contain the XML schema used by migration-center format and adhere to the naming convention expected by migration-center. The format for such a file is described below.

Although the files’ contents are XML, the extension of such metadata files can be arbitrary and is NOT recommended to be set to XML in order to prevent potential conflicts with actual files using the XML extension. The file extension migration-center should consider as metadata files can be specified in the mcMetadataFileExtension parameter of the Filesystem scanner. If the option has been set and the metadata file for a file or folder cannot be found, an appropriate warning will be logged.

If metadata files and/or folders are not required in the first place, clear the mcMetadataFileExtension parameter to disable additional metadata processing entirely. If some files require additional metadata and others don’t, configure the mcMetadataFileExtension parameters as if all files had metadata. In this case it is safe to ignore the warnings related to missing metadata files for the documents where metadata files are not required or available.

Metadata file naming

One metadata file should be available in the source system for each file and folder which is supposed to have additional metadata provided by such means. The naming for the metadata file has to follow a simple rule:

for files: filename.extension.metadataextension

for folders: .foldername. metadataextension

E.g.: If the document is named report.pdf, and the extension for the metadata files is determined to be fme, then the metadata file for this document needs to be called report.pdf.fme and fme has to be entered as the value for the mcMetadataFileExtension parameter.

If the folder is Migration the metadata file for it must be .Migration.fme.

Metadata file contents

A sample metadata file’s XML structure is illustrated below. The sample content could belong to the report.pdf.fme file mentioned above. In this case the report.pdf file has 4 attributes, each attribute being defined as a name-value pair. There are five lines because one of the attributes is a multi-value attribute. Multi-value attributes are represented by repeating the attribute element with the same name, but different value attribute (i.e. the keywords attribute is listed twice, but with different values)

<?xml version="1.0" encoding="UTF-8" ?>
<contentattributes>
    <attribute name="keywords" value="Benchmark" />
    <attribute name="keywords" value="Technical" />
    <attribute name="reference_period" value="26.11.2001" />
    <attribute name="reference_period_from" value="26.11.2001" />
    <attribute name="reference_period_to" value="01.01.2100" />
</contentattributes>

The number, name and values of attributes defined in such a file are not subject to any restrictions and can be chosen freely. The value of the name attribute will appear accordingly as a source attribute in migration-center.

If the metadata file has not the expected XML structure, the scanner will use the XSLT file that should be provided before processing in metadataXsltFile, to process the enriching metadata from the file.

Multi-value attributes can be defined by repeating the attribute element with the same name, but different value attribute.

Once the document and any additional metadata have been scanned, migration-center no longer differentiates between attributes originating from different sources. Attributes resulting from metadata files will appear alongside the regular attributes extracted from the file system properties but they are prefixed with “xml_“. The full transformation functionality is available for these attributes.

In case date/time type values are included in the metadata file, the date/time formats used must comply with the date/time pattern defined for migration-center during installation. For more information see the Installation Guide.

Extracting extended metadata from the content of files

In addition to metadata obtained from the standard file system properties and metadata added via external metadata files, the Filesystem Scanner can also extract metadata from supported document formats. This type of metadata is called external metadata. The corresponding functionality can be toggled in the Filesystem Scanner via the “scanExtendedMetadata” parameter.

The scanner can parse the following file formats for metadata:

  • HyperText Markup Language

  • XML and derived formats

  • Microsoft Office document formats

  • OpenDocument Format

  • Portable Document Format

  • Electronic Publication Format

  • Rich Text Format

  • Compression and packaging formats

  • Text formats

  • Audio formats

  • Image formats

  • Video formats

  • Java class files and archives

  • The mbox format

Metadata extracted from these files will be added to the respective documents metadata. Extended metadata source attributes will be prefixed with “ext_“ to indicate their source. Apart from their naming, these attributes are not handled differently by migration-center. Just as with attributes provided via metadata files, extended attributes will appear and work alongside the standard file system attributes. The full transformation functionality is available for these attributes.

Working with Versions

Implicit versioning

Although there aren’t any versioning features available in a standard file system, The Filesystem scanner can detect when source objects in the file system have changed and can version these upon import to Documentum. This can be configured through the "scanChangedFilesBehaviour" parameter configured in the Scanner. This parameter can take the following values:

  • 1 (default) - the changed file will be added as update object, meaning that the existing object in migration-center will be updated (i.e. overwritten) with the new attributes of the modified object.

  • 2 the changed file will be added as a new version of the existing object. This means that a new version of the document will be created, its parent will be set to the previous version and the level in version tree will be incremented by 1.

  • 3 the changed file will be added as a new object, and is not related in any way to the previous existing object in migration-center. If the user does not change the object’s name in migration-center, the document is imported in the target repository with the same name and linked under the same folder as the original object.

A file is detected as changed if either its content or its metadata file has been modified since the previous scan.

A folder is detected as changed if its metadata file has been modified since the previous scan. In this case it is saved as an update.

Explicit versioning

Filesystem Scanner can also generate versioning information from attribute values provided through additional metadata (e.g. via the fme metadata files).

The Filesystem Scanner offers two new parameters which can be set to the names of the source attributes containing the explicit versioning information the scanner should use.

The two parameters are versionIdentifierAttribute and versionLevelAttribute. They should be used together, and work as follows:

  • versionIdentifierAttribute specifies the name of the source attribute which identifies a version group/tree. Setting this parameter will activate the versioning based on metadata. Must be used together with versionLevelAttribute The specified source attribute’s value must be the same for all objects that are part of a version group/tree. Any string value is permitted as long as it fulfills the previous requirement. Note: The attribute name must be prefixed with xml_, i.e. xml_vid if the attribute containing the value in the external metadata file is called vid

  • versionLevelAttribute specifies the name of the source attribute which identifies the order of objects within a group of versions. Must be used together with versionIdentifierAttribute. The specified source attribute’s values must be distinct for all objects within the same version group/tree, i.e. with the same versionIdentifierAttribute value. The specified source attribute’s values must be positive numbers. A decimal point followed by one or more digits is also permitted, as long as the value makes sense as a number.

Setting these parameters to attributes containing valid information will allow the Filesystem Scanner to build internal information describing how the scanned objects should be linked together to form versions, similar to how the scanners targeting dedicated DMS can extract the native version information from there. This information can then be understood and processed by migration-center importers which support versioning for their respective target systems.

Limitations

  • If invalid attribute names are provided for versionIdentifierAttribute and versionLevelAttribute the explicit versioning is not applied.

  • If the value of the attribute specified in versionLevelAttribute is not a number for one or more scanned documents the explicit versioning is not applied to all of the objects in the scanner run

  • Before 3.16, the explicit versioning was only applied to the objects in the current scanner run. Since 3.16, if a new version are be added to the versions trees that were created by previous scanner runs. Nevertheless, this only apply if all scanner runs belong to the same scanner.

Filesystem Scanner Properties

To create a new Filesystem Scanner job, specify the respective adapter type in the Scanner Properties window – from the list of available adapters “Filesystem” must be selected. Once the adapter type has been selected, the Parameters list will be populated with the parameters specific to the selected adapter type, in this case the Filesystem adapter’s.

The Properties window of a scanner can be accessed by double-clicking a scanner in the list, or selecting the Properties button/menu item from the toolbar/context menu.

A detailed description is always displayed at the bottom of the window for the currently selected parameter.

The maximum length of a path for a file system object is now 512 bytes, up from 255 bytes used in previous versions of migration-center all max supported string lengths are specified in bytes. This equals characters as long as the characters are single-byte characters (i.e. Latin characters). For multi-byte characters (as used by most languages and scripts having other than the basic Latin characters) it might result in less than the equivalent number of characters, depending on the number and byte length of multi-byte characters within the string (as used in UTF-8 encoding).

Common scanner parameters

Filesystem scanner parameters

Last updated