InfoArchive Importer

Introduction

InfoArchive is an archive system from OpenText which fulfils the international standard OAIS (http://de.wikipedia.org/wiki/OAIS).

The InfoArchive Importer provides the necessary functionality for creating Submission Information Packages (SIP) compressed into ZIP format that will be ready to be ingested into an InfoArchive repository. A SIP is a data container used to transport data to be archived from the producer (source application) to InfoArchive. It consists of a SIP descriptor containing packaging and archival information about the package and the data to be archived. Based on the metadata configured in migration-center the InfoArchive Importer will create a valid SIP descriptor (eas_sip.xml) and a valid PDI file (eas_pdi.xml) for every generated SIP.

The supported InfoArchive versions are 3.2 – 20.4. Synchronous ingestion is only supported for version 3.2.

An importer is the term used for an output connector and is used at the last step of the migration process. In the context of the InfoArchive Importer the filesystem itself is considered to be the target location for migrated data, hence the designation “importer”. The InfoArchive Importer imports data sourced from other systems and processed with migration-center to the filesystem into Zip-files (SIPs).

This module works as a job that can be run at any time and can even be executed repeatedly. For every run, a detailed history and log file are created. An importer is defined by a unique name, a set of configuration parameters and an optional description.

InfoArchive Importers can be created, configured, started and monitored through migration-center Client, but the corresponding processes are executed by migration-center Job Server.

Known issues & limitations

Use Java 11 for InfoArchive migrations due to the issue below

  • Generating large PDI files when the Jobserver is running with Java 8 may result in incorrect values in the PDI file. This may be seen by the XSD validation failing during import, or it might pass silently (#59214)

  • includeChildrenOnce feature does not work with versioned Child Documents and with includeChildrenVersions set to False (#57766)

Staging area free space

The InfoArchive importer might use a lot of disk space you import a large number of objects with a lot of attributes each. For each object the importer creates a temporary PDI xml file, which will ultimately be compiled into the main PDI xml file. The files are deleted when the import finishes but they can use a lot of disk space.

The default staging area is the %temp% folder, but it can be changed by adding the following line in the wrapper.conf file of the jobserver:

wrapper.java.additional.6=-Djava.io.tmpdir=./myCustomTEMP

The Jobserver service must be reinstalled after adding this line, for the changes to take effect.

Working with the migration-center InfoArchive Object Type

Objects meant to be migrated to InfoArchive using the InfoArchive Importer have their own type in migration-center. This allows migration-center and the user to target aspects and properties specific to the filesystem.

Migration Sets

Documents targeted at InfoArchive will have to be added to a migration set first. This migration set must be configured to accept objects of type <source object type>ToInfoArchive(document).

Create a new migration set and set the <source object type>ToInfoArchive(document) object type in the Type drop-down. The type of object can no longer be changed after a migration set has been created.

Transformation Rules

The migration sets of type “<source object type>ToInfoArchive(document)” have a number of predefined rules listed under Rules for system attributes in the –Transformation Rules - window.

The values of system rules prefixed with DSS are used by the InfoArchive Importer to create the SIP descriptor (eas_sip.xml) as shown in the following example:

<sip xmlns="urn:x-emc:ia:schema:sip:1.0">
    <dss>
        <holding>Shoebox</holding>
        <id>IATEST</id>
        <pdi_schema>urn:cosmin:en:xsd:Shoebox.1.0</pdi_schema>
        <production_date>2008-02-23T11:30:38.000</production_date>
        <base_retention_date>2001-02-23T11:30:38.000</base_retention_date>
        <producer>MultiContentFS</producer>
        <entity>Entity</entity>
        <priority>0</priority>
        <application>FS</application>
    </dss>
    <production_date>2017-07-14T16:36:33.616</production_date>
    <seqno>1</seqno>
    <is_last>true</is_last>
    <aiu_count>1</aiu_count>
    <page_count>0</page_count>
    <pdi_hash algorithm="SHA-256" encoding="base64"> SxssWk1AZVxI2vfKJ3vOtCrKGdZyDA56mGCcQCIjpXk=
    </pdi_hash>
    <custom>
        <attributes>
            <attribute name="customAttr1">customValue1</attribute>
            <attribute name="customAttr2">customValue2</attribute>
        </attributes>
    </custom>
</sip>

Every unique combination of the values of the “DSS_” rules together with the “target_type” will correspond to a “Data Submission Session (DSS)”. See more information about DSS in the InfoArchive configuration guide.

Working with rules and associations is core product functionality and is described in detail in the Client User Guide.

Object Type Definitions

The target types should be defined in MC according to the InfoArchive PDI schema definition. The object types are used by the validation engine to validate the attributes before the import phase. They are also used by the importer to generate the PDI file for the SIPs.

Working with object type definitions and defining attributes is core product functionality and is described in detail in the Client User Guide.

Working With Features Specific to InfoArchive

Generation of PDI file

The importer generates the PDI file (eas_pdi.xml) by transforming the structured data from a standard structure based on an XSL file and validating it against an XSD file. An example of the standard structure of the PDI file can be found in Default format of PDI File.

The “pdiSchemasPath” parameter in the importer configuration is used to locate the XSL and XSD files needed for the PDI file (eas_pdi.xml) transformation and validation. If this parameter does not have any value then the eas_pdi.xml file will be created using the standard structure.

If the parameter does contain a value, then the user must make sure that the XSL and XSD files are present in the path. The name of the XSL and XSD files must match the first value of system rule “target_type” otherwise the importer will return an error. If the “pdiSchemasPath” is set to “D:\IA\config” and “target_type” has the following multiple values “Office,PhoneCalls,Tweets” then the XSL and XSD file names must be: “Office.xsl” and “Office.xsd”.

The XSD file needed for the PDI validation should be the same one used in InfoArchive when configuring a holding and specifying a PDI schema. The XSL file however needs to be created manually by the user. The XSL is responsible for extracting the correct values from the standard output generated by the importer in memory and transforming them into the needed structure for your configuration. An example of such files can be found at Sample PDI schema and Sample PDI transformation style sheet.

You should allocate to the jobserver 2.5x more memory for the Java Heap space than the size of the biggest generated PDI XML file.

Example: PBI file (~1 gb) <-> Java Heap space (~2.5-3 gb)

Support for Multiple Object Types per Object

Starting from version 3.2.9 of migration-center the InfoArchive Importer supports generation of PDI files that contain metadata from multiple object types. By specifying multiple object types definitions in the “target_type” system attribute, one can associate metadata to multiple object types in the associations' tab. Note that only the first value from this rule will be used to find the XSD and XSL files for transforming and validating the eas_pdi.xml file. Those files need to support the standard output provided by the importer as seen in Default format of PDI File.

Support for Multiple Contents per AIU

Starting from version 3.2.9 of migration-center the InfoArchive Importer supports multiple contents per AIU. Each content location must be specified in the mc_content_location system attribute as repeating values, and the names of each content must be specified in the content_name system attribute as repeating values as well.

The number of repeating values must be the same for both attributes.

Support for Custom Attributes in the eas_sip.xml

Starting from version 3.2.9 of migration-center the InfoArchive Importer supports setting custom attributes in the eas_sip.xml file. This can be done by setting the “custom_attribute_key” and custom_attribute_value” system attributes. The number of repeating values in these attributes must match.

custom_attribute_key: which represents the name parameter for custom attributes from eas_sip.xml.

custom_attribute_value: which represents the values of custom attributes from eas_sip.xml.

Please see Default format of PDI File for more details on how the output will look like.

Generating Multiple SIPs That Belong to the Same DSS

The InfoArchive Importer offers the possibility to automatically distribute the given content to multiple sequential SIPs grouped together as a batch that pertains to a single Data Submission Session (DSS). For activating this feature the check box “batchMode” must be enabled in the importer configuration. Additionally one of the parameters “maxObjectsPerSIP" or "maxContentSizePerSIP" must be set with a value greater than 0.

The importer will process sequentially the documents having the same values for "DSS_" system rules and a new SIP file will be generated anytime when one of the following conditions is met:

  1. The number of objects in the current SIP exceeds the value of "maxObjectsPerSIP"

  2. The total size of the objects' content exceeds the "maxContentSizePerSIP"

The importer will set the value of <seqno> element of the SIP descriptor with the sequence number of the SIP inside the DSS. The value of the element <is_last> will be set to "false" for all SIPs that belong to the same DSS except for the last one where it will be set to "true".

Generating Multiple SIPs That Belong to Different DSS

For the cases when the generated SIP will contain too many objects or the size of the SIP will be too big, the importer offers the possibility to distribute the given content to multiple independent SIPs (that belong to different DSS). For activating this feature the check box “batchMode” must be disabled but one of the parameters “maxObjectsPerSIP" or "maxContentSizePerSIP" must be set to a value greater than 0. The importer will create a new independent SIP any time when one of the following conditions is met:

  1. The number of objects in the current SIP exceeds the value of "maxObjectsPerSIP"

  2. The total size of the objects' content exceeds the "maxContentSizePerSIP"

In this scenario the value of <seqno> element in SIP descriptor will be set to “1” and <is_last> will be set to “true” for all generated SIPs.

Additionally, the importer will change the value provided by “DSS_id” by adding a counter to the end of the value. This is necessary in order to assure a unique identifier for each DSS that is derived from information contained in the SIP descriptor:

external DSS ID = <holding>+<producer><id>.

The external DSS id of multiple independent SIPs must be unique for InfoArchive in order to process them.

In this scenario, the length of the DSS_id value should be less than the maximum allowed length (32 char) so the importer can add the counter as “_1”, “_2” and so on at the end of the value.

Post-processing After the Import

Since the InfoArchive Importer does not import the data directly to InfoArchive it does offer a post processing functionality to allow the user to automate the content ingestion to InfoArchive. This can be done by providing a batch file that will be executed by the importer after all objects will have been processed. The path to the batch file can be configured in the importer parameter “targetScript”. Such a script may, for example, start the ingestion process on the InfoArchive server.

Importing Documentum Specific Objects to InfoArchive

Importing Audittrail Objects

When the importer parameter “includeAuditTrails” checked”, the importer will add a list with all audit trail records of the currently processed object to the output XML file. The importer will take the data for the audit trail records from the audit trail migration set that must be assigned to the importer. Therefore, the user has to assign at least two migration sets to the importer: one for the documents and one for the corresponding audit trail records. Each audit trail node in the output XML file will contain all the attributes defined in the audit trail migration set. The default XSLT transformation mechanism can be used to create the needed output structure for the audit trail records.

The default PDI output looks like below:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<type>
    <subtype id="1">
        <dm_document>
            <object_name index="0">Test_acrobat</object_name>
            <owner_permit index="0">7</owner_permit>

        </dm_document>
        <audittrails>
            <audittrail>
                <time_stamp_utc>10.01.2017 09:45:23</time_stamp_utc>
                <object_name>Test_acrobat</object_name>

            </audittrail>
            <audittrail>
                <time_stamp_utc>10.01.2017 09:45:23</time_stamp_utc>
                <object_name>Test_acrobat</object_name>

            </audittrail>
        </audittrails>
    </subtype>
</type>

Including the Children of Virtual Documents

When the parameter “includeVirtualDocuments” is checked the importer will include for each virtual document it processes all its descendants and add them as a list of child nodes to its output record. Each node will contain the name of the child and the content hash of the primary content (that was calculated by the scanner). The default XSLT transformation mechanism can be used to create the needed output structure for the VD objects.

The PDI file looks like below:

<documents xmlns="urn:eas-samples:en:xsd:office.1.0">
    <document id="1">
        <file_name sip_file_name=”1.txt”>1.txt</file_name>
        <title>Virtual document 1</title>
        <date_created>2014-12-12T15:32.393</date_created>

    <vdchildren>
        <vdchild>
            <file_name>Virtual document 11</file_name>
            <content_hash>EF56A612895633F3</content_hash>
        </vdchild>
        <vdchild>
            <file_name>Virtual document 12</file_name>
            <content_hash>EF56A612895633F3</content_hash>
        </vdchild>

   </vdchildren>
</document>

</documents>

The parameter “includeChildrenVersions” allows specifying if all versions of the children will be included or only the latest version.

There are several limitations that should be taken into consideration when using this feature:

  • All related objects, i.e. all descendants of a virtual document, must be associated with the same import job in migration-center. This limitation is necessary in order to ensure that all descendant objects of a virtual document are in the transformed and validated state before they are processed by the importer. If a descendant object is not contained in any of the migration sets that are associated with the import job, the migration-center will throw an error for the parent object during the import.

  • For children, the <file_name> value is taken from the first value of the system rule “content_name”. The “content_name” is the system attribute that defines the content names in the zip.

  • For children, only the content specified in the first value of "mc_content_location" will be added. If "mc_content_location" is null, the content will be taken from the column "content_location" that stores the path where the document was scanned.

  • If the same document is part of multiple VDs within the same SIP then its content will be added only one time.

  • If the size limit for one SIP is exceeded, the importer will throw an error

  • Delta migration does not work with this feature.

If the parameter “includeChildrenOnce” is checked the VD children are only added to the first imported parent. If is unchecked the children are added to every parent they belong to and they are also added as distinct nodes in the PDI file.

Synchronous Ingestion (via Webservice)

Migration center can ingest the generated ZIP files synchronously over the Webservice into InforArchive 3.2. Therefore the InfoArchive (Holding) must be configured as described in the InfoArchive documentation.

In order to let the importer transfer the files, the parameter “webserviceURL” must be filled. If that is the case the importer will try to reach the Webservice at the start of the import to ensure a connection to the webservices can be established. Once a SIP file is created in the filesystem it will be transferred via Webservice to InfoArchive in a separate Thread. The number of threads that run in parallel can be set with the parameter “numberOfThreads”.

If the transfer is successful the SIP file can be moved to a directory specified by the parameter “moveFilesToFolder”. A SIP file that fails to transfer will be deleted by default unless the parameter “keepUntransferredSIPs” is checked.

InfoArchive Importer Properties

To create a new InfoArchive Importer job, specify the respective adapter type in the Importer Properties window from the list of available connectors “InfoArchive”. Once the adapter type has been selected, the Parameters list will be populated with the parameters specific to the selected adapter type.

The Properties window of an importer can be accessed by double-clicking an importer in the list, or selecting the Properties button/menu item from the toolbar/context menu.

A detailed description is always displayed at the bottom of the window for a selected parameter.

Common Importer Parameters

InfoArchive Importer Parameters

History, Reports, Logs

A complete history is available for any InfoArchiveImporter job from the respective items’ History window. It is accessible through the History button/menu entry on the toolbar/context menu. The History window displays a list of all runs for the selected job together with additional information, such as the number of processed objects, the start and ending time and the status.

Double clicking an entry or clicking the Open button on the toolbar opens the log file created by that run. The log file contains more information about the run of the selected job:

  • Version information of the migration-center Server Components the job was run with

  • The parameters the job was run with

  • Execution Summary that contains the total number of objects processed, the number of documents and folders scanned or imported, the count of warnings and errors that occurred during runtime.

Log files generated by the InfoArchive Importer can be found in the Server Components installation folder of the machine where the job was run, e.g. …\fme AG\migration-center Server Components <Version>\logs

The amount of information written to the log files depends on the setting specified in the “loggingLevel” start parameter for the respective job.

Appendix

Default Format of PDI File (eas_pdi.xml)

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<type>
    <subtype id="1">
        <Shoebox>
            <author index="0">BUILTIN\Administrators</author>
            <dateModified index="0">2001-02-23T11:30:38.000</dateModified>
            <title index="0">false</title>
            <sourceLocation index="0">\\vms2\TestData\Filesystem\1 doc</sourceLocation>
            <keywords index="0">doc1</keywords>
            <contentSize index="0">123</contentSize>
            <department index="0">Windows 7</department>
            <fileName index="0">doc1.docx</fileName>
            <dateCreated index="0">2001-02-23T11:30:38.000</dateCreated>
            <format index="0">docx</format>
        </Shoebox>
        <Shoebox2>
            <author index="0">BUILTIN\Administrators</author>
            <title index="0">\\vms2\TestData\Filesystem\1 doc</title>
            <dateModified index="0">2001-02-23T11:30:38.000</dateModified>
            <sourceLocation index="0">false</sourceLocation>
            <keywords index="0">doc1</keywords>
            <contentSize index="0">123</contentSize>
            <department index="0">Windows 7</department>
            <fileName index="0">Second Doc.docx</fileName>
            <dateCreated index="0">2001-02-23T11:30:38.000</dateCreated>
            <format index="0">docx</format>
        </Shoebox2>
    </subtype>
    <subtype id="2">
        <Shoebox>
            <author index="0">BUILTIN\Administrators</author>
            <dateModified index="0">2001-02-23T11:30:38.000</dateModified>
            <title index="0">false</title>
            <sourceLocation index="0">\\vms2\TestData\Filesystem\1 doc</sourceLocation>
            <keywords index="0">doc1</keywords>
            <contentSize index="0">123</contentSize>
            <department index="0">Windows 7</department>
            <fileName index="0">doc1.docx</fileName>
            <dateCreated index="0">2001-02-23T11:30:38.000</dateCreated>
            <format index="0">docx</format>
        </Shoebox>
        <Shoebox2>
            <author index="0">BUILTIN\Administrators</author>
            <title index="0">\\vms2\TestData\Filesystem\1 doc</title>
            <dateModified index="0">2001-02-23T11:30:38.000</dateModified>
            <sourceLocation index="0">false</sourceLocation>
            <keywords index="0">doc1</keywords>
            <contentSize index="0">123</contentSize>
            <department index="0">Windows 7</department>
            <fileName index="0">Second Doc.docx</fileName>
            <dateCreated index="0">2001-02-23T11:30:38.000</dateCreated>
            <format index="0">docx</format>
        </Shoebox2>
    </subtype>
</type>

Sample PDI schema (XSD)

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="urn:cosmin:en:xsd:Shoebox.1.0" 
    xmlns:ns1="urn:cosmin:en:xsd:Shoebox.1.0" 
    xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="documents">
        <xs:complexType>
            <xs:sequence>
                <xs:element maxOccurs="unbounded" minOccurs="0" name="document">
                    <xs:complexType>
                        <xs:sequence>
                            <xs:element name="Shoebox" maxOccurs="1">
                                <xs:complexType>
                                    <xs:sequence>
                                        <xs:element maxOccurs="unbounded" minOccurs="0" name="fileName" type="xs:string"/>
                                        <xs:element name="title" type="xs:string"/>
                                        <xs:element name="dateCreated" type="xs:dateTime"/>
                                        <xs:element name="dateModified" type="xs:dateTime"/>
                                        <xs:element name="contentSize" type="xs:byte"/>
                                        <xs:element name="sourceLocation" type="xs:string"/>
                                        <xs:element name="department" type="xs:string"/>
                                        <xs:element name="keywords">
                                            <xs:complexType>
                                                <xs:sequence>
                                                    <xs:element maxOccurs="unbounded" minOccurs="0" name="keyword" type="xs:string"/>
                                                </xs:sequence>
                                            </xs:complexType>
                                        </xs:element>
                                        <xs:element name="format" type="xs:string"/>
                                        <xs:element name="author" type="xs:string"/>
                                    </xs:sequence>
                                </xs:complexType>
                            </xs:element>
                        </xs:sequence>
                        <xs:attribute name="id" type="xs:byte" use="optional"/>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>

Sample PDI Transformation Style Sheet (XSL)

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" 
    xmlns="urn:cosmin:en:xsd:Shoebox.1.0" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="yes" omit-xml-declaration="yes"/>
    <xsl:strip-space elements="*"/>
    <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="type">
        <xsl:element name="documents">
            <xsl:apply-templates select="subtype"/>
        </xsl:element>
    </xsl:template>
    <xsl:template match="subtype">
        <xsl:element name="document">
            <xsl:attribute name="id">
                <xsl:value-of select="@id"/>
            </xsl:attribute>
            <xsl:apply-templates select="Shoebox"/>
            <xsl:apply-templates select="Shoebox2"/>
        </xsl:element>
    </xsl:template>
    <xsl:template match="Shoebox">
        <xsl:element name="Shoebox">
            <xsl:for-each select="fileName">
                <xsl:element name="fileName">
                    <xsl:value-of select="."/>
                </xsl:element>
            </xsl:for-each>
            <xsl:element name="title">
                <xsl:value-of select="title"/>
            </xsl:element>
            <xsl:element name="dateCreated">
                <xsl:value-of select="dateCreated"/>
            </xsl:element>
            <xsl:element name="dateModified">
                <xsl:value-of select="dateModified"/>
            </xsl:element>
            <xsl:element name="contentSize">
                <xsl:value-of select="contentSize"/>
            </xsl:element>
            <xsl:element name="sourceLocation">
                <xsl:value-of select="sourceLocation"/>
            </xsl:element>
            <xsl:element name="department">
                <xsl:value-of select="department"/>
            </xsl:element>
            <xsl:element name="keywords">
                <xsl:for-each select="keywords">
                    <xsl:element name="keyword">
                        <xsl:value-of select="."/>
                    </xsl:element>
                </xsl:for-each>
            </xsl:element>
            <xsl:element name="format">
                <xsl:value-of select="format"/>
            </xsl:element>
            <xsl:element name="author">
                <xsl:value-of select="author"/>
            </xsl:element>
        </xsl:element>
    </xsl:template>
</xsl:stylesheet>

Last updated