InfoArchive Importer
Introduction
InfoArchive is an archive system from OpenText which fulfils the international standard OAIS (http://de.wikipedia.org/wiki/OAIS).
The InfoArchive Importer provides the necessary functionality for creating Submission Information Packages (SIP) compressed into ZIP format that will be ready to be ingested into an InfoArchive repository. A SIP is a data container used to transport data to be archived from the producer (source application) to InfoArchive. It consists of a SIP descriptor containing packaging and archival information about the package and the data to be archived. Based on the metadata configured in migration-center the InfoArchive Importer will create a valid SIP descriptor (eas_sip.xml) and a valid PDI file (eas_pdi.xml) for every generated SIP.
The supported InfoArchive versions are 3.2 – 20.4. Synchronous ingestion is only supported for version 3.2.
An importer is the term used for an output adapter and is used at the last step of the migration process. In the context of the InfoArchive Importer the filesystem itself is considered to be the target location for migrated data, hence the designation “importer”. The InfoArchive Importer imports data sourced from other systems and processed with migration-center to the filesystem into Zip-files (SIPs).
This module works as a job that can be run at any time and can even be executed repeatedly. For every run, a detailed history and log file are created. An importer is defined by a unique name, a set of configuration parameters and an optional description.
InfoArchive Importers can be created, configured, started and monitored through migration-center Client, but the corresponding processes are executed by migration-center Job Server.
Known issues & limitations
Use Java 11 for InfoArchive migrations due to the issue below
Generating large PDI files when the Jobserver is running with Java 8 may result in incorrect values in the PDI file. This may be seen by the XSD validation failing during import, or it might pass silently (#59214)
includeChildrenOnce feature does not work with versioned Child Documents and with includeChildrenVersions set to False (#57766)
Staging area free space
The InfoArchive importer might use a lot of disk space you import a large number of objects with a lot of attributes each. For each object the importer creates a temporary PDI xml file, which will ultimately be compiled into the main PDI xml file. The files are deleted when the import finishes but they can use a lot of disk space.
The default staging area is the %temp% folder, but it can be changed by adding the following line in the wrapper.conf file of the jobserver:
wrapper.java.additional.6=-Djava.io.tmpdir=./myCustomTEMP
The Jobserver service must be reinstalled after adding this line, for the changes to take effect.
Working with the migration-center InfoArchive Object Type
Objects meant to be migrated to InfoArchive using the InfoArchive Importer have their own type in migration-center. This allows migration-center and the user to target aspects and properties specific to the filesystem.
Migration Sets
Documents targeted at InfoArchive will have to be added to a migration set first. This migration set must be configured to accept objects of type <source object type>ToInfoArchive(document).
Create a new migration set and set the <source object type>ToInfoArchive(document) object type in the Type drop-down. The type of object can no longer be changed after a migration set has been created.
Transformation Rules
The migration sets of type “<source object type>ToInfoArchive(document)” have a number of predefined rules listed under Rules for system attributes in the –Transformation Rules - window.
The values of system rules prefixed with DSS are used by the InfoArchive Importer to create the SIP descriptor (eas_sip.xml) as shown in the following example:
Every unique combination of the values of the “DSS_” rules together with the “target_type” will correspond to a “Data Submission Session (DSS)”. See more information about DSS in the InfoArchive configuration guide.
Configuration parameters
Values
Mandatory
content_name
Must be set with the names of the content files in the SIP associated with the current document. If the current document does not have content this attribute will be ignored by the importer.
If this attribute contains multiple values, the number values must match and be in the same order as the corresponding ones in the mc_content_location attribute
Important: This rule must have the same value as the rule associated with the PDI attribute configured in the holding as the content reference.
Yes
custom_attribute_key
Represents the name parameter for custom attributes from eas_sip.xml. The number of repeating attributes must match with custom_attribute_value
custom_attribute_value
Represents the values of custom attributes from eas_sip.xml. The number of repeating attributes must match with custom_attribute_key
DSS_application
Sets the <application> Value
Yes
DSS_base_retention_date
Sets the <base_retention_date> Value
Yes
DSS_entity
Sets the <entity> Value
Yes
DSS_holding
Sets the <holding> Value
This value will be used for naming the generated SIP file(s)
Yes
DSS_id
Sets the <id> Value
Yes
DSS_pdi_schema
Sets the <pdi_schema> Value
Yes
DSS_pdi_schema_version
Sets the <pdi_schema_version> Value
No
DSS_priority
Sets the <priority> Value
Yes
DSS_producer
Sets the <producer> Value
Yes
DSS_production_date
Sets the <production_date> Value
Yes
DSS_rentention_class
Sets the <retention_class> Value if needed
No
mc_content_location
By default, the document content will be picked up by the importer from its original location (the location where the scanner exported it to)
If mc_content_location is set with a local path or network share pointing to an existing file then the original location will be ignored and the content will be picked up from the location specified in this attribute.
If the value “nocontent” is set to this system rule the document will be handled by the importer as a content less object.
No
target_type
Must be set with the MC internal types that will be used for association.
Important: The first value of this attribute also determines which XSL and XSD file will be used for generating and validating PDI File (eas_pdi.xml)!
E.g. Value=”Office, PhoneCalls” XML filenames must be: Office.xsl, Office.xsd.
See Generation of PDI file for more details.
Yes
Working with rules and associations is core product functionality and is described in detail in the Client User Guide.
Object Type Definitions
The target types should be defined in MC according to the InfoArchive PDI schema definition. The object types are used by the validation engine to validate the attributes before the import phase. They are also used by the importer to generate the PDI file for the SIPs.
Working with object type definitions and defining attributes is core product functionality and is described in detail in the Client User Guide.
Working With Features Specific to InfoArchive
Generation of PDI file
The importer generates the PDI file (eas_pdi.xml) by transforming the structured data from a standard structure based on an XSL file and validating it against an XSD file. An example of the standard structure of the PDI file can be found in Default format of PDI File.
The “pdiSchemasPath” parameter in the importer configuration is used to locate the XSL and XSD files needed for the PDI file (eas_pdi.xml) transformation and validation. If this parameter does not have any value then the eas_pdi.xml file will be created using the standard structure.
If the parameter does contain a value, then the user must make sure that the XSL and XSD files are present in the path. The name of the XSL and XSD files must match the first value of system rule “target_type” otherwise the importer will return an error. If the “pdiSchemasPath” is set to “D:\IA\config” and “target_type” has the following multiple values “Office,PhoneCalls,Tweets” then the XSL and XSD file names must be: “Office.xsl” and “Office.xsd”.
The XSD file needed for the PDI validation should be the same one used in InfoArchive when configuring a holding and specifying a PDI schema. The XSL file however needs to be created manually by the user. The XSL is responsible for extracting the correct values from the standard output generated by the importer in memory and transforming them into the needed structure for your configuration. An example of such files can be found at Sample PDI schema and Sample PDI transformation style sheet.
You should allocate to the jobserver 2.5x more memory for the Java Heap space than the size of the biggest generated PDI XML file.
Example: PBI file (~1 gb) <-> Java Heap space (~2.5-3 gb)
Support for Multiple Object Types per Object
Starting from version 3.2.9 of migration-center the InfoArchive Importer supports generation of PDI files that contain metadata from multiple object types. By specifying multiple object types definitions in the “target_type” system attribute, one can associate metadata to multiple object types in the associations' tab. Note that only the first value from this rule will be used to find the XSD and XSL files for transforming and validating the eas_pdi.xml file. Those files need to support the standard output provided by the importer as seen in Default format of PDI File.
Support for Multiple Contents per AIU
Starting from version 3.2.9 of migration-center the InfoArchive Importer supports multiple contents per AIU. Each content location must be specified in the mc_content_location system attribute as repeating values, and the names of each content must be specified in the content_name system attribute as repeating values as well.
The number of repeating values must be the same for both attributes.
Support for Custom Attributes in the eas_sip.xml
Starting from version 3.2.9 of migration-center the InfoArchive Importer supports setting custom attributes in the eas_sip.xml file. This can be done by setting the “custom_attribute_key” and custom_attribute_value” system attributes. The number of repeating values in these attributes must match.
custom_attribute_key: which represents the name parameter for custom attributes from eas_sip.xml.
custom_attribute_value: which represents the values of custom attributes from eas_sip.xml.
Please see Default format of PDI File for more details on how the output will look like.
Generating Multiple SIPs That Belong to the Same DSS
The InfoArchive Importer offers the possibility to automatically distribute the given content to multiple sequential SIPs grouped together as a batch that pertains to a single Data Submission Session (DSS). For activating this feature the check box “batchMode” must be enabled in the importer configuration. Additionally one of the parameters “maxObjectsPerSIP" or "maxContentSizePerSIP" must be set with a value greater than 0.
The importer will process sequentially the documents having the same values for "DSS_" system rules and a new SIP file will be generated anytime when one of the following conditions is met:
The number of objects in the current SIP exceeds the value of "maxObjectsPerSIP"
The total size of the objects' content exceeds the "maxContentSizePerSIP"
The importer will set the value of <seqno> element of the SIP descriptor with the sequence number of the SIP inside the DSS. The value of the element <is_last> will be set to "false" for all SIPs that belong to the same DSS except for the last one where it will be set to "true".
Generating Multiple SIPs That Belong to Different DSS
For the cases when the generated SIP will contain too many objects or the size of the SIP will be too big, the importer offers the possibility to distribute the given content to multiple independent SIPs (that belong to different DSS). For activating this feature the check box “batchMode” must be disabled but one of the parameters “maxObjectsPerSIP" or "maxContentSizePerSIP" must be set to a value greater than 0. The importer will create a new independent SIP any time when one of the following conditions is met:
The number of objects in the current SIP exceeds the value of "maxObjectsPerSIP"
The total size of the objects' content exceeds the "maxContentSizePerSIP"
In this scenario the value of <seqno> element in SIP descriptor will be set to “1” and <is_last> will be set to “true” for all generated SIPs.
Additionally, the importer will change the value provided by “DSS_id” by adding a counter to the end of the value. This is necessary in order to assure a unique identifier for each DSS that is derived from information contained in the SIP descriptor:
external DSS ID = <holding>+<producer><id>.
The external DSS id of multiple independent SIPs must be unique for InfoArchive in order to process them.
IMPORTANT: In this scenario, the length of the DSS_id value should be less than the maximum allowed length (32 char) so the importer can add the counter as “_1”, “_2” and so on at the end of the value.
Post-processing After the Import
Since the InfoArchive Importer does not import the data directly to InfoArchive it does offer a post processing functionality to allow the user to automate the content ingestion to InfoArchive. This can be done by providing a batch file that will be executed by the importer after all objects will have been processed. The path to the batch file can be configured in the importer parameter “targetScript”. Such a script may, for example, start the ingestion process on the InfoArchive server.
Importing Documentum Specific Objects to InfoArchive
Importing Audittrail Objects
When the importer parameter “includeAuditTrails” checked”, the importer will add a list with all audit trail records of the currently processed object to the output XML file. The importer will take the data for the audit trail records from the audit trail migration set that must be assigned to the importer. Therefore, the user has to assign at least two migration sets to the importer: one for the documents and one for the corresponding audit trail records. Each audit trail node in the output XML file will contain all the attributes defined in the audit trail migration set. The default XSLT transformation mechanism can be used to create the needed output structure for the audit trail records.
The default PDI output looks like below:
Including the Children of Virtual Documents
When the parameter “includeVirtualDocuments” is checked the importer will include for each virtual document it processes all its descendants and add them as a list of child nodes to its output record. Each node will contain the name of the child and the content hash of the primary content (that was calculated by the scanner). The default XSLT transformation mechanism can be used to create the needed output structure for the VD objects.
The PDI file looks like below:
The parameter “includeChildrenVersions” allows specifying if all versions of the children will be included or only the latest version.
There are several limitations that should be taken into consideration when using this feature:
All related objects, i.e. all descendants of a virtual document, must be associated with the same import job in migration-center. This limitation is necessary in order to ensure that all descendant objects of a virtual document are in the transformed and validated state before they are processed by the importer. If a descendant object is not contained in any of the migration sets that are associated with the import job, the migration-center will throw an error for the parent object during the import.
For children, the <file_name> value is taken from the first value of the system rule “content_name”. The “content_name” is the system attribute that defines the content names in the zip.
For children, only the content specified in the first value of "mc_content_location" will be added. If "mc_content_location" is null, the content will be taken from the column "content_location" that stores the path where the document was scanned.
If the same document is part of multiple VDs within the same SIP then its content will be added only one time.
If the size limit for one SIP is exceeded, the importer will throw an error
Delta migration does not work with this feature.
If the parameter “includeChildrenOnce” is checked the VD children are only added to the first imported parent. If is unchecked the children are added to every parent they belong to and they are also added as distinct nodes in the PDI file.
Synchronous Ingestion (via Webservice)
Migration center can ingest the generated ZIP files synchronously over the Webservice into InforArchive 3.2. Therefore the InfoArchive (Holding) must be configured as described in the InfoArchive documentation.
In order to let the importer transfer the files, the parameter “webserviceURL” must be filled. If that is the case the importer will try to reach the Webservice at the start of the import to ensure a connection to the webservices can be established. Once a SIP file is created in the filesystem it will be transferred via Webservice to InfoArchive in a separate Thread. The number of threads that run in parallel can be set with the parameter “numberOfThreads”.
If the transfer is successful the SIP file can be moved to a directory specified by the parameter “moveFilesToFolder”. A SIP file that fails to transfer will be deleted by default unless the parameter “keepUntransferredSIPs” is checked.
InfoArchive Importer Properties
To create a new InfoArchive Importer job, specify the respective adapter type in the Importer Properties window from the list of available adapters “InfoArchive”. Once the adapter type has been selected, the Parameters list will be populated with the parameters specific to the selected adapter type.
The Properties window of an importer can be accessed by double-clicking an importer in the list, or selecting the Properties button/menu item from the toolbar/context menu.
A detailed description is always displayed at the bottom of the window for a selected parameter.
Common Importer Parameters
Configuration parameters
Values
Name
Enter a unique name for this importer
Mandatory
Adapter type
Select the “InfoArchive” adapter from the list of available adapters
Mandatory
Location
Select the Job Server location where this job should be run. Job Servers are defined in the Jobserver window. If no Job Server migration-center will prompt the user to define a Job Server Location when saving the importer.
Mandatory
Description
Enter a description for this job (optional)
InfoArchive Importer Parameters
Configuration parameters
Values
pdiSchemasPath
Should be set with the folder path where XSL and XSD files needed for generating and validating PDI files are located.
If no value is set to this parameter the PDI file will be generated in the default format. In most cases, this parameter needs to be set.
For more details about PDI generation see Generation of PDI file.
targetDirectory
The folder where SIP files will be created. Can be a local drive or a network share.
Mandatory
includeAuditTrails
Enable audittrail entries to be added the generated SIPs. The audittrail migsets need to be associated with the importer. This works only with the audittrail objects exported from Documentum.
includeVirtualDocuments
Enable the children of the virtual documents (scanned from Documentum) to be included in the SIP together with the parent document.
includeChildrenVersions
Indicates whether all children of the virtual documents will be included in the SIP. If not checked, only the most recent version of the children will be added to the SIP. This parameter is used only when “includeVirtualDocuments” is checked.
includeChildrenOnce
If enabled, the VD children will be only added under the parent node in the PDI. If disabled, they will be added also as distinct nodes.
batchMode
Enable batch ingestion mode. Enabling this parameter has effect only when “maxObjectsPerSIP” or “maxContentSizePerSIP” is set with a value greater than 0.
maxObjectsPerSIP
The maximum number of objects in a SIP (ZIP). If it’s 0 or less it will be ignored.
maxContentSizePerSIP
Maximum overall content size of a SIP (ZIP) in MB. If it’s 0 or less it will be ignored.
computeChecksum
Flag indicating if the checksum of the generated eas_pdi.xml file should be computed. The importer will use the SHA-256 algorithm and base64 encoding.
triggerScript
Path to a custom script or batch file to be executed at the end of the import.
webServiceURL
Set a valid Webservice URL here if the SIP files should be transferred via InfoArchive webserviecs. With this is empty, no Webservice transfer will be done.
moveFilesToFolder
If set, successfully transferred files will be moved to another folder. (only webservice transfer related)
keepUntransferredSIPs
Enable this to keep SIPs that have produced an error while being transferred. Normally SIPs get deleted in case of an error. Transfer errors can either be technically (e.g. connection lost) or e.g. attribute validation failed, the schema is missing and other misconfigurations.
numberOfThreads
Set the maximum number of threads to use for each Webservice transfer. Default is 10.
loggingLevel
Logging level, 4-Debug, 3-Info, 2-Warning, 1-Error
Mandatory
History, Reports, Logs
A complete history is available for any InfoArchiveImporter job from the respective items’ History window. It is accessible through the History button/menu entry on the toolbar/context menu. The History window displays a list of all runs for the selected job together with additional information, such as the number of processed objects, the start and ending time and the status.
Double clicking an entry or clicking the Open button on the toolbar opens the log file created by that run. The log file contains more information about the run of the selected job:
Version information of the migration-center Server Components the job was run with
The parameters the job was run with
Execution Summary that contains the total number of objects processed, the number of documents and folders scanned or imported, the count of warnings and errors that occurred during runtime.
Log files generated by the InfoArchive Importer can be found in the Server Components installation folder of the machine where the job was run, e.g. …\fme AG\migration-center Server Components <Version>\logs
The amount of information written to the log files depends on the setting specified in the “loggingLevel” start parameter for the respective job.
Appendix
Default Format of PDI File (eas_pdi.xml)
Sample PDI schema (XSD)
Sample PDI Transformation Style Sheet (XSL)
Last updated