InfoArchive is an archive system from OpenText which fulfils the international standard OAIS (http://de.wikipedia.org/wiki/OAIS).
The InfoArchive Importer provides the necessary functionality for creating Submission Information Packages (SIP) compressed into ZIP format that will be ready to be ingested into an InfoArchive repository. A SIP is a data container used to transport data to be archived from the producer (source application) to InfoArchive. It consists of a SIP descriptor containing packaging and archival information about the package and the data to be archived. Based on the metadata configured in migration-center the InfoArchive Importer will create a valid SIP descriptor (eas_sip.xml) and a valid PDI file (eas_pdi.xml) for every generated SIP.
The supported InfoArchive versions are 3.2 – 20.4. Synchronous ingestion is only supported for version 3.2.
An importer is the term used for an output connector and is used at the last step of the migration process. In the context of the InfoArchive Importer the filesystem itself is considered to be the target location for migrated data, hence the designation “importer”. The InfoArchive Importer imports data sourced from other systems and processed with migration-center to the filesystem into Zip-files (SIPs).
This module works as a job that can be run at any time and can even be executed repeatedly. For every run, a detailed history and log file are created. An importer is defined by a unique name, a set of configuration parameters and an optional description.
InfoArchive Importers can be created, configured, started and monitored through migration-center Client, but the corresponding processes are executed by migration-center Job Server.
Use Java 11 for InfoArchive migrations due to the issue below
Generating large PDI files when the Jobserver is running with Java 8 may result in incorrect values in the PDI file. This may be seen by the XSD validation failing during import, or it might pass silently (#59214)
includeChildrenOnce feature does not work with versioned Child Documents and with includeChildrenVersions set to False (#57766)
InfoArchive Importer leaves xml file in Temp when running out of disk space (#66241)
The InfoArchive Importer will create the SIP Zips in the default temporary folder for the user running the Jobserver. By default it is dictated by the TMP environment variable.
These ZIP files are normally deleted from the temporary location when the Job finishes, unless loggingLevel parameter in the importer is set to 4 (Debug).
To change the location where the staging files are created, add a line with your custom path in the wrapper.conf file of the jobserver after the existing wrapper.java.additional entries:
Restart the jobserver afterwards.
You need to use forward slashes even on a Windows machine. This is a requirement of the wrapper.
To create a new InfoArchive Importer job, specify the respective adapter type in the Importer Properties window from the list of available connectors “InfoArchive”. Once the adapter type has been selected, the Parameters list will be populated with the parameters specific to the selected adapter type.
The Properties window of an importer can be accessed by double-clicking an importer in the list, or selecting the Properties button/menu item from the toolbar/context menu.
A detailed description is always displayed at the bottom of the window for a selected parameter.
The common adaptor parameters are described in Common Parameters.
The configuration parameters available for the InfoArchive importer are described below:
pdiSchemasPath Should be set with the folder path where XSL and XSD files needed for generating and validating PDI files are located.
If no value is set to this parameter the PDI file will be generated in the default format. In most cases, this parameter needs to be set.
For more details about PDI generation see Generation of PDI file.
targetDirectory The folder where SIP files will be created. Can be a local drive or a network share.
includeAuditTrails Enable audittrail entries to be added the generated SIPs. The audittrail migsets need to be associated with the importer. This works only with the audittrail objects exported from Documentum.
includeVirtualDocuments Enable the children of the virtual documents (scanned from Documentum) to be included in the SIP together with the parent document.
includeChildrenVersions Indicates whether all children of the virtual documents will be included in the SIP. If not checked, only the most recent version of the children will be added to the SIP. This parameter is used only when “includeVirtualDocuments” is checked.
includeChildrenOnce If enabled, the VD children will be only added under the parent node in the PDI. If disabled, they will be added also as distinct nodes.
batchMode Enable batch ingestion mode. Enabling this parameter has effect only when “maxObjectsPerSIP” or “maxContentSizePerSIP” is set with a value greater than 0.
maxObjectsPerSIP The maximum number of objects in a SIP (ZIP). If it’s 0 or less it will be ignored.
maxContentSizePerSIP Maximum overall content size of a SIP (ZIP) in MB. If it’s 0 or less it will be ignored.
computeChecksum Flag indicating if the checksum of the generated eas_pdi.xml file should be computed. The importer will use the SHA-256 algorithm and base64 encoding.
triggerScript Path to a custom script or batch file to be executed at the end of the import.
webServiceURL Set a valid Webservice URL here if the SIP files should be transferred via InfoArchive webserviecs. With this is empty, no Webservice transfer will be done.
moveFilesToFolder If set, successfully transferred files will be moved to another folder. (only webservice transfer related)
keepUntransferredSIPs Enable this to keep SIPs that have produced an error while being transferred. Normally SIPs get deleted in case of an error. Transfer errors can either be technically (e.g. connection lost) or e.g. attribute validation failed, the schema is missing and other misconfigurations.
numberOfThreads Set the maximum number of threads to use for each Webservice transfer. Default is 10.
loggingLevel*
See Common Parameters.
Parameters marked with an asterisk (*) are mandatory.
Objects meant to be migrated to InfoArchive using the InfoArchive Importer have their own type in migration-center. This allows migration-center and the user to target aspects and properties specific to the filesystem.
Documents targeted at InfoArchive will have to be added to a migration set first. This migration set must be configured to accept objects of type <source object type>ToInfoArchive(document).
Create a new migration set and set the <source object type>ToInfoArchive(document) object type in the Type drop-down. The type of object can no longer be changed after a migration set has been created.
The migration sets of type “<source object type>ToInfoArchive(document)” have a number of predefined rules listed under Rules for system attributes in the –Transformation Rules - window.
The values of system rules prefixed with DSS are used by the InfoArchive Importer to create the SIP descriptor (eas_sip.xml) as shown in the following example:
Every unique combination of the values of the “DSS_” rules together with the “target_type” will correspond to a “Data Submission Session (DSS)”. See more information about DSS in the InfoArchive configuration guide.
content_name* Must be set with the names of the content files in the SIP associated with the current document. If the current document does not have content this attribute will be ignored by the importer. If this attribute contains multiple values, the number values must match and be in the same order as the corresponding ones in the mc_content_location attribute Important: This rule must have the same value as the rule associated with the PDI attribute configured in the holding as the content reference.
custom_attribute_key Represents the name parameter for custom attributes from eas_sip.xml. The number of repeating attributes must match with custom_attribute_value
custom_attribute_value Represents the values of custom attributes from eas_sip.xml. The number of repeating attributes must match with custom_attribute_key
DSS_application* Sets the <application> Value
DSS_base_retention_date* Sets the <base_retention_date> Value
DSS_entity* Sets the <entity> Value
DSS_holding* Sets the <holding> Value
This value will be used for naming the generated SIP file(s)
DSS_id* Sets the <id> Value
DSS_pdi_schema* Sets the <pdi_schema> Value
DSS_pdi_schema_version Sets the <pdi_schema_version> Value
DSS_priority* Sets the <priority> Value
DSS_producer* Sets the <producer> Value
DSS_production_date* Sets the <production_date> Value
DSS_rentention_class Sets the <retention_class> Value if needed
mc_content_location By default, the document content will be picked up by the importer from its original location (the location where the scanner exported it to) If mc_content_location is set with a local path or network share pointing to an existing file then the original location will be ignored and the content will be picked up from the location specified in this attribute. If the value “nocontent” is set to this system rule the document will be handled by the importer as a content less object.
target_type* Must be set with the MC internal types that will be used for association. Important: The first value of this attribute also determines which XSL and XSD file will be used for generating and validating PDI File (eas_pdi.xml)! E.g. Value=”Office, PhoneCalls” XML filenames must be: Office.xsl, Office.xsd. See Generation of PDI file for more details.
Parameters marked with an asterisk (*) are mandatory.
The target types should be defined in MC according to the InfoArchive PDI schema definition. The object types are used by the validation engine to validate the attributes before the import phase. They are also used by the importer to generate the PDI file for the SIPs.
Working with object type definitions and defining attributes is core product functionality and is described in detail in the Client User Guide.
The importer generates the PDI file (eas_pdi.xml) by transforming the structured data from a standard structure based on an XSL file and validating it against an XSD file. An example of the standard structure of the PDI file can be found in Default format of PDI File.
The “pdiSchemasPath” parameter in the importer configuration is used to locate the XSL and XSD files needed for the PDI file (eas_pdi.xml) transformation and validation. If this parameter does not have any value then the eas_pdi.xml file will be created using the standard structure.
If the parameter does contain a value, then the user must make sure that the XSL and XSD files are present in the path. The name of the XSL and XSD files must match the first value of system rule “target_type” otherwise the importer will return an error. If the “pdiSchemasPath” is set to “D:\IA\config” and “target_type” has the following multiple values “Office,PhoneCalls,Tweets” then the XSL and XSD file names must be: “Office.xsl” and “Office.xsd”.
The XSD file needed for the PDI validation should be the same one used in InfoArchive when configuring a holding and specifying a PDI schema. The XSL file however needs to be created manually by the user. The XSL is responsible for extracting the correct values from the standard output generated by the importer in memory and transforming them into the needed structure for your configuration. An example of such files can be found at Sample PDI schema and Sample PDI transformation style sheet.
You should allocate to the jobserver 2.5x more memory for the Java Heap space than the size of the biggest generated PDI XML file.
Example: PBI file (~1 gb) <-> Java Heap space (~2.5-3 gb)
Starting from version 3.2.9 of migration-center the InfoArchive Importer supports generation of PDI files that contain metadata from multiple object types. By specifying multiple object types definitions in the “target_type” system attribute, one can associate metadata to multiple object types in the associations' tab. Note that only the first value from this rule will be used to find the XSD and XSL files for transforming and validating the eas_pdi.xml file. Those files need to support the standard output provided by the importer as seen in Default format of PDI File.
Starting from version 3.2.9 of migration-center the InfoArchive Importer supports multiple contents per AIU. Each content location must be specified in the mc_content_location system attribute as repeating values, and the names of each content must be specified in the content_name system attribute as repeating values as well.
The number of repeating values must be the same for both attributes.
Starting from version 3.2.9 of migration-center the InfoArchive Importer supports setting custom attributes in the eas_sip.xml file. This can be done by setting the “custom_attribute_key” and custom_attribute_value” system attributes. The number of repeating values in these attributes must match.
custom_attribute_key: which represents the name parameter for custom attributes from eas_sip.xml.
custom_attribute_value: which represents the values of custom attributes from eas_sip.xml.
Please see Default format of PDI File for more details on how the output will look like.
The InfoArchive Importer offers the possibility to automatically distribute the given content to multiple sequential SIPs grouped together as a batch that pertains to a single Data Submission Session (DSS). For activating this feature the check box “batchMode” must be enabled in the importer configuration. Additionally one of the parameters “maxObjectsPerSIP" or "maxContentSizePerSIP" must be set with a value greater than 0.
The importer will process sequentially the documents having the same values for "DSS_" system rules and a new SIP file will be generated anytime when one of the following conditions is met:
The number of objects in the current SIP exceeds the value of "maxObjectsPerSIP"
The total size of the objects' content exceeds the "maxContentSizePerSIP"
The importer will set the value of <seqno> element of the SIP descriptor with the sequence number of the SIP inside the DSS. The value of the element <is_last> will be set to "false" for all SIPs that belong to the same DSS except for the last one where it will be set to "true".
For the cases when the generated SIP will contain too many objects or the size of the SIP will be too big, the importer offers the possibility to distribute the given content to multiple independent SIPs (that belong to different DSS). For activating this feature the check box “batchMode” must be disabled but one of the parameters “maxObjectsPerSIP" or "maxContentSizePerSIP" must be set to a value greater than 0. The importer will create a new independent SIP any time when one of the following conditions is met:
The number of objects in the current SIP exceeds the value of "maxObjectsPerSIP"
The total size of the objects' content exceeds the "maxContentSizePerSIP"
In this scenario the value of <seqno> element in SIP descriptor will be set to “1” and <is_last> will be set to “true” for all generated SIPs.
Additionally, the importer will change the value provided by “DSS_id” by adding a counter to the end of the value. This is necessary in order to assure a unique identifier for each DSS that is derived from information contained in the SIP descriptor:
external DSS ID = <holding>+<producer><id>.
The external DSS id of multiple independent SIPs must be unique for InfoArchive in order to process them.
In this scenario, the length of the DSS_id value should be less than the maximum allowed length (32 char) so the importer can add the counter as “_1”, “_2” and so on at the end of the value.
Since the InfoArchive Importer does not import the data directly to InfoArchive it does offer a post processing functionality to allow the user to automate the content ingestion to InfoArchive. This can be done by providing a batch file that will be executed by the importer after all objects will have been processed. The path to the batch file can be configured in the importer parameter “targetScript”. Such a script may, for example, start the ingestion process on the InfoArchive server.
When the importer parameter “includeAuditTrails” checked”, the importer will add a list with all audit trail records of the currently processed object to the output XML file. The importer will take the data for the audit trail records from the audit trail migration set that must be assigned to the importer. Therefore, the user has to assign at least two migration sets to the importer: one for the documents and one for the corresponding audit trail records. Each audit trail node in the output XML file will contain all the attributes defined in the audit trail migration set. The default XSLT transformation mechanism can be used to create the needed output structure for the audit trail records.
The default PDI output looks like below:
When the parameter “includeVirtualDocuments” is checked the importer will include for each virtual document it processes all its descendants and add them as a list of child nodes to its output record. Each node will contain the name of the child and the content hash of the primary content (that was calculated by the scanner). The default XSLT transformation mechanism can be used to create the needed output structure for the VD objects.
The PDI file looks like below:
The parameter “includeChildrenVersions” allows specifying if all versions of the children will be included or only the latest version.
There are several limitations that should be taken into consideration when using this feature:
All related objects, i.e. all descendants of a virtual document, must be associated with the same import job in migration-center. This limitation is necessary in order to ensure that all descendant objects of a virtual document are in the transformed and validated state before they are processed by the importer. If a descendant object is not contained in any of the migration sets that are associated with the import job, the migration-center will throw an error for the parent object during the import.
For children, the <file_name> value is taken from the first value of the system rule “content_name”. The “content_name” is the system attribute that defines the content names in the zip.
For children, only the content specified in the first value of "mc_content_location" will be added. If "mc_content_location" is null, the content will be taken from the column "content_location" that stores the path where the document was scanned.
If the same document is part of multiple VDs within the same SIP then its content will be added only one time.
If the size limit for one SIP is exceeded, the importer will throw an error
Delta migration does not work with this feature.
If the parameter “includeChildrenOnce” is checked the VD children are only added to the first imported parent. If is unchecked the children are added to every parent they belong to and they are also added as distinct nodes in the PDI file.
Migration center can ingest the generated ZIP files synchronously over the Webservice into InforArchive 3.2. Therefore the InfoArchive (Holding) must be configured as described in the InfoArchive documentation.
In order to let the importer transfer the files, the parameter “webserviceURL” must be filled. If that is the case the importer will try to reach the Webservice at the start of the import to ensure a connection to the webservices can be established. Once a SIP file is created in the filesystem it will be transferred via Webservice to InfoArchive in a separate Thread. The number of threads that run in parallel can be set with the parameter “numberOfThreads”.
If the transfer is successful the SIP file can be moved to a directory specified by the parameter “moveFilesToFolder”. A SIP file that fails to transfer will be deleted by default unless the parameter “keepUntransferredSIPs” is checked.