Virtual document 1

> For the complete documentation index, see [llms.txt](https://docs.migration-center.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.migration-center.com/3.17-update-2/importers/infoarchive-importer.md). # InfoArchive Importer ## Introduction InfoArchive is an archive system from OpenText which fulfils the international standard OAIS (). The InfoArchive Importer provides the necessary functionality for creating Submission Information Packages (SIP) compressed into ZIP format that will be ready to be ingested into an InfoArchive repository. A SIP is a data container used to transport data to be archived from the producer (source application) to InfoArchive. It consists of a SIP descriptor containing packaging and archival information about the package and the data to be archived. Based on the metadata configured in migration-center the InfoArchive Importer will create a valid SIP descriptor (eas\_sip.xml) and a valid PDI file (eas\_pdi.xml) for every generated SIP. The supported InfoArchive versions are 3.2 – 20.4. Synchronous ingestion is only supported for version 3.2. An importer is the term used for an output adapter and is used at the last step of the migration process. In the context of the InfoArchive Importer the filesystem itself is considered to be the target location for migrated data, hence the designation “importer”. The InfoArchive Importer imports data sourced from other systems and processed with migration-center to the filesystem into Zip-files (SIPs). This module works as a job that can be run at any time and can even be executed repeatedly. For every run, a detailed history and log file are created. An importer is defined by a unique name, a set of configuration parameters and an optional description. InfoArchive Importers can be created, configured, started and monitored through migration-center Client, but the corresponding processes are executed by migration-center Job Server. ## Known issues & limitations {% hint style="warning" %} Use Java 11 for InfoArchive migrations due to the issue below {% endhint %} * Generating large PDI files when the Jobserver is running with **Java 8** may result in incorrect values in the PDI file. This may be seen by the XSD validation failing during import, or it might pass silently (#59214) * includeChildrenOnce feature does not work with versioned Child Documents and with includeChildrenVersions set to False (#57766) ## Staging area free space The InfoArchive importer might use a lot of disk space you import a large number of objects with a lot of attributes each. For each object the importer creates a temporary PDI xml file, which will ultimately be compiled into the main PDI xml file. The files are deleted when the import finishes but they can use a lot of disk space. The default staging area is the %temp% folder, but it can be changed by adding the following line in the wrapper.conf file of the jobserver: `wrapper.java.additional.6=-Djava.io.tmpdir=./myCustomTEMP` {% hint style="info" %} The Jobserver service must be reinstalled after adding this line, for the changes to take effect. {% endhint %} ## Working with the migration-center InfoArchive Object Type Objects meant to be migrated to InfoArchive using the InfoArchive Importer have their own type in migration-center. This allows migration-center and the user to target aspects and properties specific to the filesystem. ### Migration Sets Documents targeted at InfoArchive will have to be added to a migration set first. This migration set must be configured to accept objects of type *\ToInfoArchive(document).* Create a new migration set and set the *\ToInfoArchive(document)* object type in the *Type* drop-down. The type of object can no longer be changed after a migration set has been created. ![](/files/-M7JP_jX91viW_eLKnbX) ### Transformation Rules The migration sets of type *“\ToInfoArchive(document)”* have a number of predefined rules listed under *Rules for system attributes* in the –Transformation Rules - window. ![](/files/-M7JP_jYlrSuLhlTRt6c) The values of system rules prefixed with DSS are used by the InfoArchive Importer to create the SIP descriptor (eas\_sip.xml) as shown in the following example: ```markup Shoebox IATEST urn:cosmin:en:xsd:Shoebox.1.0 2008-02-23T11:30:38.000 2001-02-23T11:30:38.000 MultiContentFS Entity 0 FS 2017-07-14T16:36:33.616 1 true 1 0 SxssWk1AZVxI2vfKJ3vOtCrKGdZyDA56mGCcQCIjpXk= customValue1 customValue2 ``` Every unique combination of the values of the “DSS\_” rules together with the “target\_type” will correspond to a “Data Submission Session (DSS)”. See more information about DSS in the InfoArchive configuration guide. | **Configuration parameters** | **Values** | **Mandatory** | | -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- | | ***content\_name*** |

Must be set with the names of the content files in the SIP associated with the current document. If the current document does not have content this attribute will be ignored by the importer.

If this attribute contains multiple values, the number values must match and be in the same order as the corresponding ones in the mc\_content\_location attribute

Important: This rule must have the same value as the rule associated with the PDI attribute configured in the holding as the content reference.

Sets the \ Value

This value will be used for naming the generated SIP file(s)

By default, the document content will be picked up by the importer from its original location (the location where the scanner exported it to)

If mc\_content\_location is set with a local path or network share pointing to an existing file then the original location will be ignored and the content will be picked up from the location specified in this attribute.

If the value “nocontent” is set to this system rule the document will be handled by the importer as a content less object.

| No | | ***target\_type*** |

Must be set with the MC internal types that will be used for association.

Important: The first value of this attribute also determines which XSL and XSD file will be used for generating and validating PDI File (eas\_pdi.xml)!

E.g. Value=”Office, PhoneCalls” XML filenames must be: Office.xsl, Office.xsd.

See Generation of PDI file for more details.

| Yes | {% hint style="info" %} Working with rules and associations is core product functionality and is described in detail in the [Client User Guide](/3.17-update-2/client-user-guide.md). {% endhint %} ### Object Type Definitions The target types should be defined in MC according to the InfoArchive PDI schema definition. The object types are used by the validation engine to validate the attributes before the import phase. They are also used by the importer to generate the PDI file for the SIPs. {% hint style="info" %} Working with object type definitions and defining attributes is core product functionality and is described in detail in the [Client User Guide](/3.17-update-2/client-user-guide.md). {% endhint %} ## Working With Features Specific to InfoArchive ### Generation of PDI file The importer generates the PDI file *(eas\_pdi.xml)* by transforming the structured data from a standard structure based on an XSL file and validating it against an XSD file. An example of the standard structure of the PDI file can be found in [Default format of PDI File](/3.17-update-2/importers/infoarchive-importer.md#default-format-of-pdi-file-eas_pdi-xml). The *“pdiSchemasPath”* parameter in the importer configuration is used to locate the XSL and XSD files needed for the PDI file (*eas\_pdi.xml*) transformation and validation. If this parameter does not have any value then the eas\_pdi.xml file will be created using the standard structure. If the parameter does contain a value, then the user must make sure that the XSL and XSD files are present in the path. The name of the XSL and XSD files must match the first value of system rule *“target\_type”* otherwise the importer will return an error. If the *“pdiSchemasPath”* is set to *“D:\IA\config”* and *“target\_type”* has the following multiple values *“Office,PhoneCalls,Tweets”* then the XSL and XSD file names must be: *“Office.xsl”* and *“Office.xsd”.* The XSD file needed for the PDI validation should be the same one used in InfoArchive when configuring a holding and specifying a PDI schema. The XSL file however needs to be created manually by the user. The XSL is responsible for extracting the correct values from the standard output generated by the importer in memory and transforming them into the needed structure for your configuration. An example of such files can be found at [Sample PDI schema](/3.17-update-2/importers/infoarchive-importer.md#sample-pdi-schema-xsd) and [Sample PDI transformation style sheet](/3.17-update-2/importers/infoarchive-importer.md#sample-pdi-transformation-style-sheet-xsl). {% hint style="warning" %} You should allocate to the jobserver 2.5x more memory for the Java Heap space than the size of the biggest generated PDI XML file. Example: PBI file (\~1 gb) <-> Java Heap space (\~2.5-3 gb) {% endhint %} ### Support for Multiple Object Types per Object Starting from version 3.2.9 of migration-center the InfoArchive Importer supports generation of PDI files that contain metadata from multiple object types. By specifying multiple object types definitions in the “*target\_type*” system attribute, one can associate metadata to multiple object types in the associations' tab. Note that only the first value from this rule will be used to find the XSD and XSL files for transforming and validating the eas\_pdi.xml file. Those files need to support the standard output provided by the importer as seen in [Default format of PDI File](/3.17-update-2/importers/infoarchive-importer.md#default-format-of-pdi-file-eas_pdi-xml). ### Support for Multiple Contents per AIU Starting from version 3.2.9 of migration-center the InfoArchive Importer supports multiple contents per AIU. Each content location must be specified in the ***mc\_content\_location*** system attribute as repeating values, and the names of each content must be specified in the ***content\_name*** system attribute as repeating values as well. The number of repeating values must be the same for both attributes. ### Support for Custom Attributes in the eas\_sip.xml Starting from version 3.2.9 of migration-center the InfoArchive Importer supports setting custom attributes in the eas\_sip.xml file. This can be done by setting the “custom\_attribute\_key” and custom\_attribute\_value” system attributes. The number of repeating values in these attributes must match. **custom\_attribute\_key:** which represents the name parameter for custom attributes from eas\_sip.xml. **custom\_attribute\_value:** which represents the values of custom attributes from eas\_sip.xml. Please see [Default format of PDI File](/3.17-update-2/importers/infoarchive-importer.md#default-format-of-pdi-file-eas_pdi-xml) for more details on how the output will look like. ### Generating Multiple SIPs That Belong to the Same DSS The InfoArchive Importer offers the possibility to automatically distribute the given content to multiple sequential SIPs grouped together as a batch that pertains to a single Data Submission Session (DSS). For activating this feature the check box *“batchMode”* must be enabled in the importer configuration. Additionally one of the parameters “*maxObjectsPerSIP" or "maxContentSizePerSIP"* must be set with a value greater than 0. The importer will process sequentially the documents having the same values for *"DSS\_"* system rules and a new SIP file will be generated anytime when one of the following conditions is met: 1. The number of objects in the current SIP exceeds the value of *"maxObjectsPerSIP"* 2. The total size of the objects' content exceeds the *"maxContentSizePerSIP"* The importer will set the value of *\* element of the SIP descriptor with the sequence number of the SIP inside the DSS. The value of the element *\* will be set to *"false"* for all SIPs that belong to the same DSS except for the last one where it will be set to "*true".* ### Generating Multiple SIPs That Belong to Different DSS For the cases when the generated SIP will contain too many objects or the size of the SIP will be too big, the importer offers the possibility to distribute the given content to multiple independent SIPs (that belong to different DSS). For activating this feature the check box *“batchMode”* must be disabled but one of the parameters “*maxObjectsPerSIP" or "maxContentSizePerSIP"* must be set to a value greater than 0. The importer will create a new independent SIP any time when one of the following conditions is met: 1. The number of objects in the current SIP exceeds the value of *"maxObjectsPerSIP"* 2. The total size of the objects' content exceeds the *"maxContentSizePerSIP"* In this scenario the value of *\* element in SIP descriptor will be set to “1” and *\* will be set to *“true”* for all generated SIPs. Additionally, the importer will change the value provided by *“DSS\_id”* by adding a counter to the end of the value. This is necessary in order to assure a unique identifier for each DSS that is derived from information contained in the SIP descriptor: external DSS ID = *\+\\.* The external DSS id of multiple independent SIPs must be unique for InfoArchive in order to process them. {% hint style="warning" %} IMPORTANT: In this scenario, the length of the *DSS\_id* value should be less than the maximum allowed length (32 char) so the importer can add the counter as “\_1”, “\_2” and so on at the end of the value. {% endhint %} ### Post-processing After the Import Since the InfoArchive Importer does not import the data directly to InfoArchive it does offer a post processing functionality to allow the user to automate the content ingestion to InfoArchive. This can be done by providing a batch file that will be executed by the importer after all objects will have been processed. The path to the batch file can be configured in the importer parameter *“targetScript”.* Such a script may, for example, start the ingestion process on the InfoArchive server. ## Importing Documentum Specific Objects to InfoArchive ### Importing Audittrail Objects When the importer parameter “includeAuditTrails” checked”, the importer will add a list with all audit trail records of the currently processed object to the output XML file. The importer will take the data for the audit trail records from the audit trail migration set that must be assigned to the importer. Therefore, the user has to assign at least two migration sets to the importer: one for the documents and one for the corresponding audit trail records. Each audit trail node in the output XML file will contain all the attributes defined in the audit trail migration set. The default XSLT transformation mechanism can be used to create the needed output structure for the audit trail records. The default PDI output looks like below: ```markup Test_acrobat 7 … 10.01.2017 09:45:23 Test_acrobat … 10.01.2017 09:45:23 Test_acrobat … ``` ### Including the Children of Virtual Documents When the parameter “includeVirtualDocuments” is checked the importer will include for each virtual document it processes all its descendants and add them as a list of child nodes to its output record. Each node will contain the name of the child and the content hash of the primary content (that was calculated by the scanner). The default XSLT transformation mechanism can be used to create the needed output structure for the VD objects. The PDI file looks like below: ```markup 1.txt Virtual document 1 2014-12-12T15:32.393 … Virtual document 11 EF56A612895633F3 Virtual document 12 EF56A612895633F3 … … ``` The parameter “includeChildrenVersions” allows specifying if all versions of the children will be included or only the latest version. There are several limitations that should be taken into consideration when using this feature: * All related objects, i.e. all descendants of a virtual document, must be associated with the same import job in migration-center. This limitation is necessary in order to ensure that all descendant objects of a virtual document are in the transformed and validated state before they are processed by the importer. If a descendant object is not contained in any of the migration sets that are associated with the import job, the migration-center will throw an error for the parent object during the import. * For children, the \ value is taken from the first value of the system rule “content\_name”. The “content\_name” is the system attribute that defines the content names in the zip. * For children, only the content specified in the first value of "mc\_content\_location" will be added. If "mc\_content\_location" is null, the content will be taken from the column "content\_location" that stores the path where the document was scanned. * If the same document is part of multiple VDs within the same SIP then its content will be added only one time. * If the size limit for one SIP is exceeded, the importer will throw an error * Delta migration does not work with this feature. If the parameter “includeChildrenOnce” is checked the VD children are only added to the first imported parent. If is unchecked the children are added to every parent they belong to and they are also added as distinct nodes in the PDI file. ## Synchronous Ingestion (via Webservice) Migration center can ingest the generated ZIP files synchronously over the Webservice into InforArchive 3.2. Therefore the InfoArchive (Holding) must be configured as described in the InfoArchive documentation. In order to let the importer transfer the files, the parameter “webserviceURL” must be filled. If that is the case the importer will try to reach the Webservice at the start of the import to ensure a connection to the webservices can be established. Once a SIP file is created in the filesystem it will be transferred via Webservice to InfoArchive in a separate Thread. The number of threads that run in parallel can be set with the parameter “numberOfThreads”. If the transfer is successful the SIP file can be moved to a directory specified by the parameter “moveFilesToFolder”. A SIP file that fails to transfer will be deleted by default unless the parameter “keepUntransferredSIPs” is checked. ## InfoArchive Importer Properties To create a new InfoArchive Importer job, specify the respective adapter type in the Importer Properties window from the list of available adapters *“InfoArchive”*. Once the adapter type has been selected, the Parameters list will be populated with the parameters specific to the selected adapter type. The Properties window of an importer can be accessed by double-clicking an importer in the list, or selecting the Properties button/menu item from the toolbar/context menu. A detailed description is always displayed at the bottom of the window for a selected parameter. ![](/files/-M7JP_jZkNHKdrFdjdsP) ### Common Importer Parameters | **Configuration parameters** | **Values** | | ---------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ***Name*** |

Enter a unique name for this importer

Mandatory

| | ***Adapter type*** |

Select the “InfoArchive” adapter from the list of available adapters

Mandatory

| | ***Location*** |

Select the Job Server location where this job should be run. Job Servers are defined in the Jobserver window. If no Job Server migration-center will prompt the user to define a Job Server Location when saving the importer.

Mandatory

| | ***Description*** | Enter a description for this job (optional) | ### InfoArchive Importer Parameters | **Configuration parameters** | **Values** | | ----------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ***pdiSchemasPath*** |

Should be set with the folder path where XSL and XSD files needed for generating and validating PDI files are located.

If no value is set to this parameter the PDI file will be generated in the default format. In most cases, this parameter needs to be set.

For more details about PDI generation see Generation of PDI file.

| | ***targetDirectory*** |

The folder where SIP files will be created. Can be a local drive or a network share.

Mandatory

| | ***includeAuditTrails*** | Enable audittrail entries to be added the generated SIPs. The audittrail migsets need to be associated with the importer. This works only with the audittrail objects exported from Documentum. | | ***includeVirtualDocuments*** | Enable the children of the virtual documents (scanned from Documentum) to be included in the SIP together with the parent document. | | ***includeChildrenVersions*** | Indicates whether all children of the virtual documents will be included in the SIP. If not checked, only the most recent version of the children will be added to the SIP. This parameter is used only when “includeVirtualDocuments” is checked. | | ***includeChildrenOnce*** | If enabled, the VD children will be only added under the parent node in the PDI. If disabled, they will be added also as distinct nodes. | | ***batchMode*** | Enable batch ingestion mode. Enabling this parameter has effect only when “maxObjectsPerSIP” or “maxContentSizePerSIP” is set with a value greater than 0. | | ***maxObjectsPerSIP*** | The maximum number of objects in a SIP (ZIP). If it’s 0 or less it will be ignored. | | ***maxContentSizePerSIP*** | Maximum overall content size of a SIP (ZIP) in MB. If it’s 0 or less it will be ignored. | | ***computeChecksum*** | Flag indicating if the checksum of the generated eas\_pdi.xml file should be computed. The importer will use the **SHA-256** algorithm and **base64** encoding. | | ***triggerScript*** | Path to a custom script or batch file to be executed at the end of the import. | | ***webServiceURL*** | Set a valid Webservice URL here if the SIP files should be transferred via InfoArchive webserviecs. With this is empty, no Webservice transfer will be done. | | ***moveFilesToFolder*** | If set, successfully transferred files will be moved to another folder. (only webservice transfer related) | | ***keepUntransferredSIPs*** | Enable this to keep SIPs that have produced an error while being transferred. Normally SIPs get deleted in case of an error. Transfer errors can either be technically (e.g. connection lost) or e.g. attribute validation failed, the schema is missing and other misconfigurations. | | ***numberOfThreads*** | Set the maximum number of threads to use for each Webservice transfer. Default is 10. | | ***loggingLevel*** |

Logging level, 4-Debug, 3-Info, 2-Warning, 1-Error

Mandatory

| ## History, Reports, Logs A complete history is available for any InfoArchiveImporter job from the respective items’ History window. It is accessible through the History button/menu entry on the toolbar/context menu. The History window displays a list of all runs for the selected job together with additional information, such as the number of processed objects, the start and ending time and the status. ![](/files/-M7JP_j_O5QgLPFpf9py) Double clicking an entry or clicking the Open button on the toolbar opens the log file created by that run. The log file contains more information about the run of the selected job: * Version information of the migration-center Server Components the job was run with * The parameters the job was run with * Execution Summary that contains the total number of objects processed, the number of documents and folders scanned or imported, the count of warnings and errors that occurred during runtime. Log files generated by the InfoArchive Importer can be found in the Server Components installation folder of the machine where the job was run, e.g. *…\fme AG\migration-center Server Components \\logs* {% hint style="info" %} The amount of information written to the log files depends on the setting specified in the “*loggingLevel”* start parameter for the respective job. {% endhint %} ## Appendix ### Default Format of PDI File (eas\_pdi.xml) ```markup BUILTIN\Administrators 2001-02-23T11:30:38.000 false \\vms2\TestData\Filesystem\1 doc doc1 123 Windows 7 doc1.docx 2001-02-23T11:30:38.000 docx BUILTIN\Administrators \\vms2\TestData\Filesystem\1 doc 2001-02-23T11:30:38.000 false doc1 123 Windows 7 Second Doc.docx 2001-02-23T11:30:38.000 docx BUILTIN\Administrators 2001-02-23T11:30:38.000 false \\vms2\TestData\Filesystem\1 doc doc1 123 Windows 7 doc1.docx 2001-02-23T11:30:38.000 docx BUILTIN\Administrators \\vms2\TestData\Filesystem\1 doc 2001-02-23T11:30:38.000 false doc1 123 Windows 7 Second Doc.docx 2001-02-23T11:30:38.000 docx ``` ### Sample PDI schema (XSD) ```markup ``` ### Sample PDI Transformation Style Sheet (XSL) ```markup

``` ## --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://docs.migration-center.com/3.17-update-2/importers/infoarchive-importer.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.