Amazon Comprehend Enrich Scanner

Introduction

The Amazon Comprehend Enrich Scanner is one of the source adapters available in migration-center starting with version 3.17. It is a special adapter which enhances the objects scanned by another source adapters with some data computing using the Amazon Comprehend. The supported Comprehend classifiers: Dominant Language Classifier, Entities Classifier and Custom Classifier.

The scanner module works as a job that can be run at any time and can even be executed repeatedly. For every run, a detailed history and log file are created.

A scanner is defined by a unique name, a set of configuration parameters and an optional description.

Amazon Comprehend Enrich Scanners can be created, configured, started, and monitored through migration-center Client, but the corresponding processes are executed by migration-center Job Server.

Known Issues and Limitations

  • The scanner will extract text in enrich mode even if the scanner was run in simulation mode before.

  • The entities and language classifiers can be run in the same scan run, but the entities classifier will not take into consideration the language extracted by the dominant language classifier. The attribute generated by the language classifier can be used by the entities classifier if it will be run before the entities classifier. More information about the way to use the source attribute as entities language attribute is presented in Classifiers Configuration.

Scanner Properties

To create a new Amazon Comprehend Enrich Scanner job click on New Scanner button and select "AmazonComprehendEnrich" from the adapter type dropdown list. Once the adapter type has been selected, the parameters list will be populated with the Amazon Comprehend Enrich Scanner parameters.

The Properties window of a scanner can be accessed by double-clicking the scanner in the list or selecting the Properties button or entry from the toolbar or context menu.

Common Scanner Parameters

Configuration parameters

Description

Name

Enter a unique name for this scanner

Mandatory

Adapter type

Select the “CSV/Excel” adapter from the list of available adapters

Mandatory

Location

Select the Job Server location where this job should run. Job Servers are defined in the Jobserver window. If no Job Server was selected, migration-center will prompt the user to define a Job Server Location when saving the scanner.

Mandatory

Description

Enter a description for this scanner (optional)

Amazon Comprehend Enrich Scanner Parameters

Configuration parameters

Description

publicKey

The Amazon public key used to create a connection to AWS.

privateKey

The Amazon private key. This should be the pair of the public key.

region

The Amazon region used to create the connection to AWS.

executeClassifiers

Flag indicating if the classifiers will be executed. If this parameter is not checked then the scanner will run in Simulation mode, otherwise, the classifiers jobs will be fired in Comprehend.

inputS3Uri

The S3 location where the text files will be uploaded.

outputS3Uri

The S3 location where the output of the classifier will be located.

kmsKeyId

The ARN of custom managed key used to encrypt the data in S3.

Example: arn:aws:kms:eu-central-1:0908887578777:key/d484ee92-ffff1-444e-bcbb0-7cccceffcffc

dataAccessRoleArn

The ARN of the role that has Comprehend as trusted entities.

Example: MCDEMO_Comprehend

deleteFiles

Flag indicating if the files from S3 will be deleted. If the parameter is checked then the files from inputS3Uriand outputS3Uri will be deleted.

jobRunId*

The id of the job which scanned the objects that will be enriched.The jobRunId must exist.

Mandatory

configurationFile

The location of the file where the classifiers are configured. When the executeClassifiers parameter is checked, then this parameter is mandatory. The way to configure the classifiers is detailed in Classifiers Configuration.

loggingLevel*

Sets the verbosity of the log file.

Values:

1 - logs only errors during scan

2 - is the default value reporting all warnings and errors

3 - logs all successfully performed operations in addition to any warnings or errors

4 - logs all events (for debugging only, use only if instructed by fme product support since it generates a very large amount of output. Do not use in production)

Mandatory

Classifiers Configuration

The classifiers are configured using an XML file. The structure of this file is a predefined one and allows the user to configure the classifiers as much as possible.

The supported classifiers are divided into two types: standard classifiers and custom classifiers. There is a predefined XML structure for each classifier type. An example of this configuration file can be found in \fme AG\migration-center Server Components <Version>\lib\mc-aws-comprehend-scanner\classifiers-config.xml.

For every classifier, you can specify if the score should be displayed by using the XML attribute "dispayScore". You need to specify this attribute just if you want to have the score as an attribute in migration-center otherwise, the attribute can be omitted because the default value is false.

Standard Classifier

The standard classifiers are split into two supported classifiers and the difference between them is made using an XML attribute named "type".

  • Dominant Language Classifier

The structure of this classifier is presented in the following block. The "threshold" XML element is mandatory and is used to filter the values. If the score for a specific language is lower than the threshold value then the language is not saved on database.

<standard_classifier type="language" displayScore="true">
<threshold>0.8</threshold>
</standard_classifier>
  • Entities Classifier

The structure for the entities classifier is presented in the following block.

<standard_classifier type="entities">
<threshold>0.6</threshold>
<language>de</language>
<entityRecognizerArn>arn:aws:comprehend:eu-central-1:0000000000:entities-classifiers/docClassifier-copy</entityRecognizerArn>
<entities>DATE,PERSON</entities>
</standard_classifier>

The XML sub-elements are:

  1. threshold - is used to filter the entities. If the entity score is less than the threshold value then the entity will not be saved in the database.

  2. language - is a mandatory parameter used to specify the language of the documents. If the user has documents with different languages then the user is allowed to use a source attribute to specify the language. The attribute name should be prefixed with $ character, eg. $aws_language.

  3. entityRecognizerArn - is used to specify the custom entity classifier instead of the standard one.

  4. entities - specify the entities that will be saved on the database. If the entity is not present in the entities list, then the attribute will be ignored by the scanner.

Custom Classifier

The Custom Classifier is used to classify documents using custom created categories. The scanner allows users to use multiple custom classifiers in the same scan run.

The XML sub-element "classifierEndpointArn" is mandatory and specifies the Amazon Resource Names of the custom classifier.

The "threshold" sub-element is to filter the classes. If the class score is lower than the provided value for the threshold, then the attribute will not be saved on the database. The attribute name on the database will be "aws_className_awsJobId".

<custom_classifier displayScore="true">
<classifierEndpointArn>arn:aws:comprehend:eu-central-1:000000000:document-classifier/docClassifier-copy</classifierEndpointArn>
<threshold>0.7</threshold>
</custom_classifier>

Using the Amazon Comprehend Enrich Scanner

The Amazon Comprehend Enrich Scanner can be run in two modes: simulation mode and enrich mode.

To extract the text in both cases the scanner uses Tika and Tesseract for OCR. The OCR is disabled by default, but it can be activated by the user. More information can be found in chapter Tika Configuration and Tesseract OCR Configuration.

We recommend you to run the scanner in simulation mode to analyze the cost before running it to extract the Comprehend attributes.

Simulation Mode

The parameter "executeClassifiers" should be not checked when you want to run the scanner in simulation mode. To see the information generated by the scanner, the parameter "loggingLevel" should be set to 3 or 4.

The scanner extracts the text from documents locally and computes the number of characters and units to help the user to estimate the cost of classifiers execution.

The information generated during execution is present in the report log. An example of a report log is present in the following image.

Enrich Mode

To run the scanner in enrich mode you need to check the parameter "executeClassifiers".

The first step that the Amazon Comprehend Enrich Scanner does is to extract locally the text from documents. After that, the text files are uploaded to S3 on inputS3Uri. The classifiers jobs are fired and the result of those are saved in S3 on outputS3Uri. The scanner downloads the files and saves the results on database.

The following image presents the attributes in the database after one scan run on enrich mode with standard entities classifier and dominant language classifier.

Tika Configuration

The Tika library is used by the scanner to extract the text from documents.

The scanner provides a tika configuration file that contains all necessary parsers to extract the text from all office documents. The user can modify the configuration file if more tunings are wanted. The file is located on \fme AG\migration-center Server Components <Version>\lib\mc-aws-comprehend-scanner\tika-config.xml.

The "OOXMLParser" is used for office documents like docx and the "PDFParser" is used for pdf documents. The default configuration provided by the Tika library will be used for other documents type.

More information about the configuration can be found at https://tika.apache.org/1.26/configuring.html.

Tesseract OCR Configuration

The Tesseract OCR is used to extract the text from the embedded images and also from the image file. The documentation of this library is https://tesseract-ocr.github.io/tessdoc/Home.html.

To install this library you can follow the article: https://medium.com/quantrium-tech/installing-and-using-tesseract-4-on-windows-10-4f7930313f82. The executable file can be download from https://sourceforge.net/projects/tesseract-ocr-alt/files/ or https://digi.bib.uni-mannheim.de/tesseract/.

After you installed the Tesseract you need to complete the TesseractOCRConfig.properties file with tesseractPath and tessdataPath. Example:

tesseractPath=C:\\Users\\user\\AppData\\Local\\Tesseract-OCR
tessdataPath=C:\\Users\\user\\AppData\\Local\\Tesseract-OCR\\tessdata

By default, the Tesseract is disabled. If the user wants to enable the Tesseract, the following steps should be followed:

  • Open tika-config.xml and remove from DefaultParser the line <parser-eexclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>

<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.executable.ExecutableParser" />
<parser-exclude class="org.apache.tika.parser.pdf.PDFParser" />
<parser-exclude class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
</parser>
  • Change the value of the ocrStrategy XML element of PDFParser with ocr_and_text.

<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
<param name="extractInlineImages" type="bool">true</param>
<param name="sortByPosition" type="bool">true</param>
<param name="extractUniqueInlineImagesOnly" type="bool">false</param>
<param name="ocrStrategy" type="string">ocr_and_text</param>
<param name="ocrImageType" type="string">rgb</param>
<param name="ocrDPI" type="int">100</param>
</params>
</parser>

Additional Configuration

For configuring some additional parameters that will apply to all scanner runs, a configuration file (internal-configuration.properties) provided in the folder …\lib\mc-aws-comprehend-scanner. The following settings are available:

Configuration name

Description

waiting_time_between_requests

The time in seconds that the scanner will wait until it will make a request to Amazon to get the Comprehend Classifier Job status.

Example: waiting_time_between_requests=10 means that the scanner will make a request and if the status is "in progress" then the scanner will wait 10 seconds until it will make another request to check the status

Logs Files

A complete history is available for any Amazon Comprehend Enrich Scanner job is available from the respective items’ History window. It is accessible through the History button/menu entry on the toolbar/context menu. The History window displays a list of all runs for the selected job together with additional information, such as the number of processed objects, the start and ending time and the status.

Double clicking an entry or clicking the Open button on the toolbar opens the log file created by that run. The log file contains more information about the run of the selected job:

  • Version information of the migration-center Server Components the job was run with

  • The parameters the job was run with

  • Execution Summary that contains the total number of objects processed for every classifier and errors that occurred during runtime.

Log files generated by the Amazon Comprehend Enrich Scanner can be found in the Server Components installation folder of the machine where the job was run, e.g. …\fme AG\migration-center Server Components <Version>\logs

The amount of information written to the log files depends on the setting specified in the ‘loggingLevel’ start parameter for the respective job.