Amazon Comprehend Enrich Scanner
The Amazon Comprehend Enrich Scanner is one of the source connectors available in migration-center starting with version 3.17. It is a special connector which enhances the objects scanned by another source connectors with some data computing using the Amazon Comprehend.
The supported Comprehend classifiers: Dominant Language Classifier, Entities Classifier and Custom Classifier.
- The scanner will extract text in enrich mode even if the scanner was run in simulation mode before.
- The entities and language classifiers can be run in the same scan run, but the entities classifier will not take into consideration the language extracted by the dominant language classifier. The attribute generated by the language classifier can be used by the entities classifier if it will be run before the entities classifier. More information about the way to use the source attribute as entities language attribute is presented in Classifiers Configuration.
To create a new Amazon Comprehend Enrich Scanner job click on New Scanner button and select "AmazonComprehendEnrich" from the adapter type dropdown list. Once the adapter type has been selected, the parameters list will be populated with the Amazon Comprehend Enrich Scanner parameters.
The Properties window of a scanner can be accessed by double-clicking the scanner in the list or selecting the Properties button or entry from the toolbar or context menu.
Enter a unique name for this scanner
Select the “CSV/Excel” connector from the list of available connectors
Select the Job Server location where this job should run. Job Servers are defined in the Jobserver window. If no Job Server was selected, migration-center will prompt the user to define a Job Server Location when saving the scanner.
Enter a description for this scanner (optional)
The Amazon public key used to create a connection to AWS.
The Amazon private key. This should be the pair of the public key.
The Amazon region used to create the connection to AWS.
Flag indicating if the classifiers will be executed. If this parameter is not checked then the scanner will run in Simulation mode, otherwise, the classifiers jobs will be fired in Comprehend.
The S3 location where the text files will be uploaded.
The S3 location where the output of the classifier will be located.
The ARN of custom managed key used to encrypt the data in S3.
The ARN of the role that has Comprehend as trusted entities.
Flag indicating if the files from S3 will be deleted. If the parameter is checked then the files from inputS3Uriand outputS3Uri will be deleted.
The id of the job which scanned the objects that will be enriched.The jobRunId must exist.
The location of the file where the classifiers are configured. When the executeClassifiers parameter is checked, then this parameter is mandatory. The way to configure the classifiers is detailed in Classifiers Configuration.
Sets the verbosity of the log file.
1 - logs only errors during scan
2 - is the default value reporting all warnings and errors
3 - logs all successfully performed operations in addition to any warnings or errors
4 - logs all events (for debugging only, use only if instructed by fme product support since it generates a very large amount of output. Do not use in production)
The classifiers are configured using an XML file. The structure of this file is a predefined one and allows the user to configure the classifiers as much as possible.
The supported classifiers are divided into two types: standard classifiers and custom classifiers. There is a predefined XML structure for each classifier type. An example of this configuration file can be found in \fme AG\migration-center Server Components <Version>\lib\mc-aws-comprehend-scanner\classifiers-config.xml.
For every classifier, you can specify if the score should be displayed by using the XML attribute "dispayScore". You need to specify this attribute just if you want to have the score as an attribute in migration-center otherwise, the attribute can be omitted because the default value is false.
The standard classifiers are split into two supported classifiers and the difference between them is made using an XML attribute named "type".
- Dominant Language Classifier
The structure of this classifier is presented in the following block. The "threshold" XML element is mandatory and is used to filter the values. If the score for a specific language is lower than the threshold value then the language is not saved on database.
<standard_classifier type="language" displayScore="true">
- Entities Classifier
The structure for the entities classifier is presented in the following block.
The XML sub-elements are:
- 1.threshold - is used to filter the entities. If the entity score is less than the threshold value then the entity will not be saved in the database.
- 2.language - is a mandatory parameter used to specify the language of the documents. If the user has documents with different languages then the user is allowed to use a source attribute to specify the language. The attribute name should be prefixed with $ character, eg. $aws_language.
- 3.entityRecognizerArn - is used to specify the custom entity classifier instead of the standard one.
- 4.entities - specify the entities that will be saved on the database. If the entity is not present in the entities list, then the attribute will be ignored by the scanner.
The Custom Classifier is used to classify documents using custom created categories. The scanner allows users to use multiple custom classifiers in the same scan run.
The XML sub-element "classifierEndpointArn" is mandatory and specifies the Amazon Resource Names of the custom classifier.
The "threshold" sub-element is to filter the classes. If the class score is lower than the provided value for the threshold, then the attribute will not be saved on the database. The attribute name on the database will be "aws_className_awsJobId".
The Amazon Comprehend Enrich Scanner can be run in two modes: simulation mode and enrich mode.
To extract the text in both cases the scanner uses Tika and Tesseract for OCR. The OCR is disabled by default, but it can be activated by the user. More information can be found in chapter Tika Configuration and Tesseract OCR Configuration.
We recommend you to run the scanner in simulation mode to analyze the cost before running it to extract the Comprehend attributes.
The parameter "executeClassifiers" should be not checked when you want to run the scanner in simulation mode. To see the information generated by the scanner, the parameter "loggingLevel" should be set to 3 or 4.
The scanner extracts the text from documents locally and computes the number of characters and units to help the user to estimate the cost of classifiers execution.
The information generated during execution is present in the report log. An example of a report log is present in the following image.
To run the scanner in enrich mode you need to check the parameter "executeClassifiers".
The first step that the Amazon Comprehend Enrich Scanner does is to extract locally the text from documents. After that, the text files are uploaded to S3 on inputS3Uri. The classifiers jobs are fired and the result of those are saved in S3 on outputS3Uri. The scanner downloads the files and saves the results on database.
The following image presents the attributes in the database after one scan run on enrich mode with standard entities classifier and dominant language classifier.
The Tika library is used by the scanner to extract the text from documents.
The scanner provides a tika configuration file that contains all necessary parsers to extract the text from all office documents. The user can modify the configuration file if more tunings are wanted. The file is located on \fme AG\migration-center Server Components <Version>\lib\mc-aws-comprehend-scanner\tika-config.xml.
The "OOXMLParser" is used for office documents like docx and the "PDFParser" is used for pdf documents. The default configuration provided by the Tika library will be used for other documents type.
More information about the configuration can be found at https://tika.apache.org/1.26/configuring.html.
The Tesseract OCR is used to extract the text from the embedded images and also from the image file. The documentation of this library is https://tesseract-ocr.github.io/tessdoc/Home.html.
To install this library you can follow the article: https://medium.com/quantrium-tech/installing-and-using-tesseract-4-on-windows-10-4f7930313f82. The executable file can be download from https://sourceforge.net/projects/tesseract-ocr-alt/files/ or https://digi.bib.uni-mannheim.de/tesseract/.
After you installed the Tesseract you need to complete the TesseractOCRConfig.properties file with tesseractPath and tessdataPath. Example:
By default, the Tesseract is disabled. If the user wants to enable the Tesseract, the following steps should be followed:
- Open tika-config.xml and remove from DefaultParser the line <parser-eexclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
<parser-exclude class="org.apache.tika.parser.executable.ExecutableParser" />
<parser-exclude class="org.apache.tika.parser.pdf.PDFParser" />
<parser-exclude class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
- Change the value of the ocrStrategy XML element of PDFParser with ocr_and_text.
<param name="extractInlineImages" type="bool">true</param>
<param name="sortByPosition" type="bool">true</param>
<param name="extractUniqueInlineImagesOnly" type="bool">false</param>
<param name="ocrStrategy" type="string">ocr_and_text</param>
<param name="ocrImageType" type="string">rgb</param>
<param name="ocrDPI" type="int">100</param>
For configuring some additional parameters that will apply to all scanner runs, a configuration file (internal-configuration.properties) provided in the folder …\lib\mc-aws-comprehend-scanner. The following settings are available:
The time in seconds that the scanner will wait until it will make a request to Amazon to get the Comprehend Classifier Job status.
Example: waiting_time_between_requests=10 means that the scanner will make a request and if the status is "in progress" then the scanner will wait 10 seconds until it will make another request to check the status