Auto Classification Module
Last updated
Last updated
This document is a guide on the auto classification module for migration-center 3.
Creating transformation rules on vast amounts of documents, can be a laborious task. This module can support the user by automatically classifying documents to predefined categories. Configuring this module might require basic knowledge about machine learning principles.
The installation process can be performed in a few minutes. But before installing the module, two mandatory prerequisites must be installed.
The module requires a Java runtime environment of version 8 or higher and Python v3.7. Make sure to configure the environment variables PATH and PYTHONPATH for Python during the installation process in the interactive installation dialog. To do this, you need to check the box “Add Python 3.7 to PATH” at the first installation window. If you are unsure, read the official installation instructions.
You can configure the environment variables later, but it is mandatory for the module. More information can be found in the official Python documentation. If the variables are not configured, the module might not start or work as intended.
Along with the Python language, the installer installs the package manager “pip”. “pip” is a terminal software and needs to be executed in the Windows CMD shell.
If the use of OCR (optical character recognition) is intended, Tesseract with version 4 or 3.05 must be installed (see Installing TessaractOCR). Two other use cases require the installation of the graphical library Graphviz (see Installing Gaphviz) or the Oracle Instant Client for database access.
In addition, it is possible to use a dedicated Apache Tika instance. If you like to do that please refer to section Installing Apache Tika.
The operation of the Auto Classification Module requires configuration files. The samples subdirectory contains a set of example files.
The installation of the auto classification module can be summarized in seven steps.
Download and extract the software code from the zip archive onto your hard drive. Copy the files to a directory of your choice.
Open a CMD window and navigate to the path where the software was copied to or open a command window from the Windows Explorer.
Install all Python package requirements by executing the following statement in the CMD:
pip install -r requirements.txt
(optional) Install Oracle Instant Client (see here)
(optional) Install TesseractOCR (see Installing TesseractOCR)
Create the “config.ini” file, as described in section Deploying the auto classification module.
Provide a classifier configuration file. The configuration schema of the file is specified in Classifier configuration.
Start the module by executing the init file in the CMD:
python __init__.py {path to config.ini}
The script accepts the file path to the “config.ini” file as the only parameter. If the file path is not provided, the script assumes that the “config.ini” file is in the same directory.
Open the GUI in a web browser:
http://{server ip}:{WSGI port}/app
After executing the script, the service boots. Please wait until the service has started successfully, before executing any further operations. The CMD windows needs to stay open.
If additional resources are available or the classification is time critical, it is advised to install the AC-Module and Apache Tika with TesseractOCR on two different servers, although all programs can be executed on the same machine.
TesseractOCR is a software for Optical Character Recognition. It extracts text from image files and images embedded in PDF and Office documents. Tesseract should be used if any scanned PDF files are to be classified or documents contain images or diagrams with valuable information.
If scanned PDF files are to be classified, but OCR is not installed, the Auto Classification Module will not be able to extract any text from those files. On the other hand, if it is known that documents contain images with valuable text for the classification process, OCR should be used, as well. Valuable text or valuable words in a document are those words that might be unique for a small size of documents and make the prediction of the correct category easier.
First, download the Windows installer from the official GitHub repository. When the download is completed, execute the installer. When asked what language packs should be downloaded, select the languages that are going to be classified.
Also, you can add additional languages later on.
Add the installation path into your environment Path variable.
First, download the Linux installer from the official GitHub repository or from apt. An installation instruction can be found in the project’s wiki.
By default, the module is shipped with the capability to download Apache Tika from the official download page on first use. Tika will be started on the same machine as the module and no changes of installation procedures are generally required.
If no connection to the internet can be established, the same procedure as starting Apache Tika from a dedicated server can be applied.
To use Apache Tika on a dedicated server, please download the tika-server.jar file from the official repository first.
Afterwards, the tika-server.jar file must be executed and the port must be open for outside communication.
The connection details must be provided to the module via “config.ini”. The configuration file “config.ini” holds three variables that need to be edited. The variables can be found in the Tika section.
“tika_client” needs to be set to true.
“tika_client_port” needs to be set to the correct port number.
“host” needs to be set to the correct name / IP address.
If Tika runs on port 9000 in a dedicated process on the same machine, the configuration looks as follows:
[TIKA]
tika_client: true
tika_client_port: 9000
host: 127.0.0.1
Graphviz is an additional library for creating charts and plots. It is not mandatory and is only required if the attribute structure of a document repository is investigated and a chart should be created. If Graphviz is properly set up, the script cli_show_distribution.py
will automatically create a chart of the attribute structure.
First, Graphviz needs to be downloaded and installed.
Second, the Python wrapper for Graphviz needs to be installed in the Python environment.
On Linux OS, pygraphviz can be installed easily with pip in the CMD:
pip install pygraphviz
For Windows OS, an installation file is provided. Usually, the installation file is already used when installing all library requirements from the “requirements.txt” file. The installation file is a whl file and can be found in the directory “install_packages/”. The whl file can be installed with the CMD command:
pip install pygraphviz-1.5-cp37-cp37m-win_amd64.whl
The supplied whl file is compiled for Windows OS with 64 bit and Python 3.7 installed. Other compiled whl files for Windows can be downloaded from the GitHub repository.
The auto classification module offers a powerful functionality to enhance migration project. However, using the module can be tricky. Therefore, this document provides in-depth information for using and configuring the module in a production environment.
The module needs to be deployed as a service. Auto classification is a memory consuming task and it is advised to run the module on a dedicated server with access to as much memory as possible. If OCR is performed, it is advised to use a GPU, because character recognition on a CPU is very time consuming.
To deploy the auto classification module, a configuration file must be configured. The file can be placed in any accessible UNC file path. If no configuration file is provided during the initial startup, the module will look for a configuration file in the working directory. In this case, the file must be named config.ini
.
The following two chapters provide information on how to set up SSL encryption and a valid path to an algorithm configuration file.
Template files are provided in the sample directory of the module.
The configuration file contains two port properties. One property is defined in the FLASK_SERVER section and the other in the WSGI_SERVER section.
The Flask server is only used for development.
Therefore, if you like to change the port number of the AC module, set the port property under the WSGI_SERVER section. The definition can look like this:
port: 443
By default, the port is set to 3000
. This means that every interaction with the module goes through port 3000. This includes API requests and interacting with the web app.
The module can provide encryption of HTTP traffic with SSL. It is hardly advised to use this feature to prevent any data interception. Please keep in mind that either HTTPS or HTTP can be used. A dual usage of both modes is not supported.
SSL encryption can be activated by defining the following line in the SSL chapter of the configuration file:
ssl.enabled: true
To use SSL encryption, it is mandatory to provide a valid SSL certificate. The required properties are:
ssl.certificate_file_path
ssl.key_file_path
If you do not have access to a valid certificate, you can create one on your own. You can use the program keytool to create a self-signed certificate. The program is shipped with the Java runtime and can be found in the ${JAVA_HOME}/bin
folder.
To be able to use the auto classification module, a classifier configuration file must be provided. The file can be specified with the property
file_path
in the config.ini file under the section ALGORITHM_CONFIG
The file’s schema and configuration options are described thoroughly in the section Classifier configuration below.
The module is a Python program. Therefore, starting the module is very easy. Just enter the following command:
python __init__.py {file path to config.ini}
A classifier is a combination of an algorithm with a transformation list. Before a classifier can be used, it must be configured and trained.
As a basic understanding, one must know that an algorithm cannot be optimal for every situation, although it might be sophisticated and very powerful. Each algorithm has advantages and disadvantages and one algorithm can solve one problem better than another. Therefore, the most significant boost and most suitable adjustment can be achieved by providing only the most meaningful data to an algorithm. This is what a transformation list can be employed for.
The use of a transformation list is aimed to select the most meaningful words (feature selection) and create additional data from files (feature extraction).
A transformation list can be interpreted as a pipeline for the words of a document before they are submitted to an algorithm.
To explain the value of transformation lists and their relationship to algorithms, a simple example is depicted. The figure below displays the relationship between algorithms and transformation lists. Each entity has a unique id within their type. The list with id 1 enables three transformations:
lower-case
tf-idf
stopwords
The second list (id:2) uses the same transformations and adds the usage of OCR. It is not important in which order the transformations are listed as transformation lists are only used to enable transformations and not to define an ordered sequence.
Next to the transformation lists are three algorithms. Each algorithm points to exactly one transformation list, but one transformation list can be referenced by many algorithms. Based on these relationships, one can reason that three classifiers are defined. If each classifier is tested with the same documents, each of them will compute different results.
A second example displays how transformation lists are applied to documents. During the stages of training, test and prediction, raw documents are supplied. These unprocessed documents need to be transformed by the pipeline of the transformation list.
First, the stop words of the documents are removed. Followed by the replacement of all upper-case letters by their lower-case equivalents. Third, the tf-idf values are computed. In the end, the documents are processed and can be delivered to the SVM. The order of pipeline transformations may change on which transformations are selected.
All classifiers of the auto classification module are configured in a single XML file. The main XML elements are
algorithms
transformation-lists
common-config
The following chapters aim to explain the configuration possibilities in detail.
Also, a template and sample XML file are provided.
Inside the “algorithms” element the individual algorithms are specified. The supported algorithms are:
SVM (element name: “svm”)
Naïve Bayes Multinomial (element name: “naive-bayes-multinomial”)
Logistic regression (element-name: “logistic-regression”)
All algorithms must have an id attribute (attribute name : “id”) with a unique value and an attribute with the correspondent transformation list (attribute name: “transformationListId”). Some algorithms support the prediction of multiple attributes (called multi-output). If multi-output behaviour is desired, the attribute “multi-output” must be set to true.
Additionally, every algorithm offers attributes to configure its behavior.
The SVM supports the following attributes:
kernel (value of “linear”, “poly” or “rbf”)
degree (positive integer, affects only poly-kernel)
gamma (positive double, affects only rbf-kernel)
c (positive float)
Naïve Bayes Multinomial offers the following attributes:
alpha (positive float, affects only Naïve Bayes)
A transformation list is a group of transformation functions. It must have a unique id value. Each transformation function operates on the content of documents and transforms the content to a specific output. They can be used to optimize the process of feature selection. The order in which they are specified in the XML file does not matter. The functions are applied before an algorithm is trained. Not every function is supported by every algorithm:
On the following pages every transformation function is described.
With the “common config” attribute, global properties can be configured. These properties are accessed by all classifiers. Currently it does not offer any configuration parameters.
The module offers the functionality to analyze a dataset and perform classification tasks with migration-center. The functionality can be consumed through a web service through a HTTP/HTTPS interface with a range of provided Python scripts or within selected adaptors.
The API interface will always respond with a body encoded in JSON format.
In addition, a Web-UI is provided with the Auto Classification module. The UI offers a clean and simple interface to interact with the module and it offers the same functionality as the scripts. The UI can be reached with the following URL:
https://{server-ip}:{WSGI port}/app
The following chapters explain the usage of the functionality to analyze a dataset and to train, validate, test and reinforce a classifier. Furthermore it describes how to make predictions for unclassified documents and to query a status of a training or reinforcement processes.
This document only describes how to use the API and not the UI. However, all parameters described in this chapter, can be used in the UI.
The following chapters explain the module usage by using scan run ids from the migration-center database. It is not always a desired practice to access the database. Therefore, the alternative method is to use a unified metadata file. A unified metadata file can be created when using the Filesystem Importer of the migration-center. The metadata file contains all necessary information for the Auto-Classification module.
To use a unified metadata file, the user needs to change the following things on a HTTP request:
Replace the “scan-run” part of the URL with “unified-metadata-file”.
Send the request with the mimetype “form-data”
Send the metadata file with the form key “file”.
(optional) Remove the properties of database connection and scan-run-id.
Before validating and training a classifier, the training dataset needs to be investigated. The module offers two different reports.
The distribution report creates a barplot for every attribute with the frequencies of the values on the y axis.
A distribution report can be created with a HTTP POST request to the URL:
https://{server-ip}:{port}/api/distribution-report/scan-run
It is required to supply three parameters with the request:
id
dbCon
classAttrs
The id is the scan run id, the database connection must be a JDBC connection string and the class attributes must be string of attributes, separated with a comma.
It returns a pdf report.
Much more detailed is the hierarchy report. It allows to specify an attribute hierarchy. It creates a graph which gives an insight into the frequency of every attribute value per hierarchy layer.
A hierarchy report can be created with a HTTP POST request to the URL:
https://{server-ip}:{port}/api/hierarchy-report/scan-run
It is required to supply three parameters with the request:
id
dbCon
classAttrs
The id and database connection parameters are the same as with the distribution report. The class attributes parameter needs to be a json object dumped as a string. It can look like the following example:
{
“type”: {
“subtype”: {}
}
}
In the example, type is the most general attribute and subtype is dependent on the individual values of the type attribute.
The hierarchy report is returned as a png file.
After analyzing a dataset, one might want to remove certain documents or attribute combinations from the dataset. The dataset splitter functionality supports this process.
The dataset splitter can split a single unified metadata file into two separate files. Furthermore, he can filter documents by attribute values.
It can either be used through the web service API or the web app.
The API function is available through a HTTP Post request to the following URL:
https://{server-ip}:{port}/api/dataset-splitter/unified-metadata-file
With the request, a metadata file (key: file) must be supplied.
Additionally, the user must define a splitting percentage by which the file is split into two files. The key for the payload property is “trainingPercentage” and it must be between 0 and 100. If the value is 0, all data will be stored in the test file. If the value is 50, the data is split in half and stored in both files. If the value is 100, all data is stored in the training file.
An optional third parameter is “exclusions”. This parameter allows to define attribute combinations that are prohibited in the newly created metadata files. The dataset splitter excludes documents based on the parameter. The parameter must be a json object with the following structure:
[ { ‘attribute’: ‘document_type’, ‘values’: [‘concept’, ‘procotol’] } ]
“document_type” is an example of an attribute name, as are “concept” and “protocol” examples of values.
The ampersand (*) is a wildcard value for the values property. If the ampersand is defined, all documents are ignored that hold the attribute, no matter what attribute value they actually have.
The response object is a zip archive which contains two xml files. The two xml files are valid unified metadata files. One file starts with the keyword “TRAINING”, the other with “TEST”.
A short example explains when the exclusion parameter can be used:
A user analyzed a dataset and found out that for the attribute “document_type”, the values “concept” and “protocol” have a very low frequency and they are not classifiable. He wants to remove them from the dataset as he aims to build a classifier on the “document_type”, but the concept and protocol values increase the noise.
From there on, he uses the dataset splitter and supplies the raw metadata file. As the “trainingPercentage”, he sets a value of 100. This will save all document’s metadata into the training file and leave the test file empty. The “exclusions” parameter is defined just as seen above. The configuration leads to the result that every document that either has “concept” or “protocol” as a “document_type” value, is ignored. These documents will not be saved in the new training file.
Therefore the returned zip archive contains a full training file and an empty test file. The theoretical outcome is that the new metadata training file has less noise and the classification of the “document_type” performs better.
As the exclusion parameter is a list, the user can specify more than one exclusion entry. The entries are not connected. This means that not all entries must be true. If one entry is true for a document, the document is automatically ignored.
The following chapters only mention specific parameters of the processes.
Two parameters are supported for the training, validation, testing and reinforcement process without being mentioned explicitly:
use-serialized-documents
description
This parameter takes a Boolean value.
If it is true, it will serialize and store documents after their content has been extracted. This happens before any transformation functions are applied. If a document has already been stored on the filesystem, the stored file is loaded, instead of reextracting the content. This is helpful if OCR needs to be performed and the classifier parameter are being tuned.
One must define the “document_storage_path” setting in the config.ini file under the TIKA section.
Every process can have a description. This is helpful to identify a process later.
The training of a classifier is an essential part of auto classification. The module can be trained by performing a HTTP POST request to the URL
https://{server-ip}:{port}/api/train
It is required to supply five parameters with the request:
scan-run-id
db-con
class-attr
algorithm-id
model-target-path
The scan run id is the id of a completed scan run inside of migration center. The documents within the scan run are used to train the classifier. If the scan run does not contain all documents for the training process, the classifier can later be reinforced (see chapter 4.9).
The parameter “db-con” is a connection string to the Oracle database of migration center. The string consists of the following structure:
{account name}/{password}@{host}:{port}/{instance}
The third parameter specifies the class attribute of the documents. If the classifier should predict multiple attributes, the attribute names must be separated by a semicolon.
The algorithm id references the configured algorithm inside the classification configuration file from section Classifier configuration.
The last attribute is a file path to the location of the trained classifier. This location can be any UNC accessible path. For the uses of testing, reinforcement and prediction this file path is consumed.
Depending on the number of documents, the length of the documents and the embedded images, the training requires a significant amount of time. Because of that the initial request returns a process id, after validating the parameters. The actual training process starts after the process id is received. The id can be used to query the module for a status of the training process. Please refer to section Status on how to query the process status.
It is advised to use the provided batch script train.cmd
to train a classifier. The script requires six parameters:
Scan run id
Connection string to migration center database
Class attribute name (of the documents in the scan run id)
Algorithm id (as defined in the XML file)
Model path (the trained model will be saved in the location)
IP of the auto classification module service (optional)
If the IP is not defined, the script assumes that the service runs on localhost.
An example usage of the script can look like this:
train.cmd 432 fmemc/fmemc@localhost:1521/xe ac_class 1 C:\Temp\ac-module-dev.model
The module will use the documents within scan run “432” to train the classifier with algorithm id “1” to be able to predict the categories from the attribute name “ac_class”.
A second example shows how to train a classifier to predict two attributes:
train.cmd 432 fmemc/fmemc@localhost:1521/xe
ac_class;department
1 C:\Temp\ac-module-dev.model
This classifier will predict the attribute "ac_class" and the "department".
Testing a classifier requires an already trained classifier.
A test process can be started by performing a POST request to the URL
https://{server-ip}:{port}/api/test
It is mandatory to supply four parameters:
scan-run-id
db-con
class-attr
model-path
The scan run id is the identifier of a completed scan run inside of migration center. The documents within the scan run are used to test the classifier.
The parameter “db-con” is a connection string to the Oracle database of migration center.
The third parameter specifies the class attributes of the documents, separated by a semicolon.
The last parameter specifies a UNC file path to a trained model file.
The module responds with an overall precision value and a list of tested documents. For every document, the actual and predicted classes and confidence values are provided.
A validation process combines a training and test process. The process splits the provided datasets into k packages and performs k training batches. The user defines the value for the k parameter (common values are 10 or 5).
In every training batch, one package is used as a validation package and the remaining packages are bundled into a training dataset. The validation package works like a test dataset. Because the number of packages and batches is equal. Each package is used as a testing package exactly once.
This process allows to tune algorithm parameters without using the actual test dataset.
You can start a validation process with a POST request to
https://{server-ip}:{port}/api/validate
It is mandatory to provide six parameters
scan-run-id
db-con
class-attr
model-path
algorithm-id
k
After the parameters are validated, the module responds with a process id before starting the validation. The process id can be used to query the current state of the process. Please refer to section Status on how to query a process state.
The grid search validation process is a process to automate the tuning of algorithm parameters. It works similar to the general validation process, but uses a default and non-changeable k value of 5 and will not create a process report.
In exchange, the user can define multiple values for each algorithm parameter. The module will apply each cross product of the values on a validation process and return the precision and standard deviation of the predictions.
A grid search validation process can be started with a POST request to
https://{server-ip}:{port}/api/validate-grid-search-cv
It is mandatory to provide six parameters
scan-run-id
db-con
class-attr
model-path
algorithm-id
grid-search-config
The “grid-search-config” parameter is a string and must have the following schema:
Parameters must be separated by an ampersand (&).
Parameter values must be separated by a semicolon (;).
Therefore a sample parameter looks like the following:
C=0.5;1;2;5°ree=1;2;3;4
Because the process will apply every element of the cross product of the parameter values, a total of 16 validation processes are started.
The process of reinforcement is similar to the train process. The only difference is that it uses an already trained model file and retrains the algorithm with the newly provided documents.
You can start reinforcement with a POST request to
https://{server-ip}:{port}/api/reinforce
It is mandatory to provide four parameters
scan-run-id
db-con
class-attr
model-path
The scan run id is the id of a completed scan run inside of migration center. The documents within the scan run are used to reinforce the classifier.
The parameter “db-con” is a connection string to the Oracle database of migration center.
The third parameter specifies the class attribute of the documents.
The last parameter specifies a UNC file path to a trained model file.
After the parameters are validated, the module responds with a process id before starting the reinforcement. The process id can be used to query the current state of the process. Please refer to section Status on how to query a process state.
The prediction of a document’s class can easily be done with two options: with a HTTP API request or via the File System Scanner in migration-center.
A unclassified document can be predicted with a GET request to
https://{server-ip}:{port}/api/predict
It is mandatory to provide two parameters
document-path
model-path
The module response with a classification and a confidence value:
To be able to use the file system scanner for document prediction, it is necessary to train a classifier beforehand. Also, three file system scanner parameters must be configured.
The parameters are:
acModelFilePath
acServiceLocation
acUseClassification
The model file path points to a UNC file path of the trained classifier model file. The service location is the URI address of the deployed auto classification module. At last, the usage of classification must be enabled by ticking the parameter “acUseClassification”.
Now, a file system can be scanned by the scanner and predictions are automatically performed. The results can be accessed by viewing the attributes of the scanned objects (see figure below). In this case, the attribute “ai_class” holds the value of the predicted class. The classifier expresses his confidence in his own prediction with a value between 0 and 100. The higher the value, the greater the confidence.
If one is unsure which attribute has been predicted and where its confidence value column is, he should know the easy naming schema: The predicted values are displayed in columns starting with “ai_”, followed by the original attribute name. The confidence values column has the name of “confidence_”, again followed by the original attribute name.
By initializing a training or reinforcement process, a process id is replied. The id can be used to query the status of the process at any time until the module is turned off.
The current state of a process can be retrieved by performing a GET request to the URL
https://{server-ip}:{port}/api/processes/{process_id}
It requires to supply one parameter
process-id
The response contains a status, message, number of processed documents, number of provided documents and list of document results.
The status can have any of the following options:
STARTED
READING_DOCUMENTS
TRAINING
REINFORCING
FINISHED
BAD_REQUEST_ERROR
INTERNAL_SERVER_ERROR
ABORTED
If the status is an error, the message property always offers an explanation.
The “number of processed documents” indicates how many documents have already been transformed by the functions in the transformation list. As soon as all documents have been transformed, the module changes the status to TRAINING.
The “number of provided documents” shows how many documents are in the scan run.
If the process is of type reinforcement, the “number of provided documents” is the sum of documents from the already trained model and the provided scan run.
The document results indicate whether an error occurred while processing a particular document.
Please take in mind that a process status is not stored forever. As soon as the module is turned off or is rebooted, the status of every process is lost.
This chapter explains common and known issues that one might run into.
If the module does not extract embedded images from PDF files, multiple points can be the reason:
Check if the PdfParser property “extract-inline-images” is set to true.
Check if the PdfParser properties “ocr-strategy” is set to “ocr_only” or “ocr_and_text”. If the module has been configured through the WebApp, the strategy is set automatically to “ocr_only” if “extract-inline-images” is true.
Check if the OCR property “enable-image-processing” is set to 1.
If only a subset of the embedded images are extracted, the issue can be the missing file type support for JPG files. Apache TIka uses the software “PdfBox” to process PDF files. The JPG file type is not automatically supported by PdfBox, because the required code is not applicable to the Apache license and prohibits the usage in commercial software projects.
In some circumstances Apache Tika does not start although it is properly configured in the config.ini file. Start an investigation as follows:
Check if the used ports are not occupied.
Check if the tika-server.jar file can be downloaded from the script over the internet (internet connection required). If not, download the newest tika-server.jar file and save it in a directory of your choice (i.e: C:\Temp). If applicable, you must rename the file name to “tika-server.jar” explicitly. Now, you must define the Apache Tika environment variable TIKA_PATH with the path to the directory of the tika-server.jar file:
SET TIKA_PATH=C:\Temp\
python __init__.py samples\config
Start a dedicated Tika server from the Windows CMD terminal. Download the newest tika-server.jar file from the official website and execute this command:
java -jar tika-server.jar --config={path to tika config xml fil} --port=9000
If java cannot be started in general, try to use the absolute path to the java.jar. If this fixes the issue, you must set the Apache Tika environment variable TIKA_JAVA with the absolute path to the java.exe file before starting the module like so:
SET TIKA_JAVA=C:\Program Files\Java\{jre version}\bin
python __init__.py samples\config
If you are confronted with an exception that is not listed here, please get in contact with our technical product support at support@migration-center.com.
Algorithm
multi-output support
SVM
Yes
Naïve Bayes Multinomial
No
Logistic regression
Yes
SVM
Naïve Bayes Multinomial
lower-case
Yes
Yes
tf-idf
Yes
Yes
stopwords
Yes
Yes
length-filter
Yes
Yes
n-gram
Yes
Yes
ocr
Yes
Yes
pdf-parser
Yes
Yes
token-pattern-replace-filter
Yes
Yes
document-frequency
Yes
Yes
XML attribute name
Description
Parameters
Example code
lower-case
Transforms every upper-case letter to his lower case equivalent.
/
<lower-case></lower-case>
tf-idf
Uses the TF-IDF algorithm on the document content.
/
<tf-idf></tf-idf>
document-frequency
Filters the words by their frequency in a document. If a word has a frequency lower than the min-value or greater than the max-value, it is not selected for further processing. At least one value must be provided when using this function.
min-value (positive integer)
max-value (positive integer)
<document-frequency min-value="2" max-value="50">
</document-frequency>
n-gram
Splits the document content into n-grams of n = length. The default are uni-grams with a length of 1.
length (positive integer)
<n-gram length="1"></n-gram>
stopwords
Analyses the document content for stop words and removes them from the content.
Child element of “languages” (please consult the example code)
<stopwords> <languages> <language name="german">
</language> <language name="english">
</language> </languages> </stopwords>
length-filter
Ignores all words that do not have the specified minimum length.
min-length (positive integer)
<length-filter min-length="4">
</length-filter>
ocr
Performs OCR (object character recognition) analysis on images in a document. The library TesseractOCR in version 3.04 is used for this purpose. The analysis uses a lot of CPU power and can require several minutes for a single document (depending on the number and size of images in the document).
language (three letter language code as defined by TesseractOCR)
enable-image-processing (boolean)
density (Integer)
<ocr
language="deu"
enable-image-processing="true"
density=”300>
</ocr>
pdf-parser
Extracts the text content from PDF files. If used in combination with OCR, extracted images are automatically analyzed.
extract-inline-images (Boolean)
extract-unique-inline-images-only (Boolean)
sort-by-position (Boolean)
ocr-dpi (Integer)
ocr-strategy (String)
<pdf-parser
extract-inline-images="true"
extract-unique-inline-images-only="false"
sort-by-position="true"
ocr-dpi=”300”
ocr-strategy=”ocr_and_text”>
</pdf-parser>
token-pattern-replace-filter
Replaces words that match a regex pattern with a defined string.
pattern (regex)
replacement (characters)
<token-pattern-replace-filter
pattern="(\w)\2+"
replacement=" ">
</token-pattern-replace-filter>