This document is a guide on the auto classification module for migration-center 3.
Creating transformation rules on vast amounts of documents, can be a laborious task. This module can support the user by automatically classifying documents to predefined categories. Configuring this module might require basic knowledge about machine learning principles.
The installation process can be performed in a few minutes. But before installing the module, two mandatory prerequisites must be installed.
The module requires a Java runtime environment of version 8 or higher and Python v3.7. Make sure to configure the environment variables PATH and PYTHONPATH for Python during the installation process in the interactive installation dialog. To do this, you need to check the box “Add Python 3.7 to PATH” at the first installation window. If you are unsure, read the official installation instructions.
You can configure the environment variables later, but it is mandatory for the module. More information can be found in the official Python documentation. If the variables are not configured, the module might not start or work as intended.
Along with the Python language, the installer installs the package manager “pip”. “pip” is a terminal software and needs to be executed in the Windows CMD shell.
If the use of OCR (optical character recognition) is intended, Tesseract with version 4 or 3.05 must be installed (see Installing TessaractOCR). Two other use cases require the installation of the graphical library Graphviz (see Installing Gaphviz) or the Oracle Instant Client for database access.
In addition, it is possible to use a dedicated Apache Tika instance. If you like to do that please refer to section Installing Apache Tika.
The operation of the Auto Classification Module requires configuration files. The samples subdirectory contains a set of example files.
The installation of the auto classification module can be summarized in seven steps.
Download and extract the software code from the zip archive onto your hard drive. Copy the files to a directory of your choice.
Open a CMD window and navigate to the path where the software was copied to or open a command window from the Windows Explorer.
Install all Python package requirements by executing the following statement in the CMD:
pip install -r requirements.txt
(optional) Install Oracle Instant Client (see here)
(optional) Install TesseractOCR (see Installing TesseractOCR)
Create the “config.ini” file, as described in section Deploying the auto classification module.
Provide a classifier configuration file. The configuration schema of the file is specified in Classifier configuration.
Start the module by executing the init file in the CMD:
python __init__.py {path to config.ini}
The script accepts the file path to the “config.ini” file as the only parameter. If the file path is not provided, the script assumes that the “config.ini” file is in the same directory.
Open the GUI in a web browser:
http://{server ip}:{WSGI port}/app
After executing the script, the service boots. Please wait until the service has started successfully, before executing any further operations. The CMD windows needs to stay open.
If additional resources are available or the classification is time critical, it is advised to install the AC-Module and Apache Tika with TesseractOCR on two different servers, although all programs can be executed on the same machine.
TesseractOCR is a software for Optical Character Recognition. It extracts text from image files and images embedded in PDF and Office documents. Tesseract should be used if any scanned PDF files are to be classified or documents contain images or diagrams with valuable information.
If scanned PDF files are to be classified, but OCR is not installed, the Auto Classification Module will not be able to extract any text from those files. On the other hand, if it is known that documents contain images with valuable text for the classification process, OCR should be used, as well. Valuable text or valuable words in a document are those words that might be unique for a small size of documents and make the prediction of the correct category easier.
First, download the Windows installer from the official GitHub repository. When the download is completed, execute the installer. When asked what language packs should be downloaded, select the languages that are going to be classified.
Also, you can add additional languages later on.
Add the installation path into your environment Path variable.
First, download the Linux installer from the official GitHub repository or from apt. An installation instruction can be found in the project’s wiki.
By default, the module is shipped with the capability to download Apache Tika from the official download page on first use. Tika will be started on the same machine as the module and no changes of installation procedures are generally required.
If no connection to the internet can be established, the same procedure as starting Apache Tika from a dedicated server can be applied.
To use Apache Tika on a dedicated server, please download the tika-server.jar file from the official repository first.
Afterwards, the tika-server.jar file must be executed and the port must be open for outside communication.
The connection details must be provided to the module via “config.ini”. The configuration file “config.ini” holds three variables that need to be edited. The variables can be found in the Tika section.
“tika_client” needs to be set to true.
“tika_client_port” needs to be set to the correct port number.
“host” needs to be set to the correct name / IP address.
If Tika runs on port 9000 in a dedicated process on the same machine, the configuration looks as follows:
[TIKA]
tika_client: true
tika_client_port: 9000
host: 127.0.0.1
Graphviz is an additional library for creating charts and plots. It is not mandatory and is only required if the attribute structure of a document repository is investigated and a chart should be created. If Graphviz is properly set up, the script cli_show_distribution.py
will automatically create a chart of the attribute structure.
First, Graphviz needs to be downloaded and installed.
Second, the Python wrapper for Graphviz needs to be installed in the Python environment.
On Linux OS, pygraphviz can be installed easily with pip in the CMD:
pip install pygraphviz
For Windows OS, an installation file is provided. Usually, the installation file is already used when installing all library requirements from the “requirements.txt” file. The installation file is a whl file and can be found in the directory “install_packages/”. The whl file can be installed with the CMD command:
pip install pygraphviz-1.5-cp37-cp37m-win_amd64.whl
The supplied whl file is compiled for Windows OS with 64 bit and Python 3.7 installed. Other compiled whl files for Windows can be downloaded from the GitHub repository.
The auto classification module offers a powerful functionality to enhance migration project. However, using the module can be tricky. Therefore, this document provides in-depth information for using and configuring the module in a production environment.
The module needs to be deployed as a service. Auto classification is a memory consuming task and it is advised to run the module on a dedicated server with access to as much memory as possible. If OCR is performed, it is advised to use a GPU, because character recognition on a CPU is very time consuming.
To deploy the auto classification module, a configuration file must be configured. The file can be placed in any accessible UNC file path. If no configuration file is provided during the initial startup, the module will look for a configuration file in the working directory. In this case, the file must be named config.ini
.
The following two chapters provide information on how to set up SSL encryption and a valid path to an algorithm configuration file.
Template files are provided in the sample directory of the module.
The configuration file contains two port properties. One property is defined in the FLASK_SERVER section and the other in the WSGI_SERVER section.
The Flask server is only used for development.
Therefore, if you like to change the port number of the AC module, set the port property under the WSGI_SERVER section. The definition can look like this:
port: 443
By default, the port is set to 3000
. This means that every interaction with the module goes through port 3000. This includes API requests and interacting with the web app.
The module can provide encryption of HTTP traffic with SSL. It is hardly advised to use this feature to prevent any data interception. Please keep in mind that either HTTPS or HTTP can be used. A dual usage of both modes is not supported.
SSL encryption can be activated by defining the following line in the SSL chapter of the configuration file:
ssl.enabled: true
To use SSL encryption, it is mandatory to provide a valid SSL certificate. The required properties are:
ssl.certificate_file_path
ssl.key_file_path
If you do not have access to a valid certificate, you can create one on your own. You can use the program keytool to create a self-signed certificate. The program is shipped with the Java runtime and can be found in the ${JAVA_HOME}/bin
folder.
To be able to use the auto classification module, a classifier configuration file must be provided. The file can be specified with the property
file_path
in the config.ini file under the section ALGORITHM_CONFIG
The file’s schema and configuration options are described thoroughly in the section Classifier configuration below.
The module is a Python program. Therefore, starting the module is very easy. Just enter the following command:
python __init__.py {file path to config.ini}
A classifier is a combination of an algorithm with a transformation list. Before a classifier can be used, it must be configured and trained.
As a basic understanding, one must know that an algorithm cannot be optimal for every situation, although it might be sophisticated and very powerful. Each algorithm has advantages and disadvantages and one algorithm can solve one problem better than another. Therefore, the most significant boost and most suitable adjustment can be achieved by providing only the most meaningful data to an algorithm. This is what a transformation list can be employed for.
The use of a transformation list is aimed to select the most meaningful words (feature selection) and create additional data from files (feature extraction).
A transformation list can be interpreted as a pipeline for the words of a document before they are submitted to an algorithm.
To explain the value of transformation lists and their relationship to algorithms, a simple example is depicted. The figure below displays the relationship between algorithms and transformation lists. Each entity has a unique id within their type. The list with id 1 enables three transformations:
lower-case
tf-idf
stopwords
The second list (id:2) uses the same transformations and adds the usage of OCR. It is not important in which order the transformations are listed as transformation lists are only used to enable transformations and not to define an ordered sequence.
Next to the transformation lists are three algorithms. Each algorithm points to exactly one transformation list, but one transformation list can be referenced by many algorithms. Based on these relationships, one can reason that three classifiers are defined. If each classifier is tested with the same documents, each of them will compute different results.
A second example displays how transformation lists are applied to documents. During the stages of training, test and prediction, raw documents are supplied. These unprocessed documents need to be transformed by the pipeline of the transformation list.
First, the stop words of the documents are removed. Followed by the replacement of all upper-case letters by their lower-case equivalents. Third, the tf-idf values are computed. In the end, the documents are processed and can be delivered to the SVM. The order of pipeline transformations may change on which transformations are selected.
All classifiers of the auto classification module are configured in a single XML file. The main XML elements are
algorithms
transformation-lists
common-config
The following chapters aim to explain the configuration possibilities in detail.
Also, a template and sample XML file are provided.
Inside the “algorithms” element the individual algorithms are specified. The supported algorithms are:
SVM (element name: “svm”)
Naïve Bayes Multinomial (element name: “naive-bayes-multinomial”)
Logistic regression (element-name: “logistic-regression”)
All algorithms must have an id attribute (attribute name : “id”) with a unique value and an attribute with the correspondent transformation list (attribute name: “transformationListId”). Some algorithms support the prediction of multiple attributes (called multi-output). If multi-output behaviour is desired, the attribute “multi-output” must be set to true.
Algorithm
multi-output support
SVM
Yes
Naïve Bayes Multinomial
No
Logistic regression
Yes
Additionally, every algorithm offers attributes to configure its behavior.
The SVM supports the following attributes:
kernel (value of “linear”, “poly” or “rbf”)
degree (positive integer, affects only poly-kernel)
gamma (positive double, affects only rbf-kernel)
c (positive float)
Naïve Bayes Multinomial offers the following attributes:
alpha (positive float, affects only Naïve Bayes)
A transformation list is a group of transformation functions. It must have a unique id value. Each transformation function operates on the content of documents and transforms the content to a specific output. They can be used to optimize the process of feature selection. The order in which they are specified in the XML file does not matter. The functions are applied before an algorithm is trained. Not every function is supported by every algorithm:
SVM
Naïve Bayes Multinomial
lower-case
Yes
Yes
tf-idf
Yes
Yes
stopwords
Yes
Yes
length-filter
Yes
Yes
n-gram
Yes
Yes
ocr
Yes
Yes
pdf-parser
Yes
Yes
token-pattern-replace-filter
Yes
Yes
document-frequency
Yes
Yes
On the following pages every transformation function is described.
XML attribute name
Description
Parameters
Example code
lower-case
Transforms every upper-case letter to his lower case equivalent.
/
<lower-case></lower-case>
tf-idf
Uses the TF-IDF algorithm on the document content.
/
<tf-idf></tf-idf>
document-frequency
Filters the words by their frequency in a document. If a word has a frequency lower than the min-value or greater than the max-value, it is not selected for further processing. At least one value must be provided when using this function.
min-value (positive integer)
max-value (positive integer)
<document-frequency min-value="2" max-value="50">
</document-frequency>
n-gram
Splits the document content into n-grams of n = length. The default are uni-grams with a length of 1.
length (positive integer)
<n-gram length="1"></n-gram>
stopwords
Analyses the document content for stop words and removes them from the content.
Child element of “languages” (please consult the example code)
<stopwords> <languages> <language name="german">
</language> <language name="english">
</language> </languages> </stopwords>
length-filter
Ignores all words that do not have the specified minimum length.
min-length (positive integer)
<length-filter min-length="4">
</length-filter>
ocr
Performs OCR (object character recognition) analysis on images in a document. The library TesseractOCR in version 3.04 is used for this purpose. The analysis uses a lot of CPU power and can require several minutes for a single document (depending on the number and size of images in the document).
language (three letter language code as defined by TesseractOCR)
enable-image-processing (boolean)
density (Integer)
<ocr
language="deu"
enable-image-processing="true"
density=”300>
</ocr>
pdf-parser
Extracts the text content from PDF files. If used in combination with OCR, extracted images are automatically analyzed.
extract-inline-images (Boolean)
extract-unique-inline-images-only (Boolean)
sort-by-position (Boolean)
ocr-dpi (Integer)
ocr-strategy (String)
<pdf-parser
extract-inline-images="true"
extract-unique-inline-images-only="false"
sort-by-position="true"
ocr-dpi=”300”
ocr-strategy=”ocr_and_text”>
</pdf-parser>
token-pattern-replace-filter
Replaces words that match a regex pattern with a defined string.
pattern (regex)
replacement (characters)
<token-pattern-replace-filter
pattern="(\w)\2+"
replacement=" ">
</token-pattern-replace-filter>
With the “common config” attribute, global properties can be configured. These properties are accessed by all classifiers. Currently it does not offer any configuration parameters.
The module offers the functionality to analyze a dataset and perform classification tasks with migration-center. The functionality can be consumed through a web service through a HTTP/HTTPS interface with a range of provided Python scripts or within selected adaptors.
The API interface will always respond with a body encoded in JSON format.
In addition, a Web-UI is provided with the Auto Classification module. The UI offers a clean and simple interface to interact with the module and it offers the same functionality as the scripts. The UI can be reached with the following URL:
https://{server-ip}:{WSGI port}/app
The following chapters explain the usage of the functionality to analyze a dataset and to train, validate, test and reinforce a classifier. Furthermore it describes how to make predictions for unclassified documents and to query a status of a training or reinforcement processes.
This document only describes how to use the API and not the UI. However, all parameters described in this chapter, can be used in the UI.
The following chapters explain the module usage by using scan run ids from the migration-center database. It is not always a desired practice to access the database. Therefore, the alternative method is to use a unified metadata file. A unified metadata file can be created when using the Filesystem Importer of the migration-center. The metadata file contains all necessary information for the Auto-Classification module.
To use a unified metadata file, the user needs to change the following things on a HTTP request:
Replace the “scan-run” part of the URL with “unified-metadata-file”.
Send the request with the mimetype “form-data”
Send the metadata file with the form key “file”.
(optional) Remove the properties of database connection and scan-run-id.
Before validating and training a classifier, the training dataset needs to be investigated. The module offers two different reports.
The distribution report creates a barplot for every attribute with the frequencies of the values on the y axis.
A distribution report can be created with a HTTP POST request to the URL:
https://{server-ip}:{port}/api/distribution-report/scan-run
It is required to supply three parameters with the request:
id
dbCon
classAttrs
The id is the scan run id, the database connection must be a JDBC connection string and the class attributes must be string of attributes, separated with a comma.
It returns a pdf report.
Much more detailed is the hierarchy report. It allows to specify an attribute hierarchy. It creates a graph which gives an insight into the frequency of every attribute value per hierarchy layer.
A hierarchy report can be created with a HTTP POST request to the URL:
https://{server-ip}:{port}/api/hierarchy-report/scan-run
It is required to supply three parameters with the request:
id
dbCon
classAttrs
The id and database connection parameters are the same as with the distribution report. The class attributes parameter needs to be a json object dumped as a string. It can look like the following example:
{
“type”: {
“subtype”: {}
}
}
In the example, type is the most general attribute and subtype is dependent on the individual values of the type attribute.
The hierarchy report is returned as a png file.
After analyzing a dataset, one might want to remove certain documents or attribute combinations from the dataset. The dataset splitter functionality supports this process.
The dataset splitter can split a single unified metadata file into two separate files. Furthermore, he can filter documents by attribute values.
It can either be used through the web service API or the web app.
The API function is available through a HTTP Post request to the following URL:
https://{server-ip}:{port}/api/dataset-splitter/unified-metadata-file
With the request, a metadata file (key: file) must be supplied.
Additionally, the user must define a splitting percentage by which the file is split into two files. The key for the payload property is “trainingPercentage” and it must be between 0 and 100. If the value is 0, all data will be stored in the test file. If the value is 50, the data is split in half and stored in both files. If the value is 100, all data is stored in the training file.
An optional third parameter is “exclusions”. This parameter allows to define attribute combinations that are prohibited in the newly created metadata files. The dataset splitter excludes documents based on the parameter. The parameter must be a json object with the following structure:
[ { ‘attribute’: ‘document_type’, ‘values’: [‘concept’, ‘procotol’] } ]
“document_type” is an example of an attribute name, as are “concept” and “protocol” examples of values.
The ampersand (*) is a wildcard value for the values property. If the ampersand is defined, all documents are ignored that hold the attribute, no matter what attribute value they actually have.
The response object is a zip archive which contains two xml files. The two xml files are valid unified metadata files. One file starts with the keyword “TRAINING”, the other with “TEST”.
A short example explains when the exclusion parameter can be used:
A user analyzed a dataset and found out that for the attribute “document_type”, the values “concept” and “protocol” have a very low frequency and they are not classifiable. He wants to remove them from the dataset as he aims to build a classifier on the “document_type”, but the concept and protocol values increase the noise.
From there on, he uses the dataset splitter and supplies the raw metadata file. As the “trainingPercentage”, he sets a value of 100. This will save all document’s metadata into the training file and leave the test file empty. The “exclusions” parameter is defined just as seen above. The configuration leads to the result that every document that either has “concept” or “protocol” as a “document_type” value, is ignored. These documents will not be saved in the new training file.
Therefore the returned zip archive contains a full training file and an empty test file. The theoretical outcome is that the new metadata training file has less noise and the classification of the “document_type” performs better.
As the exclusion parameter is a list, the user can specify more than one exclusion entry. The entries are not connected. This means that not all entries must be true. If one entry is true for a document, the document is automatically ignored.
The following chapters only mention specific parameters of the processes.
Two parameters are supported for the training, validation, testing and reinforcement process without being mentioned explicitly:
use-serialized-documents
description
This parameter takes a Boolean value.
If it is true, it will serialize and store documents after their content has been extracted. This happens before any transformation functions are applied. If a document has already been stored on the filesystem, the stored file is loaded, instead of reextracting the content. This is helpful if OCR needs to be performed and the classifier parameter are being tuned.
One must define the “document_storage_path” setting in the config.ini file under the TIKA section.
Every process can have a description. This is helpful to identify a process later.
The training of a classifier is an essential part of auto classification. The module can be trained by performing a HTTP POST request to the URL
https://{server-ip}:{port}/api/train
It is required to supply five parameters with the request:
scan-run-id
db-con
class-attr
algorithm-id
model-target-path
The scan run id is the id of a completed scan run inside of migration center. The documents within the scan run are used to train the classifier. If the scan run does not contain all documents for the training process, the classifier can later be reinforced (see chapter 4.9).
The parameter “db-con” is a connection string to the Oracle database of migration center. The string consists of the following structure:
{account name}/{password}@{host}:{port}/{instance}
The third parameter specifies the class attribute of the documents. If the classifier should predict multiple attributes, the attribute names must be separated by a semicolon.
The algorithm id references the configured algorithm inside the classification configuration file from section Classifier configuration.
The last attribute is a file path to the location of the trained classifier. This location can be any UNC accessible path. For the uses of testing, reinforcement and prediction this file path is consumed.
Depending on the number of documents, the length of the documents and the embedded images, the training requires a significant amount of time. Because of that the initial request returns a process id, after validating the parameters. The actual training process starts after the process id is received. The id can be used to query the module for a status of the training process. Please refer to section Status on how to query the process status.
It is advised to use the provided batch script train.cmd
to train a classifier. The script requires six parameters:
Scan run id
Connection string to migration center database
Class attribute name (of the documents in the scan run id)
Algorithm id (as defined in the XML file)