The visual route to classifying documents: Experimenting with neural networks

Krishna Vishal Vemula

Document classification plays a vital part in any document processing pipeline. As a first step in document processing, it is important to classify documents into different groups so it becomes searchable, retrievable and actionable. Generally, documents are classified based on their textual content through Optical character recognition (OCR), where it analyzes the text of the entire document and then assigns a set of predefined classes based on the content.

This blog is an account of how we aimed to classify documents through their visual layout and structural properties. This alternative method can eliminate the OCR process of analyzing the entire text in a document. Though currently in a nascent stage, this proposed method also has the potential to bring in more efficiency than its text-based counterpart.

In our data visualization initiative, we used convolutional neural networks (CNN) to study document image classification. In this blog, you will learn about the preliminary results of document image classification, which we experimented with different neural network architectures on a predefined document image dataset.

Curated dataset for document image classification

To begin, a large dataset predefined with the list of document classes is required to study the possibilities of visual document classification. With 400,000 grayscale images in 16 classes (25,000 images per class), RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset forms a suitable dataset to achieve high accuracy. Here are the examples from each of the 16 classes in the RVL-CDIP dataset.

Harley et al introduced this dataset and presented a baseline performance using CNNs. Then, Tensmeyer et al studied how different data transformations (augmentations) affect the test set performance of the network. They also tried appending different image features (SIFT, SURF and so on) and found that SIFT features performed better.

Later, Das et al proposed a new architecture, where an image is split into multiple regions. Each of these split regions pass through different networks and the features from all the networks pass through a meta classifier for final prediction.

Learning how to deal with document images

We used VGG-16, ResNet-18 and ResNet-34 networks throughout the experiments. Training from scratch gave us only 71.9% accuracy. Since transfer learning has worked wonders on several computer vision tasks, we used imagenet pretrained weights to train our models on this dataset. For the sake of brevity, in this blog, we present only the results of the ResNet-34 network.

Initially, we trained only the last linear layer by freezing all the other layers without any data augmentation. This gave us an accuracy of 73%. Later, in an ‘unfreeze’ mode, we trained the same model with all layers along with the following data augmentations: horizontal and vertical flip, random rotation (+/- 10 degrees), random scaling, affine warp. The results are shown below.

The model was then further trained for ten more epochs. As a result, there was neither a reduction in training loss nor any validation loss. Hence, we deduced that the model has converged.

As the training progressed, we kept track of those document classes where the model is most confused in identifying the relevant document classes. To our surprise, we found that the number of samples in the top 10 most confused classes didn’t change throughout the training process. Here is a table that consists of the top 10 most confused classes at 2 epochs and 17 epochs. The tuple’s structure is ‘original_class’, ‘predicted_class’, number of samples.

After the analysis, we figured out three reasons for this confusion:

  1. Mislabeled samples: Some of the sample labels were incorrect. 
  2. Ambiguous data samples: Some of the samples contained data that synchronize with multiple classes.
  3. Model has reached its limit: The model could no longer explain the variance in the data in its current state.

For clarity in the predicted results, we need better visual explanations to lend insights into success/ failure modes of these samples. Class activation maps provide visual insights to clarify why a sample belonged to the respective class.

Class activation map visualization

The Grad-CAM (Gradient-Weighted Class Activation Mapping) by Selvaraju et al highlights the regions in the image where the model is attending, that resulted in the classification. The examples shown below highlight the region in correct predictions of letter samples.

The below examples highlight the incorrect predictions in document image classification.

The first image actually belonged to the class ‘form’, but the highlighted section can belong to more than one class: ‘form’, ‘letter’ or ‘handwritten’. The second is predicted as ‘advertisement’ because a lot of examples in ‘advertisement’ class had pictures with text surrounding them.

Training the model with an in-house dataset

After training the model on RVL-CDIP dataset, we fine-tuned the model on our in-house dataset. The training procedure is the same as we did with the RVL-CDIP dataset. Initially, we trained without data augmentations, where the accuracy was 91.5%. Then, we trained with data augmentations, where the model converged in 3 epochs and the accuracy was 97.1%.

To further enhance the training model, we explored a new 2D representation of documents proposed by Katti et al. In this 2D representation, each character in unicode is assigned with a color in RGB space and the pixels occupied by the character in the document are filled with the corresponding color. This representation encodes both textual and spatial information. However, Katti et al didn’t mention any experiments on how different choices in the color spaces affected their results.

So, we tried experimenting with this representation on our in-house dataset. We tried two representations where RGB values were assigned in random and in grayscale. Surprisingly, this change in representation didn’t affect the accuracy at all. All representations resulted in the same accuracy of 97.1%. The Grad-CAM output confirmed that the model was learning similar patterns across the respective classes. Further investigation on larger datasets is needed for this inference on colorspace choices to be conclusive.

The below examples show the Grad-CAM heatmap for images belonging to the same class.

Preparing to attain new potential

From all the above observations, there are still occurrences of mislabelled samples and ambiguous data samples. The dataset is originally multi-labelled and the class labels are assigned based on the probability of previous predictions. Hence, the present framework cannot deal with such samples effectively. In the future, our focus would be analyzing this dataset using the Bayesian framework, which can identify the uncertainty of such samples.

These studies are just a taste of what is becoming possible in document classification. By training the network with large datasets, critical visual cues are improving the accuracy and effectiveness of document image classification at scale. With this ability, automating document classification can not only be done by analyzing the entire document text, but also by focusing on the document’s visual layout and structural properties in a glimpse. It’s the tip of the iceberg in document processing where we no longer have to rely only on text to classify documents. We would also have document image classification.


Your email address will not be published. Required fields are marked *