Install Tesseract OCR on AlmaLinux 9: Best OCR Library

Posted on

Install Tesseract OCR on AlmaLinux 9: Best OCR Library

Install Tesseract OCR on AlmaLinux 9: Best OCR Library

In this guide, we aim to show you how to Install Tesseract OCR on AlmaLinux 9. Tesseract – is an optical character recognition engine with open-source code, making it a highly regarded and widely used OCR library. OCR leverages artificial intelligence to search for text and recognize it within images.

Tesseract functions by identifying patterns in pixels, representing letters, words, and sentences. It employs a two-stage adaptive recognition process. The initial stage focuses on character recognition, followed by a second stage that refines the results by considering the context of words and sentences to improve accuracy.

Follow the steps below on the Orcacore website to Install Tesseract OCR on AlmaLinux 9.

Before proceeding, ensure you are logged in to your AlmaLinux 9 server as a non-root user with sudo privileges. If you haven’t already, you can follow our guide on Initial Server Setup with AlmaLinux 9 to configure this.

Tesseract OCR Library on AlmaLinux 9

1. Tesseract OCR Setup on AlmaLinux 9

This section will guide you through installing Tesseract on AlmaLinux 9 from the source code.

First, update your local package index:

sudo dnf update -y

Install required packages and Dependencies

Install the necessary packages for building the Tesseract OCR Library on AlmaLinux 9:

sudo dnf install git automake make autoconf libtool clang gcc-c++.x86_64 wget -y

Install the leptonica dependencies:

sudo dnf install zlib zlib-devel libjpeg libjpeg-devel libwebp libwebp-devel libtiff libtiff-devel libpng libpng-devel -y

Move the executables to your path:

# cd /usr/local/lib 
# sudo cp /usr/lib64/libjpeg.so.62 . 
# sudo cp /usr/lib64/libwebp.so.7 . 
# sudo cp /usr/lib64/libtiff.so.5 . 
# sudo cp /usr/lib64/libpng16.so.16 .

Clone Leptonica From GitHub

Clone Leptonica From GitHub

Clone leptonica from git:

# cd ~ 
# git clone https://github.com/DanBloomberg/leptonica.git --depth 1

Switch to your Leptonica directory:

cd leptonica

Compile and Build Leptonica

Compile leptonica:

# ./autogen.sh 
# ./configure --prefix=/usr/local --disable-shared --enable-static --with-zlib --with-jpeg --with-libwebp --with-libtiff --with-libpng --disable-dependency-tracking 
# sudo make 
# sudo make install 
# sudo ldconfig

Download Tesseract OCR on AlmaLinux 9

After completing the Leptonica installation, download the latest version of Tesseract OCR on AlmaLinux 9 from GitHub.

# cd ~ 
# VER=$(curl -s https://api.github.com/repos/tesseract-ocr/tesseract/releases/latest|grep tag_name | cut -d '"' -f 4) 
# wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/$VER.tar.gz -O tesseract-5.tar.gz

Extract the downloaded file:

tar zxvf tesseract-5.tar.gz

Switch to your Tesseract directory on AlmaLinux 9:

cd tesseract-*/

Compile Tesseract OCR

Compile Tesseract OCR:

# export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig 
# ./autogen.sh 
# ./configure --prefix=/usr/local --disable-shared --enable-static --with-extra-libraries=/usr/local/lib/ --with-extra-includes=/usr/local/lib/

Build and Install Tesseract OCR

Build and install Tesseract on AlmaLinux 9:

# sudo make
# sudo make install 
# sudo ldconfig

After the installation, load Tesseract languages.

Load Tesseract Languages

Create a language path:

mkdir -p /tess/traineddata

Export the Tesseract path by adding the following line to ~/.bashrc:

export TESSDATA_PREFIX=/home/$USER/tess/traineddata

Note: Replace $USER with the actual username.

Source the profile:

source ~/.bashrc

Add any trained data available on Github tessdata to the path.

# cd $TESSDATA_PREFIX 
# wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata 
# wget https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata

Now, let’s see how to use Tesseract OCR after you Install Tesseract OCR on AlmaLinux 9.

2. How To Use Tesseract OCR on AmaLinux 9?

Now that Tesseract OCR has been installed on AlmaLinux 9, you can extract text from scanned documents or images.

To convert an image to a text file, use the following syntax:

tesseract <image_name> <output file_name>

For example:

tesseract image.png new

This will create a text file named new containing the extracted text from image.png.

Specify the language using the -l flag. For example, to use Czech:

tesseract image.png new -l ces

Multiple languages can also be specified:

tesseract image.png new -l ces+eng

Conclusion

This guide has walked you through the process to Install Tesseract OCR on AlmaLinux 9. Tesseract OCR allows you to extract text from images and documents lacking a text layer, converting them into searchable text files, PDFs, or other popular formats.

Hope you enjoy it. You may also be interested in these articles:

Install and Secure Wekan Server on AlmaLinux 9

How To Set up Redis on Rocky Linux 9

Install phpMyAdmin on AlmaLinux 9

AlmaLinux 10.0 Beta and Kitten 10 Now Available

Check Linux security update on AlmaLinux 9

Install VirtualBox 7.0 in AlmaLinux 9

Alternative Installation Methods for Tesseract OCR on AlmaLinux 9

While the previous method detailed compiling Tesseract OCR from source, which provides greater control over the build process, there are alternative, often simpler, methods for installing Tesseract OCR on AlmaLinux 9. These methods involve using package managers like dnf, or containerization using Docker.

1. Using the DNF Package Manager

AlmaLinux 9, being a derivative of RHEL, benefits from a robust package management system via dnf. If Tesseract and its dependencies are available in the standard or enabled repositories, installation becomes significantly easier.

Explanation:

This method leverages pre-built packages, eliminating the need for manual compilation. The package manager handles dependency resolution, ensuring all required libraries are installed correctly. However, the version of Tesseract available through dnf might not always be the latest.

Code Example:

First, search for Tesseract packages to confirm availability:

sudo dnf search tesseract

If Tesseract is found, install it using:

sudo dnf install tesseract tesseract-langpack-eng

The tesseract-langpack-eng package provides the English language data. Install other language packs as needed.

After installation, verify the installation:

tesseract --version

This method provides a quick and easy way to Install Tesseract OCR on AlmaLinux 9.

2. Using Docker Containerization

Docker provides a way to encapsulate Tesseract OCR and its dependencies within a container, ensuring consistent operation across different environments. This approach avoids modifying the host system and simplifies deployment.

Explanation:

Docker images contain everything needed to run an application: code, runtime, system tools, system libraries, and settings. Using a pre-built Tesseract OCR Docker image or creating your own offers portability and reproducibility. This is particularly useful if you need a specific version of Tesseract or have complex dependency requirements.

Code Example:

First, ensure Docker is installed and running on your AlmaLinux 9 system. Then, pull a pre-built Tesseract OCR image from Docker Hub. A popular image is jbarratt/tesseract.

docker pull jbarratt/tesseract

To run Tesseract OCR on an image (e.g., image.png) within the container, mount the directory containing the image to the container and specify the output directory:

docker run --rm -v /path/to/your/images:/data jbarratt/tesseract /data/image.png /data/output

Replace /path/to/your/images with the actual path to the directory containing your image. The output text file (output.txt) will be created in the same directory.

To specify a language, you can use the -l flag:

docker run --rm -v /path/to/your/images:/data jbarratt/tesseract /data/image.png /data/output -l eng

These two methods offer alternative approaches to Install Tesseract OCR on AlmaLinux 9 that can be more convenient than compiling from source, depending on your specific needs and environment.

Leave a Reply

Your email address will not be published. Required fields are marked *