Optical Character Recognition (OCR) is a technology that extracts readable text from images, scanned documents, and even hand-written notes. In Python, OCR tools have evolved significantly over the years, and with the latest version, these libraries now offer even more powerful, efficient solutions.
This article will cover the top seven OCR libraries in Python, highlighting their strengths, unique features, and code examples to help you get started.
1. Tesseract OCR (pytesseract)
Tesseract is undoubtedly the most popular and widely used OCR library in the Python ecosystem. Originally developed by HP and now maintained by Google, Tesseract provides high-quality OCR capabilities for over 100 languages.
Key Features:
- Open-source and free to use.
- Supports multiple languages, including non-Latin alphabets.
- Recognizes text in images, scanned documents, and PDFs.
- Can be customized with custom training data for specialized use cases.
- Works well with pre-processing tools like OpenCV to improve accuracy.
To install Tesseract OCR on Linux, follow these steps depending on your distribution:
sudo apt install tesseract-ocr [On Debian, Ubuntu and Mint] sudo yum install tesseract [On RHEL/CentOS/Fedora and Rocky/AlmaLinux] sudo emerge -a sys-apps/tesseract [On Gentoo Linux] sudo apk add tesseract [On Alpine Linux] sudo pacman -S tesseract [On Arch Linux] sudo zypper install tesseract [On OpenSUSE] sudo pkg install tesseract [On FreeBSD]
Once Tesseract is installed, if you want to use it with Python, you need to install the pytesseract package using the pip package manager.
pip3 install pytesseract OR pip install pytesseract
Here’s an example Python code for using Tesseract OCR with the pytesseract
library to extract text from an image.
import pytesseract from PIL import Image # Load an image img = Image.open("image_sample.png") # Use Tesseract to extract text text = pytesseract.image_to_string(img) # Print the extracted text print(text)
2. EasyOCR
EasyOCR is another excellent Python OCR library that supports more than 80 languages and is easy to use for beginners. It is built on deep learning techniques, making it an excellent choice for those who want to leverage modern OCR technology.
Key Features:
- High accuracy with deep learning models.
- Supports a wide range of languages.
- Can detect text in vertical and multi-lingual images.
- Simple and easy-to-understand API.
To install EasyOCR on Linux, you can use the following pip
command based on your distribution.
pip3 install easyocr OR pip install easyocr
Once the installation is complete, you can use EasyOCR to extract text from an image.
import easyocr # Initialize the OCR reader reader = easyocr.Reader(['en']) # Extract text from an image result = reader.readtext('image_sample.png') # Print the extracted text for detection in result: print(detection[1])
3. OCRopus
OCRopus is an open-source OCR system developed by Google. While it is primarily used for historical documents and books, OCRopus can also be applied to a wide variety of text extraction tasks.
Key Features:
- Specializes in document layout analysis and text extraction.
- Built with modularity in mind, enabling easy customization.
- Can work with multi-page documents and large datasets.
Here’s an example Python code to extract text from an image.
import subprocess # Use OCRopus to process an image subprocess.run(['ocropus', 'identify', 'image_sample.png'])
4. PyOCR
PyOCR is a Python wrapper around several OCR engines, including Tesseract and CuneiForm. It provides a simple interface for integrating OCR functionality into Python applications.
Key Features:
- Can interface with multiple OCR engines.
- Provides a simple API for text extraction.
- Can be combined with image preprocessing libraries for improved results.
PyOCR requires Tesseract (OCR engine) and Pillow (image processing library). You can install them using the following commands:
sudo apt install tesseract-ocr [On Debian, Ubuntu and Mint] sudo yum install tesseract [On RHEL/CentOS/Fedora and Rocky/AlmaLinux] sudo emerge -a sys-apps/tesseract [On Gentoo Linux] sudo apk add tesseract [On Alpine Linux] sudo pacman -S tesseract [On Arch Linux] sudo zypper install tesseract [On OpenSUSE] sudo pkg install tesseract [On FreeBSD]
Now, you can install the pyocr
and pillow
libraries using pip
:
pip3 install pyocr pillow OR pip install pyocr pillow
Here’s a Python example that extracts text from an image using PyOCR and Tesseract:
import pyocr from PIL import Image # Choose the OCR tool (Tesseract or CuneiForm) tool = pyocr.get_available_tools()[0] # Load the image img = Image.open('image_sample.png') # Extract text from the image text = tool.image_to_string(img) # Print the extracted text print(text)
5. PaddleOCR
PaddleOCR is an OCR library developed by PaddlePaddle, a deep learning framework. It supports more than 80 languages and offers cutting-edge accuracy due to its use of deep learning models.
Key Features:
- High performance, especially for images with complex backgrounds.
- Supports text detection, recognition, and layout analysis.
- Includes pre-trained models for a variety of languages.
To install PaddleOCR in Linux, use:
pip3 install paddlepaddle paddleocr OR pip install paddlepaddle paddleocr
Here’s a Python example that extracts text from an image using paddleocr library:
from paddleocr import PaddleOCR # Initialize the OCR ocr = PaddleOCR(use_angle_cls=True, lang='en') # Perform OCR on an image result = ocr.ocr('image_sample.png', cls=True) # Print the extracted text for line in result[0]: print(line[1])
6. Kraken
Kraken is a high-performance OCR library specifically designed for historical and multilingual text. It is built on top of OCRopus and provides additional features for complex layouts and text extraction.
Key Features:
- Best suited for old books and multilingual OCR.
- Handles complex text layouts and historical fonts.
- Uses machine learning for better recognition accuracy.
To install Kraken in Linux, use:
pip3 install kraken OR pip install kraken
Here’s a Python example that extracts text from an image using kraken library:
import kraken # Load the model and recognize text text = kraken.binarize("image_sample.png") # Print the recognized text print(text)
7. Textract (AWS)
AWS Textract is Amazon’s cloud-based OCR service that can analyze documents and forms and extract text with high accuracy. It integrates seamlessly with other AWS services.
Key Features:
- Cloud-based OCR with scalable solutions.
- Supports document structure analysis, including tables and forms.
- Integration with AWS services for further data processing.
To install Textract in Linux, use:
pip3 install boto3 OR pip install boto3
Here is an example Python script that uses AWS Textract to extract text from a document (for example, a scanned PDF or image file).
import boto3 # Initialize a Textract client client = boto3.client('textract') # Path to the image or PDF file you want to analyze file_path = 'path_to_your_file.png' # Replace with your file path # Open the file in binary mode with open(file_path, 'rb') as document: # Call Textract to analyze the document response = client.detect_document_text(Document={'Bytes': document.read()}) # Print the extracted text for item in response['Blocks']: if item['BlockType'] == 'LINE': print(item['Text'])
Conclusion
Choosing the right OCR library in Python depends on the specific use case, the language requirements, and the complexity of the documents you’re processing. Whether you’re working on historical documents, multilingual texts, or simple scanned PDFs, these libraries provide powerful tools for text extraction.
For beginners, Tesseract and EasyOCR are excellent starting points due to their ease of use and wide adoption. However, for more advanced or specialized tasks, libraries like PaddleOCR, OCRopus, and Kraken offer greater flexibility and accuracy.