Extract Text Efficiently with Datalab's OCR Guide

In this guide, you'll learn how to use Datalab's new models, Datalab Marker and OCR, to transform documents and images into editable text formats like Markdown or extract precise text data with line-level polygons. Whether you're working with scanned PDFs, photos of documents, or any image containing text, this guide will show you how to get the most out of these powerful tools.

TL;DR

Use Datalab Marker to convert entire documents into Markdown.
Employ Datalab OCR for line-level text extraction with polygon precision.
Start with an existing document or image and output well-structured, editable text.

Prerequisites

Before you begin, ensure you have the following:

A Datalab account with access to the Marker and OCR models.
A set of documents or images from which you wish to extract text.
Basic understanding of Markdown if you're converting documents.
Installed Datalab CLI (Command Line Interface) for easy integration.

Step-by-step Instructions

1. Setup Your Datalab Environment

Make sure that you have Datalab CLI installed. If not, install it using pip:
```
pip install datalab-cli
```
Authenticate the Datalab CLI with your account:
```
datalab auth login
```
Verify your setup by checking the installed models:
```
datalab models list
```
Confirm the Datalab Marker and Datalab OCR models are accessible.

2. Convert Document to Markdown using Datalab Marker

Choose a document (e.g., .pdf or .jpg) you wish to convert.

Run the following command to convert your file into Markdown format:

datalab marker convert --input [path_to_your_file] --output document.md

Open document.md with any text editor to review the extracted content.

3. Extract Text with Line-level Polygons using Datalab OCR

Select an image or document to process:

datalab ocr annotate --input [path_to_your_image] --output results.json

View results.json to see the text mapped with polygon coordinates.

4. Reviewing and Refining Output

Use a Markdown editor to organize and format the extracted text if necessary.
Examine the JSON output from the OCR for any polygon adjustments or validations needed.

Tips and Best Practices

Batch Processing: For larger datasets, automate the process using scripts to loop through directories of files.
Document Quality: Ensure documents or images are clear to improve text recognition accuracy.
Text Validation: After extraction, employ simple regex checks to refine common formatting issues in text.

Common Issues

Incomplete Text Extraction: Ensure that images are not heavily compressed and that text is well-defined.
Authentication Errors: Reauthenticate using datalab auth login if you encounter login issues.
Model Access Denied: Confirm subscription level or contact Datalab support if models are unavailable.

Next Steps

Advanced Formatting: Explore more with Markdown by using Datalab's Markdown formatting options.
Data Analysis: Use Datalab OCR extracted data in data analysis tools for further processing.
Integration with Workflow Automations: Look into integrating Datalab CLI within your CI/CD pipelines or document processing workflows for streamlined automation.

This guide should help you start leveraging Datalab's innovative OCR capabilities for document and image text extraction efficiently. With practice, these tools can significantly enhance your data processing workflows.

How to Extract Text from Documents and Images with Datalab Marker and OCR