Document Parsing

Learn about supported file formats, processing configuration, and how to extract structured data from documents.

Supported Formats

ParseSphere accepts a wide range of document formats, each with specific extraction capabilities:

PDF Documents

Native PDFs: Extracts text, tables, and metadata with high accuracy. Text-based PDFs process faster than scanned documents.

Scanned PDFs: Applies OCR to extract text from images. Includes automatic table detection within scanned pages. Processing time depends on page count and image quality.

Microsoft Office

Word (.docx): Extracts full text content, embedded tables, and document structure. Preserves formatting information and paragraph boundaries.

PowerPoint (.pptx): Captures slide text, speaker notes, and applies OCR to embedded images. Tables detected within images are returned as structured data.

Excel (.xlsx): Extracts all sheets with full preservation of cell values and data types. For querying Excel files with natural language, use the Tabula service instead.

Tabular Data

CSV: Parsed with automatic delimiter detection and column type inference. Best suited for the Tabula service for natural language queries.

Plain Text

Text (.txt): Read directly with support for RTF format detection. Fastest processing option.

File Size Limits

All file formats accept documents up to 50 MB. Larger files should be split or compressed before upload.

Processing Configuration

Configure how ParseSphere processes your documents using these parameters:

file

REQUIREDFile

Document to parse (max 50 MB). Supported formats: PDF, DOCX, PPTX, XLSX, CSV, TXT

process_images

OPTIONALBoolean

Default: true

Enable OCR for images and scanned pages. Disable for text-native documents to reduce processing time

extract_tables

OPTIONALBoolean

Default: true

Extract tables as structured JSON objects with headers, rows, and metadata

chunk

OPTIONALBoolean

Default: false

Split document into semantic chunks for vector databases or RAG pipelines

session_ttl

OPTIONALInteger

Default: 86400

Results cache duration in seconds (min: 60, max: 86400). Default is 24 hours

webhook_url

OPTIONALString

URL to receive HTTP POST notification when processing completes

webhook_secret

OPTIONALString

Secret for HMAC-SHA256 signature verification of webhook payloads

Understanding `process_images`

Optimize Processing Time

Disable process_images when processing text-native documents to reduce processing time by 50-70%.

When enabled, ParseSphere automatically:

Detects scanned PDFs and applies OCR
Extracts text from images embedded in documents
Recognizes tables within images
Processes handwritten content (with reduced accuracy)

When to disable: Processing Word documents, text-based PDFs, or any document where images are decorative rather than content-bearing.

Understanding `extract_tables`

Controls whether tables are returned as structured JSON objects. Table extraction support varies by format:

| Format | Native Tables | OCR Tables | Best Use Case | |--------|---------------|------------|---------------| | PDF | ✓ | ✓ | All table extraction | | Word (.docx) | ✓ | ✗ | Native tables only | | PowerPoint (.pptx) | ✗ | ✓ | Image-based tables | | Excel/CSV | N/A | N/A | Use Tabula service |

Each extracted table includes:

Headers: Column names (if detected)
Rows: Data as key-value dictionaries
Metadata: Row/column counts, page number (when applicable)

Information

For Excel and CSV files, use the Tabula service to query data with natural language instead of extracting as table objects.

Understanding `chunk`

Perfect for RAG

Enable chunking when building retrieval-augmented generation (RAG) systems or storing content in vector databases.

Set chunk=true to split text at semantic boundaries:

Sentence boundaries: Intelligent splitting that preserves sentence integrity
Token limits: Configurable maximum chunk size for optimal embeddings
Overlap: Chunks include overlapping context for better retrieval

Chunk structure:

{
  "text": "Chunk content...",
  "token_count": 125,
  "chunk_index": 12
}

Understanding `session_ttl`

Controls result cache duration:

Default: 86400 seconds (24 hours)
Minimum: 60 seconds (1 minute)
Maximum: 86400 seconds (24 hours)

Use longer TTL when:

Sharing results across multiple systems
Processing reference documents repeatedly
Building user-facing applications with multiple views

Use shorter TTL for:

Sensitive documents
High-volume processing
Cost optimization

Submit a Document

Create a parse job by uploading a document:

POST/v1/parses

Submit a document for text extraction and processing

bash

curl -X POST https://api.parsesphere.com/v1/parses \
-H "Authorization: Bearer sk_your_api_key" \
-F "file=@contract.pdf"

Receiving Results via Webhook

Production Best Practice

Use webhooks instead of polling for production applications. Webhooks are more reliable, reduce API calls, and provide instant notifications.

Provide a webhook_url parameter to receive an HTTP POST when processing completes:

POSThttps://your-app.com/webhook

Webhook notification sent by ParseSphere when processing completes

json

{
"parse_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"result": {
  "text": "Full extracted text content...",
  "metadata": {
    "filename": "contract.pdf",
    "file_type": "pdf",
    "file_size": 2048576,
    "page_count": 50,
    "processing_time": 45.2,
    "characters": 125000,
    "tokens": 31250
  },
  "tables": [
    {
      "page": 5,
      "headers": ["Name", "Amount", "Date"],
      "rows": [
        {"Name": "Item 1", "Amount": "100", "Date": "2024-01-01"},
        {"Name": "Item 2", "Amount": "200", "Date": "2024-01-02"}
      ],
      "row_count": 2,
      "column_count": 3
    }
  ],
  "chunks": null
},
"timestamp": "2025-12-06T12:00:45Z",
"processing_time": 45.2
}

Result Structure

text: The complete extracted text content from the document.

metadata: Document information including:

filename, file_type, file_size - Basic file information
page_count - Number of pages (for paginated formats)
processing_time - Time taken to process in seconds
characters - Character count of extracted text
tokens - Token count for AI processing cost estimation

tables: Array of extracted tables, or null if none found or extract_tables=false. Each table includes page number, headers, rows as dictionaries, and counts.

chunks: Array of text chunks if chunk=true, otherwise null. Each chunk includes text content, token count, and index.

Webhook Security

Always Verify Signatures

Never trust webhook payloads without verifying the HMAC signature. This prevents spoofed or malicious requests.

Verify webhook authenticity using the X-ParseSphere-Signature header:

python

import hmac
import hashlib

def verify_webhook(payload, signature, secret):
  """Verify webhook signature."""
  expected_signature = hmac.new(
      secret.encode('utf-8'),
      payload.encode('utf-8'),
      hashlib.sha256
  ).hexdigest()
  
  return hmac.compare_digest(signature, expected_signature)

# In your webhook handler
@app.post("/webhook")
def handle_webhook(request):
  signature = request.headers.get("X-ParseSphere-Signature")
  payload = request.body
  
  if not verify_webhook(payload, signature, WEBHOOK_SECRET):
      return {"error": "Invalid signature"}, 401
  
  # Process webhook...
  data = json.loads(payload)
  return {"success": True}

Processing Duration

Processing time varies based on file size, content complexity, and document type. Key factors that affect processing speed:

Text-native PDFs: Fastest processing. Affected by page count and file size.

Scanned PDFs: Slower processing due to OCR requirements. Affected by page count, image quality, and content complexity.

Word/PowerPoint: Medium processing speed. Affected by embedded images and table count.

Excel/CSV: Fast processing. Affected by file size and row count.

Large files (over 10MB): Variable processing time, may require additional processing steps.

Information

The API provides an estimated processing time when you submit a document. Actual processing time depends on current system load, file complexity, and selected options.

Warning

Design your integration with appropriate timeout handling and use webhooks for reliable notifications instead of polling.

Optimization Tips

Reduce Processing Time:

Disable process_images for text-native documents
Disable extract_tables if you don't need table data
Compress large PDFs before upload
Split documents over 50MB

Improve Accuracy:

Use high-quality scans (300+ DPI)
Ensure good contrast and minimal noise
Provide straight-aligned documents
Use text-based PDFs when possible

What's Next?

Explore related topics:

Quick Start - Make your first parse request
Core Concepts - Understand parse job lifecycle
Tabula - Query tabular data with natural language
Error Handling - Handle parsing errors

Document Parsing

Supported Formats

PDF Documents

Microsoft Office

Tabular Data

Plain Text

File Size Limits

Processing Configuration

Understanding process_images

Optimize Processing Time

Understanding extract_tables

Information

Understanding chunk

Perfect for RAG

Understanding session_ttl

Submit a Document

Receiving Results via Webhook

Production Best Practice

Result Structure

Webhook Security

Always Verify Signatures

Processing Duration

Information

Warning

Optimization Tips

What's Next?

Understanding `process_images`

Understanding `extract_tables`

Understanding `chunk`

Understanding `session_ttl`