ParseSphere

Core Concepts

ParseSphere provides two primary capabilities: document parsing for text extraction and tabular data querying using natural language. Understanding these core concepts will help you structure integrations effectively.


Parse Jobs

Document parsing operates asynchronously because processing time varies significantly based on file size, content complexity, and document type. When you submit a document, ParseSphere returns immediately with a parse_id that you use to track progress.

Job Lifecycle

Queued

Waiting for available worker

Processing

Extracting text and analyzing content

Completed

Results ready for retrieval

Failed

Processing error occurred

Queued: Jobs enter the queue while waiting for an available worker. During high load, this state may last a few seconds.

Processing: Active text extraction and analysis with real-time progress updates (0-100%) and status messages like "Extracting text", "Analyzing tables", or "Running OCR".

Completed: Results are available for retrieval through the /v1/parses/{parse_id} endpoint. The response includes full text, extracted tables, metadata, and optional chunks.

Failed: Processing encountered an error. Common causes include unsupported file formats, corrupted content, password-protected documents, or processing timeouts.

Tracking Progress

Use polling or webhooks to monitor job status:

GET/v1/parses/{parse_id}

Check current processing status and progress

bash
curl https://api.parsesphere.com/v1/parses/550e8400-e29b-41d4-a716-446655440000 \
-H "Authorization: Bearer sk_your_api_key"

Use Webhooks in Production

For production applications, provide a webhook_url when creating a parse to receive automatic notifications when processing completes. This eliminates the need for polling and reduces API calls.

Result Caching

Information

Completed parse results are cached for the duration specified by the session_ttl parameter (default: 24 hours, minimum: 60 seconds).

After the TTL expires, you'll need to re-submit the document. For documents you'll reference repeatedly or share across multiple systems, increase the session_ttl when creating the parse:

bash
curl -X POST https://api.parsesphere.com/v1/parses \
-H "Authorization: Bearer sk_your_api_key" \
-F "file=@contract.pdf" \
-F "session_ttl=7200"  # Cache for 2 hours

Processing Times

Processing duration varies significantly based on file size, content complexity, document type, and whether OCR is required. The API provides an estimated processing time when you submit a document.

Warning

Design your integration with appropriate timeout handling. We recommend monitoring the status and progress fields rather than relying on time estimates.


Workspaces

Workspaces act as containers for organizing tabular datasets that you want to query together. Think of them as analytical contexts where related data lives.

Why Use Workspaces?

Multi-Dataset Queries: A single natural language query can reference multiple datasets within the same workspace. ParseSphere automatically joins and correlates data across files.

Organizational Clarity: Structure workspaces around analytical contexts:

  • "Q4 Sales Analysis" → regional sales, product catalogs, customer segments
  • "Financial Reporting" → revenue data, expense reports, budget forecasts
  • "Customer Analytics" → user behavior, support tickets, feedback surveys

Permission Boundaries: Workspace access controls determine which users can query which datasets, providing a natural security boundary.

Creating Workspaces

POST/v1/workspaces

Create a new workspace for organizing datasets

bash
curl -X POST https://api.parsesphere.com/v1/workspaces \
-H "Authorization: Bearer sk_your_api_key" \
-H "Content-Type: application/json" \
-d '{
  "name": "Q4 Sales Analysis",
  "description": "Sales data and metrics for Q4 2025"
}'

Datasets

Datasets are CSV or Excel files uploaded to a workspace. ParseSphere automatically converts uploaded files to an optimized format that enables fast analytical queries even on large files.

Dataset Lifecycle

Similar to parse jobs, dataset processing happens asynchronously:

Queued

Waiting for processing

Processing

Analyzing structure and optimizing

Completed

Ready for queries

Failed

Processing error

What Happens During Processing?

  1. Structure Analysis: Identifies columns, headers, and data types
  2. Type Inference: Determines if columns contain numbers, dates, text, or categories
  3. Sample Extraction: Extracts representative values to help the AI understand your data
  4. Optimization: Converts to an efficient query format (typically 10-100x faster than CSV)

Uploading Datasets

POST/v1/workspaces/{workspace_id}/datasets

Upload a CSV or Excel file to your workspace

bash
curl -X POST https://api.parsesphere.com/v1/workspaces/ws_abc123/datasets \
-H "Authorization: Bearer sk_your_api_key" \
-F "file=@sales_q4.csv"

Information

Processing time varies by file size and complexity. Small files (less than 1MB) typically process in 5-15 seconds. Large files (over 50MB) may take several minutes.


Natural Language Queries

Once your datasets are ready, query them using plain English. ParseSphere interprets your questions, generates appropriate queries, and returns both raw data and natural language explanations.

How It Works

The AI understands:

  • Column names and relationships: "revenue", "sales", "customer_id"
  • Temporal concepts: "last quarter", "this year", "month over month"
  • Aggregations: "sum", "average", "top 5", "count"
  • Multi-dataset joins: Automatically correlates data across files

Query Examples

POST/v1/workspaces/{workspace_id}/query

Ask questions about your data in natural language

bash
curl -X POST https://api.parsesphere.com/v1/workspaces/ws_abc123/query \
-H "Authorization: Bearer sk_your_api_key" \
-H "Content-Type: application/json" \
-d '{
  "query": "What are the top 5 products by revenue?"
}'

Multi-Dataset Intelligence

If you have multiple related datasets in a workspace (e.g., "sales.csv" and "products.csv"), ParseSphere can automatically join them to answer questions like "What's the profit margin on our top-selling products?"

Best Practices

Be Specific: "Show revenue by product category for Q4" is better than "show sales"

Use Column Names: Reference actual column names when possible for more accurate results

Start Simple: Test with straightforward queries before complex multi-dataset questions

Review SQL: Check the generated SQL to understand how your question was interpreted


What's Next?

Now that you understand core concepts, dive deeper into specific features: