Core Concepts
ParseSphere provides two primary capabilities: document parsing for text extraction and tabular data querying using natural language. Understanding these core concepts will help you structure integrations effectively.
Parse Jobs
Document parsing operates asynchronously because processing time varies significantly based on file size, content complexity, and document type. When you submit a document, ParseSphere returns immediately with a parse_id that you use to track progress.
Job Lifecycle
Waiting for available worker
Extracting text and analyzing content
Results ready for retrieval
Processing error occurred
Waiting for available worker
Extracting text and analyzing content
Results ready for retrieval
Processing error occurred
Queued: Jobs enter the queue while waiting for an available worker. During high load, this state may last a few seconds.
Processing: Active text extraction and analysis with real-time progress updates (0-100%) and status messages like "Extracting text", "Analyzing tables", or "Running OCR".
Completed: Results are available for retrieval through the /v1/parses/{parse_id} endpoint. The response includes full text, extracted tables, metadata, and optional chunks.
Failed: Processing encountered an error. Common causes include unsupported file formats, corrupted content, password-protected documents, or processing timeouts.
Tracking Progress
Use polling or webhooks to monitor job status:
/v1/parses/{parse_id}Check current processing status and progress
curl https://api.parsesphere.com/v1/parses/550e8400-e29b-41d4-a716-446655440000 \
-H "Authorization: Bearer sk_your_api_key"Use Webhooks in Production
For production applications, provide a webhook_url when creating a parse to receive automatic notifications when processing completes. This eliminates the need for polling and reduces API calls.
Result Caching
Information
Completed parse results are cached for the duration specified by the session_ttl parameter (default: 24 hours, minimum: 60 seconds).
After the TTL expires, you'll need to re-submit the document. For documents you'll reference repeatedly or share across multiple systems, increase the session_ttl when creating the parse:
curl -X POST https://api.parsesphere.com/v1/parses \
-H "Authorization: Bearer sk_your_api_key" \
-F "file=@contract.pdf" \
-F "session_ttl=7200" # Cache for 2 hoursProcessing Times
Processing duration varies significantly based on file size, content complexity, document type, and whether OCR is required. The API provides an estimated processing time when you submit a document.
Warning
Design your integration with appropriate timeout handling. We recommend monitoring the status and progress fields rather than relying on time estimates.
Workspaces
Workspaces act as containers for organizing tabular datasets that you want to query together. Think of them as analytical contexts where related data lives.
Why Use Workspaces?
Multi-Dataset Queries: A single natural language query can reference multiple datasets within the same workspace. ParseSphere automatically joins and correlates data across files.
Organizational Clarity: Structure workspaces around analytical contexts:
- "Q4 Sales Analysis" → regional sales, product catalogs, customer segments
- "Financial Reporting" → revenue data, expense reports, budget forecasts
- "Customer Analytics" → user behavior, support tickets, feedback surveys
Permission Boundaries: Workspace access controls determine which users can query which datasets, providing a natural security boundary.
Creating Workspaces
/v1/workspacesCreate a new workspace for organizing datasets
curl -X POST https://api.parsesphere.com/v1/workspaces \
-H "Authorization: Bearer sk_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"name": "Q4 Sales Analysis",
"description": "Sales data and metrics for Q4 2025"
}'Datasets
Datasets are CSV or Excel files uploaded to a workspace. ParseSphere automatically converts uploaded files to an optimized format that enables fast analytical queries even on large files.
Dataset Lifecycle
Similar to parse jobs, dataset processing happens asynchronously:
Waiting for processing
Analyzing structure and optimizing
Ready for queries
Processing error
Waiting for processing
Analyzing structure and optimizing
Ready for queries
Processing error
What Happens During Processing?
- Structure Analysis: Identifies columns, headers, and data types
- Type Inference: Determines if columns contain numbers, dates, text, or categories
- Sample Extraction: Extracts representative values to help the AI understand your data
- Optimization: Converts to an efficient query format (typically 10-100x faster than CSV)
Uploading Datasets
/v1/workspaces/{workspace_id}/datasetsUpload a CSV or Excel file to your workspace
curl -X POST https://api.parsesphere.com/v1/workspaces/ws_abc123/datasets \
-H "Authorization: Bearer sk_your_api_key" \
-F "file=@sales_q4.csv"Information
Processing time varies by file size and complexity. Small files (less than 1MB) typically process in 5-15 seconds. Large files (over 50MB) may take several minutes.
Natural Language Queries
Once your datasets are ready, query them using plain English. ParseSphere interprets your questions, generates appropriate queries, and returns both raw data and natural language explanations.
How It Works
The AI understands:
- Column names and relationships: "revenue", "sales", "customer_id"
- Temporal concepts: "last quarter", "this year", "month over month"
- Aggregations: "sum", "average", "top 5", "count"
- Multi-dataset joins: Automatically correlates data across files
Query Examples
/v1/workspaces/{workspace_id}/queryAsk questions about your data in natural language
curl -X POST https://api.parsesphere.com/v1/workspaces/ws_abc123/query \
-H "Authorization: Bearer sk_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"query": "What are the top 5 products by revenue?"
}'Multi-Dataset Intelligence
If you have multiple related datasets in a workspace (e.g., "sales.csv" and "products.csv"), ParseSphere can automatically join them to answer questions like "What's the profit margin on our top-selling products?"
Best Practices
Be Specific: "Show revenue by product category for Q4" is better than "show sales"
Use Column Names: Reference actual column names when possible for more accurate results
Start Simple: Test with straightforward queries before complex multi-dataset questions
Review SQL: Check the generated SQL to understand how your question was interpreted
What's Next?
Now that you understand core concepts, dive deeper into specific features:
- Document Parsing - Learn about extraction options and formats
- Tabula - Master natural language queries for tabular data
- Rate Limits - Understand API quotas and limits
- Error Handling - Handle common error scenarios
