PDF Parsing - Sphinx Docs

Overview

Sphinx integrates with Reducto to parse PDF documents, enabling you to analyze reports, research papers, financial documents, and more directly in your notebooks. When you mention a PDF file using @, Sphinx automatically:

Validates the PDF format and size
Sends it to Reducto for intelligent parsing
Converts the content to structured Markdown
Caches the result for future use
Makes the content available to the AI for analysis

How to Use

Reference a PDF with @ Mentions

Type @ in the chat input and select a PDF file from your workspace:

Open Sphinx (Cmd+T / Ctrl+T)
Type @ to open the file picker
Search for and select your PDF file
Add your question or analysis request

Example prompts:

@quarterly-report.pdf Summarize the key financial metrics
@research-paper.pdf What methodology did the authors use?
@invoice.pdf Extract the line items into a DataFrame

PDFs cannot be read directly using the file read tool. Always use the @ mention flow to reference PDFs so they can be properly parsed.

What Gets Extracted

Reducto extracts and structures:

Content Type	How It’s Handled
Text	Preserved with formatting, organized by page
Tables	Converted to Markdown tables
Headers	Maintained as Markdown headings
Figures	Summarized with AI-generated descriptions
Page Numbers	Tracked for reference

Caching

Parsed PDFs are cached on your local filesystem to avoid re-parsing the same document:

Location: ~/.sphinx/parsed/
Filename format: {original-name}-{content-hash}.pdf.md

When you reference a PDF that’s already been parsed:

Sphinx checks the kernel filesystem for a cached version
If found, uses the cached Markdown directly
If not found, sends to Reducto for parsing

The cache is content-based, not filename-based. If you update a PDF, it will be re-parsed automatically because the content hash changes.

Limits

Limit	Value	Notes
Maximum file size	100 MB	Larger PDFs are rejected before upload
Preview length	~20,000 chars	Full content available in cached file

PDFs larger than 100MB cannot be processed. Consider splitting large documents or extracting the relevant pages.

Supported PDF Types

Reducto handles a wide variety of PDF documents:

Text-based PDFs

Standard documents, reports, articles, and papers with selectable text.

Scanned Documents

PDFs created from scanned images with OCR support.

Mixed Content

Documents combining text, tables, charts, and images.

Multi-page Documents

Long reports, books, and documentation with page tracking.

Error Handling

Error	Cause	Resolution
File too large	PDF exceeds 100MB	Split the PDF or extract relevant pages
Invalid PDF	File doesn’t have valid PDF header	Ensure the file is a valid PDF
Parse failed	Reducto couldn’t process the content	Try a different PDF or contact support
Rate limited	Too many requests	Wait a moment and try again

If parsing fails for one PDF in a batch, other PDFs are still processed. Sphinx reports which files succeeded and which failed.

Privacy & Data Handling

Understanding how your PDF data flows when using this feature:

Data Flow

Upload — When you reference an uncached PDF, the file is uploaded to Reducto’s API servers for parsing
Processing — Reducto processes the PDF and extracts structured content
Return — The parsed Markdown is returned to Sphinx
Local Storage — Parsed content is cached locally at ~/.sphinx/parsed/ on your machine

What Data Leaves Your Machine

Data	Destination	Purpose
PDF file content	Reducto API	Document parsing and text extraction
File metadata	Sphinx servers	Request logging and error tracking

What Stays Local

Parsed Markdown files — Cached in ~/.sphinx/parsed/
Original PDF files — Remain in your workspace
Cache index — Content hashes stored locally

Security & Compliance

Reducto is certified SOC 2 compliant, and maintains additional industry-standard security certifications to ensure data protection. We also have a Zero Data Retention (ZDR) agreement in place—no PDF content or parsed data is stored on Reducto’s servers after processing. Only the minimum necessary data for parsing flows to Reducto, and all cache and parsed data is stored locally on your machine unless you manually share it.

For sensitive documents, consider extracting and anonymizing relevant sections before using PDF parsing, or use alternative methods to manually include the data in your analysis.

Best Practices

Keep PDFs focused

Rather than uploading a 500-page manual, extract the specific chapters or sections relevant to your analysis. This improves parsing speed and AI comprehension.

Use descriptive prompts

When referencing a PDF, be specific about what you want to analyze:Good: @annual-report.pdf Extract the revenue by region table from page 15Less helpful: @annual-report.pdf Tell me about this

Combine with DataFrames

Ask Sphinx to convert extracted tables into pandas DataFrames for further analysis:@sales-data.pdf Convert the quarterly sales table to a DataFrame and plot the trend

Reference multiple PDFs

You can mention multiple PDFs in a single prompt:@report-2024.pdf @report-2025.pdf Compare the year-over-year changes in operating expenses

​Overview

​How to Use

​Reference a PDF with @ Mentions

​What Gets Extracted

​Caching

​Limits

​Supported PDF Types

Text-based PDFs

Scanned Documents

Mixed Content

Multi-page Documents

​Error Handling

​Privacy & Data Handling

​Data Flow

​What Data Leaves Your Machine

​What Stays Local

​Security & Compliance

​Best Practices

Overview

How to Use

Reference a PDF with @ Mentions

What Gets Extracted

Caching

Limits

Supported PDF Types

Error Handling

Privacy & Data Handling

Data Flow

What Data Leaves Your Machine

What Stays Local

Security & Compliance

Best Practices