Skip to main content

Overview

Sphinx integrates with Reducto to parse PDF documents, enabling you to analyze reports, research papers, financial documents, and more directly in your notebooks. When you mention a PDF file using @, Sphinx automatically:
  1. Validates the PDF format and size
  2. Sends it to Reducto for intelligent parsing
  3. Converts the content to structured Markdown
  4. Caches the result for future use
  5. Makes the content available to the AI for analysis

How to Use

Reference a PDF with @ Mentions

Type @ in the chat input and select a PDF file from your workspace:
  1. Open Sphinx (Cmd+T / Ctrl+T)
  2. Type @ to open the file picker
  3. Search for and select your PDF file
  4. Add your question or analysis request
Example prompts:
  • @quarterly-report.pdf Summarize the key financial metrics
  • @research-paper.pdf What methodology did the authors use?
  • @invoice.pdf Extract the line items into a DataFrame
PDFs cannot be read directly using the file read tool. Always use the @ mention flow to reference PDFs so they can be properly parsed.

What Gets Extracted

Reducto extracts and structures:
Content TypeHow It’s Handled
TextPreserved with formatting, organized by page
TablesConverted to Markdown tables
HeadersMaintained as Markdown headings
FiguresSummarized with AI-generated descriptions
Page NumbersTracked for reference

Caching

Parsed PDFs are cached on your local filesystem to avoid re-parsing the same document:
  • Location: ~/.sphinx/parsed/
  • Filename format: {original-name}-{content-hash}.pdf.md
When you reference a PDF that’s already been parsed:
  1. Sphinx checks the kernel filesystem for a cached version
  2. If found, uses the cached Markdown directly
  3. If not found, sends to Reducto for parsing
The cache is content-based, not filename-based. If you update a PDF, it will be re-parsed automatically because the content hash changes.

Limits

LimitValueNotes
Maximum file size100 MBLarger PDFs are rejected before upload
Preview length~20,000 charsFull content available in cached file
PDFs larger than 100MB cannot be processed. Consider splitting large documents or extracting the relevant pages.

Supported PDF Types

Reducto handles a wide variety of PDF documents:

Text-based PDFs

Standard documents, reports, articles, and papers with selectable text.

Scanned Documents

PDFs created from scanned images with OCR support.

Mixed Content

Documents combining text, tables, charts, and images.

Multi-page Documents

Long reports, books, and documentation with page tracking.

Error Handling

ErrorCauseResolution
File too largePDF exceeds 100MBSplit the PDF or extract relevant pages
Invalid PDFFile doesn’t have valid PDF headerEnsure the file is a valid PDF
Parse failedReducto couldn’t process the contentTry a different PDF or contact support
Rate limitedToo many requestsWait a moment and try again
If parsing fails for one PDF in a batch, other PDFs are still processed. Sphinx reports which files succeeded and which failed.

Privacy & Data Handling

Understanding how your PDF data flows when using this feature:

Data Flow

  1. Upload — When you reference an uncached PDF, the file is uploaded to Reducto’s API servers for parsing
  2. Processing — Reducto processes the PDF and extracts structured content
  3. Return — The parsed Markdown is returned to Sphinx
  4. Local Storage — Parsed content is cached locally at ~/.sphinx/parsed/ on your machine

What Data Leaves Your Machine

DataDestinationPurpose
PDF file contentReducto APIDocument parsing and text extraction
File metadataSphinx serversRequest logging and error tracking

What Stays Local

  • Parsed Markdown files — Cached in ~/.sphinx/parsed/
  • Original PDF files — Remain in your workspace
  • Cache index — Content hashes stored locally

Security & Compliance

Reducto is certified SOC 2 compliant, and maintains additional industry-standard security certifications to ensure data protection. We also have a Zero Data Retention (ZDR) agreement in place—no PDF content or parsed data is stored on Reducto’s servers after processing. Only the minimum necessary data for parsing flows to Reducto, and all cache and parsed data is stored locally on your machine unless you manually share it.
For sensitive documents, consider extracting and anonymizing relevant sections before using PDF parsing, or use alternative methods to manually include the data in your analysis.

Best Practices

Rather than uploading a 500-page manual, extract the specific chapters or sections relevant to your analysis. This improves parsing speed and AI comprehension.
When referencing a PDF, be specific about what you want to analyze:Good: @annual-report.pdf Extract the revenue by region table from page 15Less helpful: @annual-report.pdf Tell me about this
Ask Sphinx to convert extracted tables into pandas DataFrames for further analysis:@sales-data.pdf Convert the quarterly sales table to a DataFrame and plot the trend
You can mention multiple PDFs in a single prompt:@report-2024.pdf @report-2025.pdf Compare the year-over-year changes in operating expenses