Overview
Sphinx integrates with Reducto to parse PDF documents, enabling you to analyze reports, research papers, financial documents, and more directly in your notebooks. When you mention a PDF file using@, Sphinx automatically:
- Validates the PDF format and size
- Sends it to Reducto for intelligent parsing
- Converts the content to structured Markdown
- Caches the result for future use
- Makes the content available to the AI for analysis
How to Use
Reference a PDF with @ Mentions
Type@ in the chat input and select a PDF file from your workspace:
- Open Sphinx (
Cmd+T/Ctrl+T) - Type
@to open the file picker - Search for and select your PDF file
- Add your question or analysis request
@quarterly-report.pdf Summarize the key financial metrics@research-paper.pdf What methodology did the authors use?@invoice.pdf Extract the line items into a DataFrame
PDFs cannot be read directly using the file read tool. Always use the
@ mention flow to reference PDFs so they can be properly parsed.What Gets Extracted
Reducto extracts and structures:| Content Type | How It’s Handled |
|---|---|
| Text | Preserved with formatting, organized by page |
| Tables | Converted to Markdown tables |
| Headers | Maintained as Markdown headings |
| Figures | Summarized with AI-generated descriptions |
| Page Numbers | Tracked for reference |
Caching
Parsed PDFs are cached on your local filesystem to avoid re-parsing the same document:- Location:
~/.sphinx/parsed/ - Filename format:
{original-name}-{content-hash}.pdf.md
- Sphinx checks the kernel filesystem for a cached version
- If found, uses the cached Markdown directly
- If not found, sends to Reducto for parsing
Limits
| Limit | Value | Notes |
|---|---|---|
| Maximum file size | 100 MB | Larger PDFs are rejected before upload |
| Preview length | ~20,000 chars | Full content available in cached file |
Supported PDF Types
Reducto handles a wide variety of PDF documents:Text-based PDFs
Standard documents, reports, articles, and papers with selectable text.
Scanned Documents
PDFs created from scanned images with OCR support.
Mixed Content
Documents combining text, tables, charts, and images.
Multi-page Documents
Long reports, books, and documentation with page tracking.
Error Handling
| Error | Cause | Resolution |
|---|---|---|
| File too large | PDF exceeds 100MB | Split the PDF or extract relevant pages |
| Invalid PDF | File doesn’t have valid PDF header | Ensure the file is a valid PDF |
| Parse failed | Reducto couldn’t process the content | Try a different PDF or contact support |
| Rate limited | Too many requests | Wait a moment and try again |
Privacy & Data Handling
Understanding how your PDF data flows when using this feature:Data Flow
- Upload — When you reference an uncached PDF, the file is uploaded to Reducto’s API servers for parsing
- Processing — Reducto processes the PDF and extracts structured content
- Return — The parsed Markdown is returned to Sphinx
- Local Storage — Parsed content is cached locally at
~/.sphinx/parsed/on your machine
What Data Leaves Your Machine
| Data | Destination | Purpose |
|---|---|---|
| PDF file content | Reducto API | Document parsing and text extraction |
| File metadata | Sphinx servers | Request logging and error tracking |
What Stays Local
- Parsed Markdown files — Cached in
~/.sphinx/parsed/ - Original PDF files — Remain in your workspace
- Cache index — Content hashes stored locally
Security & Compliance
Reducto is certified SOC 2 compliant, and maintains additional industry-standard security certifications to ensure data protection. We also have a Zero Data Retention (ZDR) agreement in place—no PDF content or parsed data is stored on Reducto’s servers after processing. Only the minimum necessary data for parsing flows to Reducto, and all cache and parsed data is stored locally on your machine unless you manually share it.Best Practices
Keep PDFs focused
Keep PDFs focused
Rather than uploading a 500-page manual, extract the specific chapters or sections relevant to your analysis. This improves parsing speed and AI comprehension.
Use descriptive prompts
Use descriptive prompts
When referencing a PDF, be specific about what you want to analyze:Good:
@annual-report.pdf Extract the revenue by region table from page 15Less helpful: @annual-report.pdf Tell me about thisCombine with DataFrames
Combine with DataFrames
Ask Sphinx to convert extracted tables into pandas DataFrames for further analysis:
@sales-data.pdf Convert the quarterly sales table to a DataFrame and plot the trendReference multiple PDFs
Reference multiple PDFs
You can mention multiple PDFs in a single prompt:
@report-2024.pdf @report-2025.pdf Compare the year-over-year changes in operating expenses