> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sphinx.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# PDF Parsing

> Extract and analyze data from PDF documents using Reducto integration

## Overview

Sphinx integrates with [Reducto](https://reducto.ai) to parse PDF documents, enabling you to analyze reports, research papers, financial documents, and more directly in your notebooks.

When you mention a PDF file using `@`, Sphinx automatically:

1. Validates the PDF format and size
2. Sends it to Reducto for intelligent parsing
3. Converts the content to structured Markdown
4. Caches the result for future use
5. Makes the content available to the AI for analysis

***

## How to Use

### Reference a PDF with @ Mentions

Type `@` in the chat input and select a PDF file from your workspace:

1. Open Sphinx (`Cmd+T` / `Ctrl+T`)
2. Type `@` to open the file picker
3. Search for and select your PDF file
4. Add your question or analysis request

**Example prompts:**

* `@quarterly-report.pdf Summarize the key financial metrics`
* `@research-paper.pdf What methodology did the authors use?`
* `@invoice.pdf Extract the line items into a DataFrame`

<Note>
  PDFs cannot be read directly using the file read tool. Always use the `@` mention flow to reference PDFs so they can be properly parsed.
</Note>

***

## What Gets Extracted

Reducto extracts and structures:

| Content Type     | How It's Handled                             |
| ---------------- | -------------------------------------------- |
| **Text**         | Preserved with formatting, organized by page |
| **Tables**       | Converted to Markdown tables                 |
| **Headers**      | Maintained as Markdown headings              |
| **Figures**      | Summarized with AI-generated descriptions    |
| **Page Numbers** | Tracked for reference                        |

***

## Caching

Parsed PDFs are cached on your local filesystem to avoid re-parsing the same document:

* **Location:** `~/.sphinx/parsed/`
* **Filename format:** `{original-name}-{content-hash}.pdf.md`

When you reference a PDF that's already been parsed:

1. Sphinx checks the kernel filesystem for a cached version
2. If found, uses the cached Markdown directly
3. If not found, sends to Reducto for parsing

<Tip>
  The cache is content-based, not filename-based. If you update a PDF, it will be re-parsed automatically because the content hash changes.
</Tip>

***

## Limits

| Limit                 | Value          | Notes                                  |
| --------------------- | -------------- | -------------------------------------- |
| **Maximum file size** | 100 MB         | Larger PDFs are rejected before upload |
| **Preview length**    | \~20,000 chars | Full content available in cached file  |

<Warning>
  PDFs larger than 100MB cannot be processed. Consider splitting large documents or extracting the relevant pages.
</Warning>

***

## Supported PDF Types

Reducto handles a wide variety of PDF documents:

<CardGroup cols={2}>
  <Card title="Text-based PDFs" icon="file-lines">
    Standard documents, reports, articles, and papers with selectable text.
  </Card>

  <Card title="Scanned Documents" icon="scanner">
    PDFs created from scanned images with OCR support.
  </Card>

  <Card title="Mixed Content" icon="table">
    Documents combining text, tables, charts, and images.
  </Card>

  <Card title="Multi-page Documents" icon="copy">
    Long reports, books, and documentation with page tracking.
  </Card>
</CardGroup>

***

## Error Handling

| Error              | Cause                                | Resolution                              |
| ------------------ | ------------------------------------ | --------------------------------------- |
| **File too large** | PDF exceeds 100MB                    | Split the PDF or extract relevant pages |
| **Invalid PDF**    | File doesn't have valid PDF header   | Ensure the file is a valid PDF          |
| **Parse failed**   | Reducto couldn't process the content | Try a different PDF or contact support  |
| **Rate limited**   | Too many requests                    | Wait a moment and try again             |

If parsing fails for one PDF in a batch, other PDFs are still processed. Sphinx reports which files succeeded and which failed.

***

## Privacy & Data Handling

Understanding how your PDF data flows when using this feature:

### Data Flow

1. **Upload** — When you reference an uncached PDF, the file is uploaded to Reducto's API servers for parsing
2. **Processing** — Reducto processes the PDF and extracts structured content
3. **Return** — The parsed Markdown is returned to Sphinx
4. **Local Storage** — Parsed content is cached locally at `~/.sphinx/parsed/` on your machine

### What Data Leaves Your Machine

| Data             | Destination    | Purpose                              |
| ---------------- | -------------- | ------------------------------------ |
| PDF file content | Reducto API    | Document parsing and text extraction |
| File metadata    | Sphinx servers | Request logging and error tracking   |

### What Stays Local

* **Parsed Markdown files** — Cached in `~/.sphinx/parsed/`
* **Original PDF files** — Remain in your workspace
* **Cache index** — Content hashes stored locally

### Security & Compliance

Reducto is certified SOC 2 compliant, and maintains additional industry-standard security certifications to ensure data protection. We also have a Zero Data Retention (ZDR) agreement in place—no PDF content or parsed data is stored on Reducto's servers after processing. Only the minimum necessary data for parsing flows to Reducto, and all cache and parsed data is stored locally on your machine unless you manually share it.

<Tip>
  For sensitive documents, consider extracting and anonymizing relevant sections before using PDF parsing, or use alternative methods to manually include the data in your analysis.
</Tip>

***

## Best Practices

<AccordionGroup>
  <Accordion title="Keep PDFs focused">
    Rather than uploading a 500-page manual, extract the specific chapters or sections relevant to your analysis. This improves parsing speed and AI comprehension.
  </Accordion>

  <Accordion title="Use descriptive prompts">
    When referencing a PDF, be specific about what you want to analyze:

    **Good:** `@annual-report.pdf Extract the revenue by region table from page 15`

    **Less helpful:** `@annual-report.pdf Tell me about this`
  </Accordion>

  <Accordion title="Combine with DataFrames">
    Ask Sphinx to convert extracted tables into pandas DataFrames for further analysis:

    `@sales-data.pdf Convert the quarterly sales table to a DataFrame and plot the trend`
  </Accordion>

  <Accordion title="Reference multiple PDFs">
    You can mention multiple PDFs in a single prompt:

    `@report-2024.pdf @report-2025.pdf Compare the year-over-year changes in operating expenses`
  </Accordion>
</AccordionGroup>
