Skip to content

[EPIC] PDF Text Extraction Engine #1

@copyleftdev

Description

@copyleftdev

PDFVEC-002

Goals

  • 10-14x faster than pdf-extract baseline
  • Accurate text extraction from academic PDFs
  • Graceful handling of malformed PDFs
  • Memory-efficient processing of large files

Implement the core text extraction engine using the hybrid approach validated in research: pdf-rs for parsing combined with custom text extraction logic. Target throughput: 26-34 MiB/s (10-14x faster than pdf-extract).

Acceptance Criteria

AC-1

  • Given A valid PDF file
  • When Text is extracted
  • Then Output matches reference extraction with >95% accuracy

AC-2

  • Given The test corpus (50 arXiv PDFs)
  • When Benchmarked against pdf-extract
  • Then Throughput is at least 10x higher

AC-3

  • Given A malformed or encrypted PDF
  • When Extraction is attempted
  • Then Returns a typed error without panicking

Technical Context

Crates: pdf, rayon, memmap2

Files:

  • src/extractor.rs
  • src/extractor/page.rs
  • src/extractor/text.rs
  • src/extractor/normalize.rs

Performance Constraints

  • Throughput Target: 26-34 MiB/s
  • Memory Budget Mb: 100

Out of Scope

  • OCR for scanned PDFs
  • Image extraction
  • Table structure recognition

Source: epics/01-extraction.json
Content Hash: 2522b4af7bee1744

Child Issues: PDFVEC-020, PDFVEC-021, PDFVEC-022, PDFVEC-023

Metadata

Metadata

Assignees

No one assigned

    Labels

    component:extractionPDF parsing and text extractionepicLarge feature containing multiple storiesmvpRequired for MVP releaseperformancePerformance-related workpriority:criticalMust be done immediately

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions