-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
component:extractionPDF parsing and text extractionPDF parsing and text extractionepicLarge feature containing multiple storiesLarge feature containing multiple storiesmvpRequired for MVP releaseRequired for MVP releaseperformancePerformance-related workPerformance-related workpriority:criticalMust be done immediatelyMust be done immediately
Milestone
Description
PDFVEC-002
Goals
- 10-14x faster than pdf-extract baseline
- Accurate text extraction from academic PDFs
- Graceful handling of malformed PDFs
- Memory-efficient processing of large files
Implement the core text extraction engine using the hybrid approach validated in research: pdf-rs for parsing combined with custom text extraction logic. Target throughput: 26-34 MiB/s (10-14x faster than pdf-extract).
Acceptance Criteria
AC-1
- Given A valid PDF file
- When Text is extracted
- Then Output matches reference extraction with >95% accuracy
AC-2
- Given The test corpus (50 arXiv PDFs)
- When Benchmarked against pdf-extract
- Then Throughput is at least 10x higher
AC-3
- Given A malformed or encrypted PDF
- When Extraction is attempted
- Then Returns a typed error without panicking
Technical Context
Crates: pdf, rayon, memmap2
Files:
src/extractor.rssrc/extractor/page.rssrc/extractor/text.rssrc/extractor/normalize.rs
Performance Constraints
- Throughput Target: 26-34 MiB/s
- Memory Budget Mb: 100
Out of Scope
- OCR for scanned PDFs
- Image extraction
- Table structure recognition
Source: epics/01-extraction.json
Content Hash: 2522b4af7bee1744
Child Issues: PDFVEC-020, PDFVEC-021, PDFVEC-022, PDFVEC-023
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
component:extractionPDF parsing and text extractionPDF parsing and text extractionepicLarge feature containing multiple storiesLarge feature containing multiple storiesmvpRequired for MVP releaseRequired for MVP releaseperformancePerformance-related workPerformance-related workpriority:criticalMust be done immediatelyMust be done immediately