-
Notifications
You must be signed in to change notification settings - Fork 0
Description
pdfvec::extract() returns ~6,380 characters of garbled/unreadable text when processing PDFs generated by Typst (which uses pdf-writer internally). The same PDF is correctly extracted by pdf-extract and pdftotext (poppler-utils).
Reproduction
- Install Typst: cargo install typst-cli
- Create a minimal Typst document:
#set page(margin: 1in)
#set text(font: "New Computer Modern", size: 12pt)
= Test Document
The defendant filed a blockchain forensic analysis report and a cryptographic evidence motion.
- Compile: typst compile test.typ test.pdf
- Extract text:
let bytes = std::fs::read("test.pdf").unwrap();
let text = pdfvec::extract(&bytes).unwrap();
println!("{}", text); // garbled output — no readable words
Root Cause
Typst/pdf-writer generates PDFs using CIDFont (Type0) with Identity-H CMap encoding. The font's ToUnicode CMap maps glyph IDs back to Unicode codepoints. It appears pdfvec (via the pdf crate v0.9) either doesn't read the ToUnicode table or doesn't apply it when decoding CIDFont text streams.
Expected Behavior
pdfvec::extract() should return readable text, e.g.: "The defendant filed a blockchain forensic analysis report...".
Actual Behavior
Returns 6,380 characters of text where alphabetic content is replaced with glyph IDs or incorrect codepoints. The contains_readable_text heuristic fails.
Environment
- pdfvec: 0.1.1
- pdf (underlying crate): 0.9.1
- Typst: 0.14
- pdf-writer: (used by Typst internally)
- Platform: macOS Darwin 24.6.0, Rust 1.x
Workaround
Using pdf-extract 0.10.0 (extract_text_from_mem) works correctly for these PDFs.