pdfvec::extract returns garbled text for Typst-generated PDFs (CIDFont/Type0 with Identity-H encoding)

pdfvec::extract() returns ~6,380 characters of garbled/unreadable text when processing PDFs generated by Typst (which uses pdf-writer internally). The same PDF is correctly extracted by pdf-extract and pdftotext (poppler-utils).

  Reproduction

  1. Install Typst: cargo install typst-cli
  2. Create a minimal Typst document:

  #set page(margin: 1in)
  #set text(font: "New Computer Modern", size: 12pt)

  = Test Document

  The defendant filed a *blockchain forensic analysis* report and a *cryptographic evidence* motion.

  3. Compile: typst compile test.typ test.pdf
  4. Extract text:

  let bytes = std::fs::read("test.pdf").unwrap();
  let text = pdfvec::extract(&bytes).unwrap();
  println!("{}", text); // garbled output — no readable words

  Root Cause

  Typst/pdf-writer generates PDFs using CIDFont (Type0) with Identity-H CMap encoding. The font's ToUnicode CMap maps glyph IDs back to Unicode codepoints. It appears pdfvec (via the pdf crate v0.9) either doesn't read the ToUnicode table or doesn't apply it when decoding CIDFont text streams.

  Expected Behavior

  pdfvec::extract() should return readable text, e.g.: "The defendant filed a blockchain forensic analysis report...".

  Actual Behavior

  Returns 6,380 characters of text where alphabetic content is replaced with glyph IDs or incorrect codepoints. The contains_readable_text heuristic fails.

  Environment

  - pdfvec: 0.1.1
  - pdf (underlying crate): 0.9.1
  - Typst: 0.14
  - pdf-writer: (used by Typst internally)
  - Platform: macOS Darwin 24.6.0, Rust 1.x

  Workaround

  Using pdf-extract 0.10.0 (extract_text_from_mem) works correctly for these PDFs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdfvec::extract returns garbled text for Typst-generated PDFs (CIDFont/Type0 with Identity-H encoding) #11

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

pdfvec::extract returns garbled text for Typst-generated PDFs (CIDFont/Type0 with Identity-H encoding) #11

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions