Skip to content

pdfvec::extract returns garbled text for Typst-generated PDFs (CIDFont/Type0 with Identity-H encoding) #11

@tyler-harpool

Description

@tyler-harpool

pdfvec::extract() returns ~6,380 characters of garbled/unreadable text when processing PDFs generated by Typst (which uses pdf-writer internally). The same PDF is correctly extracted by pdf-extract and pdftotext (poppler-utils).

Reproduction

  1. Install Typst: cargo install typst-cli
  2. Create a minimal Typst document:

#set page(margin: 1in)
#set text(font: "New Computer Modern", size: 12pt)

= Test Document

The defendant filed a blockchain forensic analysis report and a cryptographic evidence motion.

  1. Compile: typst compile test.typ test.pdf
  2. Extract text:

let bytes = std::fs::read("test.pdf").unwrap();
let text = pdfvec::extract(&bytes).unwrap();
println!("{}", text); // garbled output — no readable words

Root Cause

Typst/pdf-writer generates PDFs using CIDFont (Type0) with Identity-H CMap encoding. The font's ToUnicode CMap maps glyph IDs back to Unicode codepoints. It appears pdfvec (via the pdf crate v0.9) either doesn't read the ToUnicode table or doesn't apply it when decoding CIDFont text streams.

Expected Behavior

pdfvec::extract() should return readable text, e.g.: "The defendant filed a blockchain forensic analysis report...".

Actual Behavior

Returns 6,380 characters of text where alphabetic content is replaced with glyph IDs or incorrect codepoints. The contains_readable_text heuristic fails.

Environment

  • pdfvec: 0.1.1
  • pdf (underlying crate): 0.9.1
  • Typst: 0.14
  • pdf-writer: (used by Typst internally)
  • Platform: macOS Darwin 24.6.0, Rust 1.x

Workaround

Using pdf-extract 0.10.0 (extract_text_from_mem) works correctly for these PDFs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions