Skip to content

fix(paste): detect heading levels from Google Docs styled paragraphs#2178

Open
ErickPetru wants to merge 4 commits intosuperdoc-dev:mainfrom
ErickPetru:fix/google-docs-heading-detection
Open

fix(paste): detect heading levels from Google Docs styled paragraphs#2178
ErickPetru wants to merge 4 commits intosuperdoc-dev:mainfrom
ErickPetru:fix/google-docs-heading-detection

Conversation

@ErickPetru
Copy link

@ErickPetru ErickPetru commented Feb 25, 2026

Problem

When pasting content with H1, H3-H6 headings from Google Docs, they lose their heading level and paste as plain paragraphs. Only H2 is reliably preserved as a heading tag.

Closes #2152

SuperDoc with pasted Google Docs headings after the fix is applied

Cause

Google Docs converts most heading levels to <p> tags with inline font-size/font-weight styling instead of semantic <h1><h6> tags. ProseMirror's paragraph.parseDOM has rules for h1h6 but they never fire for these styled <p> elements, so the heading level is silently dropped.

Fix

  • Added convertStyledHeadings to the Google Docs paste pipeline, running before ProseMirror parses the pasted DOM
  • Detects <p> elements carrying bold + a recognised font-size (checked on both the <p> itself and its first child <span>, covering both Google Docs style placements) and replaces them with the corresponding <h1><h5> element
  • Font-size thresholds accommodate the default Google Docs theme and alternate themes (e.g. 20 pt and 24 pt both resolve to h1)

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 167f5aa820

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Contributor

@caio-pizzol caio-pizzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work on this, thanks for picking it up! the code is clean, tests are thorough, and the prior bot feedback (bold false positives + list-item corruption) is both addressed in the current version.

two things worth discussing:

multi-span headings won't be detected

getHeadingStyleProps only reads styles from a single child <span> (children.length === 1). a heading that contains a link, bookmark, or line break will produce multiple child elements — and since Google Docs usually puts the styles on the spans (not the <p>), those headings will be silently skipped.

e.g. this wouldn't be detected:

<p>
  <span style="font-size:20pt;font-weight:700">Heading with </span>
  <a href="..."><span style="font-size:20pt;font-weight:700">a link</span></a>
</p>

one option: check that all child elements share the same font-size and bold status instead of requiring exactly one span. up to you whether that's worth the added complexity for this PR or a follow-up.

minor: closest('li') instead of parentElement

the list-item filter only checks the immediate parent. the rest of the pipeline makes the same assumption (<li> > <p> direct child), so it's consistent. but !p.closest('li') would be a zero-cost safety net in case Google Docs ever wraps list content differently.

@ErickPetru
Copy link
Author

Thanks for the thorough review, @caio-pizzol!

Both points addressed:

  1. Multi-span headings: replaced the singleSpan guard with an all-spans approach which uses querySelectorAll('span') then requires them all to share the same font-size and bold status. This detects headings with links/bookmarks while still rejecting body paragraphs with a single bold opening word (since inconsistent bold across spans is not a heading).

  2. closest('li'): swapped the parentElement check for p.closest('li') as suggested.

Copy link
Contributor

@caio-pizzol caio-pizzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ErickPetru the full-paragraph bold check and the closest('li') fix both look good — nice work on those.

one thing still worth a look: left an inline comment on the paragraph-level bold case in fromSpans. it's not a blocker since the pattern is uncommon in real Google Docs output, but an easy one-liner if you want to close the gap.

const [firstSpanSize] = sizes;
if (sizes.some((size) => size !== firstSpanSize)) return notHeading;

return { fontSize: firstSpanSize, isBold: allSpansBold };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fromSpans checks bold on the child spans, but if bold is set on the <p> itself (<p style="font-weight:700">) with font-size only on spans, allSpansBold will be false and the heading gets skipped. worth patching for consistency?

Suggested change
return { fontSize: firstSpanSize, isBold: allSpansBold };
return { fontSize: firstSpanSize, isBold: allSpansBold || boldWeightRegex.test(el.style.fontWeight || '') };

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, applied your suggestion and also extended it so elIsBold (the root element's bold) is checked consistently in both fromElement and fromSpans paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Detect heading levels from Google Docs styled paragraphs

2 participants