htmlmarkdownconversionmigration

How to Convert HTML to Markdown (Clean, Lossless)

Convert HTML to Markdown cleanly — whether you're migrating a CMS, archiving web pages, or moving to a static site. Tools, gotchas, and how to handle messy input.

By mdkit Team··7 min read

HTML to Markdown conversion comes up more often than people expect. You're migrating away from WordPress. You're archiving blog posts into a git repo. You're pulling content out of a CMS that's shutting down. You're moving to a static site generator. You just copy-pasted something from a website and want it as Markdown.

The good news: for most content, conversion is painless. The less-good news: HTML is infinitely more expressive than Markdown, so some things get lost — and you need to know which things.

What converts cleanly

These HTML patterns map 1:1 to Markdown without loss:

HTMLMarkdown
<h1> through <h6># through ######
<p>Text</p>Text + blank line
<strong> or <b>**bold**
<em> or <i>*italic*
<del> or <s>~~strikethrough~~ (GFM)
<code>`code`
<pre><code class="language-js">```javascript (GFM)
<a href="...">[text](url)
<img src="..." alt="...">![alt](url)
<ul><li>- item
<ol><li>1. item
<blockquote>> quote
<hr>---
<table><thead><tbody>GFM table

If your HTML is mostly these elements, you're golden.

What doesn't convert

HTMLWhat happens in Markdown
<form>, <input>, <button>Cannot be represented. Stripped or left as HTML.
<iframe>, <video>, <audio>Left as raw HTML (valid in most Markdown processors).
<div class="..."><span style="...">Classes and styles stripped. Content kept.
<details><summary>Usually left as raw HTML.
<kbd>, <abbr>, <sub>, <sup>Left as raw HTML.
Custom layout (grids, flexbox)Collapsed into flat content. Layout lost.
Inline style="..."Stripped.
<script>, <style>Stripped (and should be — security).

You'll end up with Markdown that has occasional HTML sprinkled through it. That's normal and expected.

Tools

1. Browser tool (easiest)

Our HTML to Markdown converter uses Turndown with the GitHub Flavored Markdown plugin.

  • Paste HTML → get Markdown instantly
  • Runs in your browser (nothing uploaded)
  • Copy output with one click
  • Free, no signup

Good for: single articles, pasted snippets, quick exploratory conversions.

2. Turndown (Node.js, scriptable)

npm install turndown turndown-plugin-gfm
import TurndownService from "turndown"; import { gfm } from "turndown-plugin-gfm"; const turndown = new TurndownService({ headingStyle: "atx", // # instead of === codeBlockStyle: "fenced", // ```js instead of indented bulletListMarker: "-", // - instead of * }); turndown.use(gfm); const md = turndown.turndown("<h1>Hello</h1><p>World</p>"); console.log(md); // # Hello // // World

Good for: bulk conversion, CI/CD pipelines, integrating into a larger script.

3. Pandoc (battle-tested)

pandoc input.html -o output.md # Or from stdin: pandoc -f html -t gfm < input.html > output.md

Flags worth knowing:

  • -t gfm — output GitHub Flavored Markdown (vs. plain markdown).
  • -t commonmark — strict CommonMark output.
  • --wrap=none — don't hard-wrap long lines (nicer for editing).
  • --atx-headers — use # style headings.

Good for: academic documents, conversions that need to be round-trippable, long documents.

4. html-to-md (lightweight Node alternative)

npm install html-to-md
import html2md from "html-to-md"; const md = html2md("<h1>Title</h1>");

Simpler API than Turndown; fewer config options.

5. Python: html2text / markdownify

pip install markdownify
from markdownify import markdownify as md print(md("<h1>Hello</h1>")) # # Hello

Good for: Python pipelines, Jupyter notebooks, integrating with scraping tools.

A realistic migration workflow

If you're moving content out of a CMS (say, WordPress → a static site), here's the sequence that works:

Step 1: Export raw HTML

Most CMSes have an export feature. WordPress has WXR (XML). Ghost, Medium, Substack all export to HTML or JSON. For systems without export, use wget -r or a scraping script.

Step 2: Extract only the article body

HTML exports usually include headers, sidebars, footers, and navigation. You don't want any of that in Markdown. Use a query selector:

// Node.js with cheerio import * as cheerio from "cheerio"; import fs from "fs"; const html = fs.readFileSync("post.html", "utf8"); const $ = cheerio.load(html); const articleHTML = $("article").html(); // or ".post-content", etc.

Step 3: Convert to Markdown

import TurndownService from "turndown"; import { gfm } from "turndown-plugin-gfm"; const turndown = new TurndownService({ headingStyle: "atx" }); turndown.use(gfm); const markdown = turndown.turndown(articleHTML);

Step 4: Add front matter

Static site generators need YAML front matter for metadata:

const frontMatter = `--- title: "${title}" date: "${date}" tags: [${tags.map((t) => `"${t}"`).join(", ")}] --- `; fs.writeFileSync(`posts/${slug}.md`, frontMatter + markdown);

Step 5: Fix images

Image URLs may point to the old CMS (/wp-content/uploads/...). Rewrite them to your new image hosting or download and place them in a static assets folder. A regex pass on the Markdown works:

markdown = markdown.replace( /!\[([^\]]*)\]\(\/wp-content\/uploads\/([^)]+)\)/g, "![$1](/images/$2)" );

Step 6: Review by hand

Always, always review. Pick 5–10 random posts and skim the rendered output. Look for:

  • Broken image paths
  • Stripped formatting that mattered
  • HTML that should have been converted but wasn't
  • Encoding issues (smart quotes, em dashes)
  • Empty paragraphs from nested divs

Common gotchas

Nested inline tags

<strong><em>bold italic</em></strong>

Some converters produce ***bold italic*** (correct). Some produce **_bold italic_** (ugly but valid). Check your tool's config.

Line breaks

<br> inside a paragraph converts to (two trailing spaces) in Markdown — which is invisible and easy to delete accidentally. If line breaks matter, you may want to keep them as <br> in the Markdown. Turndown has a br: " " option.

Tables with spans

<td colspan="2">Merged</td>

Markdown tables don't support colspan or rowspan. Your converter will either drop the attribute (breaking the merge) or leave the raw HTML. Decide which is worse for your content.

Code block language detection

<pre><code class="language-javascript"> const x = 1; </code></pre>

Modern converters read class="language-*" and emit ```javascript. But some old CMSes use class="lang-js" or class="prettyprint" — you may need custom logic.

Smart quotes and typography

HTML often contains smart quotes (', "), em dashes (), and non-breaking spaces ( ). These survive conversion — which is good — but some editors and git diffs render them oddly. If you want ASCII-only Markdown, run a sed or regex pass after conversion.

When to skip automatic conversion

Sometimes the HTML is so polluted (deeply nested divs, inline styles everywhere, WYSIWYG editor cruft) that conversion produces garbage. In that case:

  1. View the rendered page in a browser.
  2. Copy the visible text (Cmd/Ctrl + A → C).
  3. Paste as plain text into your Markdown editor.
  4. Re-apply formatting (headings, links, code) manually.

This is faster than fighting a converter for a single stubborn page, and you end up with cleaner Markdown.

After conversion

Once you have Markdown, run it through the other direction as a sanity check:

  • Use our Markdown to HTML tool to render the result.
  • Compare side-by-side with the original.
  • Fix what's broken, re-run.

For a large migration, automate this sanity check across 100% of your posts before going live.

Summary

  • Most HTML converts cleanly to Markdown. Forms, layout, inline styles do not.
  • Use our browser tool for one-offs, Turndown or Pandoc for scripting.
  • Always review converted output — the last 5% needs a human.
  • For CMS migrations, the real work is in the pipeline around the converter (extraction, front matter, image paths), not the converter itself.

Master the workflow once and you'll never feel trapped by a CMS again.

Frequently Asked Questions

Is HTML to Markdown conversion lossless?+
Conversion is lossy whenever the HTML contains elements Markdown can't express — forms, inline styles, custom classes, scripts, iframes, or advanced layout. Pure content HTML (articles, blog posts, documentation) converts cleanly. Expect to review and clean up after any conversion.
Which tool produces the cleanest output?+
For browser use, our HTML to Markdown tool uses Turndown with GFM extensions — the same library behind many popular content migration pipelines. For command-line bulk conversion, Pandoc is slightly more conservative (preserves more HTML) while Turndown is more aggressive (produces cleaner Markdown).
How do I handle tables, code blocks, and images?+
All three are handled automatically by modern converters. Tables become GFM table syntax. <pre><code> blocks with class='language-*' become fenced code blocks with the language hint. <img> tags become ![alt](url). You may want to review alt text and image paths after conversion.
Can I convert a whole website?+
Yes, with a scripted pipeline. Use wget or a site-crawling library to download HTML, then run each file through a converter (Pandoc or Turndown). For WordPress specifically, the WXR export plus a WXR-to-Markdown tool is faster than scraping.
What about inline styles like color or font size?+
Markdown cannot represent inline styles. They'll be stripped during conversion. If you need to preserve visual formatting, convert to Markdown + inline HTML (not pure Markdown), or rethink whether those styles are really necessary in the new system.

Keep reading