How to Convert HTML to Markdown (Clean, Lossless)
Convert HTML to Markdown cleanly — whether you're migrating a CMS, archiving web pages, or moving to a static site. Tools, gotchas, and how to handle messy input.
HTML to Markdown conversion comes up more often than people expect. You're migrating away from WordPress. You're archiving blog posts into a git repo. You're pulling content out of a CMS that's shutting down. You're moving to a static site generator. You just copy-pasted something from a website and want it as Markdown.
The good news: for most content, conversion is painless. The less-good news: HTML is infinitely more expressive than Markdown, so some things get lost — and you need to know which things.
What converts cleanly
These HTML patterns map 1:1 to Markdown without loss:
| HTML | Markdown |
|---|---|
<h1> through <h6> | # through ###### |
<p>Text</p> | Text + blank line |
<strong> or <b> | **bold** |
<em> or <i> | *italic* |
<del> or <s> | ~~strikethrough~~ (GFM) |
<code> | `code` |
<pre><code class="language-js"> | ```javascript (GFM) |
<a href="..."> | [text](url) |
<img src="..." alt="..."> |  |
<ul><li> | - item |
<ol><li> | 1. item |
<blockquote> | > quote |
<hr> | --- |
<table><thead><tbody> | GFM table |
If your HTML is mostly these elements, you're golden.
What doesn't convert
| HTML | What happens in Markdown |
|---|---|
<form>, <input>, <button> | Cannot be represented. Stripped or left as HTML. |
<iframe>, <video>, <audio> | Left as raw HTML (valid in most Markdown processors). |
<div class="..."><span style="..."> | Classes and styles stripped. Content kept. |
<details><summary> | Usually left as raw HTML. |
<kbd>, <abbr>, <sub>, <sup> | Left as raw HTML. |
| Custom layout (grids, flexbox) | Collapsed into flat content. Layout lost. |
Inline style="..." | Stripped. |
<script>, <style> | Stripped (and should be — security). |
You'll end up with Markdown that has occasional HTML sprinkled through it. That's normal and expected.
Tools
1. Browser tool (easiest)
Our HTML to Markdown converter uses Turndown with the GitHub Flavored Markdown plugin.
- Paste HTML → get Markdown instantly
- Runs in your browser (nothing uploaded)
- Copy output with one click
- Free, no signup
Good for: single articles, pasted snippets, quick exploratory conversions.
2. Turndown (Node.js, scriptable)
npm install turndown turndown-plugin-gfm
import TurndownService from "turndown"; import { gfm } from "turndown-plugin-gfm"; const turndown = new TurndownService({ headingStyle: "atx", // # instead of === codeBlockStyle: "fenced", // ```js instead of indented bulletListMarker: "-", // - instead of * }); turndown.use(gfm); const md = turndown.turndown("<h1>Hello</h1><p>World</p>"); console.log(md); // # Hello // // World
Good for: bulk conversion, CI/CD pipelines, integrating into a larger script.
3. Pandoc (battle-tested)
pandoc input.html -o output.md # Or from stdin: pandoc -f html -t gfm < input.html > output.md
Flags worth knowing:
-t gfm— output GitHub Flavored Markdown (vs. plainmarkdown).-t commonmark— strict CommonMark output.--wrap=none— don't hard-wrap long lines (nicer for editing).--atx-headers— use#style headings.
Good for: academic documents, conversions that need to be round-trippable, long documents.
4. html-to-md (lightweight Node alternative)
npm install html-to-md
import html2md from "html-to-md"; const md = html2md("<h1>Title</h1>");
Simpler API than Turndown; fewer config options.
5. Python: html2text / markdownify
pip install markdownify
from markdownify import markdownify as md print(md("<h1>Hello</h1>")) # # Hello
Good for: Python pipelines, Jupyter notebooks, integrating with scraping tools.
A realistic migration workflow
If you're moving content out of a CMS (say, WordPress → a static site), here's the sequence that works:
Step 1: Export raw HTML
Most CMSes have an export feature. WordPress has WXR (XML). Ghost, Medium, Substack all export to HTML or JSON. For systems without export, use wget -r or a scraping script.
Step 2: Extract only the article body
HTML exports usually include headers, sidebars, footers, and navigation. You don't want any of that in Markdown. Use a query selector:
// Node.js with cheerio import * as cheerio from "cheerio"; import fs from "fs"; const html = fs.readFileSync("post.html", "utf8"); const $ = cheerio.load(html); const articleHTML = $("article").html(); // or ".post-content", etc.
Step 3: Convert to Markdown
import TurndownService from "turndown"; import { gfm } from "turndown-plugin-gfm"; const turndown = new TurndownService({ headingStyle: "atx" }); turndown.use(gfm); const markdown = turndown.turndown(articleHTML);
Step 4: Add front matter
Static site generators need YAML front matter for metadata:
const frontMatter = `--- title: "${title}" date: "${date}" tags: [${tags.map((t) => `"${t}"`).join(", ")}] --- `; fs.writeFileSync(`posts/${slug}.md`, frontMatter + markdown);
Step 5: Fix images
Image URLs may point to the old CMS (/wp-content/uploads/...). Rewrite them to your new image hosting or download and place them in a static assets folder. A regex pass on the Markdown works:
markdown = markdown.replace( /!\[([^\]]*)\]\(\/wp-content\/uploads\/([^)]+)\)/g, "" );
Step 6: Review by hand
Always, always review. Pick 5–10 random posts and skim the rendered output. Look for:
- Broken image paths
- Stripped formatting that mattered
- HTML that should have been converted but wasn't
- Encoding issues (smart quotes, em dashes)
- Empty paragraphs from nested divs
Common gotchas
Nested inline tags
<strong><em>bold italic</em></strong>
Some converters produce ***bold italic*** (correct). Some produce **_bold italic_** (ugly but valid). Check your tool's config.
Line breaks
<br> inside a paragraph converts to (two trailing spaces) in Markdown — which is invisible and easy to delete accidentally. If line breaks matter, you may want to keep them as <br> in the Markdown. Turndown has a br: " " option.
Tables with spans
<td colspan="2">Merged</td>
Markdown tables don't support colspan or rowspan. Your converter will either drop the attribute (breaking the merge) or leave the raw HTML. Decide which is worse for your content.
Code block language detection
<pre><code class="language-javascript"> const x = 1; </code></pre>
Modern converters read class="language-*" and emit ```javascript. But some old CMSes use class="lang-js" or class="prettyprint" — you may need custom logic.
Smart quotes and typography
HTML often contains smart quotes (', "), em dashes (—), and non-breaking spaces ( ). These survive conversion — which is good — but some editors and git diffs render them oddly. If you want ASCII-only Markdown, run a sed or regex pass after conversion.
When to skip automatic conversion
Sometimes the HTML is so polluted (deeply nested divs, inline styles everywhere, WYSIWYG editor cruft) that conversion produces garbage. In that case:
- View the rendered page in a browser.
- Copy the visible text (Cmd/Ctrl + A → C).
- Paste as plain text into your Markdown editor.
- Re-apply formatting (headings, links, code) manually.
This is faster than fighting a converter for a single stubborn page, and you end up with cleaner Markdown.
After conversion
Once you have Markdown, run it through the other direction as a sanity check:
- Use our Markdown to HTML tool to render the result.
- Compare side-by-side with the original.
- Fix what's broken, re-run.
For a large migration, automate this sanity check across 100% of your posts before going live.
Summary
- Most HTML converts cleanly to Markdown. Forms, layout, inline styles do not.
- Use our browser tool for one-offs, Turndown or Pandoc for scripting.
- Always review converted output — the last 5% needs a human.
- For CMS migrations, the real work is in the pipeline around the converter (extraction, front matter, image paths), not the converter itself.
Master the workflow once and you'll never feel trapped by a CMS again.
Frequently Asked Questions
Is HTML to Markdown conversion lossless?+
Which tool produces the cleanest output?+
How do I handle tables, code blocks, and images?+
Can I convert a whole website?+
What about inline styles like color or font size?+
Keep reading
Markdown vs. HTML: When to Use Which (2026)
A practical comparison of Markdown and HTML — performance, flexibility, portability, and when each is the right tool. With real examples and decision rules.
Markdown for Bloggers: A Complete Workflow
A start-to-finish Markdown blogging workflow — from capture and drafting to SEO, images, cross-posting, and publishing. Tool-agnostic, optimized for speed and longevity.