blog.atwork.at

news and know-how about microsoft, technology, cloud and more.

Convert Word documents to Markdown, HTML or any other format

Sometimes it is useful to convert data in documents to another format. There are various tools for this. My favorite conversion tool is Pandoc. Pandoc supports a few dozen file formats, from Word to Markdown and many more. See how to convert a document with Pandoc here.

Download the Pandoc tool

Pandoc is a cost free and universal document converter, available for all platforms. You can download the command line tool from https://pandoc.org/installing.html.

Once installed, you can use it from the command line or from PowerShell.

Convert between various formats and use options

Check out the basic usage at getting-started for examples. The most common switches are:

  • -f stands for --from : the source file format, e.g. html, markdown, etc.
  • -t for --to : the destination file format, e.g. html, markdown, etc.
  • -o stands for --output : the name of the generated file
  • -s stands for --standalone : produce output with an appropriate header and footer (e.g. a standalone HTML, LaTeX, TEI, or RTF file, not a fragment) or with metadata included

See the supported switches at options. There are a bunch of options. For example, you can use -d options.yaml to specify a set of default option settings. To log the output, use -l output.json to get messages from the conversion in machine-readable JSON format, and so on.

So, here are some useful commands I use for converting documents.

Convert Markdown to HTML

To convert test1.md to test1.html, use this command.

pandoc test1.md -t html -s -o test1.html

With -s, the document includes an HTML-header and styles (test2.html). Without -s, the document includes just the plain HTML text (test1.html), as shown in this screen.

image

To include a stylesheet styles.css:

pandoc test1.md -t html -s -c styles.css -o test3.html

image

Convert HTML to Markdown

To convert test1.html to mdtest1.md, use this command.

pandoc test1.html -t markdown -o mdtest1.md

image

with a specific stylesheet file:

Convert Word to Markdown

To save images that are included in a binary container (docx, epub, or odt) - here a Microsoft Word document - to a directory use the following command. This will create a folder images/media. The media is extracted from the container and the original filenames are used.

pandoc --extract-media=images -s mydoc.docx -t markdown -o mddoc.md

In Word, images files actually live in a folder called "media" inside the docx. So, the "media" folder will always be created. To have a single directory level with the directory "media" only, use the current directory and this command.

pandoc --extract-media=. -s mydoc.docx -t markdown -o mddoc.md

To use Github-friendly Markdown (gfm), we can use:

pandoc --extract-media=. -s mydoc.docx -t gfm -o mddoc.md

Convert Word to HTML

To convert a Microsoft Word document to a website, run this command.

pandoc --extract-media=. -s mydoc.docx -t html -c styles.css -o htmldoc.html

To get the desired result, define your styles.css, e.g. as here:

html {
   line-height: 1.7;
   font-family: sans-serif;
   font-size: 20px;
   color: #1a1a1a;
   background-color: #fdfdfd;
}

All images will be stored in the media directory, as above. A table of contents will be generated as anchors. Headers and footers are skipped. If you have page numbering in places, the pages are not separated, it´s one large document, but you can play around with the many switches.

Convert Markdown to PDF

By default, Pandoc uses LaTeX to generate PDF documents. To generate PDF documents, you need to install a LaTex processor first. See more at creating-a-pdf and check out www.tug.org/texlive/acquire-netinstall. You can download the package from mirror.ctan.org/.../install-tl-windows.exe (18MB). Without the LaTex processor, Pandoc informs with a message: "Please select a different --pdf-engine or install pdflatex". When the tool is installed, you can run the following command.

pandoc test1.md --pdf-engine=xelatex -o pdftest.pdf

For converting to PDF details, see more at Converting Markdown to PDF or DOCX with Pandoc.

Convert to plain text

This option can be helpful to clean-up long and formatted text. Here, we convert a Markdown document to a plain text file.

pandoc test1.md -f markdown -s -t plain -o plaintext.txt

Try it out online

...at https://pandoc.org/try.

image

Convert other document types

Again, there are many formats supported. Pandoc is a really cool and helpful tool. See more at demos.

Save yourself work with Pandoc!

Loading