Sometimes it is useful to convert data in documents to another format. There are various tools for this. My favorite conversion tool is Pandoc. Pandoc supports a few dozen file formats, from Word to Markdown and many more. See how to convert a document with Pandoc here.
Download the Pandoc tool
Pandoc is a cost free and universal document converter, available for all platforms. You can download the command line tool from https://pandoc.org/installing.html.
Once installed, you can use it from the command line or from PowerShell.
Convert between various formats and use options
Check out the basic usage at getting-started for examples. The most common switches are:
- -f stands for --from : the source file format, e.g. html, markdown, etc.
- -t for --to : the destination file format, e.g. html, markdown, etc.
- -o stands for --output : the name of the generated file
- -s stands for --standalone : produce output with an appropriate header and footer (e.g. a standalone HTML, LaTeX, TEI, or RTF file, not a fragment) or with metadata included
See the supported switches at options. There are a bunch of options. For example, you can use -d options.yaml to specify a set of default option settings. To log the output, use -l output.json to get messages from the conversion in machine-readable JSON format, and so on.
So, here are some useful commands I use for converting documents.
Convert Markdown to HTML
To convert test1.md to test1.html, use this command.
pandoc test1.md -t html -s -o test1.html
With -s, the document includes an HTML-header and styles (test2.html). Without -s, the document includes just the plain HTML text (test1.html), as shown in this screen.
To include a stylesheet styles.css:
pandoc test1.md -t html -s -c styles.css -o test3.html
Convert HTML to Markdown
To convert test1.html to mdtest1.md, use this command.
pandoc test1.html -t markdown -o mdtest1.md
with a specific stylesheet file:
Convert Word to Markdown
To save images that are included in a binary container (docx, epub, or odt) - here a Microsoft Word document - to a directory use the following command. This will create a folder images/media. The media is extracted from the container and the original filenames are used.
pandoc --extract-media=images -s mydoc.docx -t markdown -o mddoc.md
In Word, images files actually live in a folder called "media" inside the docx. So, the "media" folder will always be created. To have a single directory level with the directory "media" only, use the current directory and this command.
pandoc --extract-media=. -s mydoc.docx -t markdown -o mddoc.md
To use Github-friendly Markdown (gfm), we can use:
pandoc --extract-media=. -s mydoc.docx -t gfm -o mddoc.md
Convert Word to HTML
To convert a Microsoft Word document to a website, run this command.
pandoc --extract-media=. -s mydoc.docx -t html -c styles.css -o htmldoc.html
To get the desired result, define your styles.css, e.g. as here:
html {
line-height: 1.7;
font-family: sans-serif;
font-size: 20px;
color: #1a1a1a;
background-color: #fdfdfd;
}
All images will be stored in the media directory, as above. A table of contents will be generated as anchors. Headers and footers are skipped. If you have page numbering in places, the pages are not separated, it´s one large document, but you can play around with the many switches.
Convert Markdown to PDF
By default, Pandoc uses LaTeX to generate PDF documents. To generate PDF documents, you need to install a LaTex processor first. See more at creating-a-pdf and check out www.tug.org/texlive/acquire-netinstall. You can download the package from mirror.ctan.org/.../install-tl-windows.exe (18MB). Without the LaTex processor, Pandoc informs with a message: "Please select a different --pdf-engine or install pdflatex". When the tool is installed, you can run the following command.
pandoc test1.md --pdf-engine=xelatex -o pdftest.pdf
For converting to PDF details, see more at Converting Markdown to PDF or DOCX with Pandoc.
Convert to plain text
This option can be helpful to clean-up long and formatted text. Here, we convert a Markdown document to a plain text file.
pandoc test1.md -f markdown -s -t plain -o plaintext.txt
Try it out online
...at https://pandoc.org/try.
Convert other document types
Again, there are many formats supported. Pandoc is a really cool and helpful tool. See more at demos.
Save yourself work with Pandoc!