Rmodepdf – convert web pages to PDF using LaTeX

Michal Hoftich <[email protected]>

Version devel, November 14, 2024
Homepage: https://www.kodymirus.cz/texblend/
Issue tracker: https://github.com/michal-h21/texblend

Contents

1 Introduction
2 Usage
3 Command Line Options
3.1 Image Handling
3.2 MathJax Support
3.3 Page format
4 Configuration
4.1 The document table
4.2 The pages table
5 LaTeX Templates
5.1 Syntax
5.1.1 Variable expansion
5.2 Required packages
6 Scripting
7 License
8 Changelog

1 Introduction

This utility converts text content of web pages to PDF using LaTeX. The text content is extracted using rdrview1, utility that provides a port of Firefox’s reader view functionality. This means that it strips away clutter like buttons, ads, background images, and videos, leaving only the article text.

It doesn’t support any CSS or JavaScript, only plain HTML. The main purpose is to create version of longer articles suitable for reading on e-readers, tablets and phones. Another possible usage is for printing of web pages.

2 Usage

The basic usage is following:

$ rmodepdf <url>

If the compilation goes well, Rmodepdf should print a message like:

[STATUS]  rmodepdf: File saved as: Page_Title.pdf

File name of the PDF name is based on the web page title. You can choose a different filename using the -o option:

$ rmodepdf -o sample <url>

You can also compile several web pages at once, Rmodepdf will convert all URLs passed as argument as one PDF, with the filename based on the first page’s title:

$ rmodepdf <url1> <url2> <url3>

Instead of URLs, you can also pass filenames of local files or pass the HTML code from the standard input with the - option:

$ rmodepdf - < localfile.html

3 Command Line Options

-b,--baseurl       (default "")      Base URL used when the HTML content is read
                                     from the standard input
-c,--configfile    (default "")      Filename of Lua configuration file
-h,--help                            Print help message
-H,--nohyperlinks                    Don't create special elements for internal hyperlinks
-i,--imgdir        (default "")      Download images and save them to the
                                     specified directory
-l,--loglevel      (default status)  Set log level
                                     possible values: debug, info, status,
                                     warning, error, fatal
-n,--noimages                        Don't download images
-N,--nomathjax                       Don't process LaTeX commands in the HTML
                                     document
-t,--template      (default "")      LaTeX template
-o,--output        (default "")      Output file name
-p,--pageformat    (default ebook)   Page format
-R,--nordrview                       Don't use rdrview to get the clean contents
                                     from the web pages
-s,--pagestyle     (default empty)   \pagestyle for the document
-p,--print                           Print the converted LaTeX source
-v,--version                         Print version
<url>              (string)

3.1 Image Handling

By default, Rmodepdf downloads all images and saves them as temporary files which are removed after each run. If you want to reuse these images, use the --imgdir option. It expects an existing directory where images should be saved.

$ rmodepdf -i img <url>

If you read HTML content from the standard input, you can use the --baseurl option to point to the adress where images should be looked up.

The --noimages option on the other hand will disable downloading of images.

3.2 MathJax Support

Rmodepdf expects web pages to use MathJax or KaTeX libraries, which enables LaTeX syntax for math in the HTML content. In some cases, this can lead to errors. For example if LaTeX commands are displayed in the HTML code outside of <code> or <pre> elements. The --nomathjax option will disable passing of LaTeX commands to the resulting document.

3.3 Page format

4 Configuration

add_to_config {
  img_convert = {
    -- modify the command used for conversion of svg images to
    -- a format suitable for LuaLaTeX
    svg = "cairosvg -o ${dest} -",
  },
  html_latex = { -- support for LaTeX math in webpages that use MathJax or KaTeX
    ignored = {"pre", "code"}, -- html elements which shouldn't be processed for LaTeX commands
    allowed_commands = {"ref", "pageref", "cleveref", "nameref"}
  },
}

]

function post_process()
  -- set French as a main document language
  table.insert(config.document.languages, "french")
end

4.1 The document table

preamble_extras

– additional code to be inserted at the end of the document preamble. For example font settings, extra packages, etc.

geometry

– string with page dimensions in format suitable for the Geometry package.

pagestyle

– document page style.

languages

– list of languages used by the processed pages. This is populated during page processing.

4.2 The pages table

The config.pages table contains list of all processed HTML documents and their metadata. Each item in the list contains the following properties:

language

– language of the document.

content

– result of HTML to TeX conversion.

author

– document author.

title

– document title.

url

– document URL.

5 LaTeX Templates

5.1 Syntax

Variable Printing

@{variablename}: Variables are contained in the config table. Using a dot, properties of sub-tables can also be printed. For example, @{document.preamble_extras} prints the config.document.preamble_extras variable.

Loops

_{variablename}loop code/{separator}: Variables used must be arrays. For example, document.languages contains the languages of all translated documents in a format suitable for the Babel package, or pages, which contains all converted documents. In the loop code, variables of the currently processed array are available. If the array contains only strings, the placeholder %s can be used, as with document.languages. If the current object is a table, its fields can be accessed directly using @{variablename}.

Conditions

?{variablename}{true}{false}: Used to insert elements like the title and author, which may not be present on all pages.

\documentclass{article}
\usepackage{linebreaker,responsive}
\usepackage[_{document.languages}%s/{,}] {babel}
\usepackage[@{document.geometry}]{geometry}
\pagestyle{@{document.pagestyle}}
@{document.preamble_extras}
\begin{document}
_{pages}
\selectlanguage{@{language}}
?{title}{Title: @{title}}\par}{}
?{author}{Author: @{author}\par}{}
\href{@{url}}{@{url}}\par
@{content}
/{\clearpage}
\end{document}

Note that when processing an array, we must distinguish whether it contains strings or tables. Strings are displayed using %s. If it is a table, its elements become active variables and can be displayed using @{variablename}. You can see the difference in processing the array document.languages, which contains languages as strings, and pages, which contains tables with metadata from processed pages.

5.1.1 Variable expansion

5.2 Required packages

The default templates used for conversion from HTML to LaTeX utilize some commands that are not available in pure LaTeX. If you are creating your own template, it is necessary to use the following packages in it to avoid compilation errors.

cals

– table support

csquotes

– multilinugal support for in-text quotes

adjustbox

– automatic resizing of images, to fit into page dimensions

responsive

– set font sizes to fit into page dimensions

6 Scripting

The configuration script is executed before the actual conversion, so it cannot directly influence the conversion process. However, we can define several callback functions that allow us to affect the conversion. These functions are as follows:

preprocess_content

– modify string with the raw HTML before readability and DOM parsing.

preprocess_dom

– modify DOM object before fetchching of images or handling of MathJax.

postprocess_dom

– modify DOM after all processing by Rmodepdf.

postprocess

– late post-processing of the config table.

7 License

Permission is granted to copy, distribute and/or modify this software under the terms of the LaTeX Project Public License, version 1.3.

8 Changelog

2024-07-25

Use special elements for internal links in the document. This can be dissabled with the –nohyperlinks option.

2024-07-22

Changed the MathJax handling code. It now adds a special element only for the math itself, not for the surrounding text.

2024-06-13

Added new hook, preprocess_content(), for modyfying of the raw HTML string

Clenup of some unused code

Added add_to_config() function, for modyfying of the configuraton table.

2024-06-12

Provided new templating mechanism that doesn’t depend on LuaXML templates

2024-04-09

Added --nordrview option

Basic metadata parsing if rdrview is not available or is disabled

2024-04-08

ChangeLog start

1https://github.com/eafer/rdrview