From various sources (HTML, ePub) create PDF suitable for various outputs:
Two packages that simplify automatic typesetting
A script that converts web pages to PDF.
Homepage
https://github.com/michal-h21/rmodepdf/
Page with Control Elements and Ads
Page in Reader Mode in Firefox
Readability.js | https://github.com/mozilla/readability |
Python-readability | https://github.com/buriy/python-readability |
Rdrview: | https://github.com/eafer/rdrview |
For our purpose, Rdrview is the most suitable of these projects because it is a simple C program that is fast and does not require installing additional dependencies.
How Do We Load and Transform HTML Files?
LuaXML contains two libraries for HTML processing and transforming
the luaxml-transform
library for converting XML to other formats, such
as TeX
luaxml-domobject
library can now load HTML files
Rmodepdf accepts multiple URL or filenames as an argument:
# process url1 and url2 $ rmodepdf <url1> <url2>
It can also read from the standard input:
# process local foo.html passed from the standard input # "-" will tell rmodepdf to read from stdin $ cat foo.html | rmodepdf --baseurl foo -
The --baseurl
option is necessary for downloading of images. If the
document don’t contain any external images, use a bogus value for the base
URL.
Rmodepdf merges downloaded pages into a single output TeX document, which is then immediately compiled.
For each page, it displays a header with basic document information and a table of contents. This is followed by the text of the document.
pipe the generated TeX code to foo.tex
$ rmodepdf -p <url> > foo.tex
If we print the page using the -p
option, the generated TeX code is output to
the standard output, and no compilation occurs.
save as foo.pdf
$ rmodepdf -o foo.pdf <url>
The output file name is based on the first page title. If no title was detected
on the page, the output file name is named using the following template:
rmodepdf-%Y%m%d-%H-%M
. You can choose another name using the -o
or
--output
option.
# use A4 format for the paper size # use plain page style $ rmodepdf -P a4paper -s plain <url>
You can choose a different page size using the -P option. By default, the page size and margins are set for e-book readers, but you can also select other sizes, such as A4 paper size. The page style is currently set to empty (blank), but you can change it using the -s option.
# save the document as foo.pdf and # save images in the temp dir $ rmodepdf -o foo.pdf -i /tmp/img <url>
To enhance speed, images are stored in a local directory. By default, this is the img/ subdirectory within the current directory, but you can specify a different directory using the -i option.
don’t download images
don’t process LaTeX math in pages
don’t run Rdrview
debug messages log level
You can disable image downloading entirely with the -n option. The Rmodepdf also detects and displays LaTeX mathematical commands embedded in web pages that use MathJax or KaTeX for rendering. This default behavior can be disabled using the -N option. Additionally, the removal of page elements using Rdrview can be disabled with the -R option.
Loading of the Configuration File
load script.lua as the configuration file
$ rmodepdf -c script.lua <url>
Using a configuration file, we can declare custom rules for transforming HTML to LaTeX, change the document template, load extra packages, or modify the processed page before transformation.
add_to_config { document = { preamble_extras = [[ \setmainfont{Linux Libertine O} ]], }, img_convert = { -- modify the command used for -- conversion of SVG images to PDF svg = "cairosvg -o ${dest} -", }, }
This example uses the command add_to_config, which safely copies new configuration values into the original configuration. If you only want to set a single configuration value, you can also directly write to the config table:
change settings for the Geometry package
config.document.geometry = "a6paper"
Note that settings for Geometry are automatically generated by selecting the
-p
or --pageformat
option. You can overwrite these setting using this
variable.
The configuration script is executed before the actual conversion, so it cannot directly influence the conversion process. However, we can define several callback functions that allow us to affect the conversion. These functions are as follows:
modify string with the raw HTML before readability and DOM parsing.
modify DOM object before fetchching of images or handling of MathJax.
modify DOM after all processing by Rmodepdf.
late post-processing of the config table.
In the following text, we will introduce some new features provided by LuaXML, namely DOM processing and transforming to other formats.
function postprocess_dom(dom) print(dom:serialize()) return dom end
This example is useful in that it allows you to view the DOM serialized back into HTML. You can see all the elements that were transferred from the original HTML document after being processed by Rdrview and the functions of Rmodepdf. It is important to return the DOM at the end of the function, this ensures that any modifications made to the DOM are preserved and applied to the final document.
Here’s a slightly more complex example. Let’s assume that Rdrview did not remove a menu that might look like this:
<div class="menu"> ... menu contents ... </div>
We can use the postprocess_dom function to remove this menu:
function postprocess_dom(dom) -- Find the menu using a CSS selector local menu = dom:query_selector(".menu") -- Iterate over the menu elements -- and remove each one for _, el in ipairs(menu) do el:remove_node() end -- Return the modified DOM return dom end
In this example:
We use the query_selector method to find all elements with the class menu.
Iterate over each element retrieved in the previous step using a for loop.
Remove each menu element using the remove_node method.
Return the modified DOM at the end of the function.
This ensures that any remaining menus are removed from the final document.
Other Useful LuaXML DOM Functions
get element attribute
set element text
get text content of the element
get element name
There are many more functions:
See the LuaXML documentation for the API docs and examples of use.
LuaXML allows us to create rules for transforming the DOM into various formats. Rmodepdf includes rules for transforming basic HTML elements into LaTeX.
add a new rule
process element using Lua
remove rules for the given selector
insert transformed contents of the element
insert value of an attribute
In the configuration file, the variable htmlprocess contains an object with rules for converting HTML elements. It provides two main functions: htmlprocess.reset_actions, which clears all rules for a given selector, and htmlprocess.add_action, which adds new rules.
A more powerful tool is the htmlprocess.add_custom_action
function,
which enables processing of elements in Lua. For an example of its usage, consult
the LuaXML documentation.
The following code displays some basic usage of the transformation library:
htmlprocess.reset_actions("figure") htmlprocess.reset_actions("img") htmlprocess.add_action("img", [[\includegraphics[max width=\textwidth]{@{src}}]]) htmlprocess.add_action("figure", "\n\n \\noindent %s") htmlprocess.add_action(".sample .foo", "hello: %s")
In this example, we change the default formatting for the <figure>
element
and include the text that is contained inside using the %s
instruction. For the
<img>
element, we use the src
attribute to get the image file name. As this
element cannot contain any child elements, we don’t need to use %s
in this
action. .sample .foo
is an example of using HTML class attributes in
actions.
Using Lua’s string syntax [[ ... ]]
allows for easy insertion of
LaTeX commands without the need for backslash doubling. When using regular
quotes, as you can see in the rule declaration for figure
, backslashes must be
doubled.
<figure> <img src="hello.png" /> </figure> <p class="sample"><span class="foo">Matched <p><span class="foo">Not matched
This is an small HTML snippet that shows usage of our transformation
rules. Note the use of unclosed <p>
elements, which would cause errors in
XML. Thanks to CSS selectors, only the text in the first paragraph is selected,
the second one is not, because the span element with the foo class is not child
element of an element with the sample class.
\noindent \includegraphics[max width=\textwidth]{hello.png} hello: Matched Not matched
# require template $ rmodepdf -t mytemplate.tex <url>
@{variablename}
: Prints a variable from the config
table or its
sub-tables.
_{variablename}loop code/{separator}
:
Iterates over array variables, using %s
placeholders or accessing fields
directly.
?{variablename}{true}{false}
: Evaluates a condition to insert
content based on the presence of variables.
% loop over languages \usepackage[_{document.languages}%s/{,}]{babel} % use geometry settings \usepackage[@{document.geometry}]{geometry} @{document.preamble_extras} \begin{document} % loop over documents _{pages} \selectlanguage{@{language}} % conditionaly print title ?{title}{Title: @{title}}\par}{} % document contents @{content} /{\clearpage} \end{document}
Although this example is not complete, it demonstrates the available syntax in
templates. Note that when processing an array, we must distinguish whether it
contains strings or tables. Strings are displayed using %s
. If it is a table, its
elements become active variables and can be displayed using @{variablename}
.
You can see the difference in processing the array document.languages
, which
contains languages as strings, and pages
, which contains tables with metadata
from processed pages.
So far, we have explored the features of Rmodepdf and LuaXML. Now, we will focus on additional packages that can be used independently to facilitate automated typesetting of documents.
Thanks to these features, the same page code can be well displayed both on a large monitor and on mobile devices.
Page Example on a Large Monitor
Page Example on a Small Screen
A package inspired by responsive design methods for web pages
Homepage
https://ctan.org/pkg/responsive
Various sizes of spaces and other elements depend on the font size, so the Responsive package adjusts them with each font size change to match the new size.
Setting Font Size Based on Display Size
Font size can be set using the command \setsizes{number of characters per line}
.
\begin{minipage}{5cm} \setsizes{25} \lipsum[1] \end{minipage}
Difference in Font Size Based on Number of Characters
\setsizes{55}
\setsizes{25}
Options can be set when calling the package or later using the command
\ResponsiveSetup
.
Important options:
do not set font size automatically at the beginning of the document
number of characters when automatically setting the font size
typographic scale used for font sizes
ratio used when calculating line height
When the Responsive package changes the base font size, it automatically
adjusts the sizes used for \large
, \small
, and other commands, as well as line
height and other fundamental dimensions.
Line height can be influenced by the lineratio option. The higher its value, the smaller the distance between lines.
\ResponsiveSetup{lineratio=38}
\ResponsiveSetup{lineratio=34}
Inspired by this article:
https://www.smashingmagazine.com/2020/07/css-techniques-legibility/
Example 3. Change text color depending on the page width
body { color: green; } @media screen and (max-width: 600px) { body { color: blue; } }
This example sets a different text color for documents on screens with a maximum width of 600 pixels.
Using the \mediaquery
command, we can test various properties:
Additional tests can be easily added.
This example displays fewer characters if the text width is less or equal to 4 cm.
\mediaquery{max-textwidth=4cm} {\setsizes{45}}{\setsizes{60}}
Do Media Queries Make Sense in LaTeX?
Prevents the occurrence of overfull lines
tolerance
and emergencystretch
.Hopegage
https://ctan.org/pkg/linebreaker
The example document given below creates two pages by using Lua code alone. You will learn how to access TeX’s boxes and counters from the Lua side, shipout a page into the PDF file, create horizontal and vertical boxes (hbox and vbox), create new nodes and manipulate the nodes links structure. Without Linebreaker The example document given below creates two pages by using Lua code alone. You will learn how to access TeX’s boxes and counters from the Lua side, shipout a page into the PDF file, create horizontal and vertical boxes (hbox and vbox), create new nodes and manipulate the nodes links structure. With Linebreaker
Linebreaker can be configured using the \linebreakersetup
command:
number of attempts to retypeset a paragraph
maximum value of \emergencystretch
maximum value of tolerance
Example configuration:
\linebreakersetup{ maxtolerance = 90, % default 9999 maxemergencystretch = 1em, % default 3em maxcycles = 4 % default 30 }
When Linebreaker detects paragraph overflow, it attempts to typeset it
again with increasing \tolerance
and \emergencystretch
values. These
values are incremented by a specified number of steps until they reach the
maximum values configured in Linebreaker. If a value is found where the
paragraph no longer overflows, processing stops, and those values are
used.
Other useful packages for automatic typesetting
prevents widows and orphans.
prevents single chars at end of lines for Czech and Slovak, prevents line breaks in SI units and academic titles.
Thank you for your attention!