Web Page to PDF Conversion with Rmodepdf: Leveraging LuaLaTeX for E-book Reader-friendly Documents

Michal Hoftich

July 19, 2024

1 Introduction

Whad Do We Want to Achieve?  

From various sources (HTML, ePub) create PDF suitable for various outputs:

Why?  

What Will I Show?  

2 How Do We Convert HTML to PDF for an E-reader?

Rmodepdf  

A script that converts web pages to PDF.

Homepage

https://github.com/michal-h21/rmodepdf/

Page with Control Elements and Ads  

PIC

Page in Reader Mode in Firefox  

PIC

Reader Mode for Scripts  

Readability.js https://github.com/mozilla/readability
Python-readability https://github.com/buriy/python-readability
Rdrview: https://github.com/eafer/rdrview

For our purpose, Rdrview is the most suitable of these projects because it is a simple C program that is fast and does not require installing additional dependencies.

How Do We Load and Transform HTML Files?  

LuaXML contains two libraries for HTML processing and transforming

3 Rmodepdf usage

Basic Usage  

Rmodepdf accepts multiple URL or filenames as an argument:

# process url1 and url2
$ rmodepdf <url1> <url2>
    

Basic Usage  

It can also read from the standard input:

# process local foo.html passed from the standard input
# "-" will tell rmodepdf to read from stdin
$ cat foo.html | rmodepdf --baseurl foo -

The --baseurl option is necessary for downloading of images. If the document don’t contain any external images, use a bogus value for the base URL.

Rmodepdf merges downloaded pages into a single output TeX document, which is then immediately compiled.

Example Output  

PIC

For each page, it displays a header with basic document information and a table of contents. This is followed by the text of the document.

Print Transformed LaTeX Code  

pipe the generated TeX code to foo.tex

$ rmodepdf -p <url> > foo.tex

If we print the page using the -p option, the generated TeX code is output to the standard output, and no compilation occurs.

Output File Name  

save as foo.pdf

$ rmodepdf -o foo.pdf <url>
  

The output file name is based on the first page title. If no title was detected on the page, the output file name is named using the following template: rmodepdf-%Y%m%d-%H-%M. You can choose another name using the -o or --output option.

Choose Page Format and Style  

# use A4 format for the paper size
# use plain page style
$ rmodepdf -P a4paper -s plain <url>

You can choose a different page size using the -P option. By default, the page size and margins are set for e-book readers, but you can also select other sizes, such as A4 paper size. The page style is currently set to empty (blank), but you can change it using the -s option.

Change Image Directory  

# save the document as foo.pdf and
# save images in the temp dir
$ rmodepdf -o foo.pdf -i /tmp/img <url>

To enhance speed, images are stored in a local directory. By default, this is the img/ subdirectory within the current directory, but you can specify a different directory using the -i option.

Other Options  

-n

don’t download images

-N

don’t process LaTeX math in pages

-R

don’t run Rdrview

-l

debug messages log level

You can disable image downloading entirely with the -n option. The Rmodepdf also detects and displays LaTeX mathematical commands embedded in web pages that use MathJax or KaTeX for rendering. This default behavior can be disabled using the -N option. Additionally, the removal of page elements using Rdrview can be disabled with the -R option.

4 Configuration

Loading of the Configuration File  

load script.lua as the configuration file

$ rmodepdf -c script.lua <url>

Using a configuration file, we can declare custom rules for transforming HTML to LaTeX, change the document template, load extra packages, or modify the processed page before transformation.

Change Settings  

add_to_config {
  document = {
    preamble_extras = [[
    \setmainfont{Linux Libertine O}
    ]],
  },
  img_convert = {
    -- modify the command used for
    -- conversion of SVG images to PDF
    svg = "cairosvg -o ${dest} -",
  },
}

This example uses the command add_to_config, which safely copies new configuration values into the original configuration. If you only want to set a single configuration value, you can also directly write to the config table:

Direct Settings  

change settings for the Geometry package

config.document.geometry = "a6paper"

Note that settings for Geometry are automatically generated by selecting the -p or --pageformat option. You can overwrite these setting using this variable.

5 Callbacks

The configuration script is executed before the actual conversion, so it cannot directly influence the conversion process. However, we can define several callback functions that allow us to affect the conversion. These functions are as follows:

Available Callbacks  

preprocess_content

modify string with the raw HTML before readability and DOM parsing.

preprocess_dom

modify DOM object before fetchching of images or handling of MathJax.

postprocess_dom

modify DOM after all processing by Rmodepdf.

postprocess

late post-processing of the config table.

In the following text, we will introduce some new features provided by LuaXML, namely DOM processing and transforming to other formats.

Example: Print the HTML Code  

function postprocess_dom(dom)
  print(dom:serialize())
  return dom
end

This example is useful in that it allows you to view the DOM serialized back into HTML. You can see all the elements that were transferred from the original HTML document after being processed by Rdrview and the functions of Rmodepdf. It is important to return the DOM at the end of the function, this ensures that any modifications made to the DOM are preserved and applied to the final document.

Here’s a slightly more complex example. Let’s assume that Rdrview did not remove a menu that might look like this:

Example: Remove HTML Elements  

<div class="menu">
... menu contents ...
</div>

We can use the postprocess_dom function to remove this menu:

Example: Remove HTML Elements  

function postprocess_dom(dom)
  -- Find the menu using a CSS selector
  local menu = dom:query_selector(".menu")

  -- Iterate over the menu elements
  -- and remove each one
  for _, el in ipairs(menu) do
    el:remove_node()
  end

  -- Return the modified DOM
  return dom
end

In this example:

1.

We use the query_selector method to find all elements with the class menu.

2.

Iterate over each element retrieved in the previous step using a for loop.

3.

Remove each menu element using the remove_node method.

4.

Return the modified DOM at the end of the function.

This ensures that any remaining menus are removed from the final document.

Other Useful LuaXML DOM Functions  

el:get_attribute

get element attribute

el:set_attribute

set element text

el:get_text

get text content of the element

el:get_element_name

get element name

There are many more functions:

See the LuaXML documentation for the API docs and examples of use.

6 Transformation rules

LuaXML allows us to create rules for transforming the DOM into various formats. Rmodepdf includes rules for transforming basic HTML elements into LaTeX.

LuaXML DOM Tranformations  

htmlprocess.add_action

add a new rule

htmlprocess.add_custom_action

process element using Lua

htmlprocess.reset_actions

remove rules for the given selector

%s

insert transformed contents of the element

@{<attribute name>}

insert value of an attribute

In the configuration file, the variable htmlprocess contains an object with rules for converting HTML elements. It provides two main functions: htmlprocess.reset_actions, which clears all rules for a given selector, and htmlprocess.add_action, which adds new rules.

A more powerful tool is the htmlprocess.add_custom_action function, which enables processing of elements in Lua. For an example of its usage, consult the LuaXML documentation.

The following code displays some basic usage of the transformation library:

Rules Example  

htmlprocess.reset_actions("figure")
htmlprocess.reset_actions("img")
htmlprocess.add_action("img",
  [[\includegraphics[max width=\textwidth]{@{src}}]])
htmlprocess.add_action("figure", "\n\n \\noindent %s")
htmlprocess.add_action(".sample .foo", "hello: %s")

In this example, we change the default formatting for the <figure> element and include the text that is contained inside using the %s instruction. For the <img> element, we use the src attribute to get the image file name. As this element cannot contain any child elements, we don’t need to use %s in this action. .sample .foo is an example of using HTML class attributes in actions.

Using Lua’s string syntax [[ ... ]] allows for easy insertion of LaTeX commands without the need for backslash doubling. When using regular quotes, as you can see in the rule declaration for figure, backslashes must be doubled.

Transformation example  

Example 1. HTML document

<figure>
<img src="hello.png" />
</figure>
<p class="sample"><span class="foo">Matched
<p><span class="foo">Not matched
     

This is an small HTML snippet that shows usage of our transformation rules. Note the use of unclosed <p> elements, which would cause errors in XML. Thanks to CSS selectors, only the text in the first paragraph is selected, the second one is not, because the span element with the foo class is not child element of an element with the sample class.

Transformation example  

Example 2. Transformed result

 \noindent
\includegraphics[max width=\textwidth]{hello.png}

hello: Matched

Not matched

7 Templates

Template Basics  

# require template
$ rmodepdf -t mytemplate.tex <url>

Template Syntax  

Variable Printing

@{variablename}: Prints a variable from the config table or its sub-tables.

Loops

_{variablename}loop code/{separator}: Iterates over array variables, using %s placeholders or accessing fields directly.

Conditions

?{variablename}{true}{false}: Evaluates a condition to insert content based on the presence of variables.

Sample Template Snippet  

% loop over languages
\usepackage[_{document.languages}%s/{,}]{babel}
% use geometry settings
\usepackage[@{document.geometry}]{geometry}
@{document.preamble_extras}
\begin{document}
% loop over documents
_{pages}
\selectlanguage{@{language}}
% conditionaly print title
?{title}{Title: @{title}}\par}{}
% document contents
@{content}
/{\clearpage}
\end{document}

Although this example is not complete, it demonstrates the available syntax in templates. Note that when processing an array, we must distinguish whether it contains strings or tables. Strings are displayed using %s. If it is a table, its elements become active variables and can be displayed using @{variablename}. You can see the difference in processing the array document.languages, which contains languages as strings, and pages, which contains tables with metadata from processed pages.

8 Responsive Design in LaTeX

So far, we have explored the features of Rmodepdf and LuaXML. Now, we will focus on additional packages that can be used independently to facilitate automated typesetting of documents.

What is Responsive Design  

Thanks to these features, the same page code can be well displayed both on a large monitor and on mobile devices.

Page Example on a Large Monitor  

PIC

Page Example on a Small Screen  

PIC

The responsive Package  

A package inspired by responsive design methods for web pages

Homepage

https://ctan.org/pkg/responsive

Various sizes of spaces and other elements depend on the font size, so the Responsive package adjusts them with each font size change to match the new size.

Setting Font Size Based on Display Size  

Font size can be set using the command \setsizes{number of characters per line}.

\begin{minipage}{5cm}
\setsizes{25}

\lipsum[1]

\end{minipage}

   Lorem   ipsum   dolor  sit
amet,  consectetuer  adipisc-
ing  elit.    Ut  purus  elit,
vestibulum  ut, placerat ac,
adipiscing vitae, felis.

Difference in Font Size Based on Number of Characters  

\setsizes{55}

adLiopirescmingip esulimt. Udtol poru sruits eamliet,t, vescotnibseulctuetmu uert,
ptularce draicttu amc,gr adaivipidascinmgau vritisa.e, N famelisa.rc Cuu lribaberi-o,
nonummy eget, consectetuer id, vulputate a,
mleangnteasq.u Deonhaebcitvanehticmuolarbaiutguriseteiquuenseqeunee.ctuPesl-et
nMetausuriset mutalleeso.uadCara fsam veivse arrca t mureptuiss e rghesontacsus.
seumlt.ric Neus.llaPethalseceltulussv euestteiblluluusm siutrn aamfetri tngoilrtlaor
grina,vi pdreati pulamce qrautis., I vintveergerar a scap,ie nunn ec.st,P iraaceuselisnt
eget sem vel leo ultrices bibendum. Aenean
fapuulcivibunsar. a Mto,r mbiollisdoloacr, n nuullllaa,. m Caulerasbuaitdura a euuc-,
toDruseism npibehrn muill,a. co Dnoguneec evua,ri acuscoumrsciaenge etleriifesunsd.,
saorgcititidisgn quisiss,im diarumtr.umD.uis eget orci sit amet

\setsizes{25}

   Lorem   ipsum   dolor  sit
amet,  consectetuer  adipisc-
ing  elit.    Ut  purus  elit,
vestibulum  ut, placerat ac,
adipiscing vitae, felis.

Configuration  

Options can be set when calling the package or later using the command \ResponsiveSetup.

Important options:

noautomatic

do not set font size automatically at the beginning of the document

characters

number of characters when automatically setting the font size

scale

typographic scale used for font sizes

lineratio

ratio used when calculating line height

When the Responsive package changes the base font size, it automatically adjusts the sizes used for \large, \small, and other commands, as well as line height and other fundamental dimensions.

Line Height  

Line height can be influenced by the lineratio option. The higher its value, the smaller the distance between lines.

\ResponsiveSetup{lineratio=38}

elscLit.inor Ugemt viptaipsure,uu fmseelidolis.lot,rveCusistrataibbmulituetumr,c donuticts,uecplmteac gtuerraeratvidadaaipc, misaacidiurngpiis-.
Npuamtatareacu,limbeagrona,.noDnuonmecmvyehegicetul,acoaunsgueceteeutunereqiud,e.vPulel--
leetntmesaquleesu hadabaitfaanmtes maorcbitu trprisistiequgeest seasn.ec Mtuausri ets nutetleuso.
Cstamraibesultvium tovertrrurorana gmefraturinvisgdarhill poalancultceusrirascet.ems.. PIn Nhateusegellllraus saetepileuenctte eusllustvs,esit ia--
cuselimsivn,elplereoutiltumriqceusbisib,venivderumra.aAc,enneunanc.fa Puracibesuens.tMegoretbi
doCriloursurabs.nuit Dlluruia,asmucnialtobhesrmuasei,damcepeonu,rgupnueulllaeuvi.,na Dacraoncut,ecmmvsaoarnlliiuelsaseic,orfencinduleg,slaeta.-
girutttrisumqu.is,diam.Duisegetorcisitametorcidignissim

\ResponsiveSetup{lineratio=34}

elLit.or Uemtpipsuruumsedolilot,rvesisttaibmuletum,conuts,ecplteactuereratadaipc,isacidingpi-
scNpuinamtgat viareatacu,e,lim fbeagelironas.,.noDCunuonramecbmvituyehregic detulict,aucoaumnsgu gecerateeuvidtunaereq miuad,e.urvPisulel.--
leetntmesaquleesu hadabaitfaanmtes maorcbitu trprisistiequgeest seasn.ec Mtuausri ets nutetleuso.
Cstraibsulviumverruranamefturinsgrhilloancultusrisceems.. P Nhausellllauseteleuctteuslluvsesit-
amcuelitsi ton,rtporre gtiraumviqdau pisla,vceivraert.raaInc,tengeunrc. sa Ppiraenes eenstt,eg iaet-
sedomlovrelnulelloua,ltmrialceessbuaibdaenedu,ump.ulAviennaeraant,famuociblliusas.c,Mnorulblai.
Criursuabs.it Duruiasucnitobhrmsei,mcpeonrgunuellaeu., Daconcuecmvsaarniuelseiorfecindeg,seta-
girutttrisumqu.is,diam.Duisegetorcisitametorcidignissim

Inspired by this article:

https://www.smashingmagazine.com/2020/07/css-techniques-legibility/

CSS Media Query Example  

Example 3. Change text color depending on the page width

body {
  color: green;
}
@media screen and (max-width: 600px) {
    body {
      color: blue;
    }
}

This example sets a different text color for documents on screens with a maximum width of 600 pixels.

Media Queries in LaTeX  

Using the \mediaquery command, we can test various properties:

Additional tests can be easily added.

Media Query Example  

This example displays fewer characters if the text width is less or equal to 4 cm.

\mediaquery{max-textwidth=4cm}
{\setsizes{45}}{\setsizes{60}}

piLscoinregm e iplits.um Utd polurorus siteli at,m vetes,ti cobunsluecmtetutue,r pl adaci-e-
ragrta avic,da amdiapiuscrisin.g N vamitaea,rc feulislib.e Crou,ranobnituumrm dyicetugmet,
cocunlaseactuetguueereidu,nevquluepu.ta Petlleena,temsqaugneaha.Dboitnanectvmehori-bi
trtuisrptiiquse egseesntaecs.tusMetaunreistu ustet lemo.aleCrsuaadsa vifveamrreas mace-
tufrsinrhgiollancuulstrseicems..P Nuhlalaseellutlsecetuustevlluesstisbuitluammeutrtonar-
toprregrtiuavmidqaupisla,vceivraetrr.Ianacte,genurnsac.pi Penraeestse,ntiaecugelitsseinm,
vedolloleronulultrlaice,smbalibesenuadduma.eu A,enpuelvaninfaaurcatib,usm.ol Mlisorabci,
nurilusla.orcCuiregabetitrurisuasu.ct Duorissnemibphemri,nucllona.g Dueoenu,ecavcca-u-
msaneleifend,sagittisquis,diam. Duisegetorci
sitametorcidignissim rutrum.

 Lorem ipsum dolor sit amet, con-
seelcitte,tu veerstibaduipluimscuintg,p ellitac.eraUttac pu,r adusi-
pgriscainvigdavmitaaeu,rifes.lis N.a Cmuararcbiutulirbderico,tunmo-
ntaumtmey a,e mgaetg,ncoa.nseDctoneteucer veihd,icvulualp au-u-
gue eu neque. Pellentesque habitant
mmoarblesiu tradisati fquaem sesen aecct tusurpetis neteguessta ets.
Mrhaourncisu ustse lmeo.. NulClaraest vleivcterusravemstetibusu-
lueumuterllunasfsriitngaimllaetutltorrtoicresg.ra Phvaidseallplusa-
ceprreatti.um Intquegiser,visavpierenraaesc,t,n iuacncu.lis P irane-,
sednumt. egetAenseeman velfa lueocibuuls.tricMeso brbibie dn-o-
lomronlliulsla a,c,m nalueslluaa.daCeuu,rapbiutulvrin aaruacttor,
sempernulla. Donecvariusorcieget
ricusums.sanDuielsei nfeibndh, m saig,citotingsu queis e,u, di aac-m.
Druuitsruegme.torcisitametorcidignissim

Do Media Queries Make Sense in LaTeX?  

9 The linebreaker Package

The linebreaker Package  

Prevents the occurrence of overfull lines

Hopegage

https://ctan.org/pkg/linebreaker

Example  

The example document given below creates two pages by using Lua code alone. You will learn how to access TeX’s boxes and counters from the Lua side, shipout a page into the PDF file, create horizontal and vertical boxes (hbox and vbox), create new nodes and manipulate the nodes links structure. Without Linebreaker The example document given below creates two pages by using Lua code alone. You will learn how to access TeX’s boxes and counters from the Lua side, shipout a page into the PDF file, create horizontal and vertical boxes (hbox and vbox), create new nodes and manipulate the nodes links structure. With Linebreaker

Configuration  

Linebreaker can be configured using the \linebreakersetup command:

maxcycles

number of attempts to retypeset a paragraph

maxemergencystretch

maximum value of \emergencystretch

maxtolerance

maximum value of tolerance

Example configuration:

\linebreakersetup{
maxtolerance = 90,         % default 9999
maxemergencystretch = 1em, % default 3em
maxcycles = 4              % default 30
}

When Linebreaker detects paragraph overflow, it attempts to typeset it again with increasing \tolerance and \emergencystretch values. These values are incremented by a specified number of steps until they reach the maximum values configured in Linebreaker. If a value is found where the paragraph no longer overflows, processing stops, and those values are used.

10 Conclusion

Rmodepdf status  

Other useful packages for automatic typesetting  

lua-widow-control

prevents widows and orphans.

luavlna

prevents single chars at end of lines for Czech and Slovak, prevents line breaks in SI units and academic titles.

Thank you for your attention!

[email protected]

www.kodymirus.cz