| Homepage: | https://www.kodymirus.cz/texblend/ |
| Issue tracker: | https://github.com/michal-h21/texblend |
This utility converts text content of web pages to PDF using LaTeX. The text content is extracted using rdrview1, utility that provides a port of Firefox’s reader view functionality. This means that it strips away clutter like buttons, ads, background images, and videos, leaving only the article text.
It doesn’t support any CSS or JavaScript, only plain HTML. The main purpose is to create version of longer articles suitable for reading on e-readers, tablets and phones. Another possible usage is for printing of web pages.
The basic usage is following:
$ rmodepdf <url>
If the compilation goes well, Rmodepdf should print a message like:
[STATUS] rmodepdf: File saved as: Page_Title.pdf
File name of the PDF name is based on the web page title. You can choose a
different filename using the -o option:
$ rmodepdf -o sample <url>
You can also compile several web pages at once, Rmodepdf will convert all URLs passed as argument as one PDF, with the filename based on the first page’s title:
$ rmodepdf <url1> <url2> <url3>
Instead of URLs, you can also pass filenames of local files or pass the HTML code
from the standard input with the - option:
$ rmodepdf - < localfile.html
-b,--baseurl (default "") Base URL used when the HTML content is read from the standard input -c,--configfile (default "") Filename of Lua configuration file -h,--help Print help message -H,--nohyperlinks Don't create special elements for internal hyperlinks -i,--imgdir (default "") Download images and save them to the specified directory -l,--loglevel (default status) Set log level possible values: debug, info, status, warning, error, fatal -n,--noimages Don't download images -N,--nomathjax Don't process LaTeX commands in the HTML document -t,--template (default "") LaTeX template -o,--output (default "") Output file name -p,--pageformat (default ebook) Page format -R,--nordrview Don't use rdrview to get the clean contents from the web pages -s,--pagestyle (default empty) \pagestyle for the document -p,--print Print the converted LaTeX source -v,--version Print version <url> (string)
By default, Rmodepdf downloads all images and saves them as temporary files
which are removed after each run. If you want to reuse these images, use the
--imgdir option. It expects an existing directory where images should be
saved.
$ rmodepdf -i img <url>
If you read HTML content from the standard input, you can use the
--baseurl option to point to the adress where images should be looked
up.
The --noimages option on the other hand will disable downloading of
images.
Rmodepdf expects web pages to use MathJax or KaTeX libraries, which enables
LaTeX syntax for math in the HTML content. In some cases, this can lead to errors.
For example if LaTeX commands are displayed in the HTML code outside of
<code> or <pre> elements. The --nomathjax option will disable passing of
LaTeX commands to the resulting document.
add_to_config {
img_convert = {
-- modify the command used for conversion of svg images to
-- a format suitable for LuaLaTeX
svg = "cairosvg -o ${dest} -",
},
html_latex = { -- support for LaTeX math in webpages that use MathJax or KaTeX
ignored = {"pre", "code"}, -- html elements which shouldn't be processed for LaTeX commands
allowed_commands = {"ref", "pageref", "cleveref", "nameref"}
},
}
]
function post_process() -- set French as a main document language table.insert(config.document.languages, "french") end
– additional code to be inserted at the end of the document preamble. For example font settings, extra packages, etc.
– string with page dimensions in format suitable for the Geometry package.
– document page style.
– list of languages used by the processed pages. This is populated during page processing.
The config.pages table contains list of all processed HTML documents and their metadata. Each item in the list contains the following properties:
– language of the document.
– result of HTML to TeX conversion.
– document author.
– document title.
– document URL.
@{variablename}: Variables are
contained in the config table. Using a dot, properties of sub-tables can
also be printed. For example, @{document.preamble_extras} prints the
config.document.preamble_extras variable.
_{variablename}loop code/{separator}: Variables used must be
arrays. For example, document.languages contains the languages of all
translated documents in a format suitable for the Babel package, or pages,
which contains all converted documents. In the loop code, variables of
the currently processed array are available. If the array contains only
strings, the placeholder %s can be used, as with document.languages.
If the current object is a table, its fields can be accessed directly using
@{variablename}.
?{variablename}{true}{false}: Used to insert elements like the title
and author, which may not be present on all pages.
\documentclass{article}
\usepackage{linebreaker,responsive}
\usepackage[_{document.languages}%s/{,}] {babel}
\usepackage[@{document.geometry}]{geometry}
\pagestyle{@{document.pagestyle}}
@{document.preamble_extras}
\begin{document}
_{pages}
\selectlanguage{@{language}}
?{title}{Title: @{title}}\par}{}
?{author}{Author: @{author}\par}{}
\href{@{url}}{@{url}}\par
@{content}
/{\clearpage}
\end{document}
Note that when processing an array, we must distinguish whether it contains
strings or tables. Strings are displayed using %s. If it is a table, its elements become
active variables and can be displayed using @{variablename}. You can see the
difference in processing the array document.languages, which contains languages as
strings, and pages, which contains tables with metadata from processed
pages.
The default templates used for conversion from HTML to LaTeX utilize some commands that are not available in pure LaTeX. If you are creating your own template, it is necessary to use the following packages in it to avoid compilation errors.
– table support
– multilinugal support for in-text quotes
– automatic resizing of images, to fit into page dimensions
– set font sizes to fit into page dimensions
The configuration script is executed before the actual conversion, so it cannot directly influence the conversion process. However, we can define several callback functions that allow us to affect the conversion. These functions are as follows:
– modify string with the raw HTML before readability and DOM parsing.
– modify DOM object before fetchching of images or handling of MathJax.
– modify DOM after all processing by Rmodepdf.
– late post-processing of the config table.
Permission is granted to copy, distribute and/or modify this software under the terms of the LaTeX Project Public License, version 1.3.
| 2024-07-25 |
Use special elements for internal links in the document. This can be dissabled with the –nohyperlinks option. |
| 2024-07-22 |
Changed the MathJax handling code. It now adds a special element only for the math itself, not for the surrounding text. |
| 2024-06-13 |
Added new hook, preprocess_content(), for modyfying of the raw HTML string |
|
Clenup of some unused code |
|
|
Added add_to_config() function, for modyfying of the configuraton table. |
|
| 2024-06-12 |
Provided new templating mechanism that doesn’t depend on LuaXML templates |
| 2024-04-09 |
Added --nordrview option |
|
Basic metadata parsing if rdrview is not available or is disabled |
|
| 2024-04-08 |
ChangeLog start |