Reference manual

Soupault (soup-oh) is an HTML manipulation tool.

It can work either as a static site generator that builds a complete site from a page template and a directory with page body files, or an HTML processor for an existing website.

Soupault is based on HTML element tree rewriting. Pretty much like client-side DOM manipulation, just without the browser or interactivity. You can tell it to do things like “insert contents of footer.html file into <div id="footer"> no matter where that div is in the page”, or “use the first <h1> element in the page for the page title”.

Full HTML awareness
Soupault is aware of the full page structure. It can find and manipulate any element using CSS3 selectors like p, div#content, or a.nav, no matter where it is in the page. There is no need “front matter”— all metadata can be extracted from HTML itself.
No themes
Any page can be a soupault theme, you just tell it where to insert the content. By default it inserts page content into the <body> element, but it can be anything identifiable with a CSS selector. You decide how much is the “theme” and how much is the content.
Built for text-heavy, nested websites
Tables of contents, footnotes, and breadcrumbs are supported out of the box. Existing element id's are respected, so you can make table of contents and footnote links permanent and immune to link rot, even if you change the heading text or page structure.
Preserves your content structure
Soupault can mirror your directory structure exactly, down to file extensions. You can use it to enhance an existing website without breaking any links, or gradually migrate from an HTML processor to the website generator workflow.
Any input format
HTML is the primary format, but you can use anything that can be converted to HTML. Just specify preprocessors for files with certain extensions.
Easy to install
Soupault is a single executable file with no dependencies. Just unpack the archive and it's ready to use.
Fast
Soupault is written in OCaml and compiled to native code on all platforms. This website takes less than 0.5 second to build.
Extensible
You can add your own HTML rewriting logic with Lua scripts.

For a very quick start:

  1. Create a directory for your site.
  2. Run soupault --init. It will create a page template in templates/main.html, default soupault.conf config file, and a directory for pages (site).
  3. Drop some HTML files into the site directory.
  4. Run soupault in that directory.
  5. Look in build/ for the result.

For details, read the documentation below.

Overview

Soupault is named after the French dadaist and surrealist writer Philippe Soupault because it's based on the Lambda Soup library, which is a reference to the Beautiful Soup library, which is a reference to the Mock Turtle chapter from Alice in Wonderland (the best book on programming for the layman according to Alan Perlis), which is a reference to an actual turtle soup imitation recipe, and also a reference to tag soup, a derogatory term for malformed HTML, which is a reference to... in any case, soupault is not the French for a stick horse, for better or worse, and this paragraph is the only nod to dadaism in this document.

Soupault is quite different from other website generators. Its design goals are:

  1. No special syntax inside pages, only plain old semantic HTML.
  2. No templates or themes. Any HTML page can be a soupault theme.
  3. No front matter.
  4. Usable as a postprocessor for existing website and friendly to sites with unique pages.

People have written lots of static website generators, but most of them are variations on a single theme: take a file with some metadata (front matter) followed by content in a limited markup format such as Markdown and feed it to a template processor.

This approach works well in many cases, but has its limitations. A template processor is only aware of its own tags, but not of the full page structure. The front matter metadata is the only part of the page source it can use, and it can only insert it into fixed locations in the template. The part below the front matter is effectively opaque, and formats like Markdown simply don't allow you to add machine-readable annotations anyway.

Well-formed HTML, however, is a machine-readable format, and the libraries that can handle it existed for a long time. As shown by microformats, you can embed a lot of information in it. More importantly, unlike front matter metadata, HTML metadata is multi-purpose: for example, you can use the id attribute for CSS styling and as a page anchor, as well as a unique element identifier for data extraction or HTML manipulation programs.

With soupault, it's possible to take advantage of the full HTML markup and even make every page on your website look different rather than built from the same template, and still have an automated workflow.

You can also use Markdown/reStructuredText/whatever if you specify preprocessors. Soupault will automatically run a preprocessor before parsing your page, though you'll miss some finer points like footnotes if you go that way.

How soupault works

In the website generator mode (the default), soupault takes a page “template”—an HTML file devoid of content, parses it into an element tree, and locates the content container element inside it.

By default the content container is <body>, but you can use any selector: div#content (a <div id="content"> element), article (an HTML5 article element), #post (any element with id="post") or any other valid CSS selector.

Then it traverses your site directory where page source files are stored, takes a page file, and parses it into an HTML element tree too. If the file is not a complete HTML document (doesn't have an <html> element in it), soupault inserts it into the content container element of the template. If it is a complete page, then it goes straight to the next step.

The new HTML tree is then passed to widgets—HTML rewriting modules that manipulate it in different ways: include other files or outputs of external programs into specific elements, create breadcrumbs for your page, they may delete unwanted elements too.

Processed pages are then written to disk, into a directory structure that mirrors your source directory structure.

Here is a simplified flowchart:

soupault flowchart

Performance

Despite its heavy-handed approach, soupault is reasonably fast. With a config that includes ToC, footnotes, file inclusion, breadcrumbs, and built-in section index generator, it can process 1000 copies of its own documentation page in about 14 seconds.1 Small websites take less than a second to build.

For comparison, in a simplified “read-parse-prettify-write” test with 1000 copies of this document, CPython/BeautifulSoup takes about 20 seconds to complete.

Why use soupault?

If you are starting a website

All website generators provide a default theme, but making new themes can be tricky. With soupault, there are no intermediate steps between writing your page layout and building your website from it.

In the simplest case you can just create a page skeleton in templates/main.html with an empty <body>, add a bunch of pages to site/ (site/about.html, site/cv.html, ...) and run the soupault command in that directory.

If you already have a website

If you have a website made of handwritten pages, you can use soupault as a drop-in automation tool. Just copy your pages to the site directory, and you can already take advantages of features like tables of content, without modifying any of your pages.

Then you can gradually migrate to using a shared template, if you want to. Take a page, remove the content from it, save that empty page to templates/main.html, and start stripping pages in site down to their content. Pages that don't have an <html> element will be treated as page bodies and inserted in the template, but other pages will only be processed by widgets.

If you have an existing directory structure that you don't want to change because it will break links, soupault can mirror it exactly, with the clean_urls = false config option. It also preserves original file names and extensions in that mode.

What can it do?

Soupault comes with a bunch of built-in widgets that can:

  • Include external files, HTML snippets, or output of external programs into an element.
  • Set the page title from text in some element.
  • Generate breadcrumbs.
  • Move footnotes out of the text.
  • Insert a table of contents.
  • Delete unwanted elements.

For example, here's a config for the title widget that sets the page title.

[widgets.page-title]
  widget = "title"
  selector = "#title"
  default = "My Website"
  append = " &mdash; My Website"

It takes the text from the element with id="title" and copies it to the <title> tag of the generated page. It can be any element, and it can be a different element in every page. If you use <h1 id="title"> in site/foo.html and <strong id="title"> in another, soupault will still find it.

It's just as simple to prevent something from appearing on a particular page. Just don't use an element that a widget uses for its target, and the widget will not run on that page.

Another example: to automatically include content of a templates/nav-menu.html into the <nav> element, you can put this into your soupault.conf file:

[widgets.nav-menu]
  widget = "include"
  selector = "nav"
  file = "templates/nav-menu.html"

What it cannot do

By design:

Development web server
There are plenty of those, even python3 -m http.server is perfectly good for previews.
Deployment automation.
Same reason, there are lots of tools for it.

Because I don't need it and I'm not sure if anyone wants it or how it will fit:

  • Asset management
  • Incremental builds
  • Multilingual sites

Installation

Binary release packages

Soupault is distributed as a single, self-contained executable, so installing it from a binary release package it trivial.

You can download it from files.baturin.org/software/soupault. Prebuilt executables are available for Linux (x86-64, statically linked), macOS (x86-64), and Microsoft Windows (32-bit, Windows 7 and newer).

Just unpack the archive and copy the executable wherever you want.

Prebuilt executables are compiled with debug symbols. It makes them a couple of megabytes larger than they could be, but you can get better error messages if something goes wrong. If you encounter an internal error, you can get an exception trace by running it with OCAMLRUNPARAM=b environment variable.

Building from source

If you are familiar with the OCaml programming language, you may want to install from source.

Since version 1.6, soupault is available from the opam repository. If you already have opam installed, you can install it with opam install soupault.

If you want the latest development version, the git repository is at github.com/dmbaturin/soupault.

To build a statically linked executable for Linux, identical to the official one, first install a +musl+static+flambda compiler flavor, then uncomment the (flags (-ccopt -static)) line in src/dune.

Using soupault on Windows

Windows is a supported platform and soupault includes some fixups to account for the differences. This document makes a UNIX cultural assumption throughout, but most of the time the same configs will work on both systems. Some differences, however, require user intervention to resolve.

If a file path is only used by soupault itself, then the UNIX convention will work, i.e. file = 'templates/header.html' and file = 'templates\header.html' are both valid options for the include widget. However, if it's passed to something else in the system, then you must use the Windows convention with back slashes. This applies to the preprocessors, the command option of the exec widget, and the index_processor option.

So, if you are on Windows, remember to adjust the paths if needed, e.g.:

[widgets.some-script]
  widget = 'exec'
  command = 'scripts\myscript.bat'
  selector = 'body'

Note that inside double quotes, the back slash is an escape character, so you should either use single quotes for such paths ('scripts\myscript.bat') or use a double back slash ("scripts\\myscript.bat").

Getting started

Create your first website

Soupault has only one file of its own: the config file soupault.conf. It does not impose any particular directory layout on you. However, it has default settings that allow you to run it unconfigured.

You can initialize a simple project with default configuration using this command:

$ soupault --init

It will create the following directory structure:

.
├── site
│   └── index.html
├── templates
│   └── main.html
└── soupault.conf
  • The site/ is a site directory where page files are stored.
  • The templates/ directory is just a convention, soupault uses templates/main.html as the default page template.
  • soupault.conf is the config file.

Now you can build your website. Just run soupault in your website directory, and it will put the generated pages in build/. Your index page will become build/index.html.

By default, soupault inserts page content into the <body> element of the page template. Therefore, from the default template:

<html>
  <head></head>
  <body>
  <!-- your page content here -->
  </body>
</html>

and the default index page source that is <p>Site powered by soupault</p> it will make this page:

<!DOCTYPE html>
<html>
 <head></head>
 <body>
  <p>
   Site powered by soupault
  </p>
 </body>
</html>

You can use any CSS selector to determine where your content goes. For example, you can tell soupault to insert it into <div id="content> by changing the content_selector in soupault.conf to content_selector = "div#content".

Page files

Soupault assumes that files with extensions .html, .htm, .md, .rst, and .adoc are pages and processes them. All other files are simply copied to the build directory.

If you want to use other extensions, you can change it in soupault.conf. For example, to add .txt to the page extension list, use this option:

[settings]
  page_file_extensions = ["htm", "html", "md", "rst", "txt"]

Remember that for files in formats other than HTML, you also need to specify a converter, simply adding an extension to the list is not enough.

Clean URLs

Soupault uses clean URLs by default. If you add another page to site/, for example, site/about.html, it will turn into build/about/index.html so that it can be accessed as https://mysite.example.com/about.

Index files, by default files whose name is index are simply copied to the target directory.

site/index.html → build/index.html
site/about.html → build/about/index.html
site/papers/theorems-for-free.html → build/papers/theorems-for-free/index.html

Note: a page named foo.html and a section directory named foo/ is undefined behaviour when clean URLs are used. Don't do that to avoid unpredictable results.

This is what soupault will make from a source directory:

$ tree site/
site/
├── about.html
├── cv.html
└── index.html

$ tree build/
build/
├── about
│   └── index.html
├── cv
│   └── index.html
└── index.html

Disabling clean URLs

If you've had a website for a long time and there are links to your page that will break if you change the URLs, you can make soupault mirror your site directory structure exactly and preserve original file names.

Just add clean_urls = false to the [settings] table of your soupault.conf file.

[settings]
  clean_urls = false

Soupault as an HTML processor

If you want to use soupault with an existing website and don't want the template functionality, you can switch it from a website generator mode to an HTML processor more where it doesn't use a template and doesn't require the default_template to exist.

Recommended settings for the preprocessor mode:

[settings]
  generator_mode = false
  clean_urls = false

Page templates

Some people want all their pages to have a consistent look, and then having just a default_template option works perfectly. Some people prefer to give every page a unique look, and then disabling generator mode is a right thing to do. Yet a third category of people want just some pages or sections to look different.

For this use case, soupault supports multiple page templates. You can define them in the [templates] section.

[templates.serious-template]
  file = "templates/serious.html"
  section = "serious-business"

[templates.fun-template]
  file = "templates/fun.html"
  path_regex = "(.*)/funny-(.*)"

As of soupault 1.12, there is no way to specify a content_selector option per template, and they all will use the content_selector option from [settings].

Page/section targeting options are discussed in detail in the widgets section.

Note that proper template targeting is your responsibility. Omitting the page, section, path_regex option will not cause errors, but soupault may use a wrong template for some pages. If that happens, make your targeting options more specific.

Nested structures

A flat layout is not always desirable. If you want to create site sections, just add some directories to site/. Subdirectories are subsections, their subdirectories are subsubsections and so on—they can go as deep as you want. Soupault will process them all recursively and recreate the directories in build/.

site/
├── articles
│   ├── goto-considered-harmful.html
│   ├── index.html
│   └── theorems-for-free.html
├── about.html
├── cv.html
└── index.html

build/
├── about
│   └── index.html
├── articles
│   ├── goto-considered-harmful
│   │   └── index.html
│   ├── theorems-for-free
│   │   └── index.html
│   └── index.html
├── cv
│   └── index.html
└── index.html

Note that if your section does not have an index page, soupault will not create it automatically. If you want a page to exist, you need to make it.

Configuration

The default directory and file paths soupault --init creates are not fixed, you can change any of them. If you prefer different names, or you have an existing directory structure you want soupault to use, just edit the soupault.conf file. This is [settings] section of the default config:

[settings]
  # Stop on page processing errors?
  strict = true

  # Display progress?
  verbose = false

  # Display detailed debug output?
  debug = false

  # Where input files (pages and assets) are stored.
  site_dir = "site"

  # Where the output goes
  build_dir = "build"

  # Files inside the site/ directory can be treated as pages or static assets,
  # depending on the extension.
  #
  # Files with extensions from this list are considered pages and processed.
  # All other files are copied to build/ unchanged.
  #
  # Note that for formats other than HTML, you need to specify an external program
  # for converting them to HTML (see below).
  page_file_extensions = ["htm", "html", "md", "rst", "adoc"]

  # Files with these extensions are ignored.
  ignore_extensions = ["draft"]

  # Soupault can work as a website generator or an HTML processor.
  #
  # In the "website generator" mode, it considers files in site/ page bodies
  # and inserts them into the empty page template stored in templates/main.html
  #
  # Setting this option to false switches it to the "HTML processor" mode
  # when it considers every file in site/ a complete page and only runs it through widgets/plugins.
  generator_mode = false

  # Files that contain an  element are considered complete pages rather than page bodies,
  # even in the "website generator" mode.
  # This allows you to use a unique layout for some pages and still have them processed by widgets.
  complete_page_selector = "html"

  # Website generator mode requires a page template (an empty page to insert a page body into).
  # If you use "generator_mode = false", this file is not required.
  #default_template = "templates/main.html"

  # Page content is inserted into a certain element of the page template. This option is a CSS selector
  # used for locating that element.
  # By default the content is inserted into the 
  content_selector = "body"

  # Soupault currently doesn't preserve the original doctype declaration
  # and uses the HTML5 doctype by default. You can change it using this option.
  doctype = ""

  # Enables or disables clean URLs.
  # When false: site/about.html -> build/about.html
  # When true: site/about.html -> build/about/index.html
  clean_urls = true

  # Since 1.10, soupault has plugin auto-discovery
  # A file like plugins/my-plugin.lua will be registered as a widget named my-plugin
  plugin_discovery = true
  plugin_dirs = ["plugins"]

Note that if you create soupault.conf file before running soupault --init, it will use settings from that file instead of default settings.

In this document, whenever a specific site or build dir has to be mentioned, we'll use default values.

If you misspell and option or make another mistake, soupault will warn you about an invalid option and try to suggest a correction.

Note that the config is typed and wrong value type has the same effect as missing option. All boolean values must be true or false (without quotes), all integer values must not have quotes around numbers, and all strings must be in single or double quotes.

Custom directory layouts

If you are using soupault as an HTML processor, or using it as a part of a CI pipeline, the typical website generator approach with a single “project directory” may not be optimal.

You can override the location of the config using an environment variable SOUPAULT_CONFIG. You can also override the locations of the source and destination directories with --site-dir and --build-dir options. Thus it's possible to run soupault without a dedicated project directory at all:

SOUPAULT_CONFIG="something.conf" soupault --site-dir some-input-dir --build-dir some-other-dir

Page preprocessors

Soupault has no built-in support for formats other than HTML, but you can use any format with it if you specify an appropriate page preprocessor.

Any preprocessor that takes page file as its argument and outputs the result to stdout can be used.

For example, this configuration will make soupault call a program called cmark on the page file if its extension is .md.

[preprocessors]
  md = "cmark"

The table key can be any extension (without the dot), and the value is a command. You can specify as many extensions as you want.

Preprocessor commands are executed in shell, so it's fine to use relative paths and add arguments. Page file name will be appended to the command string.

By default soupault assumes that files with extensions .md, .rst, .adoc are pages. If you are using another extension, you need to add it to the page_file_extensions list in the [settings] section, else it will assume it's an asset and copy it unmodified, even if you configure a preprocessor for that extension.

Automatic section and site index

Having to add links to all pages by hand can be a tedious task. Nothing beats a carefully written and annotated section index, but it's not always practical.

Soupault can automatically generate a section index for you. While it's not a blog generator and doesn't have built-in features for generating indices of pages by date, category etc., it can save you time writing a section index pages by hand.

Metadata is extracted directly from pages using selectors you specify in the config. It's more than possible to use a different element for excerpt in every page, not just the first paragraph, without having to duplicate it in the “front matter”. It doesn't even have to be text either. Same goes for other fields.

To use automatic indexing, you still need an index page in your section. It can be empty, but it must be there. Default index page name is index, so you should make a page like site/articles/index.html first.

Then enable indexing in the config. All indexing options are in the [index] table.

[index]
  index = true

By default, soupault will append the index to the <body> element. You can tell it to insert it anywhere you want with the index_selector option, e.g. index_selector = "div#index".

There are a few configurable options. You can specify element selectors for page title, excerpt, date, and author.

These are all available options:

[index]
  # Whether to generate indices or not
  # Default is false, set to true to enable
  index = false

  # Where to insert the index
  index_selector = "body"

  # Page title selector
  index_title_selector = "h1"

  # Page excerpt selector
  index_excerpt_selector = "p"

  # Page date selector
  index_date_selector = "time"

  # By default entries are sorted by date in ascending order (oldest to newest)
  # If you are making a blog, you can set this to true to sort from newest to oldest
  newest_entries_first = false

  # Date format for sorting
  # Default %F means YYYY-MM-DD
  # For other formats, see http://calendar.forge.ocamlcore.org/doc/Printer.html
  index_date_format = "%F"

  # Page author selector
  index_author_selector = "#author"

  # Mustache template for index entries
  index_item_template = "<div> <a href=\"{{url}}\">{{{title}}}</a> </div>"

  # External index generator
  # There is no default
  index_processor = 

  # When false, causes soupault to ignore index_item_template and index_processor
  # options (and their default values). Useful if you want to use named views exclusively.
  use_default_view = true

Built-in index generator

Since version 1.6, soupault includes a built-in index generator that uses Mustache templates.

The following built-in fields are supported: url, title, excerpt, date, author. For a simple blog feed, you can use something like this:

index_item_template = """
    <h2><a href="{{url}}">{{title}}</a></h2>
    <p><strong>Last update:</strong> {{date}}.</p>
    <p>{{{excerpt}}}</p>
    <a href="{{url}}">Read more</a>
  """

If you define custom fields, they also become available to the Mustache renderer. An easy way to see what fields are available to the built-in and external generators is to run soupault --debug, which prints a JSON dump of the index data among other things.

External index generators

The built-in index generator simply copies elements from the page to the index. You can easily end up with a rather odd-looking index, especially if you are using different elements on every page and identify them by id rather than element name.

Generating indices and blog feeds is where template processors really shine. Everyone has different preferences though, so instead of having a built-in template processor, soupault supports exporting the index to JSON and feeding it to an external program.

JSON-encoded index is written to program's standard input, as a single line.2 It's a list of objects with following fields:

url
Absolute page URL path, like /papers/simple-imperative-polymorphism
nav_path
A list of strings that represents the logical section path, e.g. for /pictures/cats/grumpy it will be ["pictures", "cats"].
title, date, excerpt, author
Metadata extracted from the page. Any of them can be null.
page_file
Original page file path.

Here's an example of very simple indexing setup that will take the first h1 of every page in a section and make an unordered list of links to them. The external processor will use Python and Mustache templates.

First, create an index.html page in every section and include a <div id="index"> element in it.

Then write this to your config file:

[index]
  index = true
  index_selector = "#index"
  index_processor = "scripts/index.py"
  index_title_selector = "h1"

Then install the pystache library and save this script to scripts/index.py:

#!/usr/bin/env python3

import sys
import json

import pystache

template = """
<li><a href="{{url}}">{{title}}</a></li>
"""

renderer = pystache.Renderer()

input = sys.stdin.readline()
index_entries = json.loads(input)

print("<ul class=\"nav\">")
for entry in index_entries:
    print(renderer.render(template, entry))
print("</ul>")

Index processors are not required to output anything. You can as well use them to save the index data somewhere and create taxonomies and custom indices from it with another script, then re-run soupault to have them included in the pages.3

Multiple index views

Since version 1.7, soupault supports multiple index “views” that allows you to present the same index data in different ways. For example, you can have a blog feed in /blog, but a simple list or pages in every other section.

Views are defined in the [index.views] table. You can use either normal or inline table syntax, from TOML's point of view it's the same. Example:

[index.views.title-and-date]
  index_selector = "div#index-title-date"
  index_item_template = "<li> <a href=\"{{url}}\">{{{title}}}</a> ({{date}})</li>"

[index.views.custom]
  index_selector = "div#custom-index"
  index_processor = 'scripts/my-index-generator.pl'

Which view is used is determined by the index_selector option. In this example, if your section index page has a <div id="index-title-date">, soupault will insert an index generated by the title-and-date view, and if it has a <div id="index-custom">, it will insert the output of scripts/my-index-generator.pl.

If your page has both elements, both index views will be inserted in their containers. This may be useful if you want to include different views in the same page, e.g. show articles grouped by date and by author.

Note that defining custom views doesn't automatically make soupault ignore index_item_template and index_processor options from the [index] table. Since the index_item_template option has a default value, and index_selector defaults to "body", you may end up with an unwanted index at the top of your page if you leave the global index_selector option undefined. If you want to use named views exclusively, add use_default_view = true to the [index] table.

Custom fields

Built-in fields should be enough for a typical blog taxonomy, but it's possible to add custom fields to your JSON index data.

Custom field queries are defined in the [index.custom_fields] table. Table keys are field names as they will appear in the exported JSON. Their values are subtables with required selector field and optional select_all parameters.

You can also specify default value for a field that will be used if matching element is not found, using default option. For fields with select_all flag, default value is ignored.

[index.custom_fields]
  category = { selector = "span#category", default = "misc" }

  tags = { selector = ".tag", select_all = true }

In this example, the category field will contain the inner HTML of the first <span id="category"> element even if there's more than one in the page (if there are none, it will be set to "misc"). The tags field will contain a list of contents of all elements with class="tag".

Exporting site index to JSON

The index processor invoked with the index_processor option receives the index of the current section. It doesn't include subsections. Since the site directory is processed top to bottom, the site/index.html page would not get the global site index either.

If you want to create your own taxonomies from the metadata imported from pages, create a global site index, or an index of a section and all its subsections, you can export the aggregated index data to a file for further processing. Add this option to your index config:

[index]
  dump_json = "path/to/file.json"

This way you can use a TeX-like workflow:

  1. Run soupault so that index file is created.
  2. Run your custom index generator and save generated taxonomy pages to site/.
  3. Run soupault one more time to have them included in the build.

Behaviour

As of 1.6, generated index is always inserted before any widgets have run, so that the HTML produced by index generators can be processed by widgets.

Metadata extraction happens as early as possible. By default, it happens before any widgets have run, to avoid adverse interaction with widgets. However, if you want data inserted by widgets to be available in the index, you can make soupault do the extraction after certain widgets with a extract_after_widgets = ["foo", "bar"] option in the [index] table.

Note that it doesn't mean no other widgets will run before metadata is extracted. It only means metadata will not be extracted until widgets specified in the extract_after_widgets option have run. So, if you want a widget to run only after metadata is extracted, you should specify all those widgets in its dependencies. There is no easier way to do that now.

Limiting index extraction to pages or sections

Since soupault 1.9, you can extract index data only from some pages/sections, or exclude specific pages/sections from indexing. The options are the same as for widgets.

Extracting the index without generating pages

Since 1.9, soupault supports --index-only option. In the index-only mode, it will extract the index data from pages and save it in JSON, but will not generate any pages. It will only run widgets that must run before index extraction. if you have extract_after_widgets option configured.

Widgets

Widgets provide additional functionality. When a page is processed, its content is inserted into the template, and the resulting HTML element tree is passed through a widget pipeline.

Widget behaviour

Widgets that require a selector option first check if there's an element matching that selector in the page, and do nothing if it's not found, since they wouldn't have a place to insert their output.

This way you can avoid having a widget run on a page simply by omitting the element it's looking for.

If more than one element matches the selector, the first element is used.

Widget configuration

Widget configuration is stored in the [widgets] table. The TOML syntax for nested tables is [table.subtable], therefore, you will have entries like [widgets.foo], [widgets.bar] and so on.

Widget subtable names are purely informational and have no effect, the widget type is determined by the widget option. Therefore, if you want to use a hypothetical frobnicator widget, your entry will look like:

[widgets.frobnicate]
  widget = "frobnicator"
  selector = "div#frob"

It may seem confusing and redundant, but if subtable name defined the widget to be called, you could only have one widget of the same type, and would have to choose whether to include the header or the footer with the include widget for example.

Choosing where to insert the output

By default, widget output is inserted after the last child element in its container.

If you have a designated place in your page for the widget output, e.g. leave an empty <div id="header"> for the header, then this detail is unimportant.

However, if you are modifying existing pages or just want more control and flexibility, you can specify the insert position using the action option.

For example, you can insert a header file before the first element in the page <body>:

[widgets.insert-header]
  widget = "include"
  file = "templates/header.html"
  selector = "body"
  action = "prepend_child"

Or insert a table of contents before the first <h1> element (it a page has it):

[widgets.table-of-contents]
  widget = "toc"
  selector = "h1"
  action = "insert_before"

Possible values of that option are: prepend_child, append_child, insert_before, insert_after, replace_element, replace_content.

Limiting widgets to pages or sections

If the widget target comes from the page content rather than the template, you can simply not include any elements matching its selector option.

Otherwise, you can explicitly set a widget to run or not run on specific pages or sections.

All options from this section can take either a single string, or a list of strings.

Limiting to pages or sections

There are page and section options that allow you to specify exact paths to specific pages or sections. Paths are relative to your site directory.

The page option limits a widget to an exact page file, while the section option applies a widget to all files in a subdirectory.

[widgets.site-news]
  # only on site/index.html and site/news.html
  page = ["index.html", "news.html"]

  widget = "include"
  file = "includes/site-news.html"
  selector = "div#news"

[widgets.cat-picture]
  # only on site/cats/*
  section = "cats"

  widget = "insert_html"
  html = "<img src=\"/images/lolcat_cookie.gif\" />"
  selector = "#catpic"

Excluding sections or pages

It's also possible to explicitly exclude pages or sections.

[widgets.toc]
  # Don't add a TOC to the main page
  exclude_page = "index.html"
  ...

[widgets.evil-analytics]
  exclude_section = "privacy"
  ...

Using regular expressions

When nothing else helps, path_regex and exclude_path_regex options may solve your problem. They take a Perl-compatible regular expression (not a glob).

[widgets.toc]
  # Don't add a TOC to any section index page
  exclude_path_regex = '^(.*)/index\.html$'
  ...

[widgets.cat-picture]
  path_regex = 'cats/'

Widget processing order

The order of widgets in your config file doesn't determine their processing order. By default, soupault assumes that widgets are independent and can be processed in arbitrary order. In future versions they may even be processed in parallel, who knows.

This can be an issue if one widget relies on output from another. In that case, you can order widgets explicitly with the after parameter. It can be a single widget (after = "mywidget"after = ["some-widget", "another-widget"]).

Here is an example. Suppose in the template there's a <div id="breadcrumbs"> where breadcrumbs are inserted by the add-breadcrumbs widget. Since there may not be breadcrumbs if the page is not deep enough, the div may be left empty, and that's not neat, so the cleanup-breadcrumbs widget removes it.

## Breadcrumbs
[widgets.add-breadcrumbs]
  widget = "breadcrumbs"
  selector = "#breadcrumbs"
  # <omitted>

## Remove div#breadcrumbs if the breadcrumbs widget left it empty
[widgets.cleanup-breadcrumbs]
  widget = "delete_element"
  selector = "#breadcrumbs"
  only_if_empty = true

  # Important!
  after = "add-breadcrumbs"
</omitted>

Limiting widgets to “build profiles”

Sometimes you may want to enable certain widgets only for some builds. For example, include analytics scripts only in production builds. It can be done with “build profiles”.

For example, this way you can only include includes/analytics.html file in your pages for a build profile named “live”:

[widgets.analytics]
  profile = "live"
  widget = "include"
  file = "includes/analytics.html"
  selector = "body"

Soupault will only process that widget if you run soupault --profile live. If you run soupault --profile dev, or run it without the --profile option, it will ignore that widget.

Built-in widgets

File and output inclusion

These widgets include something into your page: a file, a snippet, or output of an external program.

include

The include widget simply reads a file and inserts its content into some element.

The following configuration will insert the content of templates/header.html into an element with id="header" and the content of templates/footer.html into an element with id="footer".

[widgets.header]
  widget = "include"
  file = "templates/header.html"
  selector = "#header"

[widgets.footer]
  widget = "include"
  file = "templates/footer.html"
  selector = "#footer"

This widget provides a parse option that controls whether the file is parsed or included as a text node. Use parse = false if you want to include a file verbatim, with HTML special characters escaped.

insert_html

For a small HTML snippet, a separate file may be too much. The insert_html widget

[widgets.tracking-script]
  widget = "insert_html"
  html = '<script src="/scripts/evil-analytics.js"> </script>'
  selector = "head"

exec

The exec widget executes an external program and includes its output into an element. The program is executed in shell, so you can write a complete command with arguments in the command option. Like the include widget, it has a parse option that includes the output verbatim if set to false.

Simple example: page generation timestamp.

[widgets.generated-on]
  widget = "exec"
  selector = "#generated-on"
  command = "date -R"

preprocess_element

This widget processes element content with an external program and includes its output back in the page.

Element content is sent to program's stdin, so it can be used with any program designed to work as a pipe. HTML entities are expanded, so if you have &gt; or &amp; in your page, the program gets a > and &.

By default it assumes that the program output is HTML and runs it through an HTML parser. If you want to include its output as text (with HTML special characters escaped), you should specify parse = false

For example, this is how you can run content of <pre> elements through cat -n to automatically add line numbers:

[widgets.line-numbers]
  widget = "preprocess_element"
  selector = "pre"
  command = "cat -n"
  parse = false

You can pass element metadata to the program for better control. The tag name is passed in the TAG_NAME environment variable, and all attributes are passed in environment variables prefixed with ATTR: ATTR_ID, ATTR_CLASS, ATTR_SRC...

For example, highlight, a popular syntax highlighting tool, has a language syntax option, e.g. --syntax=python. If your elements that contain source code samples have language specified in a class (like <pre class="language-python">), you can extract the language from the ATTR_CLASS variable like this:

# Runs the content of <* class="language-*"> elements through a syntax highlighter
[widgets.highlight]
  widget = "preprocess_element"
  selector = '*[class^="language-"]'
  command = 'highlight -O html -f --syntax=$(echo $ATTR_CLASS | sed -e "s/language-//")'

Like all widgets, this widget supports the action option. The default is action = "replace_content", but using different actions you can insert a rendered version of the content alongside the original. For example, insert an inline SVG version of every Graphviz graph next to the source, and then highlight the source:

[widgets.graphviz-svg]
  widget = 'preprocess_element'
  selector = 'pre.language-graphviz'
  command = 'dot -Tsvg'
  action = 'insert_after'

[widgets.highlight]
  after = "graphviz-svg"
  widget = "preprocess_element"
  selector = '*[class^="language-"]'
  command = 'highlight -O html -f --syntax=$(echo $ATTR_CLASS | sed -e "s/language-//")'

The result will look like this:

Note: this widget supports multiple selectors, e.g. selector = ["pre", "code"].

Environment variables

External programs executed by exec and preprocess_element widgets get a few useful environment variables:

PAGE_FILE
Path to the page source file, relative to the current working directory (e.g. site/index.html
TARGET_DIR
The directory where the rendered page will be saved.

This is how you can include page's own source into a page, on a UNIX-like system:

[widgets.page-source]
  widget = "exec"
  selector = "#page-source"
  parse = false
  command = "cat $PAGE_FILE"

If you store your pages in git, you can get a page timestamp from the git log with a similar method (note that it's not a very fast operation for long commit histories):

[widgets.last-modified]
  widget = "exec"
  selector = "#git-timestamp"
  command = "git log -n 1 --pretty=format:%ad --date=format:%Y-%m-%d -- $PAGE_FILE"

The PAGE_FILE variable can be used in many different ways, for example, you can use it to fetch the page author and modification date from a revision control system like git or mercurial.

The TARGET_DIR variable is useful for scripts that modify or create page assets. For example, this snippet will create PNG images from Graphviz graphs inside <pre class="graphviz-png"> elements and replace those pre's with relative links to images.

[widgets.graphviz-png]
  widget = 'preprocess_element'
  selector = '.graphviz-png'
  command = 'dot -Tpng > $TARGET_DIR/graph_$ATTR_ID.png && echo \<img src="graph_$ATTR_ID.png"\>'
  action = 'replace_element'

Content manipulation

title

The title widget sets the page title, that is, the <title> from an element with a certain selector. If there is no <title> element in the page, this widget assumes you don't want it and does nothing.

Example:

[widgets.page-title]
  widget = "title"
  selector = "h1"
  default = "My Website"
  append = " on My Website"
  prepend = "Page named "

If selector is not specified, it uses the first <h1> as the title source element by default.

The selector option can be a list. For example, selector = ["h1", "h2", "#title"] means “use the first <h1> if the page has it, else use <h2>, else use anything with id="title", else use default”.

Optional prepend and append parameters allow you to insert some text before and after the title.

If there is no element matching the selector in the page, it will use the title specified in default option. In that case the prepend and append options are ignored.

If the title source element is absent and default title is not set, the title is left empty.

footnotes

The footnotes widget finds all elements matching a selector, moves them to a designated footnotes container, and replaces them with numbered links.4 As usual, the container element can be anywhere in the page—you can have footnotes at the top if you feel like it.

[widgets.footnotes]
  widget = "footnotes"

  # Required: Where to move the footnotes
  selector = "#footnotes"

  # Required: What elements to consider footnotes
  footnote_selector = ".footnote"

  # Optional: Element to wrap footnotes in, default is <p>
  footnote_template = "<p> </p>"

  # Optional: Element to wrap the footnote number in, default is <sup>
  ref_template = "<sup> </sup>"

  # Optional: Class for footnote links, default is none
  footnote_link_class = "footnote"

  # Optional: do not create links back to original locations
  back_links = true

  # Prepends some text to the footnote id
  link_id_prepend = ""

  # Appends some text to the back link id
  back_link_id_append = ""

The footnote_selector option can be a list, in that case all elements matching any of those selectors will be considered footnotes.

By default, the number in front of a footnote is a hyperlink back to the original location. You can disable it and make footnotes one way links with back_links = false.

You can create a custom “namespace” for footnotes and reference links using the link_id_prepend and back_link_id_append options. This makes it easier to use custom styling for those elements.

link_id_prepend = "footnote-"
back_link_id_append = "-ref"

toc

The toc widget generates a table of contents for your page.

Table of contents is generated from the heading tags from <h1> to <h6>.

Here is the ToC configuration from this website:

[widgets.table-of-contents]
  widget = "toc"

  # Required: where to insert the ToC
  selector = "#generated-toc"

  # Optional: minimum and maximum levels, defaults are 1 and 6 respectively
  min_level = 2
  max_level = 6

  # Optional: use <ol> instead of <ul> for ToC lists
  # Default is false
  numbered_list = false

  # Optional: Class for the ToC list element, default is none
  toc_list_class = "toc"

  # Optional: append the heading level to the ToC list class
  # In this example list for level 2 would be "toc-2"
  toc_class_levels = false

  # Optional: Insert "link to this section" links next to headings
  heading_links = true

  # Optional: text for the section links
  # Default is "#"
  heading_link_text = "→ "

  # Optional: class for the section links
  # Default is none
  heading_link_class = "here"

  # Optional: insert the section link after the header text rather than before
  # Default is false
  heading_links_append = false

  # Optional: use header text slugs for anchors
  # Default is false
  use_heading_slug = true

  # Optional: use unchanged header text for anchors
  # Default is false
  use_heading_text = false

  # Place nested lists inside a <li> rather than next to it
  valid_html = false
Choosing the heading anchor options

For the table of contents to work, every heading needs a unique id attribute that can be used as an anchor.

If a heading has an id attribute, it will be used for the anchor. If it doesn't, soupault has to generate one.

By default, if a heading has no id, soupault will generate a unique numeric identifier for it.5 This is safe, but not very good for readers (links are non-indicative) and for people who want to share direct links to sections (they will change if you add more sections).

If you want to find a balance between readability, permanence, and ease of maintenance, there are a few ways you can do it and the choice is yours.

The use_heading_slug = true option converts the heading text to a valid HTML identifier. Right now, however, it's very aggressive and replaces everything other than ASCII letters and digits with hyphens. This is obviously a no go for non-ASCII languages, that is, pretty much all languages in the world. It may be implemented more sensibly in the future.

The use_heading_text = true option uses unmodified heading text for the id, with whitespace and all. This is against the rules of HTML, but seems to work well in practice.

Note that use_heading_slug and use_heading_text do not enforce uniqueness.

All in all, for best link permanence you should give every heading a unique id by hand, and for best readability you may want to go with use_heading_text = true.

breadcrumbs

The breadcrumbs widget generates breadcrumbs for the page.

The only required parameter is selector, the rest is optional.

[widgets.breadcrumbs]
  widget = "breadcrumbs"

  selector = "#breadcrumbs"
  prepend = ".. / "
  append = " /"
  between = " / "
  breadcrumb_template = "<a class="\"nav\""></a>"
  min_depth = 1

The breadcrumb_template is an HTML snippet that can be used for styling your breadcrumbs. It must have an <a> element in it. By default, a simple unstyled link is used.

The min_depth sets the minimum nesting depth where breadcrumbs appear. That's the length of the logical navigation path rather than directory path.

There is a fixup that decrements the path for section index pages, that is, pages namedindex by default, or whatever is specified in the index_page option. Their navigation path is considered one level shorter than any other page in the section, when clean URLs are used. This is to prevent section index pages from having links to themselves.

  • site/index.html → 0
  • site/foo/index.html → 0 (sic!)
  • site/foo/bar.html → 1

HTML manipulation

delete_element

The opposite of insert_html. Deletes an element that matches a selector. It can be useful in two situations:

  • Another widget may leave an element empty and you want to clean it up.
  • Your pages are generated with another tool and it inserts something you don't want.
# Who reads footers anyway?
[widgets.delete_footer]
  widget = "delete_element"
  selector = "#footer"

You can limit it to deleting only empty elements with only_if_empty = true option. Element is considered empty if there's nothing but whitespace inside it.

It's possible to delete only the first element matching a selector by adding delete_all = false to its config.

Plugins

Since version 1.2, soupault can be extended with Lua plugins. Currently there are following limitations:

  • The supported language is Lua 2.5, not modern Lua 5.x. That means no closures and no for loops in particular. Here's a copy of the Lua 2.5 reference manual.
  • Only string and integer options can be passed to plugins via widget options from soupault.conf

Plugins are treated like widgets and are configured the same way.

You can find ready to use plugins in the Plugins section on this site.

Installing plugins

Plugin discovery

By default, soupault looks for plugins in the plugins/ directory. Suppose you want to use the Site URL plugin. To use that plugin, save it to plugins/site-url.lua. Then a widget named site-url will automatically become available. The site_url option from the widget config will be accessible to the plugin as config["site_url"].

[widgets.absolute-urls]
  widget = "site-url"
  site_url = "https://www.example.com"

You can specify multiple plugin directories using the plugin_dirs option under [settings]:

[settings]
  plugin_dirs = ["plugins", "/usr/share/soupault/plugins"]

If a file with the same name is found in multiple directories, soupault will use the file from the first directory in the list.

You can also disable plugin discovery and load all plugins explicitly.

[settings]
  plugin_discovery = false

Explicit plugin loading

You can also load plugins explicitly. This can be useful if you want to:

  • load a plugin from an unusual directory
  • use a widget name not based on the plugin file name
  • override a built-in widget with a plugin

Suppose you want to associate the site-url.lua plugin with absolute-links widget name. Add this snippet to soupault.conf:

[plugins.absolute-links]
  file = "plugins/site-url.lua"

It will register the plugin as a widget named absolute-links.

Then you can use it like any other widget. Plugin subtable name becomes the name of the widget, in our case absolute-links.

[widgets.make-urls-absolute]
  widget = "absolute-links"
  site_url = "https://www.example.com"

If you want to write your own plugins, read on.

Plugin example

Here's the source of that Site URL plugin that converts relative links to absolute URLs by prepending a site URL to them:

-- Converts relative links to absolute URLs
-- e.g. "/about" -> "https://www.example.com/about"

-- Get the URL from the widget config
site_url = config["site_url"]

if not Regex.match(site_url, "(.*)/$") then
  site_url = site_url .. "/"
end

links = HTML.select(page, "a")

-- Lua array indices start from 1
local index = 1
while links[index] do
  link = links[index]
  href = HTML.get_attribute(link, "href")
  if href then
    -- Check if URL schema is present
    if not Regex.match(href, "^([a-zA-Z0-9]+):") then
      -- Remove leading slashes
      href = Regex.replace(href, "^/*", "")
      href = site_url .. href
      HTML.set_attribute(link, "href", href)
    end
  end
  index = index + 1
end

In short:

  • Widget options can be retrieved from the config hash table.
  • The element tree of the page is in the page variable. You can think of it as an equivalent of document in JavaScript.
  • HTML.select() function is like document.querySelectorAll in JS.
  • The HTML module provides an API somewhat similar to DOM API in browsers, though it's procedural rather than object-oriented.

Plugin environment

Plugins have access to the following global variables:

page
The page element tree that can be manipulated with functions from the HTML module.
page_file
Page file path, e.g. site/index.html
target_dir
The directory where the page file will be saved, e.g. build/about/.
nav_path
List of strings representing the logical navigation path. For example, for site/foo/bar/quux.html it's ["foo", "bar"].
page_url
Relative page URL, e.g. /articles or /articles/index.html, depending on the clean_urls setting.
config
A table with widget config options.

Note: only string and integer options can be passed to plugins through the config table. You cannot pass TOML lists or inline tables to plugins in the current soupault version. You sort of can pass booleans, but they will be converted to strings "true" and "false", so you need to compare explicitly.

Plugin API

Apart from the standard Lua 2.5 functions, soupault provides additional modules.

The HTML module

HTML.parse(string)

Example: h = HTML.parse("<p>hello world<p>")

Parses a string into an HTML element tree

HTML.create_element(tag, text)

Example: h = HTML.create_element("p", "hello world")

Creates an HTML element node.

HTML.create_text(string)

Example: h = HTML.create_text("hello world")

Creates a text node that can be inserted into the page just like element nodes. This function automatically escapes all HTML special characters inside the string.

HTML.inner_html(html)

Example: h = HTML.inner_html(HTML.select(page, "body"))

Returns element content as a string.

HTML.strip_tags(html)

Example: h = HTML.strip_tags(HTML.select(page, "body"))

Returns element content as a string, with all HTML tags removed.

HTML.select(html, selector)

Example: links = HTML.select(page, "a")

Returns a list of elements matching specified selector.

HTML.select_one(html, selector)

content_div = HTML.select(page, "div#content")

Returns the first element matching specified selector, or nil if none are found.

HTML.select_any_of(html, selectors)

Example: link_or_pic = HTML.select_any_of(page, {"a", "img"})

Returns the first element matching any of specified selectors.

HTML.select_all_of(html, selectors)

Example: links_and_pics = HTML.select_all_of(page, {"a", "img"})

Returns all elements matching any of specified selectors.

Returns element's parent.

Example: if there's an element that has a <blink> in it, insert a warning just before that element.

blink_elem = HTML.select_one(page, "blink")
if elem then
  parent = HTML.parent(blink_elem)
  warning = HTML.create_element("p", "Warning: blink element ahead!")
  HTML.insert_before(parent, warning)
end
Access to surrounding elements

This family of functions provides access to element's neighboring elements in the tree. They all return a (possibly empty) list.

  • HTML.children
  • HTML.ancestors
  • HTML.descendants
  • HTML.siblings

Example: add class="silly-class" to every element inside the page <body>.

body = HTML.select_one(page, "body")
children = HTML.children(body)

local i = 1
while children[i] do
  if HTML.is_element(children[i]) then
    HTML.add_class(children[i], "silly-class")
  end
  i = i + 1
end
HTML.is_element

Web browsers provide a narrower API than general purpose HTML parsers. In the JavaScript DOM API, element.children provides access to all child elements of an element.

However, in the HTML parse tree, the picture is more complex. Text nodes are also child nodes—browsers just filter those out because JavaScript code rarely has a need to do anything with text nodes.

Consider this HTML: <p>This is a <em>great</em> paragraph</p>. How many children does the <p> element have? In fact, three: text("This is a "), element("em", "great"), text(" paragraph").

The goal of soupault is to allow modifying HTML pages in any imaginable way, so it cannot ignore this complexity. Many operations like HTML.add_class still make no sense for text nodes, so there has to be a way to check if something is an element or not.

That's where HTML.is_element comes in handy.

HTML.get_attribute(html_element, attribute)

Example: href = HTML.get_attribute(link, "href")

Returns the value of an element attribute. The first argument must be an element reference produced by HTML.select/HTML.select_one/HTML.select_element

If the attribute is missing, it returns nil. If the attribute is present but its value is empty (like in <elem attr=""> or <elem attr>), it returns an empty string. In Lua, both empty string and nil are false for the purpose of if value then ... end, so if you want to check for presence of an attribute regardless of its value, you should explicitly check for nil.

HTML.set_attribute(html_element, attribute)

Example: HTML.set_attribute(content_div, "id", "content")

Sets an attribute value. The first argument must be an element reference produced by HTML.select/HTML.select_one/HTML.select_element

HTML.add_class(html_element, class_name)

Example: HTML.add_class(p, "centered")

Adds a class="class_name" attribute. The first argument must be an element reference produced by HTML.select/HTML.select_one/HTML.select_element

HTML.remove_class(html_element, class_name)

Example: HTML.remove_class(p, "centered")

Adds a class="class_name" attribute. The first argument must be an element reference produced by HTML.select/HTML.select_one/HTML.select_element

HTML.get_tag_name(html_element)

Returns the tag name of an element.

Example: link_or_pic = HTML.select_any_of(page, {"a", "img"}); tag_name = HTML.get_tag_name(link_of_pic)

HTML.set_tag_name(html_element)

Changes the tag name of an element.

Example: ”modernize” <blink> elements by converting them to <span class="blink">.

blinks = HTML.select(page, "blink")

local i = 1
while blinks[i] do
  elem = blinks[i]
  HTML.set_tag_name(elem, "span")
  HTML.add_class(elem, "blink")

  i = i + 1
end
HTML.append_child(parent, child)

Example: HTML.append_child(page, HTML.create_element("br"))

Appends a child element to the parent.

Apart from HTML.append_child, there's a family of functions that insert an element at a different position in the tree: HTML.prepend_child, HTML.insert_before, HTML.insert_after, HTML.replace_element, and HTML.replace_content.

HTML.delete(html_element)

Example: HTML.delete(HTML.select_one(page, "h1"))

Deletes an element from the page. The argument must be an element reference returned by a select function.

HTML.delete_content(html_element)

Deletes all children of an element (but leaves the element itself in place).

HTML.clone_content(html_element)

Creates a new element tree from the content of an element.

Useful for duplicating an element elsewhere in the page.

HTML.get_heading_level(html_element)

For elements whose tag name matches <h[1-9]> pattern, returns the heaving level.

Returns zero for elements whose tag name doesn't look like a heading and for values that aren't HTML elements.

HTML.get_headings_tree(html_element)

Returns a table that represents the tree of HTML document headings in a format like this:.

[
  {
    "heading": ...,
    "children": [
      {"heading": ..., "children": []}
    ]
  },
  {"heading": ..., "children": []}
]

Values of the heading fields are HTML element references.

Behaviour

If an element tree access function cannot find any elements (e.g. there are no elements that match a selector), it returns nil.

If a function that expects an HTML element receives nil, it immediately returns nil, so you don't need to check for nil at every step and can safely chain calls to those functions.

The Regex module

Regular expressions used by this module are mostly Perl-compatible. Capturing groups and back references are not supported.

Regex.match(string, regex)

Example: Regex.match("/foo/bar", "^/")

Checks if a string matches a regex.

Regex.find_all(string, regex)

Example: matches = Regex.find_all("/foo/bar", "([a-z]+)")

Returns a list of substrings matching a regex.

Regex.replace(string, regex, string)

Example: s = Regex.replace("/foo/bar", "^/", "")

Replaces the first occurrence of a matching substring. It returns a new string and doesn't modify the argument.

Regex.replace_all(string, regex, string)

Example: Regex.replace_all("/foo/bar", "/", "")

Replaces every matching substring. It returns a new string and doesn't modify the argument.

Regex.split(string, regex)

Example: substrings = Regex.find_all("foo/bar", "/")

Splits a strings at a separator.

The Plugin module

Provides functions for communicating with the plugin runner code.

Plugin.fail

Example: Plugin.fail("Error occurred")

Stops plugin execution immediately and signals an error. Errors raised this way are treated as widget processing errors by soupault, for the purpose of the strict option.

Plugin.exit

Example: Plugin.exit("Nothing to do"), Plugin.exit()

Stops plugin execution immediately. The message is optional. This kind of termination is not considered an error by soupault.

Plugin.require_version

Example: Plugin.require_version("1.8.0")

Stops plugin execution if soupault version is less than required version. You can give a full version like 1.9.0 or a short version like 1.9. This function was introduced in 1.8, so plugins that use it will fail to work in 1.7 and below.

The Sys module

Sys.read_file

Example: Sys.read_file("site/index.html")

Reads a file into a string. The path is relative to the working directory.

Sys.run_program(command)

Executes given command in the system shell (/bin/sh on UNIX, cmd.exe on Windows).

The output of the command is ignored. If command fails, its stderr is logged.

Example: creating a silly file in the directory where generated page will be stored.

res = Sys.run_program(format("echo \"Kilroy was here\" > %s/graffiti", target_dir))
if not res then
  Log.warning("Damn, busted")
end

The intended use case for it is creating and processing assets, e.g. converting images to different formats.

Sys.get_program_output(command)

Executes a command in the system shell and returns its output.

If the command fails, it returns nil. The stderr is shown in the execution log, but there's no way a plugin can access its stderr or the exit code.

Example: getting the last modification date of a page from git.

git_command = "git log -n 1 --pretty=format:%ad --date=format:%Y-%m-%d -- " .. page_file
timestamp = Sys.get_program_output(git_command)

if not timestamp then
  timestamp = "1970-01-01"
end
Sys.random(max)

Example: Sys.random(1000)

Generates a random number from 0 to max.

Sys.is_unix ()

Returns true on UNIX-like systems (Linux, Mac OS, BSDs), false otherwise.

Sys.is_windows ()

Returns true on Microsoft Windows, false otherwise.

The String module

String.trim

Example: String.trim(" my string ") -- produces "my string"

Removed leading and trailing whitespace from a string.

String.slugify_ascii

Example: String.slugify_ascii("My Heading") -- produces "my-heading"

Replaces all characters other than English letters and digits with hyphens, exactly like the ToC widget.

String.truncate(string, length)

Truncates a string to given length.

Example: String.truncate("foobar", 3) -- "foo".

String.to_number

Example: String.to_number("2.7") -- produces 2.7 (float)

Converts strings to numbers. Returns nil is a string isn't a valid representation of a number.


1On my desktop with an i5-7260U CPU and a magnetic drive.

2 Newline as end of message is a horrible protocol, but since there's no universally agreed upon alternative for sending structured data to stdin, that's what we've got.

3TeX users are familiar with this approach.

4As if anyone doesn't know what what's a footnote.

5Much like the footnotes widget.