Getting started: HTML processor

One feature that sets soupault apart from other website generators is that the “generator” part is optional. You can use it as an HTML processor for existing websites, without modifying any single page.

In this guide we’ll set up soupault to add some meta tags to every page, set the <title> to the first heading, and insert tables of contents.

It assumes that you already have a static website, either handwritten or generated with another tool.

If you don’t have soupault on your machine yet, you should install it. The soupault (soupault.exe on Windows) executable is standalone and has no dependencies, so you only need to copy it somewhere to ‘install’ it.

Create a basic config

First you should create a directory for your project. For example:

$ mkdir mysite

Soupault’s workflow is defined in a single configuration file, soupault.conf. It’s a file in the TOML format.

For the start, we’ll write a basic config for running soupault as an HTML processor:

[settings]
  strict = false
  verbose = true

  generator_mode = false

  clean_urls = false

  build_dir = "build"
  site_dir = "site"

  doctype = "<!DOCTYPE html>"

  page_file_extensions = ["htm", "html"]

Save that file to mysite/soupault.conf.

The generator_mode = false option tells soupault not to look for or use a page template.

With clean_urls = false we tell soupault to preserve file paths exactly, e.g. site/about.html will become build/about.html.

If you want to automatically convert your site to use clean URLs along the way, you can use clean_urls = true instead. Then site/about.html will become build/about/index.html and so on.

The page_file_extensions = ["htm", "html"] option from our config tells soupault to treat files with extensions .htm and .html as pages (parse, process, and output). All other files will be simply copied unchanged.

Configuring the source directory

The site_dir options tell soupault where to look for page source files. In our config, we have site_dir = "site", which means you should copy your pages to mysite/site to have them processed.

However, if you already have a directory with your site somewhere, you can simply point soupault to it:

For example:

[settings]
  site_dir = '/home/jrandomhacker/homepage'

Or, on Windows:

[settings]
  site_dir = 'C:\Users\jrandomhacker\homepage'

Note that soupault never modifies anything in the site_dir, so it’s a safe thing to do.

Run soupault

Now you can run soupault:

$ cd mysite
$ soupault

We’ve set build_dir = "build" in the config, so it will create a mysite/build directory and output processed pages to it. Just like with site_dir, you can set build_dir to an arbitrary directory, even outside the project directory.

There’s no built-in web server in soupault, but you can use any web server you like for preview, for example the http.server module that comes with Python:

python3 -m http.server --directory build

The output will be more or less exact copy of your source dir. Soupault will set the doctype of the pages to <!DOCTYPE html> as per the doctype option. It will also

Configure widgets

General purpose text preprocessors usually work by looking for special directives in files, like #include "myfile.html" or <a href="{{site_url}}">Home</a> and replacing them with something else. The downsides of that approach are that a) you have to modify the page to have it processed b) generated content is in a fixed place.

That is not how soupault works. Parsing HTML into an element tree allows it to see the page structure and modify pages regardless of their exact layout. However, it also requires a different approach to telling it what to do with the pages.

Instead of template variables and filters, soupault provides a set of HTML transformation modules. Some are low level and simple, like insert_html and include. Other modules have more logic in them, like toc and footnotes. To identify the source and target elements for transformation, they use CSS selectors.

If you are familiar with DOM manipulation in JavaScript, it’s the same concept as document.querySelector(".myclass"). You can use any CSS selectors, like h1 (first <h1> element), div#content (<div id="content">), .footnote (any element with class="footnote"), or div#content p (first paragraph inside <div id="content">).

HTML transformation modules are called “widgets”, for lack of a better word. They are configured in the [widgets] table of the config file. TOML uses a dot as a table name separator, so options for a widget named set-title will be in [widgets.set-title].

Here’s an example of a config with two widgets:

[settings]
  strict = false
  verbose = true

  generator_mode = false

  clean_urls = false

  build_dir = "build"
  site_dir = "site"

  doctype = "<!DOCTYPE html>"

  page_file_extensions = ["htm", "html"]

[widgets.set-title]
  selector = "h1"
  default = "My website"

[widgets.generator-meta]
  widget = 'insert_html'
  selector = 'head'
  html = '<meta name="generator" content="soupault">'

Now let’s see how to automatically enhance a website with some widgets:

Setting the page title

In a lot of websites, page title is the same as its first heading. You can easily automate it using the title widget. It takes the content from the element you specify in its selector option and inserts it in the page <title>.

[widgets.set-title]
  widget = "title"
  selector = "h1"
  default = "My website"

Some widgets allow you to specify more than one selector, and title is one of those:

[widgets.set-title]
  widget = "title"
  selector = ["h1", "h2"]
  default = "My website"

With this config it will check if the page has an <h1> element, and use its content for the title. If there is no such element, it will try <h2> instead. If all else fails, it will set the title to My website.

Adding a meta tag

The reason widget name and type are separate things is that you may want to have multiple widgets of the same type.

For example, suppose you want to add two <meta> tags to the <head> of each page, one to tell mobile browsers to behave like every sensible browser should, the other to tell everyone you are using soupault.

You can do it with the insert_html widget. It has a selector option that defines where the content is inserted, and html option for the HTML snippet to insert.

You can combine both meta tags in one snippet and it will work, but it may be better to use two independent widgets for that:

[widgets.viewport-meta]
  widget = 'insert_html'
  selector = 'head'
  html = '<meta name="viewport" content="width=device-width, initial-scale=1">'

[widgets.generator-meta]
  widget = 'insert_html'
  selector = 'head'
  html = '<meta name="generator" content="soupault">'

Meta tags always go to the page <head>, so naturally we use selector = "head".

By default, soupault inserts new content after the last child in the

So if your source page looked like:

<head>
  <style>h1 { color: red; }</style>
</head>

after processing it will look like:

<head>
  <style>h1 { color: red; }</style>
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <meta name="generator" content="soupault">
</head>

The order of the meta tags can be different, but they will come after the <style> tag that originally was there.

What if you want the generator tag to always come after the viewport tag? That is possible. In soupault, widgets form a pipeline. Output of one widget can be used as input for another. Or you can just schedule their execution order for aesthetic reasons, nothing wrong with it:

[widgets.viewport-meta]
  widget = 'insert_html'
  selector = 'head'
  html = '<meta name="viewport" content="width=device-width, initial-scale=1">'

[widgets.generator-meta]
  widget = 'insert_html'
  selector = 'head'
  html = '<meta name="generator" content="soupault">'
  # Run after viewport-meta
  after = "viewport-meta"

Including a file and choosing where to insert it

Now, support you want to add a header to every page, and you want to keep it in a separate file. For example:

echo "Please read: a personal appeal from the webmaster" > header.html

Suppose you want the header to come before the first element of the page <body>. You can do it with an include widget like this:

[widgets.alleged-header]
  widget = 'include'
  selector = 'body'
  file = 'header.html'

However, since soupault inserts new content after the last element, this config will create a footer rather than a header.

You can specify where to insert it using the action option. Its default value is append_child, but you can choose any of prepend_child, append_child, insert_before, insert_after, replace_element, replace_content.

For inserting before the first element, you will need prepend_child:

[widgets.header]
  widget = 'include'
  selector = 'body'
  file = 'header.html'
  action = 'prepend_child'

Adding a ToC

Soupault can generate tables of contents from your page headings, as you can see from this website. That widget has a large number of configurable options with (hopefully) sensible defaults.

It’s a good idea to add a container with a unique id to every page where you want a ToC, and point the widget to it. For example, add a <div id="generated-toc"> to those pages, and set up the widget like:

[widgets.toc]
  widget = "toc"
  selector = "div#generated-toc"

However, if you know you have an <h1> element in every page where you want a ToC, you can take advantage of the insert_before action and tell soupault to insert it right before the first <h1>:

[widgets.toc]
  widget = "toc"
  selector = "h1"
  action = "insert_before"

Where to go from here

There are many other things you can do, for example, create lists of pages in a section or a blog feed, add footnotes, breadcrumbs, and more. You can also extend soupault with Lua plugins. Read the reference manual for details.