HTML to Structured Content

Case

Assume we want to move an HTML webpage into a structured content type. One of the fields is a reference to an author content. In this example the HTML is taken from the following article https://www.espn.com/college-football/story/_/id/38491515/what-relegation-college-football-look-like.

Section of Source HTML

Destination Content Type

So we want to take the source HTML and move it into a destination content type.

We need to be able to:

  1. Parse HTML into multiple fields

  2. Create a reference to the article author

Solution

ImpulseSync can solve this with a two jobs using the liquid-field and relationship manipulators. The first job will align the article to the destination author content, creating an ID map in Impulse. The second job will sync the HTML and parse it into multiple fields.

The first job is the "Align Author" job.

Because we want to use this job to align, we need to set what fields to align in the mapping.

The aligner is set to align the source field author_name to the destination field title. When these two fields exactly match, then ImpulseSync will align the contents and create an ID map between them for later transactions to use.

When creating a job only for aligning, we can use the Nodeliver job option on the destination endpoint to not deliver any content. However, the job will still align content and create ID maps accordingly.

This job will also use a liquid field manipulator.

The config for this manipulator will create a new field author_name and set the value based on a liquid template.

The template is set to query and parse the source HTML value for a tag with the classes .author and .has-bio. It will return the text value of the data parsed. It will also remove any tag with the class .timestamp That value is then split by , and the first index of the array will be set as the value from the template.

The second job is the "HTML Sync" job. When this job runs it will take the source HTML value, parse it, and set it into the structured content at the destination.

This job has 3 liquid manipulators configured and 1 relationship manipulator.

The first liquid manipulator creates the field html_title.

The config for this manipulator uses a liquid template to parse the source HTML for a tag with the class .article-header. This value is then set for the html_title field.

The second liquid manipulator creates the field date.

The config for this manipulator uses a liquid template to parse the source HTML for a tag with the class .timestamp in a tag with the class .author-has-bio. This value is then set for the date field.

The third liquid manipulator creates the field body.

The config for this manipulator uses a liquid template to parse the source HTML for a tag with the class .article-body and returns every p tag in that .article-body. It will return the OuterHTML value of the data parsed, leaving the HTML found in tact, rather than remove any or all of the tags. This value is then set for the body field.

The relationship manipulator is used to create the field author.

The manipulator is configured to create a relationship between the content and the author it was previously aligned with. (The author the content has an ID map to)

Once both these jobs are run the end result is a structured content with fields populated based on data parsed from the source HTML.

Last updated