PDF importer discovery

This project will explore easier ways of migrating PDFs to HTML, including manual options with better editing tools for content designers. We’ll also explore if AI can help.

Councils everywhere have hundreds of PDFs in various styles, from large constitution documents to colourfully graphic-heavy marketing pamphlets.

We need to move away from them.

GOV.UK advocates HTML rather than PDF for various reasons. Content stored as HTML generally has a lower file size than its PDF equivalent, so it is also better for the planet.


User need

As a content editor
I need to quickly and easily convert many PDFs to HTML for publishing on our website
So vital information is presented in an accessible and climate-friendly way

Our new HTML publications format is an ideal candidate for receiving PDF content.

However, the format only allows manual copying and pasting text into the content type, which is a slow process.

Ultimately, we need to:

  • explore ways to automatically import PDF content 
  • provide better editing tools so content designers can quickly scan imported content and make changes
  • investigate AI tools and council attitudes to them, making sure we provide solutions for councils who are both pro and anti AI

 

Work

We believe we can implement a process for transforming a PDF to HTML as follows:

  • Upload file
  • Extract content from file
  • Process content
  • Manual review
  • Save reviewed content as nodes in Drupal

There’s a prototype already, but it cannot recognise headings, lists, tables and images yet.

Steps 2 and 3 above are the trickiest to solve.

Our first import attempts saw words out of order, spaces randomly in the middle of words, etc.

We expect to either need a better library to extract the content (for example Parsr) or something else entirely.

This work could also be implemented in a modular way, so steps of the process that may be unworkable for some councils can be replaced with alternatives.

In future phases of work, this approach could also allow us to enable additional functionality, such as importing to other LGD content types.

 

Estimated Time/Costs

Discovery into the Parsr and other options is around 10 days. This will unlock more information about how to proceed.

Total costs

The Publications module was originally built by Hammersmith & Fulham Council and Chicken so they are best placed to deliver this new work.

Chicken’s day rate for this project is £600+VAT

This is 10 days in total = £6000 +VAT

 

Fund this work

If you’re interested please contact Will. Payment is likely to be made to Hammersmith & Fulham Council who will in turn pay Chicken as needed.