Pattern: Convert Table in PDF Document to Data Table

Thomas_937381
Thomas_937381 Posts: 196
edited July 2020 in Show and Tell

You may have a table contained in a .pdf file that you need to extract and manipulate (see the workflow-level PDF field in the attached .catalytic file). This pattern would also work for .docx files if you add a step at the top to save in a .pdf format.

Say for example you have a table like the one below:

Color Number True?
Red 1111111 True
Blue 2222222 False
Green 3333333 True
Yellow 4444444 True
Orange 5555555 False

This may be between two anchor points (labels) in the document. After you've used PDF: Extract text to a field or Images: Optical character recognition, you may need to use Text: Find text next to other text to parse the table from the rest of the document.

After that, we replace whitespaces ( ) with commas (,) to make the text comma-delimited. Note that if you have cells contained in your table that have whitespace, additional creativity may be required. This pattern assumes your cell values do not contain spaces.

After the workflow has run, view the pdf-to-data-table field in the Create data table from comma-delimited text step.

Once in a table format you can summarize as needed using the Tables:, Excel:, and/or CSV: suite of actions.