Pattern: Convert Table in PDF Document to Data Table

Thomas_937381 · July 2020

You may have a table contained in a .pdf file that you need to extract and manipulate (see the workflow-level PDF field in the attached .catalytic file). This pattern would also work for .docx files if you add a step at the top to save in a .pdf format.

Say for example you have a table like the one below:

Color	Number	True?
Red	1111111	True
Blue	2222222	False
Green	3333333	True
Yellow	4444444	True
Orange	5555555	False

This may be between two anchor points (labels) in the document. After you've used PDF: Extract text to a field or Images: Optical character recognition, you may need to use Text: Find text next to other text to parse the table from the rest of the document.

After that, we replace whitespaces ( ) with commas (,) to make the text comma-delimited. Note that if you have cells contained in your table that have whitespace, additional creativity may be required. This pattern assumes your cell values do not contain spaces.

After the workflow has run, view the pdf-to-data-table field in the Create data table from comma-delimited text step.

Once in a table format you can summarize as needed using the Tables:, Excel:, and/or CSV: suite of actions.

Pattern: Convert Table in PDF Document to Data Table

Categories