Pattern: Extract Fixed Number of Values Adjacent to Labels in .PDF Document

Options
Thomas_937381
Thomas_937381 Posts: 196
edited July 2020 in Show and Tell

Let's assume that you have a .pdf document containing a fixed number of attributes, but for a variable number of unique records. Each of the attributes has a static, unchanging label; there is no variability in the document format:

Part Number: 111-111-111
Quantity: 100
Part Price: $9.99
Total Price: $999.00

Part Number: 222-222-222
Quantity: 50
Part Price: $5.50
Total Price: $275.00

Part Number: 333-333-333
Quantity: 75
Part Price: $2.25
Total Price: $168.75

Using Text: Find text next to other text, one of the outputs of the step is a matches table. You can follow this pattern for each of the four labels. We would have a result like this for Part Number:

Matching Text | Position
111-111-111 | #,#
222-222-222 | #,#
333-333-333 | #,#

If we follow this same pattern for Part Price:, Quantity:, and Total Price:, we can repurpose the Position column to house the rowIndex value. It then becomes possible to join the values from the rowIndex in each one of those lists, using CSV: Update file with another file. The result looks like this:

Part Number | Part Price | Quantity | Total Price
111-111-111 | 100 | $9.99 | $999.00
222-222-222 | 50 | $5.50 | $275.00
333-333-333 | 75 | $2.25 | $168.75

We've thus transformed this from a .pdf into a reportable format that can be easily summarized or manipulated using the Excel:, CSV:, or Tables: suite of actions.

The attached workflow has a top-level field called PDF; download the default value, Sample PDF.pdf and look at it. We convert this into text (if your document is an image or hand-scanned, you could use Images: Optical character recognition), and then parse information from it. Try running the workflow; you can see the final output, final-file, in the Remove Position (rowIndex) column step.