Regular expression to extract HTML text

LI_213600 · November 2019

Hi there, I encountered a problem where I want to extract certain text from a html text block, I tried to use a command called {Text: Find words that match a pattern} but it appears this command only supports limited number of patterns including 1.{#: Represents any digit (0-9)} , 2. {@: Represents any letter (a-z)} and 3. {?: Represents any letter, digit, or dash}.
Is there a way I could extract target text from below html block? I want to extract "Dbjd", "Dnjdsk", "外部供应商", “其他问询”，"Hdjdkkd", "Djjd". This is the regular expression query i wrote:

<

p class="MsoNormal">(?:<span.?>)?(.?)(?:<\/span>)?<\/o:p>

It's actually working in python but I couldn't find a way to make it work in Catalytic, understood the best way to transit these values are by api/json but this raw text is the only thing I got now and I couldn't change the source data.

html text block:

<body lang="EN-US" link="#0563C1" vlink="#954F72">
    <div class="WordSection1">
        <p><span lang="ZH-CN" style="font-family:DengXian">供应商代码</span>
            <o:p></o:p>
        </p>
        <p>Vendor Code<o:p></o:p>
        </p>
        <p class="MsoNormal">Dbjd<o:p></o:p>
        </p>
        <div style="border:none;border-bottom:solid #EEEEEE 1.0pt;padding:0cm 0cm 4.0pt 0cm">
            <p><span lang="ZH-CN" style="font-family:DengXian">事业部代码</span>
                <o:p></o:p>
            </p>
            <p>Company code<o:p></o:p>
            </p>
            <p class="MsoNormal">Dnjdsk<o:p></o:p>
            </p>
        </div>
        <div style="border:none;border-bottom:solid #EEEEEE 1.0pt;padding:0cm 0cm 4.0pt 0cm">
            <p><span lang="ZH-CN" style="font-family:DengXian">外部供应商或博世关联公司</span>
                <o:p></o:p>
            </p>
            <p>External supplier or Bosch affiliated company<o:p></o:p>
            </p>
            <p class="MsoNormal"><span lang="ZH-CN" style="font-family:DengXian">外部供应商</span>
                <o:p></o:p>
            </p>
        </div>
        <div style="border:none;border-bottom:solid #EEEEEE 1.0pt;padding:0cm 0cm 4.0pt 0cm">
            <p><span lang="ZH-CN" style="font-family:DengXian">发票相关问询或其他问询</span>
                <o:p></o:p>
            </p>
            <p>Invoice related question or general question<o:p></o:p>
            </p>
            <p class="MsoNormal"><span lang="ZH-CN" style="font-family:DengXian">其他问询</span>
                <o:p></o:p>
            </p>
        </div>
        <div style="border:none;border-bottom:solid #EEEEEE 1.0pt;padding:0cm 0cm 4.0pt 0cm">
            <p><span lang="ZH-CN" style="font-family:DengXian">问题描述</span>
                <o:p></o:p>
            </p>
            <p>Problem Description<o:p></o:p>
            </p>
            <p class="MsoNormal">Hdjdkkd<o:p></o:p>
            </p>
        </div>
        <div style="border:none;border-bottom:solid #EEEEEE 1.0pt;padding:0cm 0cm 4.0pt 0cm">
            <p><span lang="ZH-CN" style="font-family:DengXian">采购订单号码</span>
                <o:p></o:p>
            </p>
            <p>Purchasing orde<o:p></o:p>
            </p>
            <p class="MsoNormal">Djjd<o:p></o:p>
            </p>
        </div>
    </div>
</body>

Jozef_783863 · November 2019

Hi @LI_133741, thank you for the question!

I uploaded the image down below of your html text block to help highlight the text we want.

<p class="MsoNormal"> as the start value and <o:p> as the end value works using Text: Find text next to other text. However, I do see a few lines in your html text block where there is more html between this start and end value (ex: <span>).

Using a second Text: Find text next to other text action, you can receive the text you need with a refined start value. The two images below include the two configurations you can use with Text: Find text next to other text.

You can combine these two tables using Tables: Copy a table to another table. You can then filter the new table to exclude the two text matches that include the HTML tags. You can also turn these four steps into one step using a Custom Action.

Jozef_783863 · November 2019

@LI_133741, are your documents XML documents? In other words, is there anything in the document above the <body> tag?

If that is the case, we can provide a second building approach using XML: Parse string field actions.

LI_213600 · November 2019

Hello Jozef,
Appreciate for your answer. To clarify this is a html document, I tried to search based on class name "MsoNormal" but it doesn't appear to be working. Xpath expression: //*[contains(@class, 'MsoNormal')]
The first solutions works for me. cheers

Jozef_783863 · November 2019

Tony, I am glad we can help get this working for you!

For XML: Parse string field, I received the XPath Expression printed below.

//*[@class="MsoNormal"]//text()

XML: Parse string field saves a single value, so it is best used for simple strings where you only need one item. To save more than one value, XML: Convert XML to JSON and then Field: Create fields from JSON should work. This is the recommendation in the XML: Parse string field article.

Thank you!

Regular expression to extract HTML text

Best Answer

Answers

Categories