Regular expression to extract HTML text

Options
LI_213600
LI_213600 Posts: 4
edited November 2019 in Questions

Hi there, I encountered a problem where I want to extract certain text from a html text block, I tried to use a command called {Text: Find words that match a pattern} but it appears this command only supports limited number of patterns including 1.{#: Represents any digit (0-9)} , 2. {@: Represents any letter (a-z)} and 3. {?: Represents any letter, digit, or dash}.
Is there a way I could extract target text from below html block? I want to extract "Dbjd", "Dnjdsk", "外部供应商", “其他问询”,"Hdjdkkd", "Djjd". This is the regular expression query i wrote:

<

p class="MsoNormal">(?:<span.?>)?(.?)(?:<\/span>)?<\/o:p>

It's actually working in python but I couldn't find a way to make it work in Catalytic, understood the best way to transit these values are by api/json but this raw text is the only thing I got now and I couldn't change the source data.

html text block:

<body lang="EN-US" link="#0563C1" vlink="#954F72">
    <div class="WordSection1">
        <p><span lang="ZH-CN" style="font-family:DengXian">供应商代码</span>
            <o:p></o:p>
        </p>
        <p>Vendor Code<o:p></o:p>
        </p>
        <p class="MsoNormal">Dbjd<o:p></o:p>
        </p>
        <div style="border:none;border-bottom:solid #EEEEEE 1.0pt;padding:0cm 0cm 4.0pt 0cm">
            <p><span lang="ZH-CN" style="font-family:DengXian">事业部代码</span>
                <o:p></o:p>
            </p>
            <p>Company code<o:p></o:p>
            </p>
            <p class="MsoNormal">Dnjdsk<o:p></o:p>
            </p>
        </div>
        <div style="border:none;border-bottom:solid #EEEEEE 1.0pt;padding:0cm 0cm 4.0pt 0cm">
            <p><span lang="ZH-CN" style="font-family:DengXian">外部供应商或博世关联公司</span>
                <o:p></o:p>
            </p>
            <p>External supplier or Bosch affiliated company<o:p></o:p>
            </p>
            <p class="MsoNormal"><span lang="ZH-CN" style="font-family:DengXian">外部供应商</span>
                <o:p></o:p>
            </p>
        </div>
        <div style="border:none;border-bottom:solid #EEEEEE 1.0pt;padding:0cm 0cm 4.0pt 0cm">
            <p><span lang="ZH-CN" style="font-family:DengXian">发票相关问询或其他问询</span>
                <o:p></o:p>
            </p>
            <p>Invoice related question or general question<o:p></o:p>
            </p>
            <p class="MsoNormal"><span lang="ZH-CN" style="font-family:DengXian">其他问询</span>
                <o:p></o:p>
            </p>
        </div>
        <div style="border:none;border-bottom:solid #EEEEEE 1.0pt;padding:0cm 0cm 4.0pt 0cm">
            <p><span lang="ZH-CN" style="font-family:DengXian">问题描述</span>
                <o:p></o:p>
            </p>
            <p>Problem Description<o:p></o:p>
            </p>
            <p class="MsoNormal">Hdjdkkd<o:p></o:p>
            </p>
        </div>
        <div style="border:none;border-bottom:solid #EEEEEE 1.0pt;padding:0cm 0cm 4.0pt 0cm">
            <p><span lang="ZH-CN" style="font-family:DengXian">采购订单号码</span>
                <o:p></o:p>
            </p>
            <p>Purchasing orde<o:p></o:p>
            </p>
            <p class="MsoNormal">Djjd<o:p></o:p>
            </p>
        </div>
    </div>
</body>
Tagged:

Best Answer

Answers

  • Jozef_783863
    Jozef_783863 Posts: 331 admin
    edited November 2019
    Options

    @LI_133741, are your documents XML documents? In other words, is there anything in the document above the <body> tag?

    If that is the case, we can provide a second building approach using XML: Parse string field actions.

  • LI_213600
    Options

    Hello Jozef,
    Appreciate for your answer. To clarify this is a html document, I tried to search based on class name "MsoNormal" but it doesn't appear to be working. Xpath expression: //*[contains(@class, 'MsoNormal')]
    The first solutions works for me. cheers

  • Jozef_783863
    Jozef_783863 Posts: 331 admin
    Options

    Tony, I am glad we can help get this working for you!

    For XML: Parse string field, I received the XPath Expression printed below.

    //*[@class="MsoNormal"]//text()

    XML: Parse string field saves a single value, so it is best used for simple strings where you only need one item. To save more than one value, XML: Convert XML to JSON and then Field: Create fields from JSON should work. This is the recommendation in the XML: Parse string field article.

    Thank you!