PHP regex: split content

You have super powers, maybe you can help me with this regular expression.

I want to insert some content between some blocks. I’d like to split the content in an array.

This is an example of content generated by Wordpress:

<!-- wp:heading {"level":4} -->
<h4><em>More info: <a href="https://example.com">Some text for the link</a></em></h4>
<!-- /wp:heading -->

<!-- wp:paragraph -->
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam est diam, ultrices in tempor a, dignissim et neque.</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>SLorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam est diam, ultrices in tempor a, dignissim et neque.</p>
<!-- /wp:paragraph --><!-- wp:heading -->
<h2>Neque porro quisquam est qui </h2>
<!-- /wp:heading -->



<!-- wp:paragraph -->
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam est diam, ultrices in tempor a, dignissim et neque.</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>Cras ullamcorper luctus felis vitae lobortis. Fusce ut aliquam elit. Proin malesuada arcu sit amet ullamcorper auctor.</p>
<!-- /wp:paragraph -->

Output should be something like:

Array
(
    [0] => 
<!-- wp:heading {"level":4} -->
<h4><em>More info: <a href="https://example.com">Some text for the link</a></em></h4>
<!-- /wp:heading -->

    [1] => 
<!-- wp:paragraph -->
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam est diam, ultrices in tempor a, dignissim et neque.</p>
<!-- /wp:paragraph -->

    [2] => 
<!-- wp:paragraph -->
<p>SLorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam est diam, ultrices in tempor a, dignissim et neque.</p>
<!-- /wp:paragraph -->

    [3] =>
<!-- wp:heading -->
<h2>Neque porro quisquam est qui </h2>
<!-- /wp:heading -->

    [4] => 
<!-- wp:paragraph -->
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam est diam, ultrices in tempor a, dignissim et neque.</p>
<!-- /wp:paragraph -->

    [5] => 
<!-- wp:paragraph -->
<p>Cras ullamcorper luctus felis vitae lobortis. Fusce ut aliquam elit. Proin malesuada arcu sit amet ullamcorper auctor.</p>
<!-- /wp:paragraph -->

The following is a regex used in the Wordpress core to extract data from these blocks but I’m not sure how to adapte it:

(?P<closer>\/)?wp:(?P<namespace>[a-z][a-z0-9_-]*\/)?(?P<name>[a-z][a-z0-9_-]*)\s+(?P<attrs>{(?:(?:[^}]+|}+(?=})|(?!}\s+\/?-->).)*+)?}\s+)?(?P<void>\/)?

After converting to an array, I will be free to insert more content.

Thank you.

What language are you using? In Python, I know you can just make an array from all matches. Does this exist in your language? Maybe just match something like

<!-- wp:.*? -->.*?<!-- \/wp:.*? -->

and let . match newlines.

1 Like

Actually, this looks like js. So, I guess:

var regex = /<!-- wp:.*? -->.*?<!-- \/wp:.*? -->/sg;
var outputArray = inputString.match(regex);
1 Like

Awesome @mwt! Thank you very much.

I’ve seen another case than can exists:

<!-- wp:heading {"level":4} -->
<h4><em>More info: <a href="https://example.com">Some text for the link</a></em></h4>
<!-- /wp:heading -->

<!-- wp:paragraph -->
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam est diam, ultrices in tempor a, dignissim et neque.</p>
<!-- /wp:paragraph -->

There is some additional content here too.

Can be multiple paragraphs. This and the previous one should be an item too.

<!-- wp:paragraph -->
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam est diam, ultrices in tempor a, dignissim et neque.</p>
<!-- /wp:paragraph -->

What if there are some content unwrapped in between? I’d like to match it too because after joining all the content, it can be lost.

One hack would be to use regex replace to add some delimiter that you don’t expect in your HTML like &🙃; and then split by this delimiter. That way you would know that your text was complete.

There’s probably a better way…

1 Like

I made it :smiley:

$content_by_blocks = preg_split( '/(<!-- wp:.*? -->.*?<!-- \/wp:.*? -->)/s', $the_content, null, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE );

Thank you!

1 Like

That’s much better than my emoji delimiter idea :smile:.

1 Like