Uploaded image for project: 'phpBB3'
  1. phpBB3
  2. PHPBB3-16269

Sphinx backend indexes HTML markup as keywords

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Minor
    • Resolution: Fixed
    • 3.3, 3.2.0
    • 3.2.9
    • Search
    • None
    • Sphinx 2.2.11

    Description

      Using Sphinx backend with the default configuration, the indexer creates a search index based on POST_TABLE using utf8 encoding. The indexer treats punctuation marks and special characters as delimiters and ignores/replaces them with whitespace.

      However, text in the post_text column is in HTML format and uses HTML entity encoding for certain (but not all) special characters such as " ("), '&' (&), '<' (<) etc.

      This means that HTML markup such as <br> and <img src=> are actually indexed in the search dictionary as BR, IMG, SRC etc. It also means that special characters that have HTML entity encoding are indexed as the word element without punctuation. For example, the unicode character & is in post_text as & and is indexed by Sphinx as simply AMP. Bold text would be indexed as including the word STRONG.

      This can be reproduced on the phpBB community board by searching for the keyword "br", which will return 2.5 million hits. If you go to the posts themselves, there is of course no word "br" - it is simply returning line breaks as a hit.

      This should not be default behaviour.

      Resolving the issue is quite simple and requires the following line to be added to the Sphinx configuration file in the indexer section:

      html_strip = 1

      According to the Sphinx documentation, this does the following:

      HTML tags are removed, their contents (i.e., everything between <P> and </P>) are left intact by default. You can choose to keep and index attributes of the tags (e.g., HREF attribute in an A tag, or ALT in an IMG one). Several well-known inline tags are completely removed, all other tags are treated as block level and replaced with whitespace. For example, 'te<B>st</B>' text will be indexed as a single keyword 'test', however, 'te<P>st</P>' will be indexed as two keywords 'te' and 'st'. Known inline tags are as follows: A, B, I, S, U, BASEFONT, BIG, EM, FONT, IMG, LABEL, SMALL, SPAN, STRIKE, STRONG, SUB, SUP, TT.

      HTML entities get decoded and replaced with corresponding UTF-8 characters. Stripper supports both numeric forms (such as ï) and text forms (such as ó or  ). All entities as specified by HTML4 standard are supported.

      This setting should be enabled by default in both the sample Sphinx configuration file and the configuration file generated by the ACP module.

      In my testing, this resulted in a reduction in the size of the search index by approximately 20% and the time taken to recreate the main index by approximately 45%, which is a massive improvement.

      Converting HTML entities to UTF-8 characters also enables those characters to be indexed and searched literally, which is currently missing from the search functionality when using the Sphinx backend. This can be implemented through special configuration settings and modifications to the phpBB code. I will address this functionality in the main pull request to fix broken search operators in Sphinx (PHPBB3-16324 and PHPBB-16233).

      In the meantime, I will submit a pull request with the fix for this issue.

      Attachments

        Activity

          People

            Marc Marc
            KYPREO KYPREO [X] (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: