Search Shortcut cmd + k | ctrl + k
html_query

Query HTML using CSS selectors, extract JSON from LD+JSON and JavaScript variables

Maintainer(s): onnimonni

Installing and Loading

INSTALL html_query FROM community;
LOAD html_query;

Example

-- Extract text from first matching element
SELECT html_query('<html><title>Hello World</title></html>', 'title', '@text');
┌─────────────┐
 html_query  
   varchar   
├─────────────┤
 Hello World 
└─────────────┘

-- Extract all matching elements as VARCHAR[]
SELECT html_query_all('<div><p>First</p><p>Second</p></div>', 'p', '@text');
┌────────────────┐
 html_query_all 
   varchar[]    
├────────────────┤
 [First,Second] 
└────────────────┘

-- Extract JSON from LD+JSON scripts (returns array)
SELECT html_extract_json(
  '<script type="application/ld+json">{"name":"Product"}</script>',
  'script[type="application/ld+json"]'
);
┌──────────────────────────┐
    html_extract_json     
         varchar          
├──────────────────────────┤
 [{"name":"Product"}]     
└──────────────────────────┘

About html_query

html_query is a DuckDB extension for querying HTML using CSS selectors.

Functions

Function Returns Description
html_query(html, selector?, extract?) VARCHAR First matching element
html_query_all(html, selector?, extract?) VARCHAR[] All matching elements as list
html_extract_json(html, selector, var_pattern?) JSON array JSON from script tags

All functions accept both VARCHAR and BLOB input types.

Extract Parameter

The extract parameter specifies what to extract from matched elements:

Value Description
(omitted) Full HTML of element
@text or text Inner text content
@href, href href attribute
@src, src src attribute
data-test-id Any attribute name
['@href', '@text'] Multiple attributes as JSON object

html_query(html, selector, extract)

Extract first HTML element matching CSS selector.

Examples:

SELECT html_query(html, 'title', '@text') FROM pages;
SELECT html_query(html, 'a.nav-link', '@href') FROM pages;
SELECT html_query(html, 'div.content') FROM pages;  -- returns HTML

html_query_all(html, selector, extract)

Extract all HTML elements matching CSS selector as VARCHAR[].

Examples:

SELECT html_query_all(html, 'a', '@href') FROM pages;
-- Returns: [/home, /about, /contact]

SELECT list_extract(html_query_all(html, 'a', '@href'), 2) FROM pages;
-- Access second element

SELECT html_query_all(html, 'a', ['@href', '@text']) FROM pages;
-- Returns: [{"href":"/home","text":"Home"}, ...]

html_extract_json(html, selector, var_pattern)

Extract JSON from HTML scripts. Supports LD+JSON and JavaScript variables. Always returns a JSON array.

Examples:

-- Extract LD+JSON (decodes HTML entities)
SELECT html_extract_json(html, 'script[type="application/ld+json"]')->0 FROM pages;

-- Extract JS variable (decodes hex escapes like \x22)
SELECT html_extract_json(html, 'script', 'var config')->0 FROM pages;

CSS Selectors

  • Tag: div, p, a
  • Class: .classname
  • ID: #idname
  • Attribute: [href], [type="application/ld+json"]
  • Pseudo: :first-child, :last-child, :nth-child(n)
  • Combinators: div > p, div p

Added Functions

function_name function_type description comment examples
html_extract_json scalar NULL NULL  
html_query scalar NULL NULL  
html_query_all scalar NULL NULL