Query HTML using CSS selectors, extract JSON from LD+JSON and JavaScript variables
Maintainer(s):
onnimonni
Installing and Loading
INSTALL html_query FROM community;
LOAD html_query;
Example
-- Extract text from first matching element
SELECT html_query('<html><title>Hello World</title></html>', 'title', '@text');
┌─────────────┐
│ html_query │
│ varchar │
├─────────────┤
│ Hello World │
└─────────────┘
-- Extract all matching elements as VARCHAR[]
SELECT html_query_all('<div><p>First</p><p>Second</p></div>', 'p', '@text');
┌────────────────┐
│ html_query_all │
│ varchar[] │
├────────────────┤
│ [First,Second] │
└────────────────┘
-- Extract JSON from LD+JSON scripts (returns array)
SELECT html_extract_json(
'<script type="application/ld+json">{"name":"Product"}</script>',
'script[type="application/ld+json"]'
);
┌──────────────────────────┐
│ html_extract_json │
│ varchar │
├──────────────────────────┤
│ [{"name":"Product"}] │
└──────────────────────────┘
About html_query
html_query is a DuckDB extension for querying HTML using CSS selectors.
Functions
| Function | Returns | Description |
|---|---|---|
html_query(html, selector?, extract?) |
VARCHAR | First matching element |
html_query_all(html, selector?, extract?) |
VARCHAR[] | All matching elements as list |
html_extract_json(html, selector, var_pattern?) |
JSON array | JSON from script tags |
All functions accept both VARCHAR and BLOB input types.
Extract Parameter
The extract parameter specifies what to extract from matched elements:
| Value | Description |
|---|---|
| (omitted) | Full HTML of element |
@text or text |
Inner text content |
@href, href |
href attribute |
@src, src |
src attribute |
data-test-id |
Any attribute name |
['@href', '@text'] |
Multiple attributes as JSON object |
html_query(html, selector, extract)
Extract first HTML element matching CSS selector.
Examples:
SELECT html_query(html, 'title', '@text') FROM pages;
SELECT html_query(html, 'a.nav-link', '@href') FROM pages;
SELECT html_query(html, 'div.content') FROM pages; -- returns HTML
html_query_all(html, selector, extract)
Extract all HTML elements matching CSS selector as VARCHAR[].
Examples:
SELECT html_query_all(html, 'a', '@href') FROM pages;
-- Returns: [/home, /about, /contact]
SELECT list_extract(html_query_all(html, 'a', '@href'), 2) FROM pages;
-- Access second element
SELECT html_query_all(html, 'a', ['@href', '@text']) FROM pages;
-- Returns: [{"href":"/home","text":"Home"}, ...]
html_extract_json(html, selector, var_pattern)
Extract JSON from HTML scripts. Supports LD+JSON and JavaScript variables. Always returns a JSON array.
Examples:
-- Extract LD+JSON (decodes HTML entities)
SELECT html_extract_json(html, 'script[type="application/ld+json"]')->0 FROM pages;
-- Extract JS variable (decodes hex escapes like \x22)
SELECT html_extract_json(html, 'script', 'var config')->0 FROM pages;
CSS Selectors
- Tag:
div,p,a - Class:
.classname - ID:
#idname - Attribute:
[href],[type="application/ld+json"] - Pseudo:
:first-child,:last-child,:nth-child(n) - Combinators:
div > p,div p
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| html_extract_json | scalar | NULL | NULL | |
| html_query | scalar | NULL | NULL | |
| html_query_all | scalar | NULL | NULL |