Search Shortcut cmd + k | ctrl + k
plinking_duck

Read PLINK 2 genomics file formats and run common genetic analyses directly in SQL

Maintainer(s): teaguesterling

Installing and Loading

INSTALL plinking_duck FROM community;
LOAD plinking_duck;

Example

-- Read variant metadata from a .pvar file
SELECT CHROM, POS, ID, REF, ALT
FROM read_pvar('cohort.pvar')
WHERE CHROM = '22';

-- Read genotypes in tidy format (one row per variant x sample)
SELECT chrom, pos, iid, genotype
FROM read_pfile('cohort', orient := 'genotype')
LIMIT 10;

-- Compute allele frequencies
SELECT * FROM plink_freq('cohort.pgen')
WHERE ALT_FREQ > 0.01;

-- Run a GWAS association test
SELECT * FROM plink_glm('cohort')
WHERE p_value < 5e-8;

About plinking_duck

PlinkingDuck brings PLINK 2 genotype, variant, and sample data into DuckDB, letting you query genomics datasets with standard SQL instead of format-specific command-line tools.

File readers:

  • read_pvar(path) — variant metadata (.pvar/.bim)
  • read_psam(path) — sample metadata (.psam/.fam)
  • read_pgen(path) — binary genotypes (.pgen)
  • read_pfile(prefix) — unified fileset reader with orient modes (variant/genotype/sample), sample subsetting, region and variant filtering

Analysis functions:

  • plink_freq — per-variant allele frequencies via fast genotype counting
  • plink_hardy — Hardy-Weinberg equilibrium exact test
  • plink_missing — per-variant or per-sample missingness rates
  • plink_ld — pairwise linkage disequilibrium (r², D, D')
  • plink_score — polygenic risk scoring with mean imputation
  • plink_glm — per-variant GWAS regression (linear, logistic, Firth)
  • plink_pca — principal component analysis via randomized SVD

Genotype output modes:

  • genotypes='struct' — STRUCT with named fields per sample
  • genotypes='counts' — fast genotype counting (no decompression)
  • genotypes='stats' — counts + AF, MAF, missingness, heterozygosity

Flexible inputs:

  • Unified variants parameter: indices, rsids, CPRA strings/structs, ranges
  • Parquet/CSV/table companions for variant and sample metadata
  • af_range / ac_range / genotype_range filter pushdown

All functions support projection pushdown (skip genotype decompression for metadata-only queries), parallel scanning, sample subsetting, and region filtering. Legacy PLINK 1 formats (.bim/.fam) are auto-detected.

Built on pgenlib for efficient access to the compressed .pgen binary format without full decompression.

For full documentation, see plinking-duck.readthedocs.io.

Added Functions

function_name function_type description comment examples
plink_freq table NULL NULL  
plink_glm table NULL NULL  
plink_hardy table NULL NULL  
plink_ld table NULL NULL  
plink_missing table NULL NULL  
plink_pca table NULL NULL  
plink_score table NULL NULL  
read_pfile table NULL NULL  
read_pgen table NULL NULL  
read_psam table NULL NULL  
read_pvar table NULL NULL  

Overloaded Functions

This extension does not add any function overloads.

Added Types

This extension does not add any types.

Added Settings

name description input_type scope aliases
plinking_max_matrix_elements Maximum genotype matrix elements for orient := 'sample' pre-read (variants x samples). Default 16 billion (~16 GB of int8). BIGINT GLOBAL []
plinking_max_threads Maximum threads for parallel scan operations. 0 = default (hardcoded cap of 16), >0 = cap at this value. BIGINT GLOBAL []
plinking_use_parquet_companions Auto-discover .pvar.parquet and .psam.parquet companion files. When true, parquet companions are preferred over text formats. BOOLEAN GLOBAL []