Relict

Reproducible Environmental DNA Analysis Platform with Conservation-Grade Provenance

RoleFull-Stack Bioinformatics Engineering, Pipeline Architecture, Scientific Systems Design & Research Publication

Built byShaurya Punj

LocationIndia

Year2025 - 2026

Overview

Relict is an open-source environmental DNA analysis platform that turns raw FASTQ sequencing reads into publication-ready biodiversity data - ASVs, seven-rank taxonomy, alpha-diversity indices, and IUCN Red List conservation status - in a single automated run. Built for ecologists, citizen scientists, and conservation researchers who refuse to stitch together ten bash scripts and a spreadsheet. Every number is computed by real, peer-reviewed bioinformatics tools. No mock data. No fabricated metrics. Ever.

Access on Request

Not publicly deployed.

Book a 20-minute private walkthrough and I will run the whole system end-to-end: architecture, code path, live UI, and the edge cases that shaped it.

Book a private demo

The Problem

The eDNA field is mature in theory and broken in practice. Researchers burn weeks gluing QIIME2, R, and Excel together; conservation status is looked up species-by-species by hand; provenance - the exact tool versions, reference DB hashes, and parameters that produced a result - is almost never captured, making two-thirds of published eDNA findings irreproducible. Existing platforms either hide the pipeline behind a paywall or skip the conservation layer entirely. Small research groups end up publishing without GBIF-compliant metadata and their samples never make it into the global biodiversity record.

The Solution

Relict collapses the eight-stage eDNA workflow into one async FastAPI pipeline: upload a FASTQ, watch eight stages stream progress over a live WebSocket, download a Darwin Core Archive that GBIF accepts without a single manual edit. Under the hood it wires fastp, vsearch UNOISE3, and scikit-bio to a PostgreSQL metadata spine and a signed SHA-256 provenance manifest that pins every tool version, every reference database, every parameter. The 2.4-million-sequence reference stack (SILVA 138.1 + MIDORI2 GB269 + MitoFish) lives on a mounted disk so taxonomy assignment runs in seconds, not hours. Reproducibility is not a doc page - it is cryptographically enforced.

Project Walkthrough

Agent Architecture

Sample Ingestion & Quality Control Engine

InputRaw FASTQ reads (Illumina / Nanopore / PacBio, up to 500 MB per sample)

EngineMinIO S3 upload -> SHA-256 fingerprint -> fastp 0.24.0 adapter trim + Q20 sliding window

OutputQC-filtered FASTQ + machine-readable fastp JSON report + human HTML dashboard

Speed: < 30s for 50K reads

ASV Discovery Engine

InputQC-passed sequences

Enginevsearch 2.28.1 dereplication -> UNOISE3 denoising (alpha=2.0, min_size=2)

OutputAmplicon Sequence Variants with per-ASV abundances, singleton-free, ready for taxonomy

Speed: ~60s for 45K reads

Taxonomic Classification Engine

InputASV centroids + amplicon marker (16S V4 / 12S MiFish / COI Leray / 18S V9 / rbcL / ITS2)

Enginevsearch --usearch_global (>=80% identity) vs SILVA 138.1 + MIDORI2 GB269 + MitoFish (2.4M reference sequences)

OutputSeven-rank Linnaean lineage (Kingdom -> Species) + confidence score + reference accession

Speed: < 90s per 50 ASVs on cached UDB index

Conservation Intelligence Engine

InputAssigned species names

EngineGBIF Backbone species match -> GBIF occurrence aggregator -> IUCN Red List v3 status lookup -> invasive-species cross-check

OutputPer-species conservation record - IUCN category (LC/NT/VU/EN/CR/EW/EX), population trend, occurrence count, invasive flag; threatened-species alert surfaced to user

Speed: ~200ms per species (30-day cache hit), ~1s on cache miss

Provenance, Diversity & Export Engine

InputFull pipeline execution trace + ASV/taxonomy tables

Enginescikit-bio alpha-diversity (Shannon, Simpson, Chao1, evenness) -> UMAP + HDBSCAN ordination -> canonical JSON manifest -> SHA-256 signature

OutputDarwin Core Archive (GBIF-ready) + BIOM 2.1.0 + CSV + HTML report + cryptographically signed provenance manifest

Speed: < 15s full export bundle

System Architecture

Technology Stack

client

React 18.3
TypeScript 5.8
Vite 5.4
Tailwind + shadcn/ui (66 components)
Three.js / @react-three/fiber
React Query, React Router v6, Framer Motion
WebSocket live telemetry

server

Python 3.11 + FastAPI + Uvicorn
24 REST endpoints + 1 WebSocket channel
SQLAlchemy 2.0 async + Alembic
PostgreSQL 16 (9 tables, 110+ columns)
Redis 7 (RQ queue + pub/sub)
Argon2id + JWT access/refresh rotation
structlog with request-ID propagation

bioinformatics

fastp 0.24.0 (QC + adapter trim)
vsearch 2.28.1 (dereplication, UNOISE3, global alignment)
cutadapt 4.9 (primer trimming)
scikit-bio 0.6.2 (alpha-diversity)
umap-learn 0.5.7 + hdbscan 0.8.40
biopython 1.84, biom-format 2.1.16

data

SILVA 138.1 SSU NR99 (436,680 sequences)
MIDORI2 GB269 COI (1.8M sequences)
MIDORI2 12S (193,724 sequences)
MitoFish mitogenomes
GBIF Backbone + Occurrence API
IUCN Red List v3 API
S3-compatible blob storage (MinIO / R2 / S3)
Docker Compose + Render Blueprint

Key Features

Eight-stage end-to-end pipeline - FASTQ in, GBIF-ready Darwin Core Archive out, zero manual steps.

Cryptographically signed provenance - every run emits a SHA-256-sealed manifest pinning tool versions, reference DB hashes, and parameters, so any result is byte-reproducible.

Integrated conservation layer - automatic IUCN Red List and GBIF cross-reference flags endangered and invasive species the moment they're detected.

Six amplicon markers, 2.4 million reference sequences - 16S V4, 12S MiFish, COI Leray, 18S V9, rbcL, ITS2, all version-pinned and SHA-verified.

Real-time WebSocket telemetry - the browser streams per-stage progress, read counts, and timing events as the worker executes. No polling, no "still working" spinner.

Publication-grade validation - 10/10 on synthetic mock community; 51 ASVs at 100% taxonomy assignment from a 45,204-read real SRA dataset in 3.6 minutes; research paper drafted for Methods in Ecology and Evolution.

Screenshots

Relict - eDNA Analysis Pipeline Dashboard

Relict - Taxonomic Classification & Conservation Intelligence

Relict - Provenance Manifest & Darwin Core Export

Slide 1 of 3

01 / 03

Project Presentation

15 slides

Design rationale, six-service architecture, the eight-stage eDNA pipeline, conservation cross-referencing, benchmarks, and the research publication track for Relict.

Download deck PDF

Slide 1 of 15

01 / 15