GBIF Name Parser

The core GBIF scientific name parser library.
https://github.com/gbif/name-parser

Category: Biosphere
Sub Category: Biodiversity Data Cleaning and Standardization

Keywords from Contributors

biodiversity-informatics darwin-core taxonomy gbif tdwg biodiversity species snapshot interest-group

Last synced: about 18 hours ago
JSON representation

Repository metadata

The core GBIF scientific name parser library

Host: GitHub
URL: https://github.com/gbif/name-parser
Owner: gbif
License: apache-2.0
Created: 2014-01-24T10:44:23.000Z (over 12 years ago)
Default Branch: master
Last Pushed: 2026-07-07T11:32:04.000Z (13 days ago)
Last Synced: 2026-07-10T05:03:59.238Z (10 days ago)
Language: Java
Size: 3.89 MB
Stars: 19
Watchers: 18
Forks: 4
Open Issues: 52
Releases: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

GBIF Name Parser

A library and command-line tool that parses scientific names — including the
authorship, rank, hybrid markers and nomenclatural notes — into a structured
ParsedName
model.

Modules

Module	Purpose
`name-parser-api`	Pure model + interface module: `ParsedName`, `Authorship`, `Rank`, `NomCode`, `NameType`, the `NameParser` interface, plus formatter / Unicode utilities. Depend on this if you only need the data model.
`name-parser`	The parser implementation. Single public entry point: `org.gbif.nameparser.NameParserImpl`.
`name-parser-cli`	Command-line tools (`parse`, `compare`, `benchmark`) wrapping the parser, packaged as an executable shaded jar.

Build everything with mvn install from the repo root.

Library use

<dependency>
  <groupId>org.gbif</groupId>
  <artifactId>name-parser</artifactId>
  <version>4.0.0-SNAPSHOT</version>
</dependency>

NameParser parser = new NameParserImpl();
ParsedName pn = parser.parse("Vulpes vulpes silaceus Miller, 1907", null, null, null);

Command-line interface

After mvn install, the executable jar is at
name-parser-cli/target/name-parser-cli-<version>-shaded.jar.

java -jar name-parser-cli-<version>-shaded.jar <command> [options]

Command	What it does
`parse`	Stream a text file with one name per row through the parser and write a JSONL file (one JSON object per row).
`compare`	Stream two JSONL files in lockstep, report aggregate metrics and a per-row dump of every differing parsed value.
`benchmark`	Measure parser throughput against a name-per-line input file (count, total / avg / min / p50 / p95 / max).

Run <command> --help for the full per-command option list.

All commands stream their input — memory use stays flat regardless of input size,
so multi-million-row inputs are fine.

Bundled sample corpora

Sample inputs ship in name-parser-cli/data/:

benchmark-data.txt — ~8k mixed names (hand-picked + test-assertion inputs +
random Catalogue of Life rows with authorship) used for throughput benchmarking.
Top up with more random names anytime via:
```
python3 name-parser-cli/scripts/append-colnames-sample.py [-n 2000] [--seed 17]
```
The script reservoir-samples col-names.tsv in a single pass and appends rows
as scientificName authorship — manual edits to the benchmark file are
preserved.
col-names.tsv — the full Catalogue of Life names dump (~6.3M rows, ~340 MB,
not tracked in git — drop your own copy here)

Each command's --input defaults assume you run it from the repo root.

`parse`

Usage: name-parser-cli parse [options]

Options:
  --input=PATH    source file (default: data/col-names.tsv; '-' = stdin)
  --output=PATH   target file (default: <input>.<format-ext>; '-' = stdout)
  --format=FMT    output format: jsonl (default), json, csv, tsv
                  csv / tsv produce a flat ColDP Name file with header
  --quiet         suppress progress output
  -h --help       print this message and exit

Use - as the input or output path to stream from stdin / to stdout — the
command is fully unix-pipe friendly. Progress messages and the final summary
are written to stderr so stdout stays a clean data stream:

cat names.txt | name-parser-cli parse --input=- --output=- --format=tsv | head
xz -dc col-names.tsv.xz | name-parser-cli parse --input=- --output=- --format=jsonl > col.jsonl

Input

The input format is auto-detected from the first non-blank, non-comment line:

ColDP Name file (TSV or CSV) — recognised when the header row contains
any ColdpTerm
property names (looked up via ColdpTerm.find). Only the columns the parser
interface accepts are honoured: ID, scientificName, authorship, rank,
code. Other columns are read but ignored.
Plain text — one name per line. If a line contains a tab, only the
substring before the first tab is treated as the name (so col-names.tsv is
usable both as ColDP-style TSV and as bare plain text).

Lines starting with # and blank lines are skipped.

Output formats

Format	Description
`jsonl` (default)	One self-contained JSON object per line; consumed by `compare`.
`json`	Single document containing a JSON array of all rows (streamed; not held in memory).
`csv` / `tsv`	Flat ColDP Name file with header row.

JSON / JSONL rows look like:

{"line":42,"id":"42","input":"Felis catus","parsed":{ ...full ParsedName... }}
{"line":99,"id":"99","input":"Iridoviridae","error":{"type":"VIRUS","message":"..."}}

The id field is populated from the ColDP ID column when present; otherwise
it is omitted.

ColDP CSV/TSV column mapping

Every structural ParsedName field maps to a ColDP column. Where the ColDP
Name entity lacks a column but the NameUsage entity defines one, that
NameUsage term is used (nameStatus, namePhrase, namePublishedInPage,
provisional, extinct). Parser-only fields without a ColDP equivalent are
written into custom columns prefixed with np: — strict ColDP readers ignore
unknown columns, so the file stays valid ColDP.

Multi-value rules: author lists join with | (the ColDP convention); notho
parts join with ,.

`ParsedName` field	ColDP column
`id` (from input)	`ID` (falls back to verbatim scientificName when absent)
`canonicalNameWithoutAuthorship()` (`Candidatus` prefixed when applicable)	`scientificName`
`authorshipComplete()`	`authorship`
`rank`, `code`	`rank`, `code` (lower-cased)
`nomenclaturalNote` (or `manuscript` flag)	`nameStatus`
`uninomial`, `genus`, `infragenericEpithet`, `specificEpithet`, `infraspecificEpithet`, `cultivarEpithet`	same column names
`notho` (every flagged part, comma-joined)	`notho`
`originalSpelling`	`originalSpelling`
`combinationAuthorship.{authors,exAuthors,year}`	`combinationAuthorship`, `combinationExAuthorship`, `combinationAuthorshipYear` (authors joined with `\|`)
`basionymAuthorship.{authors,exAuthors,year}`	`basionymAuthorship`, `basionymExAuthorship`, `basionymAuthorshipYear` (authors joined with `\|`)
`publishedIn` (free text)	`namePublishedInPage`
`extinct`	`extinct`
`phrase`	`namePhrase`
`doubtful`	`provisional`
`type` (when not `SCIENTIFIC`)	`np:type`
`sanctioningAuthor`	`np:sanctioningAuthor`
`taxonomicNote` (sensu)	`np:taxonomicNote`
`unparsed`	`np:unparsed`
`warnings` (joined with `\|`)	`np:warnings`
(parser failure message)	`np:error`

Unparsable rows are still written: ID, scientificName (the verbatim input)
and the np:type / np:error columns are populated.

`compare`

Usage: name-parser-cli compare [options] <a.jsonl> <b.jsonl> [diffs.txt]

Options:
  --a=PATH              first JSONL file (alt. to first positional arg)
  --b=PATH              second JSONL file (alt. to second positional arg)
  --output=PATH         write per-row diffs here (default: stdout)
  --ignore-whitespace   strip whitespace from string leaves before compare
  --max-diffs=N         cap per-row diff dump at N rows (default: 100)
  -h --help             print this message and exit

Both inputs are expected to come from the same source file (matching line
numbers, same row order). The summary reports rows compared / identical /
differing, status transitions (PARSED→ERROR, ERROR→PARSED, …) and the top
differing field paths. Whitespace inside parsed string values is significant by
default — pass --ignore-whitespace to suppress whitespace-only differences in
parsed values (the JSON formatting itself is ignored either way).

`benchmark`

Usage: name-parser-cli benchmark [options]

Options:
  --input=PATH    source file (default: data/benchmark-data.txt)
  --warmup        do an extra untimed pass over the input first to warm the JIT
  -h --help       print this message and exit

Pure throughput measurement — every input row is parsed and timed. JIT warmup
is opt-in via --warmup, in which case the input is streamed through the
parser once without timing before the timed pass; on subsequent runs the
HotSpot-warmed numbers tend to be ~10× lower. Nothing is written to disk; the
report goes to stdout.

License

Apache 2.0.

Owner metadata

Name: Global Biodiversity Information Facility
Login: gbif
Email:
Kind: organization
Description:
Website: https://www.gbif.org
Location: Copenhagen, Denmark
Twitter:
Company:
Icon url: https://avatars.githubusercontent.com/u/1963797?v=4
Repositories: 288
Last ynced at: 2024-04-14T06:45:04.085Z
Profile URL: https://github.com/gbif

GitHub Events

Total

Delete event: 3
Pull request event: 1
Fork event: 1
Issues event: 2
Watch event: 1
Issue comment event: 4
Push event: 71
Create event: 6

Last Year

Delete event: 1
Pull request event: 1
Issues event: 1
Watch event: 1
Issue comment event: 2
Push event: 52
Create event: 3

Committers metadata

Last synced: 4 days ago

Total Commits: 729
Total Committers: 14
Avg Commits per committer: 52.071
Development Distribution Score (DDS): 0.246

Commits in past year: 157
Committers in past year: 2
Avg Commits per committer in past year: 78.5
Development Distribution Score (DDS) in past year: 0.115

Name	Email	Commits
Markus Döring	m**g@g**g	550
gbif-jenkins	d**v@g**g	102
gbif-jenkins	j**s@r**g	30
gbif-jenkins	j**s@j**g	18
pal155	D**r@c**u	8
Federico Mendez	f**z@g**g	7
Oliver Meyn	o**r@m**m	3
Kyle Braak	k**k@g**g	3
dependabot[bot]	4****]	2
Matthew Blissett	m**t@g**g	2
Thomas Stjernegaard Jeppesen	t**n@g**g	1
Nikolay Volik	n**k@g**g	1
Christian Gendreau	c****u	1
Jorrit Poelen	j**n@g**m	1

Committer domains:

Issue and Pull Request metadata

Last synced: 18 days ago

Total issues: 4
Total pull requests: 3
Average time to close issues: 1 day
Average time to close pull requests: almost 2 years
Total issue authors: 3
Total pull request authors: 2
Average comments per issue: 1.25
Average comments per pull request: 0.33
Merged pull request: 0
Bot issues: 0
Bot pull requests: 3

Past year issues: 3
Past year pull requests: 1
Past year average time to close issues: 2 days
Past year average time to close pull requests: N/A
Past year issue authors: 3
Past year pull request authors: 1
Past year average comments per issue: 1.33
Past year average comments per pull request: 0.0
Past year merged pull request: 0
Past year bot issues: 0
Past year bot pull requests: 1

More stats: https://issues.ecosyste.ms/repositories/lookup?url=https://github.com/gbif/name-parser

Top Issue Authors

djtfmartin (2)
CecSve (1)
mdoering (1)

Top Pull Request Authors

dependabot[bot] (2)
renovate[bot] (1)

Top Issue Labels

bug (1)

Top Pull Request Labels

dependencies (2)
java (1)

Dependencies

name-parser/pom.xml maven

com.google.guava:guava
commons-io:commons-io
org.apache.commons:commons-lang3
org.gbif:name-parser-api
org.slf4j:slf4j-api
ch.qos.logback:logback-classic test
junit:junit test
org.gbif:name-parser-api test

name-parser-api/pom.xml maven

com.google.code.findbugs:jsr305
com.google.guava:guava
org.apache.commons:commons-lang3
org.slf4j:slf4j-api
ch.qos.logback:logback-classic test
commons-io:commons-io test
junit:junit test

pom.xml maven

com.google.code.findbugs:jsr305 3.0.2
com.google.guava:guava 28.0-jre
commons-io:commons-io 2.8.0
org.apache.commons:commons-lang3 3.12.0
org.gbif:gbif-api 0.166
org.gbif:name-parser 3.7.3-SNAPSHOT
org.gbif:name-parser-api 3.7.3-SNAPSHOT
org.gbif:name-parser-gbif 3.7.3-SNAPSHOT
org.slf4j:slf4j-api 1.7.24
ch.qos.logback:logback-classic 1.2.3 test
junit:junit 4.12 test
org.gbif:name-parser-api 3.7.3-SNAPSHOT test

name-parser-cli/pom.xml maven

ch.qos.logback:logback-classic *
com.google.code.gson:gson *
org.catalogueoflife:coldp *
org.gbif:name-parser *
org.slf4j:slf4j-api *
junit:junit * test

Score: 6.901737206656573

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Sustainable Technology

GBIF Name Parser

Keywords from Contributors

Repository metadata

README.md

GBIF Name Parser

Modules

Library use

Command-line interface

Bundled sample corpora

`parse`

Input

Output formats

ColDP CSV/TSV column mapping

`compare`

`benchmark`

License

Owner metadata

GitHub Events

Total

Last Year

Committers metadata

Committer domains:

Issue and Pull Request metadata

Top Issue Authors

Top Pull Request Authors

Top Issue Labels

Top Pull Request Labels

Dependencies