GBIF Name Parser
The core GBIF scientific name parser library.
https://github.com/gbif/name-parser
Category: Biosphere
Sub Category: Biodiversity Data Cleaning and Standardization
Keywords from Contributors
biodiversity-informatics darwin-core taxonomy gbif tdwg snapshot species biodiversity interest-group ontologies
Last synced: 20 minutes ago
JSON representation
Repository metadata
The core GBIF scientific name parser library
- Host: GitHub
- URL: https://github.com/gbif/name-parser
- Owner: gbif
- License: apache-2.0
- Created: 2014-01-24T10:44:23.000Z (over 12 years ago)
- Default Branch: master
- Last Pushed: 2026-05-16T07:49:28.000Z (18 days ago)
- Last Synced: 2026-05-16T09:37:37.585Z (18 days ago)
- Language: Java
- Size: 3.03 MB
- Stars: 19
- Watchers: 18
- Forks: 4
- Open Issues: 52
- Releases: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
README.md
GBIF Name Parser(s)
The project contains various implementations of parsers for scientific names.
At the core there is an independent parser mainly based on regular expression with minimal dependencies.
The modules provided by this project are:
- name-parser: The main GBIF Name Parser implementing the API natively
- name-parser-api: The minimal API to represent parsed names.
- name-parser-v1: The GBIF Name Parser wrapped to implement the GBIF API
The GBIF name parser has been tested with millions of GBIF names over many years.
An extensive body of unit tests has been created over the years that guarantee high parsing qualities.
A library and command-line tool that parses scientific names — including the
authorship, rank, hybrid markers and nomenclatural notes — into a structured
ParsedName
model.
Modules
| Module | Purpose |
|---|---|
name-parser-api |
Pure model + interface module: ParsedName, Authorship, Rank, NomCode, NameType, the NameParser interface, plus formatter / Unicode utilities. Depend on this if you only need the data model. |
name-parser |
The parser implementation. Single public entry point: org.gbif.nameparser.NameParserImpl. |
name-parser-cli |
Command-line tools (parse, compare, benchmark) wrapping the parser, packaged as an executable shaded jar. |
Build everything with mvn install from the repo root.
Library use
<dependency>
<groupId>org.gbif</groupId>
<artifactId>name-parser</artifactId>
<version>4.0.0-SNAPSHOT</version>
</dependency>
NameParser parser = new NameParserImpl();
ParsedName pn = parser.parse("Vulpes vulpes silaceus Miller, 1907", null, null, null);
Command-line interface
After mvn install, the executable jar is at
name-parser-cli/target/name-parser-cli-<version>-shaded.jar.
java -jar name-parser-cli-<version>-shaded.jar <command> [options]
| Command | What it does |
|---|---|
parse |
Stream a text file with one name per row through the parser and write a JSONL file (one JSON object per row). |
compare |
Stream two JSONL files in lockstep, report aggregate metrics and a per-row dump of every differing parsed value. |
benchmark |
Measure parser throughput against a name-per-line input file (count, total / avg / min / p50 / p95 / max). |
Run <command> --help for the full per-command option list.
All commands stream their input — memory use stays flat regardless of input size,
so multi-million-row inputs are fine.
Bundled sample corpora
Sample inputs ship in name-parser-cli/data/:
benchmark-data.txt— ~8k mixed names (hand-picked + test-assertion inputs +
random Catalogue of Life rows with authorship) used for throughput benchmarking.
Top up with more random names anytime via:
The script reservoir-samples col-names.tsv in a single pass and appends rowspython3 name-parser-cli/scripts/append-colnames-sample.py [-n 2000] [--seed 17]
asscientificName authorship— manual edits to the benchmark file are
preserved.col-names.tsv— the full Catalogue of Life names dump (~6.3M rows, ~340 MB,
not tracked in git — drop your own copy here)
Each command's --input defaults assume you run it from the repo root.
parse
Usage: name-parser-cli parse [options]
Options:
--input=PATH source file (default: data/col-names.tsv; '-' = stdin)
--output=PATH target file (default: <input>.<format-ext>; '-' = stdout)
--format=FMT output format: jsonl (default), json, csv, tsv
csv / tsv produce a flat ColDP Name file with header
--quiet suppress progress output
-h --help print this message and exit
Use - as the input or output path to stream from stdin / to stdout — the
command is fully unix-pipe friendly. Progress messages and the final summary
are written to stderr so stdout stays a clean data stream:
cat names.txt | name-parser-cli parse --input=- --output=- --format=tsv | head
xz -dc col-names.tsv.xz | name-parser-cli parse --input=- --output=- --format=jsonl > col.jsonl
Input
Plain text only — one name per line. Lines starting with # and blank lines
are skipped. If a line contains a tab, only the substring before the first
tab is treated as the name, so a bare TSV like col-names.tsv can be fed in
verbatim and the extra columns are silently ignored.
Output formats
| Format | Description |
|---|---|
jsonl (default) |
One self-contained JSON object per line; consumed by compare. |
json |
Single document containing a JSON array of all rows (streamed; not held in memory). |
csv / tsv |
Flat ColDP Name file with header row. |
JSON / JSONL rows look like:
{"line":42,"input":"Felis catus","parsed":{ ...full ParsedName... }}
{"line":99,"input":"Iridoviridae","error":{"type":"VIRUS","message":"..."}}
ColDP CSV/TSV column mapping
Every structural ParsedName field maps to a ColDP column. Where the ColDP
Name entity lacks a column but the NameUsage entity defines one, that
NameUsage term is used (nameStatus, namePhrase, namePublishedInPage,
provisional, extinct). Parser-only fields without a ColDP equivalent are
written into custom columns prefixed with np: — strict ColDP readers ignore
unknown columns, so the file stays valid ColDP.
Multi-value rules: author lists join with | (the ColDP convention).
ParsedName field |
ColDP column |
|---|---|
id (from input) |
ID (falls back to verbatim scientificName when absent) |
canonicalNameWithoutAuthorship() (Candidatus prefixed when applicable) |
scientificName |
authorshipComplete() |
authorship |
rank, code |
rank, code (lower-cased) |
nomenclaturalNote (or manuscript flag) |
nameStatus |
uninomial, genus, infragenericEpithet, specificEpithet, infraspecificEpithet, cultivarEpithet |
same column names |
notho (single hybrid-marker part, lower-cased) |
notho |
originalSpelling |
originalSpelling |
combinationAuthorship.{authors,exAuthors,year} |
combinationAuthorship, combinationExAuthorship, combinationAuthorshipYear (authors joined with |) |
basionymAuthorship.{authors,exAuthors,year} |
basionymAuthorship, basionymExAuthorship, basionymAuthorshipYear (authors joined with |) |
publishedIn (free text) |
namePublishedInPage |
extinct |
extinct |
phrase |
namePhrase |
doubtful |
provisional |
type (when not SCIENTIFIC) |
np:type |
sanctioningAuthor |
np:sanctioningAuthor |
taxonomicNote (sensu) |
np:taxonomicNote |
unparsed |
np:unparsed |
warnings (joined with |) |
np:warnings |
| (parser failure message) | np:error |
Unparsable rows are still written: ID, scientificName (the verbatim input)
and the np:type / np:error columns are populated.
compare
Usage: name-parser-cli compare [options] <a.jsonl> <b.jsonl> [diffs.txt]
Options:
--a=PATH first JSONL file (alt. to first positional arg)
--b=PATH second JSONL file (alt. to second positional arg)
--output=PATH write per-row diffs here (default: stdout)
--ignore-whitespace strip whitespace from string leaves before compare
--max-diffs=N cap per-row diff dump at N rows (default: 100)
-h --help print this message and exit
Both inputs are expected to come from the same source file (matching line
numbers, same row order). The summary reports rows compared / identical /
differing, status transitions (PARSED→ERROR, ERROR→PARSED, …) and the top
differing field paths. Whitespace inside parsed string values is significant by
default — pass --ignore-whitespace to suppress whitespace-only differences in
parsed values (the JSON formatting itself is ignored either way).
benchmark
Usage: name-parser-cli benchmark [options]
Options:
--input=PATH source file (default: data/benchmark-data.txt)
--warmup do an extra untimed pass over the input first to warm the JIT
-h --help print this message and exit
Pure throughput measurement — every input row is parsed and timed. JIT warmup
is opt-in via --warmup, in which case the input is streamed through the
parser once without timing before the timed pass; on subsequent runs the
HotSpot-warmed numbers tend to be ~10× lower. Nothing is written to disk; the
report goes to stdout.
License
Apache 2.0.
Owner metadata
- Name: Global Biodiversity Information Facility
- Login: gbif
- Email:
- Kind: organization
- Description:
- Website: https://www.gbif.org
- Location: Copenhagen, Denmark
- Twitter:
- Company:
- Icon url: https://avatars.githubusercontent.com/u/1963797?v=4
- Repositories: 288
- Last ynced at: 2024-04-14T06:45:04.085Z
- Profile URL: https://github.com/gbif
GitHub Events
Total
- Delete event: 3
- Pull request event: 1
- Fork event: 1
- Issues event: 2
- Watch event: 1
- Issue comment event: 4
- Push event: 54
- Create event: 6
Last Year
- Delete event: 1
- Pull request event: 1
- Issues event: 2
- Watch event: 1
- Issue comment event: 4
- Push event: 37
- Create event: 3
Committers metadata
Last synced: 2 days ago
Total Commits: 609
Total Committers: 14
Avg Commits per committer: 43.5
Development Distribution Score (DDS): 0.274
Commits in past year: 39
Committers in past year: 2
Avg Commits per committer in past year: 19.5
Development Distribution Score (DDS) in past year: 0.154
| Name | Commits | |
|---|---|---|
| Markus Döring | m****g@m****m | 442 |
| gbif-jenkins | d****v@g****g | 90 |
| gbif-jenkins | j****s@r****g | 30 |
| gbif-jenkins | j****s@j****g | 18 |
| pal155 | D****r@c****u | 8 |
| Federico Mendez | f****z@g****g | 7 |
| Oliver Meyn | o****r@m****m | 3 |
| Kyle Braak | k****k@g****g | 3 |
| dependabot[bot] | 4****] | 2 |
| Matthew Blissett | m****t@g****g | 2 |
| Thomas Stjernegaard Jeppesen | t****n@g****g | 1 |
| Nikolay Volik | n****k@g****g | 1 |
| Christian Gendreau | c****u | 1 |
| Jorrit Poelen | j****n@g****m | 1 |
Committer domains:
- gbif.org: 6
- mineallmeyn.com: 1
- csiro.au: 1
- jenkins-vh.gbif.org: 1
- rancor.gbif.org: 1
- mac.com: 1
Issue and Pull Request metadata
Last synced: 18 days ago
Total issues: 3
Total pull requests: 3
Average time to close issues: 2 days
Average time to close pull requests: almost 2 years
Total issue authors: 2
Total pull request authors: 2
Average comments per issue: 1.33
Average comments per pull request: 0.33
Merged pull request: 0
Bot issues: 0
Bot pull requests: 3
Past year issues: 2
Past year pull requests: 1
Past year average time to close issues: 4 days
Past year average time to close pull requests: N/A
Past year issue authors: 2
Past year pull request authors: 1
Past year average comments per issue: 1.5
Past year average comments per pull request: 0.0
Past year merged pull request: 0
Past year bot issues: 0
Past year bot pull requests: 1
Top Issue Authors
- djtfmartin (2)
- CecSve (1)
Top Pull Request Authors
- dependabot[bot] (2)
- renovate[bot] (1)
Top Issue Labels
Top Pull Request Labels
- dependencies (2)
- java (1)
Dependencies
- com.google.guava:guava
- commons-io:commons-io
- org.apache.commons:commons-lang3
- org.gbif:name-parser-api
- org.slf4j:slf4j-api
- ch.qos.logback:logback-classic test
- junit:junit test
- org.gbif:name-parser-api test
- com.google.code.findbugs:jsr305
- com.google.guava:guava
- org.apache.commons:commons-lang3
- org.slf4j:slf4j-api
- ch.qos.logback:logback-classic test
- commons-io:commons-io test
- junit:junit test
- org.gbif:gbif-api
- org.gbif:name-parser
- org.gbif:name-parser-api
- org.slf4j:slf4j-api
- ch.qos.logback:logback-classic test
- junit:junit test
- com.google.code.findbugs:jsr305 3.0.2
- com.google.guava:guava 28.0-jre
- commons-io:commons-io 2.8.0
- org.apache.commons:commons-lang3 3.12.0
- org.gbif:gbif-api 0.166
- org.gbif:name-parser 3.7.3-SNAPSHOT
- org.gbif:name-parser-api 3.7.3-SNAPSHOT
- org.gbif:name-parser-gbif 3.7.3-SNAPSHOT
- org.slf4j:slf4j-api 1.7.24
- ch.qos.logback:logback-classic 1.2.3 test
- junit:junit 4.12 test
- org.gbif:name-parser-api 3.7.3-SNAPSHOT test
Score: 6.901737206656573