GBIF Name Parser
The core GBIF scientific name parser library.
https://github.com/gbif/name-parser
Category: Biosphere
Sub Category: Biodiversity Data Cleaning and Standardization
Keywords from Contributors
biodiversity-informatics darwin-core taxonomy gbif tdwg biodiversity species snapshot interest-group
Last synced: 1 day ago
JSON representation
Repository metadata
The core GBIF scientific name parser library
- Host: GitHub
- URL: https://github.com/gbif/name-parser
- Owner: gbif
- License: apache-2.0
- Created: 2014-01-24T10:44:23.000Z (over 12 years ago)
- Default Branch: master
- Last Pushed: 2026-06-15T09:50:30.000Z (8 days ago)
- Last Synced: 2026-06-17T18:04:41.007Z (6 days ago)
- Language: Java
- Size: 3.06 MB
- Stars: 19
- Watchers: 18
- Forks: 4
- Open Issues: 52
- Releases: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
README.md
GBIF Name Parser
A library and command-line tool that parses scientific names — including the
authorship, rank, hybrid markers and nomenclatural notes — into a structured
ParsedName
model.
Modules
| Module | Purpose |
|---|---|
name-parser-api |
Pure model + interface module: ParsedName, Authorship, Rank, NomCode, NameType, the NameParser interface, plus formatter / Unicode utilities. Depend on this if you only need the data model. |
name-parser |
The parser implementation. Single public entry point: org.gbif.nameparser.NameParserImpl. |
name-parser-cli |
Command-line tools (parse, compare, benchmark) wrapping the parser, packaged as an executable shaded jar. |
Build everything with mvn install from the repo root.
Library use
<dependency>
<groupId>org.gbif</groupId>
<artifactId>name-parser</artifactId>
<version>4.0.0-SNAPSHOT</version>
</dependency>
NameParser parser = new NameParserImpl();
ParsedName pn = parser.parse("Vulpes vulpes silaceus Miller, 1907", null, null, null);
Command-line interface
After mvn install, the executable jar is at
name-parser-cli/target/name-parser-cli-<version>-shaded.jar.
java -jar name-parser-cli-<version>-shaded.jar <command> [options]
| Command | What it does |
|---|---|
parse |
Stream a text file with one name per row through the parser and write a JSONL file (one JSON object per row). |
compare |
Stream two JSONL files in lockstep, report aggregate metrics and a per-row dump of every differing parsed value. |
benchmark |
Measure parser throughput against a name-per-line input file (count, total / avg / min / p50 / p95 / max). |
Run <command> --help for the full per-command option list.
All commands stream their input — memory use stays flat regardless of input size,
so multi-million-row inputs are fine.
Bundled sample corpora
Sample inputs ship in name-parser-cli/data/:
benchmark-data.txt— ~8k mixed names (hand-picked + test-assertion inputs +
random Catalogue of Life rows with authorship) used for throughput benchmarking.
Top up with more random names anytime via:
The script reservoir-samples col-names.tsv in a single pass and appends rowspython3 name-parser-cli/scripts/append-colnames-sample.py [-n 2000] [--seed 17]
asscientificName authorship— manual edits to the benchmark file are
preserved.col-names.tsv— the full Catalogue of Life names dump (~6.3M rows, ~340 MB,
not tracked in git — drop your own copy here)
Each command's --input defaults assume you run it from the repo root.
parse
Usage: name-parser-cli parse [options]
Options:
--input=PATH source file (default: data/col-names.tsv; '-' = stdin)
--output=PATH target file (default: <input>.<format-ext>; '-' = stdout)
--format=FMT output format: jsonl (default), json, csv, tsv
csv / tsv produce a flat ColDP Name file with header
--quiet suppress progress output
-h --help print this message and exit
Use - as the input or output path to stream from stdin / to stdout — the
command is fully unix-pipe friendly. Progress messages and the final summary
are written to stderr so stdout stays a clean data stream:
cat names.txt | name-parser-cli parse --input=- --output=- --format=tsv | head
xz -dc col-names.tsv.xz | name-parser-cli parse --input=- --output=- --format=jsonl > col.jsonl
Input
The input format is auto-detected from the first non-blank, non-comment line:
- ColDP Name file (TSV or CSV) — recognised when the header row contains
anyColdpTerm
property names (looked up viaColdpTerm.find). Only the columns the parser
interface accepts are honoured:ID,scientificName,authorship,rank,
code. Other columns are read but ignored. - Plain text — one name per line. If a line contains a tab, only the
substring before the first tab is treated as the name (socol-names.tsvis
usable both as ColDP-style TSV and as bare plain text).
Lines starting with # and blank lines are skipped.
Output formats
| Format | Description |
|---|---|
jsonl (default) |
One self-contained JSON object per line; consumed by compare. |
json |
Single document containing a JSON array of all rows (streamed; not held in memory). |
csv / tsv |
Flat ColDP Name file with header row. |
JSON / JSONL rows look like:
{"line":42,"id":"42","input":"Felis catus","parsed":{ ...full ParsedName... }}
{"line":99,"id":"99","input":"Iridoviridae","error":{"type":"VIRUS","message":"..."}}
The id field is populated from the ColDP ID column when present; otherwise
it is omitted.
ColDP CSV/TSV column mapping
Every structural ParsedName field maps to a ColDP column. Where the ColDP
Name entity lacks a column but the NameUsage entity defines one, that
NameUsage term is used (nameStatus, namePhrase, namePublishedInPage,
provisional, extinct). Parser-only fields without a ColDP equivalent are
written into custom columns prefixed with np: — strict ColDP readers ignore
unknown columns, so the file stays valid ColDP.
Multi-value rules: author lists join with | (the ColDP convention); notho
parts join with ,.
ParsedName field |
ColDP column |
|---|---|
id (from input) |
ID (falls back to verbatim scientificName when absent) |
canonicalNameWithoutAuthorship() (Candidatus prefixed when applicable) |
scientificName |
authorshipComplete() |
authorship |
rank, code |
rank, code (lower-cased) |
nomenclaturalNote (or manuscript flag) |
nameStatus |
uninomial, genus, infragenericEpithet, specificEpithet, infraspecificEpithet, cultivarEpithet |
same column names |
notho (every flagged part, comma-joined) |
notho |
originalSpelling |
originalSpelling |
combinationAuthorship.{authors,exAuthors,year} |
combinationAuthorship, combinationExAuthorship, combinationAuthorshipYear (authors joined with |) |
basionymAuthorship.{authors,exAuthors,year} |
basionymAuthorship, basionymExAuthorship, basionymAuthorshipYear (authors joined with |) |
publishedIn (free text) |
namePublishedInPage |
extinct |
extinct |
phrase |
namePhrase |
doubtful |
provisional |
type (when not SCIENTIFIC) |
np:type |
sanctioningAuthor |
np:sanctioningAuthor |
taxonomicNote (sensu) |
np:taxonomicNote |
unparsed |
np:unparsed |
warnings (joined with |) |
np:warnings |
| (parser failure message) | np:error |
Unparsable rows are still written: ID, scientificName (the verbatim input)
and the np:type / np:error columns are populated.
compare
Usage: name-parser-cli compare [options] <a.jsonl> <b.jsonl> [diffs.txt]
Options:
--a=PATH first JSONL file (alt. to first positional arg)
--b=PATH second JSONL file (alt. to second positional arg)
--output=PATH write per-row diffs here (default: stdout)
--ignore-whitespace strip whitespace from string leaves before compare
--max-diffs=N cap per-row diff dump at N rows (default: 100)
-h --help print this message and exit
Both inputs are expected to come from the same source file (matching line
numbers, same row order). The summary reports rows compared / identical /
differing, status transitions (PARSED→ERROR, ERROR→PARSED, …) and the top
differing field paths. Whitespace inside parsed string values is significant by
default — pass --ignore-whitespace to suppress whitespace-only differences in
parsed values (the JSON formatting itself is ignored either way).
benchmark
Usage: name-parser-cli benchmark [options]
Options:
--input=PATH source file (default: data/benchmark-data.txt)
--warmup do an extra untimed pass over the input first to warm the JIT
-h --help print this message and exit
Pure throughput measurement — every input row is parsed and timed. JIT warmup
is opt-in via --warmup, in which case the input is streamed through the
parser once without timing before the timed pass; on subsequent runs the
HotSpot-warmed numbers tend to be ~10× lower. Nothing is written to disk; the
report goes to stdout.
License
Apache 2.0.
Owner metadata
- Name: Global Biodiversity Information Facility
- Login: gbif
- Email:
- Kind: organization
- Description:
- Website: https://www.gbif.org
- Location: Copenhagen, Denmark
- Twitter:
- Company:
- Icon url: https://avatars.githubusercontent.com/u/1963797?v=4
- Repositories: 288
- Last ynced at: 2024-04-14T06:45:04.085Z
- Profile URL: https://github.com/gbif
GitHub Events
Total
- Delete event: 3
- Pull request event: 1
- Fork event: 1
- Issues event: 2
- Watch event: 1
- Issue comment event: 4
- Push event: 61
- Create event: 6
Last Year
- Delete event: 1
- Pull request event: 1
- Issues event: 1
- Watch event: 1
- Issue comment event: 2
- Push event: 44
- Create event: 3
Committers metadata
Last synced: 4 days ago
Total Commits: 660
Total Committers: 14
Avg Commits per committer: 47.143
Development Distribution Score (DDS): 0.262
Commits in past year: 90
Committers in past year: 2
Avg Commits per committer in past year: 45.0
Development Distribution Score (DDS) in past year: 0.133
| Name | Commits | |
|---|---|---|
| Markus Döring | m****g@g****g | 487 |
| gbif-jenkins | d****v@g****g | 96 |
| gbif-jenkins | j****s@r****g | 30 |
| gbif-jenkins | j****s@j****g | 18 |
| pal155 | D****r@c****u | 8 |
| Federico Mendez | f****z@g****g | 7 |
| Oliver Meyn | o****r@m****m | 3 |
| Kyle Braak | k****k@g****g | 3 |
| dependabot[bot] | 4****] | 2 |
| Matthew Blissett | m****t@g****g | 2 |
| Thomas Stjernegaard Jeppesen | t****n@g****g | 1 |
| Nikolay Volik | n****k@g****g | 1 |
| Christian Gendreau | c****u | 1 |
| Jorrit Poelen | j****n@g****m | 1 |
Committer domains:
- gbif.org: 7
- mineallmeyn.com: 1
- csiro.au: 1
- jenkins-vh.gbif.org: 1
- rancor.gbif.org: 1
Issue and Pull Request metadata
Last synced: 8 days ago
Total issues: 4
Total pull requests: 3
Average time to close issues: 1 day
Average time to close pull requests: almost 2 years
Total issue authors: 3
Total pull request authors: 2
Average comments per issue: 1.25
Average comments per pull request: 0.33
Merged pull request: 0
Bot issues: 0
Bot pull requests: 3
Past year issues: 3
Past year pull requests: 1
Past year average time to close issues: 2 days
Past year average time to close pull requests: N/A
Past year issue authors: 3
Past year pull request authors: 1
Past year average comments per issue: 1.33
Past year average comments per pull request: 0.0
Past year merged pull request: 0
Past year bot issues: 0
Past year bot pull requests: 1
Top Issue Authors
- djtfmartin (2)
- CecSve (1)
- mdoering (1)
Top Pull Request Authors
- dependabot[bot] (2)
- renovate[bot] (1)
Top Issue Labels
- bug (1)
Top Pull Request Labels
- dependencies (2)
- java (1)
Dependencies
- com.google.guava:guava
- commons-io:commons-io
- org.apache.commons:commons-lang3
- org.gbif:name-parser-api
- org.slf4j:slf4j-api
- ch.qos.logback:logback-classic test
- junit:junit test
- org.gbif:name-parser-api test
- com.google.code.findbugs:jsr305
- com.google.guava:guava
- org.apache.commons:commons-lang3
- org.slf4j:slf4j-api
- ch.qos.logback:logback-classic test
- commons-io:commons-io test
- junit:junit test
- org.gbif:gbif-api
- org.gbif:name-parser
- org.gbif:name-parser-api
- org.slf4j:slf4j-api
- ch.qos.logback:logback-classic test
- junit:junit test
- com.google.code.findbugs:jsr305 3.0.2
- com.google.guava:guava 28.0-jre
- commons-io:commons-io 2.8.0
- org.apache.commons:commons-lang3 3.12.0
- org.gbif:gbif-api 0.166
- org.gbif:name-parser 3.7.3-SNAPSHOT
- org.gbif:name-parser-api 3.7.3-SNAPSHOT
- org.gbif:name-parser-gbif 3.7.3-SNAPSHOT
- org.slf4j:slf4j-api 1.7.24
- ch.qos.logback:logback-classic 1.2.3 test
- junit:junit 4.12 test
- org.gbif:name-parser-api 3.7.3-SNAPSHOT test
Score: 6.901737206656573