Gene nomenclature is the scientific naming of genes, the units of heredity in living organisms. It is also closely associated with protein nomenclature, as genes and the proteins they code for usually have similar nomenclature. An international committee published recommendations for genetic symbols and nomenclature in 1957. The need to develop formal guidelines for human gene names and symbols was recognized in the 1960s and full guidelines were issued in 1979 (Edinburgh Human Genome Meeting). Several other genus-specific research communities (e.g., Drosophila fruit flies, Mus mice) have adopted nomenclature standards, as well, and have published them on the relevant model organism websites and in scientific journals, including the Trends in Genetics Genetic Nomenclature Guide. Scientists familiar with a particular gene family may work together to revise the nomenclature for the entire set of genes when new information becomes available. For many genes and their corresponding proteins, an assortment of alternate names is in use across the scientific literature and public biological databases, posing a challenge to effective organization and exchange of biological information.Standardization of nomenclature thus tries to achieve the benefits of vocabulary control and bibliographic control, although adherence is voluntary. The advent of the information age has brought gene ontology, which in some ways is a next step of gene nomenclature, because it aims to unify the representation of gene and gene product attributes across all species.
Gene nomenclature and protein nomenclature are not separate endeavors; they are aspects of the same whole. Any name or symbol used for a protein can potentially also be used for the gene that encodes it, and vice versa. But owing to the nature of how science has developed (with knowledge being uncovered bit by bit over decades), proteins and their corresponding genes have not always been discovered simultaneously (and not always physiologically understood when discovered), which is the largest reason why protein and gene names do not always match, or why scientists tend to favor one symbol or name for the protein and another for the gene. Another reason is that many of the mechanisms of life are the same or very similar across species, genera, orders, and phyla (through homology, analogy, or some of both), so that a given protein may be produced in many kinds of organisms; and thus scientists naturally often use the same symbol and name for a given protein in one species (for example, mice) as in another species (for example, humans). Regarding the first duality (same symbol and name for gene or protein), the context usually makes the sense clear to scientific readers, and the nomenclatural systems also provide for some specificity by using italic for a symbol when the gene is meant and plain (roman) for when the protein is meant. Regarding the second duality (a given protein is endogenous in many kinds of organisms), the nomenclatural systems also provide for at least human-versus-nonhuman specificity by using different capitalization, although scientists often ignore this distinction, given that it is often biologically irrelevant.
Also owing to the nature of how scientific knowledge has unfolded, proteins and their corresponding genes often have several names and symbols that are synonymous. Some of the earlier ones may be deprecated in favor of newer ones, although such deprecation is voluntary. Some older names and symbols live on simply because they have been widely used in the scientific literature (including before the newer ones were coined) and are well established among users. For example, mentions of HER2 and ERBB2 are synonymous.
Lastly, the correlation between genes and proteins is not always one-to-one (in either direction); in some cases it is several-to-one or one-to-several, and the names and symbols may then be gene-specific or protein-specific to some degree, or overlapping in usage:
- Some proteins and protein complexes are built from the products of several genes (each gene contributing a polypeptide subunit), which means that the protein or complex will not have the same name or symbol as any one gene. For example, a particular protein called “example” (symbol “EXAMP”) may have 2 chains (subunits), which are encoded by 2 genes named “example alpha chain” and “example beta chain” (symbols EXAMPA and EXAMPB).
- Some genes encode multiple proteins, because post-translational modification (PTM) and alternative splicing provide several paths for expression. For example, glucagon and similar polypeptides (such as GLP1 and GLP2) all come (via PTM) from proglucagon, which comes from preproglucagon, which is the polypeptide that the GCG gene encodes. When one speaks of the various polypeptide products, the names and symbols refer to different things (i.e., preproglucagon, proglucagon, glucagon, GLP1, GLP2), but when one speaks of the gene, all of those names and symbols are aliases for the same gene. Another example is that the various μ-opioid receptor proteins (e.g., μ1, μ2, μ3) are all splice variants encoded by one gene, OPRM1; this is how one can speak of MORs (μ-opioid receptors) in the plural (proteins) even though there is only one MOR gene, which may be called OPRM1, MOR1, or MOR—all of those aliases validly refer to it, although one of them (OPRM1) is preferred nomenclature.