Table of Contents
Phylogenetic research investigates how species and higher taxa are related through common ancestry and how life has diversified over time. It provides the methods and tools to reconstruct evolutionary trees (phylogenies) rather than just listing organisms in groups. In this chapter, the focus is on how such evolutionary relationships are inferred and tested.
Phylogenetic Trees: What They Represent
A phylogenetic tree is a graphical hypothesis about evolutionary relationships:
- Branches represent lineages through time.
- Nodes (branching points) represent common ancestors.
- Root (if present) indicates the most recent common ancestor (MRCA) of all taxa in the tree.
- Tips (leaves) represent the organisms or taxa being studied (species, genera, etc.).
Important distinctions:
- Rooted vs. unrooted trees
- Rooted trees have a direction of time (from root to tips).
- Unrooted trees show relationships but not the time direction or ancestor–descendant order.
- Clade (monophyletic group)
A clade includes a common ancestor and all of its descendants. Clades are the central units in modern phylogenetic systematics. - Paraphyletic and polyphyletic groups
- Paraphyletic: includes a common ancestor but not all descendants (e.g., “reptiles” without birds).
- Polyphyletic: includes organisms from separate lineages without their most recent common ancestor (e.g., grouping “flying animals” as bats, birds, insects).
Phylogenetic research aims to identify true clades and avoid paraphyletic or polyphyletic groupings in classification.
Data Used in Phylogenetic Research
Modern phylogenetic research is strongly data-driven. Two main data types are used:
Morphological and Anatomical Characters
These include observable traits of organisms:
- External form (body plan, appendages, wings, leaves).
- Internal structure (skeletal features, organ systems, flower structure).
- Developmental patterns (e.g., larval vs. adult traits).
In phylogenetic analysis, such traits must be translated into discrete characters with defined character states. For example:
- Character: presence of hair
- State 0: absent
- State 1: present
Only traits that are considered homologous (shared due to common ancestry) are suitable as phylogenetic characters. Distinguishing homology from analogy is therefore crucial.
Molecular Characters
Molecular phylogenetics relies on DNA, RNA, or protein sequences:
- DNA sequences: e.g., mitochondrial genes, chloroplast genes, nuclear genes.
- Protein sequences: amino acid sequences of conserved proteins.
- Genomic data: entire genomes or large genomic regions.
Sequences are represented as series of characters (e.g., A, C, G, T for DNA) for each species. Positions (sites) in sequences serve as individual characters. For example:
Species A: A T G C A T
Species B: A C G C G T
Species C: G C G A G TEach column represents a character; each letter is a character state.
Molecular data enable phylogenetic inference even for organisms with few morphological differences (e.g., bacteria, cryptic species) and allow comparisons over great evolutionary distances using conserved genes.
Choosing Appropriate Markers
Phylogenetic studies select markers that evolve at appropriate rates:
- Slowly evolving genes for deep divergences (e.g., between major animal groups).
- Rapidly evolving genes or genomic regions for recent divergences (e.g., among closely related species or populations).
Frequently used markers include:
- Ribosomal RNA genes (e.g., 16S rRNA in prokaryotes, 18S rRNA in eukaryotes).
- Mitochondrial genes (e.g., COI in animals).
- Chloroplast genes (in plants).
- Nuclear genes or noncoding regions (introns, spacers) for fine-scale relationships.
Character Coding and Alignment
Before analysis, data must be prepared in a standardized form.
Coding Morphological Characters
Key steps:
- Define characters clearly (e.g., “type of leaf attachment”).
- Establish discrete states (e.g., opposite, alternate, whorled).
- Avoid mixing multiple traits in one character (e.g., texture and color in a single state definition).
Characters can be:
- Binary (two states, e.g., present/absent).
- Multistate (three or more states, e.g., types of teeth).
Researchers must decide how to handle continuous measurements (e.g., body size): discretize into categories, use ratios, or exclude if not suitable.
Sequence Alignment
For molecular data, sequences must be aligned so that homologous positions are compared:
- Insert gaps (
-) to account for insertions and deletions (indels). - Ensure that nucleotides or amino acids in the same column are evolutionarily comparable.
Example:
Seq1: A T G C A T G A
Seq2: A T - C A T G A
Seq3: A T G C A T - AHere, a dash represents an inferred insertion or deletion. Incorrect alignments lead to incorrect phylogenetic signals, so careful alignment (often with computer programs plus manual checking) is essential.
Methods for Reconstructing Phylogenies
Once characters are coded and aligned, various analytical methods can be used to infer phylogenetic trees.
Parsimony Methods
Maximum parsimony searches for the tree that requires the fewest evolutionary changes (steps) to explain the observed distribution of character states.
Key ideas:
- Each change (e.g.,
0 → 1) is counted as a step. - The “best” tree is that with minimal total steps.
- Parsimony assumes that evolution does not make unnecessary changes, though in reality evolution can be complex.
Pros and limitations:
- Easy to understand conceptually.
- Works for morphological and molecular data.
- Can be misled by homoplasy (independent evolution of similar traits) and long branch attraction (rapidly evolving lineages incorrectly grouped together).
Distance-Based Methods
These methods are based on pairwise measures of dissimilarity (distance) between taxa:
- Compute a distance matrix from raw data (e.g., percentage of sequence difference).
- Use clustering algorithms (e.g., UPGMA, Neighbor-Joining) to construct a tree that best represents these distances.
Features:
- Often computationally fast and can handle large datasets.
- Convert many characters into a single distance measure, which may lose information about individual character history.
These methods are commonly used as initial exploratory tools or when very large datasets need quick approximations.
Likelihood-Based Methods
Maximum likelihood methods evaluate, for each candidate tree, the probability of observing the data given:
- A specific tree topology.
- Branch lengths (amount of change).
- A model of sequence evolution (e.g., probabilities of different substitutions).
The best tree is the one that maximizes the likelihood:
$$
\text{choose tree } T \text{ such that } L(T \mid \text{data}) \text{ is maximal}
$$
These approaches:
- Are statistically well-founded.
- Allow different rates of evolution at different sites and along branches.
- Are computationally intensive but widely used with modern computing power.
Bayesian Methods
Bayesian phylogenetics uses Bayes’ theorem to calculate the posterior probability of trees:
$$
P(T \mid \text{data}) \propto P(\text{data} \mid T) \cdot P(T)
$$
Where:
- $P(T)$ is the prior probability of a tree model (before seeing data).
- $P(\text{data} \mid T)$ is the likelihood (as in maximum likelihood).
Using Markov Chain Monte Carlo (MCMC) algorithms, Bayesian methods:
- Sample many possible trees in proportion to their posterior probability.
- Provide posterior probabilities for clades (a measure of support).
- Can integrate over uncertainty in model parameters.
Bayesian methods are very powerful but require careful choice of priors and interpretation.
Outgroups and Tree Rooting
To determine the direction of evolution in a tree, researchers use an outgroup:
- An outgroup is a taxon or group known (from independent evidence) to be more distantly related to the focal group (ingroup) than any ingroup member is to each other.
- The outgroup allows the tree to be rooted, i.e., to determine which node is the earliest divergence.
Example:
- In a study of mammals, a reptile species might serve as outgroup.
- Traits shared by the outgroup and some ingroup members are likely ancestral states.
Choosing an inappropriate outgroup can distort the inferred relationships, so selection is guided by prior knowledge from fossils, morphology, or other molecular analyses.
Homology, Analogy, and Homoplasy in Phylogenetic Research
Phylogenetic inference depends heavily on correctly interpreting similarities:
- Homology: similarity due to shared ancestry (useful for phylogeny).
- Analogy (convergence): similarity due to similar function or environment, not common ancestry.
- Homoplasy: any similarity not due to common ancestry, including convergence and evolutionary reversals.
Homoplasy complicates tree reconstruction because:
- It can make unrelated taxa appear closely related.
- It is more likely in traits under strong, similar selection (e.g., body shape in fast swimmers).
To reduce homoplasy effects:
- Emphasize multiple independent characters.
- Use traits less directly influenced by the same selective pressures.
- Apply models of evolution that can accommodate varying rates of change.
Testing and Evaluating Phylogenetic Trees
Phylogenetic trees are hypotheses and must be evaluated for robustness.
Support Values for Branches
Common measures:
- Bootstrap support (for parsimony, distance, or likelihood methods):
- Resample the dataset (with replacement) many times.
- Reconstruct a tree for each resampled dataset.
- Calculate the percentage of replicates in which a particular clade appears (bootstrap value in %).
- Posterior probabilities (for Bayesian methods):
- Directly estimate the probability that a clade is correct given the data and model.
High support values suggest that the clade is strongly supported by the data, but:
- Even high values do not guarantee correctness.
- Low values indicate uncertainty or conflicting signals.
Comparing Alternative Trees
Phylogenetic research may propose different tree topologies. To compare them:
- Statistical tests (e.g., likelihood-ratio tests, topology tests) can evaluate whether differences in fit to the data are significant.
- Different datasets (e.g., genes, morphology) can be analyzed separately and jointly to see if they support the same relationships.
Conflicts between datasets may indicate:
- Incomplete lineage sorting.
- Horizontal gene transfer.
- Hybridization.
- Measurement or alignment errors.
Molecular Clocks and Dating Divergences
Phylogenetic research can estimate the timing of divergences using molecular clocks:
- Basic idea: genetic differences accumulate roughly linearly with time at an average rate.
- Under a strict molecular clock:
$$
d = r \cdot t
$$
Where:
- $d$ = genetic distance (e.g., number of substitutions per site),
- $r$ = substitution rate per unit time,
- $t$ = time since divergence.
In practice:
- Rates vary among lineages and genes, so relaxed clock models are used, allowing rate differences.
- Fossils or known geological events are used as calibration points:
- A fossil assigned to a clade provides a minimum age for that clade.
- These calibrations link molecular branch lengths to absolute time (millions of years).
Dating analyses combine:
- Phylogenetic tree structure.
- Sequence data.
- Fossil and geological information.
This allows estimates of when major groups originated and diversified.
Integrating Different Data Types
Phylogenetic research often combines information:
- Morphological + molecular data in a single analysis (total evidence approach).
- Extant (living) taxa + fossil taxa, where:
- Fossils contribute morphological characters and age constraints.
- Extant taxa provide extensive molecular data.
Benefits:
- More complete picture of evolutionary history.
- Better placement of fossils within the tree of life.
- More accurate timing and pattern of diversification.
Challenges:
- Fossil data often incomplete or distorted by preservation.
- Morphological and molecular data may give conflicting signals that must be interpreted.
Applications of Phylogenetic Research
Phylogenetic methods are used widely across biology:
- Systematics and classification
- Designing classifications that reflect true evolutionary history (cladistic classification).
- Biogeography
- Reconstructing the historical spread and diversification of lineages across regions.
- Comparative biology
- Studying trait evolution (e.g., evolution of flight, parasitism, photosynthesis pathways).
- Epidemiology and public health
- Tracing the spread and origin of pathogens (e.g., viral outbreaks).
- Conservation biology
- Identifying evolutionarily distinct lineages and cryptic species for conservation priorities.
In all these areas, phylogenetic trees are essential tools for asking and answering evolutionary questions rather than ends in themselves.
Limitations and Future Directions
Phylogenetic research faces limitations:
- Incomplete sampling of taxa (extinct or unsampled species).
- Gene tree vs. species tree discordance:
- A single gene’s history can differ from the history of the species (due to incomplete lineage sorting, hybridization, horizontal gene transfer).
- Model assumptions:
- Evolutionary models are simplifications and may not capture real complexities.
Current and future directions include:
- Phylogenomics: using thousands of genes or whole genomes for tree reconstruction.
- Species tree methods that account for gene tree discordance.
- Network approaches to represent reticulate evolution (hybridization, introgression, horizontal gene transfer).
- Improved integrations of fossil and molecular data to refine the tree of life and its timeline.
Phylogenetic research thus continually refines our understanding of how all organisms are related, providing the backbone for interpreting biological diversity in an evolutionary framework.