Starless bias and parameter-estimation bias in the likelihood-based phylogenetic method

I analyzed various site pattern combinations in a 4-OTU case to identify sources of starless bias and parameter-estimation bias in likelihood-based phylogenetic methods, and reported three significant contributions. First, the likelihood method is counterintuitive in that it may not generate a star tree with sequences that are equidistant from each other. This behaviour, dubbed starless bias, happens in a 4-OTU tree when there is an excess (i.e., more than expected from a star tree and a substitution model) of conflicting phylogenetic signals supporting the three resolved topologies equally. Special site pattern combinations leading to rejection of a star tree, when sequences are equidistant from each other, were identified. Second, fitting gamma distribution to model rate heterogeneity over sites is strongly confounded with tree topology, especially in conjunction with the starless bias. I present examples to show dramatic differences in the estimated shape parameter α between a star tree and a resolved tree. There may be no rate heterogeneity over sites (with the estimated α > 10000) when a star tree is imposed, but α < 1 (suggesting strong rate heterogeneity over sites) when an (incorrect) resolved tree is imposed. Thus, the dependence of “rate heterogeneity” on tree topology implies that “rate heterogeneity” is not a sequence-specific feature, cautioning against interpreting a small α to mean that some sites are under strong purifying selection and others not. Thirdly, because there is no existing (and working) likelihood method for evaluating a star tree with continuous gamma-distributed rate, I have implemented the method for JC69 in a self-contained R script for a four-OTU tree (star or resolved), in addition to another R script assuming a constant rate over sites. These R scripts should be useful for teaching and exploring likelihood methods in phylogenetics.


Introduction
I explore two phylogenetic issues here. The first is the starless bias. If a set of aligned sequences are equidistant from each other, i.e., the number of various types of substitutions between any two sequences is exactly the same, then we intuitively would expect a star tree. Distance-based methods such as neighbor-joining [1] or FastME [2] will indeed give us a star tree whenever pairwise distances are all equal. The starless bias refers to the inability of a phylogenetic method to generate a star tree with equidistant sequences. It was first alluded to in a study of potential bias in maximum likelihood method involving missing data and rate heterogeneity over sites [3], but its occurrence is more general that that. I will illustrate this bias here with four sequences, identify the source of the bias, and discuss alternative approaches relevant to the problem.
The second issue, related to the first, is the confounding effect of tree topology on phylogenetic parameter estimation, in particular the shape parameter α of gamma distribution used to model rate heterogeneity over sites. It may seem obvious that α depend on topology, but the issue needs to be studied for two reasons. First, how such dependence occurs is not well dissected. Second, one frequently encounters interpretation of α as if it is a sequence-specific feature, with a small alpha interpreted as indicating some sites strongly constrained by purifying selection and other sites not. It is therefore relevant to caution against such interpretation with real examples. For simplicity, I will work on 4-OTU trees only so that all site patterns can be conveniently considered.
I provide two self-contained R script files New2.R and NewGamma3.R (s001.R and s002.R within the Supplementary genetics-05-04-212.zip) implementing the likelihood method with JC69 model [4] with four OTUs. New2.R (s001) assumes a constant rate over sites and estimates branch lengths for the star tree (with four branches) and a resolved tree (with five branches). NewGamma3.R (s002) does the same but with a continuous gamma-distributed rate over sites. They are plain text files that can be copied and pasted into an R window to generate results presented in the paper, and should also be useful for teaching and exploring likelihood methods in phylogenetics.

Site patterns classification and sequence representation
With four sequences, there are 256 possible site patterns. For sequences to evolve according to the JC69 substitution model [4], the 256 site patterns would become equally frequently after sequences experienced full substitution saturation. Different combinations of these 256 site patterns support different substitution models and different topologies. We will focus on the JC69 substitution model and classify these 256 possible site patterns into 15 classes ( Figure 1A) so that we only need to calculate likelihood for these 15 site pattern classes. The transition probabilities needed for likelihood calculation for the JC69 model, together with other frequently used Markovian nucleotide substitution models such as F84 (used in DNAML since 1984) [5,6], {XE "substitution model: HKY85"} [7], TN93 {XE "substitution model: TN93"} [8], and GTR {XE "substitution model: GTR"} [9,10] have been numerically illustrated and re-derived with three different approaches [11,12]. These illustrations are sufficiently detailed for one to extend the JC69 model in the two R script files to other models.
The 15 site pattern classes ( Figure 1A) can be further lumped into five groups (G 1 to G 5 , Figure 1A). Sites in G 1 have all four OTUs (S1 to S4) with different nucleotides, with 24 unique site patterns represented as a single G 1 site in Figure 1A (because JC69 sees these 24 site patterns as identical).
Sites in G 2 each have three different nucleotides, with a total of 144 unique site patterns. Only six sites are used to represent them in Figure 1A because JC69 sees these 144 unique sites in this group to be identical to one of the six representative sites. Sites in G 3 feature two nucleotides, with three OTUs having the same nucleotide, and a total of 48 unique sites represented by four sites in Figure 1A. Sites in G 4 also feature two nucleotides, with two OTUs sharing one nucleotide and the other two OTUs sharing the other (i.e., they are the traditional informative sites in Fitch parsimony). There are 36 unique G 4 sites represented by three sites in Figure 1A. G 5 sites are monomorphic, with four unique site patterns represented by site 15 in Figure 1A. Given a JC69 model, the five groups of sites (G 1 to G 5 in Figure 1A) support the three unrooted topologies equally (i.e., they do not preferentially support any of the three). This is obvious for G 1 and G 5 sites. The six sites in G 2 , shown in Figure 1A, jointly also support the three topologies equally, so do the four sites in G 3 and three sites in G 4 . Note that, with a star tree and JC69, we cannot have G 1 , G 2 and G 4 sites without having G 3 sites first. With low sequence divergence, almost all site patterns from a star tree should be G 3 sites.
Subscripts n 1 to n 5 in Figure 1A mean multiples of the enclosed sites, e.g., an n 2 of 10 means that the six G 2 sites in Figure 1A is repeated 10 times (for a total of 60 sites). A combination of (n 1 , n 2 , n 3 , n 4 , n 5 ), where n i corresponds to those in Figure 1A, means a set of aligned sequences containing n 1 G 1 sites, n 2 G 2 sites (i.e., n 2 *6 sites), n 3 G 3 sites (for a total of n 3 *4 sites), n 4 G4 sites (for a total of n 4 *3 sites), and n 5 G5 sites ( Figure 1A). A set of four sequences of length 256 containing all 256 possible site patterns is specified as (24, 24, 12, 12, 4). Such a set of sequences will naturally have equal nucleotide frequencies and twice as many transversions as transitions, i.e., the equilibrium ratio of substitution saturation. A set of sequences with any combinations of (n 1 , n 2 , n 3 , n 4 , n 5 ) are equidistant from each other from a JC69 perspective, and we would desire to have a star tree ( Figure 1B) instead of one of the three resolved trees ( Figure 1C). Hereafter I specify a set of four aligned sequences equidistant form each other simply by (n 1 , n 2 , n 3 , n 4 , n 5 ) which guarantee that the four sequences are equidistant from each other.
Not all (n 1 , n 2 , n 3 , n 4 , n 5 ) combinations are equally likely under the JC69 model with a star tree. For example, the site pattern combination (24, 24, 12, 96, 32) has a near-zero chance to occur with a star tree and a strict JC69 model, because, given a star tree with x 5 = 0, G 4 sites can only emerge through independent substitutions along each of the four branches (i.e., all G 4 sites result from convergent substitutions). This means that we cannot have G 4 sites in a star tree without first having G 3 sites, so G 4 sites should not be more frequent than G 3 sites given a star tree and JC69. The ratio of G 3 /G 4 sites will decrease from ∞ (when the first substitution occurs) towards 4/3 (i.e., 48 possible G4 site patterns and 36 possible site patterns) when sequences have gradually experienced full substitution saturation. The site pattern combination above with a ratio of G 3 /G 4 equal to (12*4)/(96*3) cannot happen with a star tree because the ratio is far too small.
However, when different genes evolving under JC69 models with different rates (and potentially with different evolutionary histories and conflicting phylogenetic signals) are concatenated, strange site pattern combinations such as (24, 24, 12, 96, 32) may occur and cannot be dismissed as unreal. In fact, this site pattern combination results from a concatenation of site patterns from simulations with four different trees, i.e., the star tree and the three resolved trees (all with JC69 with no rate heterogeneity over sites), with slight modifications to ensure 1) that the nucleotide frequencies are exactly equal to 0.25, 2) that the number of transversions is exactly trice as many as the number of transitions, and 3) that the number of transitional and transversional differences between each pairwise comparison is exactly the same.

The starless bias
The site pattern combination (24, 24, 12, 96, 32) mentioned above represents a special deviation from any of the four topologies (the star tree plus the three resolved trees). While the four sequences will remain equidistant from each other with the site pattern combination of (24, 24, 12, 96, 32), the likelihood method will not favour a star tree in spite of equidistance among sequences. In general, as illustrated in Figure 2, we expect that increasing number of G 4 sites relative to G 3 sites would increase the likelihood of rejecting a star tree. As I mentioned before, G 3 sites are the first site pattern to appear in sequence evolution along a star tree. In contrast, a G 4 site requires a minimum of two substitutions when x 5 = 0 ( Figure 2C), but only a minimum of one with a resolved tree from a parsimony perspective ( Figure 2D). Tree lnL would be greater with x 5 > 0 (so that a single substitution can occur along the internal branch to be shared by two descendants) than with x 5 = 0 (which would force two independent substitutions). Thus, G 4 sites favor a resolved tree (x 5 > 0). In short, increasing number of G 4 sites relative to G 3 sites increases the likelihood of rejecting the star tree. This is true even with G 4 sites support all three resolved topologies equally because 1/3 of the G 4 sites will support a resolved topology.

Characterization of rate heterogeneity is confounded by topology
I will use G 4 sites to illustrate the effect of topology on rate heterogeneity over sites. With a star tree in Figure 2C, all G 4 sites require exactly the same number of changes (a minimum of two substitutions per site from a parsimony perspective). In other words, there is no rate heterogeneity among G 4 sites with a star tree. However, for a resolved tree in Figure 2D, 1/3 of the G 4 sites (with identical nucleotides between sister groups as in Figure 2D) would require only one substitution from the parsimony perspective. The other 2/3 of the G 4 sites (with different nucleotides between sister groups) would require two substitutions from the parsimony perspective. Thus, from a parsimony perspective, 1/3 of the G 4 sites evolve at a rate half as fast as the other 2/3 of the sites. Likewise with the likelihood perspective when multiple substitutions are corrected, 2/3 of the G 4 sites will have a substitution rate at least twice as large as 1/3 of the G 4 sites, giving rise to rate heterogeneity not present with a star tree. Thus, the relative numbers of G 4 and G 3 sites (denoted n 4 and n 3 , respectively, Figure 1A) affect not only the tendency to reject the star tree, but also the characterization of the rate heterogeneity over sites. In what follows, I assess the effect of (n 4 -n 3 ) on the tendency to reject the star tree and the confounding effect of topology on estimating the shape parameter α.

Sets of sequences to make different points
I provide five additional sets of sequences in fasta format as representative of various site pattern combinations. Sample1.fas and Sample2.fas (s003.fas and s004.fas within the Supplementary genetics-05-04-212.zip) have sequences equivalent to site pattern combinations (0, 14,43,4,40) and (1,11,41,5,42), respectively ( Figure 1A). They are from sequence simulation under JC69 model with a star tree using Evolver in PAML [13] to show that likelihood methods will indeed recover a star tree if sequences indeed evolve according to a star tree and a specific substitution model.
Sample3.fas (s005.fas within the Supplementary genetics-05-04-212.zip) includes all 256 possible site patterns, i.e., all 24 G 1 sites, 144 G 2 sites, 48 G 3 sites, 36 G 4 sites and 4 G 5 sites. It is represented by the combination of (24, 24, 12, 12, 4). Such site patterns are expected from sequence evolution under JC69 for an infinitely long time so that all sites have experienced full substitution saturation. It is used to highlight the point that the likelihood method correctly recovers the star tree as it should even with full substitution saturation.
Supplementary sequence files Sample4.fas and Sample5.fas (s006.fas and s007.fas within the Supplementary genetics-05-04-212.zip) are used to make the main points in the paper. S006 features a site pattern combination of (24, 24, 12, 96, 32) from which likelihood method will not recover a star tree, although the sequences are equidistant from each other with distances computed with any substitution models. As mentioned earlier, this data set is a concatenation of site patterns from simulations with four different trees, i.e., the star tree and the three resolved trees based on JC69 with a constant rate over sites, with slight modifications to ensure that 1) nucleotide frequencies are exactly equal to 0.25, 2) all pairwise comparisons lead to exactly the same number of transitional and transversional differences, and 3) the ratio of transitional and transversional differences between each pairwise comparison is exactly 1/2. Because of the concatenation of sequences simulated from four different topologies, the sequence set represents a special deviation from any of the four topologies (the star tree plus the three resolved trees). While the four sequences will remain equidistant from each other with the site pattern combination of (24, 24, 12, 96, 32), the likelihood method will strongly reject the star tree in spite of equidistance among sequences. The sequence set is also used to investigate the confounding effect of tree topology on the estimation of shape parameter α of gamma distribution.
Supplementary file Sample5.fas is used to assess the effect of increasing (n 4 -n 3 ) on decreasing support for the star tree. It contains 16 sets of sequences (with each set having fours sequences) of equal sequence length of 2528 derived as follows. A simulation of 100 sets of sequences with a star tree, JC69 model without rate heterogeneity, a sequence length of 2528, and a tree length of 4.8, leads to site pattern combinations that average to (168, 216, 156, 120, 80). This site pattern combination, analyzed by either New2.R or NewGamma3.R, leads to the same tree lnL. The continuous gamma model with a star tree yields lnL equal to −13965.18, shape parameter α equal to 5000 (maximum set in optimization), and branch lengths equal to 1.118709. Exactly the same results were obtained with a resolved tree with an additional internal branch length equal to 0. I changed n 3 and n 4 values to explore the effect of (n 4 -n 3 ) on the changing support for the star tree and on parameter estimation. In order to maintain the sequence length of 2528, n 1 , n 2 and n 5 were also adjusted.

Likelihood method implemented in R for star tree and gamma-distributed rate
For anyone to recreate results in this paper, I have implemented the likelihood method in R for the JC69 model and four OTUs to evaluate a star tree and a rooted tree, either with a constant rate over sites (Supplementary New2.R) or with a continuous gamma-distributed rate (NewGamma3.R). These are plain text files, self-contained, and well-annotated at the beginning of the file. One can copy and paste them into an R window to obtain results reported here. Computing likelihood with the pruning algorithm has been numerically illustrated in detail [14].
Because the continuous gamma requires integration over the rate, I used the R function 'integral' in the 'pracma' package which therefore needs to be installed before running the R scripts. The tree evaluation with a continuous gamma may take three minutes to complete on a PC with an i7-4770 CPU. I also used PhyML [15] to obtain lnL for the resolved tree. I set the tree improvement option '-s' to 'BEST' (best of NNI and SPR search), and the '-o' option to 'tlr' which optimizes the topology, the branch lengths and rate parameters. For JC69+Γ model, four categories of rates were used to estimate the shape parameter. PhyML does not generate lnL for a star tree, hence the need for the R scripts.

Likelihood method recovers the true star tree when sequences evolve under a substitution model
This part, while trivial, is needed for methodological validation. I have simulated sequence evolution of four OTUs under JC69 model and a star tree, by using the Evolver program in the PAML package [13], with different tree lengths and sequence lengths from 500 to 3000, each with 100 sets of sequences, with a constant rate over site. The sequences are then processed in DAMBE [16] so that the six site patterns in G 2 ( Figure 1A) occur exactly equally, so do the four site patterns in G 3 and three site patterns in G 4 ( Figure 1A). This ensures that the resulting sequences do not support any one of the three alternative topologies. I analyzed these site patterns by both PhyML 3.1 [15] and the likelihood methods implemented in R in the Supplementary New2.R file and NewGamma3.R. All these methods recover the star tree, which has the same tree lnL as any of the resolved tree. Furthermore, the branch lengths are equal and nearly identical to the tree length in the input tree used for simulation. Table 1 includes results for two such simulated sets of sequences without rate heterogeneity (A and B in the column headed by 'Data', Table 1). These two sets of sequences are in Supplementary Sample1.FAS and Sample2.Fas (s003 and s004). Their sequence patterns are represented by the combination of (0, 14, 43, 4, 40) and (1,11,41,5,42), respectively, for input to the R scripts. The resolve tree and the star tree consistently have the same tree lnL, whether evaluated by New2.R or PhyML (Table 1). Take Sample1.fas (Data A in Table 1) for example. The resolved tree and the star tree both have lnL of −1537.68, and the branch length for the internal branch of the resolved tree is b 5 = 0 (Table 1). PhyML outputs b 5 = 0.000001 which is likely the lower bound set for lnL maximization. Analyzing these two sets of sequences with JC69+ Γ by PhyML or my NewGamma3.R script generates the same branch lengths and tree lnL, except for a very large shape parameter α in the order of 10000 (which is expected because these two data sets were simulated with no rate heterogeneity over sites). Table 1. Results of likelihood-based phylogenetic estimation of branch lengths (b 1 to b 5 corresponding to x 1 to x 5 in Figure 1B and Figure 1C) without using gamma-distributed rates to accommodate rate heterogeneity over sites, from running the attached R script (New2.R) and PhyML 3.1 [15].  14,43,4,40), (1,11,41,5,42), and (24, 24, 12, 12, 4), respectively. (2) N/A indicates b 5 set to 0 for the star tree. Data set C (Table 1) is in Sample3.fas with the site pattern combination of (24, 24, 12,12,4). It represents sequences having experienced full substitution saturation under JC69 model and should have infinitely long branch lengths. However, for practical computation one always set an upper limit for branch lengths, e.g., 10, partly because 1) it is rare for branch lengths to be longer than 10 in practice, and 2) lnL changes little when a branch length increases beyond 10. Minimum and maximum branch lengths can be specified in my two R scripts. Because PhyML 3.1 does not seem to allow branch lengths to go above 10, for comparison I have presented branch lengths and tree lnL with maximum branch lengths set to 10 ( Table 1). Both the resolved tree and the star tree have the same lnL of −1419.57. One can change the maximum branch lengths to 100, 1000 or greater in my R scripts, and the star tree, which has one fewer branch to estimate, always has the same lnL as the resolved tree. Thus, the likelihood method does it job well as long as sequences evolve strictly according to substitution model, even with sequences having experienced full substitution saturation.

Likelihood method fails to recover a star tree with equidistant sequences
The likelihood method rejects a star tree for sequences in Supplementary file Sample4.fas (s006), with a site pattern combination of (24, 24, 12, 96, 32), in spite of equidistance among the sequences, either with or without gamma-distributed rate ( Table 2). When a constant rate is assumed, the tree lnL is −2943.148 for a resolved tree in contrast to −2947.793 for a star tree (Table 2). A likelihood ratio test would lead to 2∆lnL = 9.29, DF = 1, p = 0.0023 and a rejection of the star tree in favour of a resolved tree. The phylogenetic result with JC69+Γ rejects the star tree more strongly, with 2∆lnL = 58.03, DF = 1, p = 2.58*10 −14 . Note that the sequences length for this data set is only 536 nt. Longer alignment length would lead to even stronger rejection of the star tree. Table 2. Evaluate two alternative topologies and their branch lengths (b i ) in Figure 1B (with b 5 set to 0 and not evaluated) and Figure 1C. Data in Sample4.fas file (s006), with site pattern combination (24, 24, 12, 96, 32).  While a site pattern of (24, 24, 12, 96, 32) might be considered too extreme, one may take a milder site pattern combination such as (120, 192, 120, 204, 164), with a sequence length of 2528. The resulting lnL based on JC69+Γ is −13880.12 for a resolved tree, but −13889.4900 for a star tree, so 2∆lnL = 18.74. The star tree is rejected with p = 0.000015.
One might argue that there is nothing wrong with the likelihood method itself because the data set is pathological. As mentioned in the Materials and Methods section, this data set is from a concatenation of simulated sequences from four topologies (the star tree and the three resolved trees) with modifications so that nucleotide frequencies are equal and all pairwise comparisons lead to the same number of transition and transversional differences. Thus, although the sequences are equidistant from each other, the star tree is not an appropriate model for the concatenated sequences (neither is any of the three resolved topologies). However, the issue at hand is not on the validity of the likelihood approach, but on its robustness in phylogenetic reconstruction with conflicting phylogenetic signals. With sequences equidistant from each other, we do desire a star tree instead of having it conclusively rejected. Furthermore, we have the problem of inconsistent parameter estimation that I highlight below.
The PAML package [13] has a basemlg program which implements a continuous gamma-distributed rate. However, it does not produce correct results. Take

Inconsistent parameter estimation of likelihood method
The estimated parameter values in Table 2 are disconcerting in two ways. First, when the star tree is imposed, there is no rate heterogeneity over sites, with the shape parameter α in the order of 10000 (Table 2). However, the estimated α becomes 0.745 when b 5 (x 5 in Figure 1) is allowed to be greater than 0, indicating strong rate heterogeneity over sites. I have given reasons in Figure 2, illustrated with G 4 sites, that an excess of G4 sites will result in rate heterogeneity when a resolved tree is imposed but no rate heterogeneity when a star tree is imposed. That is, given a set of G 4 sites supporting the three resolved topologies equally, 1/3 of the G 4 sites will share a low rate of substitution whereas the other 2/3 will share a high rate of substitution when a resolved topology is imposed. In contrast, all G 4 sites will have the same rate of substitution when a star tree is imposed. This highlights the point that we do need a star tree instead of having three equally supported resolved trees because we need the star tree to perform proper parameter estimation in this case. As I mentioned before, the sequence data is a concatenation of sequence simulation on four different topologies (star and three resolved trees), all with JC69 with a constant rate. However, the maximum likelihood criterion does not allow us to choose the star tree.
Second, the substantial change of lnL with b 5 under JC69+Γ, associated with a change in α, is also unpleasant. For all practical consideration, the range of values for b 5 from 10 to 50 all just indicates a branch experiencing substitution saturation, and one would expect lnL to change little with b 5 in this range of values. However, lnL is −2930.116 with b 5 = 10 (and α = 1, Table 2), but −0.2918.782 with b 5 = 50 (and α = 0.745, Table 2). Given that the sequences are equidistant from each other, we do desire a tree with b 5 = 0, not a method that says that a tree with an unrealistically long b 5 is the best tree. Note that PhyML (version 3.1) appears to have set the maximum branch length to 10, so its generated lnL (= −2927.806, Table 2) has not yet reached its maximum.
While α is known to depend on topology, phylogenetic researchers often interpret α as if it is a sequence-specific property. A small α is often interpreted to mean strong purifying selection constraining certain sites but not others. I hope that the illustration above will caution against such interpretation.

The tendency to reject the star tree with increasing (n 4 -n 3 )
I used sequences in Sample5.fas (s007) to assess the effect of (n 4 -n 3 ). The file contains 16 sets of sequences of the same length (= 2528) but with different site patterns (n 1 , n 2 , n 3 , n 4 , n 5 ), in particular with different (n 4 −n 3 ) values. We may take likelihood chi-square statistic 2∆lnL as a measure of the tendency to reject the star tree, and expect it to increase with (n 4 −n 3 ) for reasons outlined in Figure 2. This expectation is consistent with the empirical evidence ( Figure 3). The relationship is stronger between 2∆lnL and n 4 /n 3 . Also, if we change on n 4 , but keep n 1 , n 2 , n 3 , and n 5 constant, then 2∆lnL increases smoothly with (n 4 −n 3 ) or with n 4 /n 3 . Figure 3 indicates that the relationship between 2∆lnL and (n 4 -n 3 ) is weaker when rate heterogeneity over sites is modelled by a gamma distribution. However, it is not always so, and one can easily find a counter example in which a constant-rate model leads to rejection of the star tree and a gamma-distributed rate model does not. For example, for a site pattern combination of (400,0,0,0,400) we will have lnL = −3984.64 for a resolved tree with JC69, lnL = −3987.998 for a star tree. This leads to 2∆lnL = 6.716 and a rejection of the star tree (DF = 1, p = 0.0096). With JC69+Γ, we will have lnL = −3546.648 for a resolved tree, lnL = −3547.94 for a star tree. This leads to 2∆lnL = 2.584 and the star tree is not rejected (DF = 1, p = 0.1079). Gamma-distributed rate Constant rate Figure 3. Support for a resolved tree against the star tree, measured by 2∆lnL = 2(lnL resolved tree −lnL star tree ), increases with increased G 4 sites relative to G 3 sites (measured by n 4 -n 3 ). The open circles indicates the mean (n 4 -n 3 ) value from 100 simulated sequences with a star tree and x 1 = x 2 = x 3 = x 4 = 1.2.
I should finally mention that the starless bias and parameter-estimation bias illustrated by the site pattern combination of (24, 24, 12, 96, 32) in Sample4.fas (s006) is also caused by model misspecification, in the sense that no tree model is appropriate for the data (because it is concatenation of sequences simulated from the star tree and three resolved trees). Existing model-testing methods do not address this type of model misspecification. The conventional selection of the best substitution model will choose JC69 because lnL is the same from JC69 to GTR. This would give us false confidence that we are using an appropriate substitution model for data analysis, without realizing that no tree model is appropriate (i.e., neither the star tree nor the resolved trees) for this set of concatenated sequences.

Conclusions
Two approaches are relevant to this problem of different topologies contributing conflicting phylogenetic signals caused by concatenating sequences of potentially different evolutionary history. The first is to use mixture model with different tree models each contributing a subset of sites. This is different from mixture models with the same topology but different nucleotide or amino acid substitution models. The second approach is as follows. For each phylogeny (T) obtained from a set of sequence data (S) under a substitution model (M), we should obtain the empirical site patterns from S and compare these empirical site patterns against their expectation from sequence simulation based on T and M. If the empirical site patterns deviate significantly from the expectation, then we conclude that the tree model is not appropriate. Applying this approach to sequences in Sample4.fas (s006), we will find the empirical site patterns from the sequence data differ highly significantly from the expectation, and we can conclude that no tree model is appropriate for this set of concatenated sequences or, more specifically, that the sequences are from a mixture of incompatible tree models.