摘要
Peroxisomes are universal eukaryotic organelles hosting various metabolic pathways, with particularly diverse metabolic roles in plants (Pan et al., 2020). Severe peroxisomal dysfunction can lead to fatal genetic disorders in humans and embryonic lethality in plants (Hu et al., 2012; Honsho et al., 2020). The proteome and metabolism of peroxisomes vary significantly between species, organs, and developmental stages, and in response to different environmental conditions (Gabaldón, 2018; Pan & Hu, 2018; Yifrach et al., 2018; Corpas, 2019). Peroxisomes not only have shared but also species-specific functions (Sibirny, 2016; Kao et al., 2018; Pan & Hu, 2018; Pan et al., 2020). Mass spectrometry (MS)-based peroxisome proteomics studies in different Arabidopsis developmental stages and tissue types, such as etiolated seedlings, green leaves, senescent leaves, and cultured cells (Fukao et al., 2002, 2003; Reumann et al., 2007, 2009; Eubel et al., 2008; Quan et al., 2013; Pan et al., 2018; Pan & Hu, 2018), as well as in several other plant species, such as soybean, spinach, sweet pepper, and castor bean (Arai et al., 2008; Babujee et al., 2010; González-Gordo et al., 2022; Wrobel et al., 2023), have uncovered hundreds of peroxisomal proteins and significantly improved our understanding of these metabolically plastic organelles. Another powerful strategy in uncovering peroxisome functions is in silico identification of proteins containing peroxisome targeting signals (PTSs), which requires full defining of the PTSs (Lingner et al., 2011; Reumann, 2011; Nakanishi et al., 2024). This is of particular importance for plant peroxisomes, which possess a larger proteome and play more diverse metabolic and signaling roles than those in many other eukaryotes (Sibirny, 2016; Pan & Hu, 2018; Pan et al., 2020). Moreover, the peroxisome has emerged as a highly valuable compartment for organelle engineering, particularly in the fields of biological manufacturing and agriculture (Song et al., 2024). Thus, decoding PTSs will not only lay the foundation for comprehensively predicting lineage-specific peroxisomal proteins and metabolic pathways but also provide more accurate sequence information to localize enzymes and transporters to peroxisomes as desired. Peroxisomal membrane protein (PMP) targeting is believed to use two mechanisms: direct peroxisomal targeting and indirect targeting through the endoplasmic reticulum (ER; Cross et al., 2016; Mayerhofer, 2016). The targeting signal of PMPs has not been clearly defined yet. As one of the best-characterized plant PMPs in trafficking, the targeting signal for ascorbate peroxidase 3 (APX3) was found to lie within the C-terminal region including the transmembrane domain (Cross et al., 2016). By comparison, peroxisomal matrix proteins rely on two types of PTSs located at the C-terminus (PTS1) and N-terminus (PTS2). PTS1, carried by most peroxisomal matrix proteins at their extreme C-termini, was initially recognized as a simple tripeptide with a 'canonical' consensus of (S/A)-(K/R)-(L/M), and PTS2 is a nonapeptide with a loose consensus sequence (R/K)-(L/V/I)-(X)5-(H/Q)-(L/A) usually located within variable distances near the N-terminus (Kunze, 2020). However, more and more 'noncanonical' derivatives of PTS1 have been discovered, demonstrating the complexity of PTS1 and the likely existence of many unknown PTS1 tripeptides (Lametschwandtner et al., 1998; Brocard & Hartig, 2006; Lingner et al., 2011; Reumann & Chowdhary, 2018). Recently, using a large-scale statistical analysis followed by experimental validation, we determined that for weak PTS1 tripeptides, a c. 12-amino acid auxiliary peptide (AuxP) upstream of PTS1 plays a supportive but pivotal role in peroxisome targeting, with RILVRTKRPRPR as the likely strongest AuxP in plants (Deng et al., 2022). In the same study, we identified 12 additional PTS1 peptides, increasing the number of validated plant PTS1s to 54 (Deng et al., 2022). Most of these known PTS1s are divergent in sequences and poorly conserved, indicating that PTS1 tripeptides may be far from completely identified. Machine learning (ML) has proven to be powerful in predicting protein motifs (Savojardo et al., 2023). In ML-based motif prediction, constructing a sizable training set and properly encoding the input data are critical for the final output. However, poorly conserved motifs like PTS1 tend to contain divergent sequences that are under-represented in empirical experimental data based on known functional proteins (Brocard & Hartig, 2006; Deng et al., 2022). Moreover, creating a dataset using experimental data can be laborious and costly, which limits the amount of experimentally validated data to be used for the training set. To provide sufficient and accurate training information for ML-assisted prediction of the extremely diversified PTS1 sequences, we devised a ML-assisted method utilizing two aspects of the information obtained from the PTS1 tripeptides: their frequency of appearance in proteins and peroxisome targeting data with amino acid substitutions. This cost-effective strategy enabled us to achieve the goal of comprehensively identifying PTS1 in plants. Apart from the 54 known PTS1 tripeptides, we were able to additionally validate 445 PTS1s and predict the existence of another 348. This study has significantly deepened our understanding of PTS1, the most important 'postal code' for a major metabolic hub in all eukaryotic cells, and thus will facilitate the discovery of proteins residing in peroxisomes of various plant species and likely nonplant species as well. Our ML-assisted strategy may also be applicable to studies to define other protein motifs. The 'mutual best-match' dataset of 20 712 proteins was generated as described previously (Deng et al., 2022). Briefly, 362 species covering all the main clades of angiosperms were selected from species with completely sequenced genomes (https://www.plabipd.de), including 88 monocots, 263 eudicots, and 11 others (Supporting Information Table S1). A protein was selected only if it was the 'mutual best-match' in the two-way Blast search between Arabidopsis and the plant species from which the protein was identified. The targeting plasmid mVenus-AuxP-3aa was transformed into Agrobacterium tumefaciens strain GV3101 (pMP90) by freeze-thaw (Höfgen & Willmitzer, 1988). Transient protein expression in tobacco (Nicotiana tabacum) leaves followed by confocal microscopy was carried out as described previously (Deng et al., 2022) to analyze protein targeting. A previously generated 35S promoter-driven moxCerulean3-enhancer-PTS1(SKL) protein in the pGWB545 vector (Deng et al., 2022) was used as the peroxisome marker. A Fluoview FV3000 confocal laser-scanning microscope (Olympus, Tokyo, Japan) was used for image capturing, where mVenus was excited with 514-nm lasers and detected at 530–630 nm, and moxCerulean3 was excited with 445-nm lasers and detected at 460–500 nm. The primers used to construct vectors for expressing the mVenus-AuxP-3aa fusion proteins are shown in Table S2. The 'ML-set', a 3aa dataset containing 224 PTS1 and 149 non-PTS1 peptide sequences, was generated as described in the Results section. Data storage and processing were conducted using Pandas (v.2.1.4; McKinney, 2010) and Numpy (v.1.24.3; Harris et al., 2020). The PTS1 tripeptides each possessed a specific eFrequency, that is the frequency (number of times) of its appearance in the 3aa library, of ≥ 1. The eFrequency was arbitrarily assigned as 0 for all the non-PTS1 peptides and as 1 for those PTS1 tripeptides not in the 3aa library. SUM indicates the summation operator, eFrequenc y aa position $$ \mathrm{eFrequenc}{\mathrm{y}}_{\left[\mathrm{aa},\mathrm{position}\right]} $$ denotes the eFrequency of the 3aa peptides containing the amino acid at a specific position, and Position indicates a specific position in a tripeptide. Numbe r aa position Loss $$ \mathrm{Numbe}{\mathrm{r}}_{\left(\mathrm{aa},\mathrm{position}\right)}^{\mathrm{Loss}} $$ denotes the total number of times when the substitution of one amino acid in a specific position caused loss of peroxisome targeting function, and Numbe r aa position Gain $$ \mathrm{Numbe}{\mathrm{r}}_{\left(\mathrm{aa},\mathrm{position}\right)}^{\mathrm{Gain}} $$ denotes the total number of times when the substitution of an amino acid at a specific position induced gain of peroxisome targeting function. Given the limited size of our training dataset, we initially evaluated several models suitable for smaller datasets rather than more complex models like deep neural networks. To this end, linear regression (LR), random forest (RF), gradient boosting decision tree (GBDT), support vector machine (SVM), and linear discriminant analysis (LDA) were compared using fivefold cross-validation and evaluation metrics including accuracy and recall (Fig. S1). Overall, all five models demonstrated satisfactory performance. SVM was chosen for eValue-based evolutionary information due to its relatively higher accuracy and the fact that it offers robust regularization and generalization capabilities. LDA was chosen for sValue-based experimental information due to its relatively higher recall and the fact that it does not require hyperparameter tuning, making it a straightforward and efficient choice. Both LDA and SVM are well-suited for low-dimensional data, which is three-dimensional in our study, as they exhibit strong performance with fewer hyperparameters to tune and therefore ensuring the efficiency and reliability of the analysis. For the SVM model construction, each 3aa was converted to a three-dimensional vector, where each dimension represents a residue within the 3aa with an amino acid-specific eValue. The model was written in Scikit-learn (v.1.2.5; Pedregosa et al., 2011), opting for the support vector classification (SVC) with a radial basis function (RBF) kernel to balance model complexity with generalization capability. To evaluate the model's performance, a fivefold cross-validation was conducted. The ML-set was divided into five equal parts (or fivefold), each containing the same ratio of PTS1 vs non-PTS1 peptides as in the total ML-set. In each fold, 4/5 of the dataset (294 3aa sequences) served as the training set, while the remaining 1/5 (79 3aa sequences) served as the test set. All the hyperparameters, that is the kernel width and slack-weight regularization parameter, were selected by fivefold cross-validation on the ML-set. All data were standardized by removing the mean and scaling to unit variance to ensure fair training conditions. The model was evaluated by a set of metrics, that is accuracy, recall, and the area under the curve (AUC) values of the receiver operating characteristic (ROC) curves and the precision-recall (PR) curves. After training, the model's predictive accuracy and recall were evaluated based on the number of correct labeling by the test set. This process was repeated through five iterations. To further assess the model's performance, prediction probabilities from SVM were used to plot ROC and PRC curves. After analyzing the model behavior and selecting hyperparameters on the fivefold cross-validation data, the model was retrained on the whole ML-set of 373 3aa peptides. For LDA model construction, each 3aa was converted to a three-dimensional vector, where each dimension represents a residue position with an amino acid-specific sValue. These vectors were normalized to have zero mean and unit variance. The LDA model with default settings was employed to project the data from the three-dimensional space to a one-dimensional form. Similar to the evaluation of SVM, we performed fivefold cross-validation using the ML-set to evaluate the performance of the LDA model. The model was evaluated based on its classification accuracy, recall, and AUC and PRC scores that were averaged over the five iterations. Given that LDA does not produce probabilistic predictions, logistic regression was further employed to classify features extracted by the LDA model. After evaluating the LDA model using the fivefold cross-validation, the model was retrained on the whole ML-set of 373 3aa peptides. The codes for models used in this study are available at https://github.com/Jenelolen/Xiao-Hong-Papers/tree/master/PTS1. To assemble a sizable PTS1 training dataset that contains more sequences besides the 54 peptides previously identified, we selected 85 PTS1-containing and previously validated Arabidopsis peroxisomal proteins (Table S1; Pan et al., 2018). After Blast searches for homologs in the genomes of 362 angiosperms, we assembled a 'mutual best-match' dataset of 20 712 proteins, in which a protein was included only if it was identified from the two-way Blast search between Arabidopsis and the species from which the protein was identified (Table S1). The C-terminal tripeptide (3aa) of the proteins was extracted to generate a plant '3aa library' that contains 20 712 tripeptides (Table S1; Fig. 1a). We then selected 299 3aa to experimentally validate their peroxisome targeting ability. These 299 3aa tripeptides included 288 from the 3aa library and 11 that contain SKL (the strongest canonical PTS1) derivatives but absent from the library (Fig. 1a). Most of these 3aa sequences contained amino acids in the 54 previously established PTS1 peptides, that is (T/S/P/C/A/Q) (K/T/S/R/Q/N/M/L/H/G/F/E/D/A/Y/C) (L/M/I/F/V/Y). Specifically, 116 tripeptides contained previously established PTS1 amino acids at all three positions, and 150, 26, and 7 tripeptides contained established PTS1 amino acids at 2, 1, and 0 positions, respectively (Table S3). Each 3aa was fused to the C-terminus of the previously identified AuxP, 'RILVRTKRPRPR' (Deng et al., 2022), to support weak PTS1. The resulting AuxP-3aa was then fused to the C-terminus of the yellow fluorescent protein mVenus to generate the mVenus-AuxP-3aa fusion protein (Fig. S2). Using tobacco transient protein expression followed by confocal microscopy, we successfully validated 170 new PTS1 peptides whose mVenus-AuxP fusions showed co-localization with the peroxisome marker, moxCerulean3-enhancer-PTS1(SKL) (Figs 1c, S3, S4). The 129 nonperoxisomal targeted 3aa (Fig. S5), together with 20 tripeptides with 3 identical residues (e.g. SSS), was designated as non-PTS1 (Fig. 1b). A 'ML-set' of 3aa, which included 224 functional PTS1 (54 + 170) and 149 non-PTS1 (129 + 20) sequences, was generated for ML model training (Fig. 1a,b). We reasoned that PTS1 peptides with stronger targeting abilities would be more conserved in evolution, thus having higher frequencies of appearance in the 3aa library. To this end, we used 'Evolutionary Frequency' (eFrequency), the frequency (number of times) of appearance of a 3aa in the 3aa library, as a quantitative index of the functional strength of each PTS1 (Table S1). The eFrequency was arbitrarily designated as 1 for those PTS1-3aa sequences not in the 3aa library and 0 for all the non-PTS1 sequences. An eValue was then generated for an amino acid at a specific position of the tripeptide, using the eFrequency sums of all 3aa peptides containing this amino acid at this specific position. Then, each 3aa was converted to a three-dimensional vector, where each dimension represents a residue in the tripeptide with an amino acid-specific eValue (Fig. 2a). After preliminary trials with several ML models (see Materials and Methods section), the SVM model with a RBF kernel was chosen to classify each 3aa as either PTS1 or non-PTS1. The model was evaluated with 5-fold cross-validation (Wong & Yeh, 2020) on the ML-set, which demonstrated an average accuracy of 90% and an average recall rate of 91% in predicting the peroxisome targeting function (Fig. S1). The ROC curves showed an average AUC value of 0.93, indicating a high classification reliability, and the PR curves showed an average AUC of 0.96, demonstrating the robust classification performance of the model (Fig. 2a). One caveat to the use of the eFrequency-derived eValue is its unbalanced nature, because only PTS1 sequences had quantitative eFrequency values while all non-PTS1 sequences were arbitrarily designated as having an eFrequency of 0. To provide a more balanced representation of the contribution of an amino acid at a specific position to the targeting ability of each 3aa, we employed an experiment-based 'sValue', which is calculated based on the number of times that the substitution of an amino acid at a specific position caused loss vs gain of peroxisome targeting ability (Fig. 2b). Each 3aa was then converted to a three-dimensional vector, where each dimension represents a residue of the tripeptide with an amino acid-specific sValue (Fig. 2b). After preliminary trials with several ML models (see Materials and Methods section), we chose the LDA model, which was able to reduce the three-dimensional sValue-induced feature space to a one-dimensional axis and thereby maximizing the distinction between PTS1 and non-PTS1 tripeptides. The efficacy of the LDA model was appraised using fivefold cross-validation across the ML-set, which achieved an average accuracy of 88% and a recall rate of 95% in predicting the peroxisome targeting ability (Fig. S1). Since LDA lacks probabilistic predictions, we used LR to derive probability estimates for the ROC and PR curve analyses. This model demonstrated strong performance in classification, with an average ROC and PR curve AUCs of 0.96 and 0.98, respectively (Fig. 2b). After evaluation, the SVM and LDA models were retrained on the whole ML-set before being used to predict novel PTS1 peptides from all the possible tripeptide variations (203 or 8000). SVM and LDA predicted 423 and 1158 putative novel PTS1 tripeptides, respectively (Tables S4, S5). Besides the 373 3aa peptides in the ML-set, the remaining 7627 tripeptide variations were grouped into four categories: 329 predicted by both models, 94 only by SVM, 829 only by LDA, and 6375 denied by both models (Fig. 3a). To assess the positive rate for each category and verify novel PTS1 tripeptides, we used confocal microscopy of mVenus-AuxP-3aa fusion proteins to test the peroxisome targeting ability of randomly selected 3aa peptides from each category. As expected, Category 1 showed the highest positive rate of 91% (Fig. 3a), whereby 253 out of the 278 tested conferred peroxisome targeting (Figs 3b,c, S6–S8), followed by 80% (16 out of 20) for Category 2, 30% (6 out of 20) for Category 3, and 0% (0 out of 20) for Category 4 (Figs 3a,b, S9). The 0% positive rate of Category 4 supports our conclusion that the two models combined can predict virtually all functional PTS1 sequences (Fig. 3a). Based on the positive rates, we estimated the number of functional PTS1 tripeptides to be 299 in Category 1, 75 in Category 2, 249 in Category 3, and 0 in Category 4, which makes a total of 623 novel PTS1-3aa peptides proposed in this study (Fig. 3d). After adding the 224 tripeptides from the ML-set, the total functional plant PTS1-3aa peptides can reach 847 (Fig. 3d). To discover the highly diversified PTS1 sequences, we combined the results from two ML models, the SVM model utilizing the eValue-based evolutionary information and the LDA model utilizing the sValue-based experimental information. Based on our targeting results for all the possible tripeptide variations, the SVM model has high precision of an c. 86% positive rate but a lower recall of c. 60% coverage, whereas the LDA model has low precision of a c. 47% positive rate but high recall of c. 88%. These two models complemented each other well and together enabled us to increase the functional plant PTS1 tripeptides from 54 to 499 and estimate the total number of plant PTS1 tripeptides to be 847 (Fig. 3d). We believe that the principle and framework of this approach can be applied to defining other highly diversified short protein motifs. Although both SVM and LDA are suitable for our low-dimensional dataset in this study, they have limitations. LDA is a linear model and SVM is a nonlinear model, which makes the features deduced from the training data prone to be linear for LDA and nonlinear for SVM. Additionally, the eValue used for SVM is imbalanced, as it is designated as 0 for all non-PTS1 tripeptides, which may lead to bias toward predicting the majority class, thereby reducing its effectiveness in identifying true positives among the minority class. Moreover, when calculating sValue, many tripeptides in the ML-set cannot be used, because they do not have any corresponding tripeptide that has a single amino acid substitution causing the loss or gain of peroxisome targeting ability, posing a challenge to comprehensively assessing the ability of each amino acid at each position in peroxisome targeting and thereby constraining the prediction capability of the model. In addition, both LDA and SVM are tailored for low-dimensional data. While being advantageous for our current 3D datasets, these models may not be applicable to higher-dimensional data, where feature space complexity increases. In higher dimensions, LDA might oversimplify the data structure due to its linearity assumption and SVM might require more extensive tuning and computational resources to maintain its performance. To overcome the limitations for SVM and LDA in motif prediction, future work should explore more sophisticated models and data augmentation techniques to ensure better generalization and robustness across different datasets. It has long been known that PTS1 tripeptides in different kingdoms all have many varied forms (Neuberger et al., 2003; Brocard & Hartig, 2006; Deng et al., 2022), yet a comprehensive view of this heterogeneity has been lacking. Our study is a crucial step toward fully understanding plant PTS1 peptides by revealing the range of its diversity. Similar to the expanded number of PTS1 tripeptides, the type of amino acids that can appear in each of the three positions has also been significantly increased. Based on the previously established 54 PTS1 peptides, the potential amino acid composition can be summarized as (T/S/P/C/A/Q) (K/T/S/R/Q/N/M/L/H/G/F/E/D/A/Y/C) (L/M/I/F/V/Y). Our study expanded the possible amino acids at position −3 and −2 to all 20 amino acids and those at position −1 to (L/I/M/V/F/Y/A/C/H/K/N/G/Q/S/W/T), indicating the remarkable tolerance of PEX5′s binding cavity to different amino acids on the PTS1 tripeptides (Neuberger et al., 2003; Brocard & Hartig, 2006), as well as the significant contribution of the strong upstream auxiliary peptide (Deng et al., 2022). Previously developed ML-based PTS1 predicting algorithms in plants (Lingner et al., 2011; Wang et al., 2017) have very limited ability in predicting PTS1 proteins with weak and rare tripeptides, because these PTS1 sequences are very rare and sometimes even absent from the training data. The high number of PTS1 tripeptides discovered in this study expands the number of candidates for peroxisomal proteins, therefore can facilitate future efforts in identifying new peroxisomal functions. Among the 20 712 'mutual best-match' proteins of the 85 PTS1-containing and previously validated Arabidopsis peroxisomal proteins identified from 362 angiosperms (Table S1), we found 530 proteins with the newly validated PTS1-3aa in this study and 294 proteins with predicated putative PTS1 tripeptides (Table S6). As an example of how the findings of this study may help to identify new peroxisomal proteins, we selected two rice proteins, Os01g0689600 (uncharacterized protein) and Os03g0854400 (putative RNA-specific endonuclease) for targeting analysis. Both proteins were previously unknown to be peroxisomal and contain new PTS1 tripeptides with many basic and nonpolar residues in the upstream AuxP region, which are features known to assist PTS1 function (Deng et al., 2022). Both the full length and the C-terminal 15aa can confer partial peroxisome targeting for these two proteins (Fig. S10), indicating that they are both new peroxisomal proteins in rice that may represent previously unknown peroxisomal functions. Furthermore, the comprehensive lists of validated PTS1 tripeptides and predicted putative PTS1 tripeptides revealed in this study can provide a reliable range of potential PTS1 tripeptides, which could be used as a reference to help remove false positives predicted by existing algorithms, thus refining their prediction accuracy. It is important to note that in this study all the new PTS1 tripeptides are experimentally verified in the presence of a strong upstream AuxP. Proteins ending with these new PTS1 tripeptides but lack AuxP may not be imported into peroxisomes. It is also worth noting that some PTS1 tripeptides, despite carrying a strong AuxP, only led to partial peroxisome targeting, indicating that these tripeptides have very weak capability in peroxisome targeting. Furthermore, the predicted PTS1 tripeptides in Categories 2 and 3, which were supported by only one ML model and mostly not yet experimentally verified, have a relatively lower positive rate than that of Category 1. Therefore, some of these predicted putative PTS1 tripeptides may not have peroxisome targeting ability. In general, the presence of the new PTS1 tripeptides should only be considered as a good possibility but not solid evidence for peroxisome targeting. It is also important to note that sometimes even the presence of a strong PTS1 is not sufficient for peroxisome targeting. For example, the PTS1 sequence may be shielded from the receptor PEX5 by other parts of the protein, or a strong N-terminal targeting signal is competing with PTS1. In rare cases, the PTS1 tripeptide may be located near, instead of at, the C-terminus. The Arabidopsis peroxisomal oxidative pentose-phosphate pathway (OPPP) enzymes are good examples of these so-called peculiar PTS1 proteins (Meyer et al., 2011; Baune et al., 2020; Doering et al., 2024). For instance, 6-phosphogluconate dehydrogenase 2 (PGD2) ends with an established PTS1 tripeptide SKI, but a phosphorylation event at its N-terminus is needed to prevent the formation of the PGD dimer that shield the PTS1 peptide, thereby enabling peroxisome import (Doering et al., 2024). Glucose-6-phosphate dehydrogenase 1 (G6PD1) contains SKY as an internal PTS1-like signal, which is normally overruled by an N-terminal plastid import signal. Only when temporary cytosolic oxidation activates G6PD4 to prime its interaction with the G6PD1 precursor can the internal PTS1-like signal on G6PD1 be exposed and confer peroxisome import (Meyer et al., 2011). Therefore, when a fusion of a fluorescent protein and a protein containing a strong PTS1 does not localize to peroxisomes, it may be worth checking whether this PTS1 is well conserved in other species, which may be a good indication of its function in peroxisome targeting, at least under certain conditions. This work was supported by the National Natural Science Foundation of China (32200231), the Zhejiang Provincial Natural Science Foundation of China (LZ23C020002), the National Key Research and Development Program (2022YFD1401600), and the Beijing Life Science Academy regular project (2023000CC0010). None declared. RP, XL and QZ designed the research. QD and YX conducted in vivo analysis. XH, ZG, JC and YF performed in silico analysis. HD, JH, NL, XS, JZ and XX contributed to data analysis. QD, XH, RP, XL and JH prepared the manuscript. QD, YX, XH and ZG contributed equally to this work. Fig. S1 Initial evaluation of different ML models. Fig. S2 Diagrams for construct generation to express the mVenus-AuxP-3aa fusion protein. Fig. S3 Confocal images of peroxisome targeting analysis of PTS1 peptides for generating the ML-set (second subset). Fig. S4 Confocal images of peroxisome targeting analysis of PTS1 peptides for generating the ML-set (third subset). Fig. S5 Confocal images of peroxisome targeting analysis of non-PTS1 peptides for generating the ML-set. Fig. S6 Peroxisome targeting analysis of 3aa predicted by both models (second subset). Fig. S7 Peroxisome targeting analysis of 3aa predicted by both models (third subset). Fig. S8 Peroxisome targeting analysis of 3aa predicted by both models (fourth subset). Fig. S9 Peroxisome targeting analysis of 3aa predicted only by one or no model. Fig. S10 Peroxisome targeting analysis of two rice proteins with newly identified PTS1. Table S1 The 'mutual best-match' dataset of plant PTS1 proteins and the plant 3aa library. Table S2 Primers used in this study. Table S3 Selected 3aa to be validated for generating the ML-set. Table S4 Putative PTS1-3aa predicted by the SVM model. Table S5 Putative PTS1-3aa predicted by the LDA model. Table S6 Proteins in Table S1 that contain either a newly validated or a predicated putative PTS1 tripeptide identified in this study. Please note: Wiley is not responsible for the content or functionality of any Supporting Information supplied by the authors. Any queries (other than missing material) should be directed to the New Phytologist Central Office. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.