Manual for the Amberbio app
The Amberbio app allows scientists to analyze and visualize data sets, in particular biological data, on mobile devices. All calculations are performed directly on the device and all data is stored locally. The functionality should be evident from the user interface and it is not necessary to read the manual to use the app. The manual gives a high level overview of the app and explains some important details that are not clear from the user interface such as some mathematical details.
This manual is available in the app and on the web at www.amberbio.com/manual.
Table of contents
Version 8 of the Amberbio app.
All data is stored locally on the device. All calculations are performed locally. Amber Biosciences does not have any access to user data.
Projects and data sets
The app handles projects and data sets. A data set is a set of values organized in a table with rows and columns. A project is a collection of one or more data sets. A project is always created with a data set called "The original data set". Other data sets within a project are created by the app using various methods such as normalization or sample removal.
The columns of a data set are called samples and the rows are called molecules. The values in the table represent an intensity, or some other quantity, for a sample-molecule pair. Data sets could represent biological information such as gene expression, protein abundances, microRNA expressions, peptide abundances, or metabolite concentrations. Data sets do not need to be of biological nature even though the app uses terminology from biology.
A project also contains factor, or grouping, information about the samples. A factor has one or more levels. An example of a factor is "Gender" with levels "male" and "female". Naturally, factors are shared for all data sets within a project.
A project can also contain extra information about the molecules. This information is called molecule annotations. An example could be the molecule annotation "Chromosome" where each molecule has a value such as "chromosome 21".
The active data set
Analysis is performed on the active data set. Selection of the active data set is done on the page Data Set Selection. The active project is the project to which the active data set belongs. Editing and adding factors and molecule annotations are done in the active project.
There is almost always an active data set and project. The only exceptions are when the app starts the first time and when the active project is deleted. When a new data set is created, the app automatically makes the new data set active and jumps to the page Data Set Selection
Import of data
Data import is used to create new projects and to add factor and molecule annotation information to projects. Import of data is a two step process. First, a file is downloaded to the app. Second, the file is read and parsed and the data is imported into a database kept by the app.
Import of data is handled by the page Import data. All downloaded files can be seen on this page. The app keeps all imported files until they are deleted by the user.
A project is created from a table of values. The values can not be changed. New values must be imported in a new project. Factors and molecule annotations can be added at any time.
The app can import files with file extension ".txt", ".xlsx" and ".sqlite".
The ".txt" and ".xlsx" files contain tables, or spreadsheets, of data. They are prepared by the user. The tables are used to create a new project, to import factors to the active project, or to import molecule annotations to the active project. The precise format of these file types is explained below.
The ".sqlite" files are internal files. They are exported by the app and later imported in the app on any device. They are used to transfer and backup data by the app.
Download of files
Files can be downloaded in two ways; either by "opening" the file in another app such as Mail or by import from a cloud storage.
Opening a file in the Amberbio app from another app is done by tapping the file and selecting the Amberbio app. A typical use case is to open an email attachment by tapping the attachment in the Mail app.
Files can be imported from a cloud storage such as iCloud Drive, Dropbox, Box, or Google Drive. The import is done by tapping "Download file" on the page Import data. The corresponding cloud storage app must itself be installed on the device. The first time a certain cloud storage provider is used in the Amberbio app, it might be necessary to tap "more" and enable that provider.
Importing downloaded files
Tap the button "Import" below the file name for a downloaded file to import it. ".sqlite" files are automatically processed. ".txt" and ".xlsx" files are processed by an import wizard that goes through several steps where the user can select a rectangular region by tapping. The use of the import wizard minimizes the need to edit the table in an external program before import.
Format of .txt files
The ".txt" files contain tables of data. The rows, or lines, are separated by "\n", "\r", or "\r\n", and the cells are separated by "\t" or ",". In other words, the ".txt" files can be both tab and comma separated and have all common line endings. The app finds the cell separator automatically by first searching for tabs. If there are tabs, the app reads the file as a tab separated file. If there are no tabs, the file is read as a comma separated file. Tabs are never allowed within the cells. If there are commas within the cells, the file must be tab separated.
Format of .xlsx files
The ".xlsx" are modern Excel files. They are distinct from the older ".xls" files. The ".xlsx" files can be exported from Excel and many other programs. The data must reside in the first sheet.
The ".xlsx" files are parsed by the library "XlsxReaderWriter" by René BIGOT. The library is available at https://github.com/renebigot/XlsxReaderWriter.
Copyright for XlsxReaderWriter
The MIT License (MIT)
Copyright (c) 2014 René BIGOT
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Specification of tables from .txt and .xlsx files
Numerical values can be specified with either decimal point as in "12.98" or decimal comma as in "12,98". If the decimal separator is comma, a ".txt" file must be tab separated. Anything else than a number is taken as a missing value if a number is expected.
Because of the import wizard, any reasonable arrangement of rows and columns can be used for import. For instance, samples can be in either row or columns with molecules in the other dimension. Examples are shown below.
The first example shows a table that contains everything, i.e., sample names, molecule names, values, factors, and molecule annotations. Through the import wizard various rectangular regions can be imported.
|Anything||Sample 1||Sample 2||Unigene||Chromosome|
The second example is a transposed version of the first example. Also "," is used as decimal separator in some values. This file can not then be a comma separated ".txt" file.
|Anything||Molecule 1||Molecule 2||Molecule 3||Gender||Status|
The third example shows a table that contains information about values and molecule annotations but no factors.
|Molecule name||Molecule id||Mouse A||Mouse B|
|Mol 1||Id 1||5.6||3.4|
|Mol 2||Id 2||-9.8||#NA|
The fourth example provides factors for the third example. The factors can be imported at any later time. Samples that are absent in the project, like Mouse C, are automatically ignored.
In project settings, the molecule identifier used in plots can be selected. The default molecule identifier is the molecule name. However, often another molecule annotation is more descriptive in a figure. The selected molecule identifier is stored in the local app database and will be remembered through restarts of the app.
Data sets can be imported directly from the NCBI Gene Expression Omnibus (GEO). The app downloads and imports a data set based on a user supplied id. The id must be of the form GDSxxxx or GSExxxx where xxxx is any number.
The imported data set becomes the original data in a new project named after the id. The app extracts some, but not all, information from the downloaded data set. The factors are not standardized in GEO data sets. Like all data sets in the app, factors can be manually edited and added.
GEO data sets can be searched on the web page http://www.ncbi.nlm.nih.gov/sites/GDSbrowser.
Result files are created by the app. The result files are figures and tables and have the file extension ".txt", ".pdf", or ".png". The result files are kept on the page Result files. The result files can be sent by email, opened in another app, or exported to a cloud storage.
Backup and sharing of projects
Projects can be exported to a database file on the page Export projects. The database file has the extension ".sqlite". The file can be used for backup and transfer of projects to other devices. The file can be sent by email or exported to a cloud storage.
Name and emails
On the page User, a name and a list of emails can be typed. There are no user accounts in the app. The name is only used for comments in the result files and the project notes. The emails are used as suggested emails when a file is sent by email from the app. The actual destination emails can always be changed before sending the email.
The Anova test is performed on the selected levels of a factor. The selected levels are highlighted, and the anova test is performed by tapping the blue factor name. For each molecule a standard Anova test is performed by removing samples with missing values, calculating the F-statistic as the ratio of between-groups variation divided by the within-groups variation. The p-value is the upper tail of the cumulative F-distribution.
After selecting a factor, any number of pairwise tests between two levels can be selected in the table. The pairwise test will be performed on all pairs and presented in a table. The t-test is a standard student t-test with equal variance. The p-value from the Wilcoxon test, which is the same as the Mann-Whitney test, is calculated as an exact value for small sample size and by the Normal distribution approximation for large sample sizes. The p-values are two-sided.
A pairing factor must be selected. Samples with the same level for the pairing factor can be paired. An example of a pairing factor is "Patient" and the levels are the patient ids or names. For each patient, two or more samples are measured.
The comparison factor is the factor for which the levels will be compared. When two levels of the comparison factor are compared, the pairs of samples are defined as follows. For each pairing factor level, there must be exactly one sample for each of the two comparison levels with that particular pairing factor level. An example of a comparison factor is "Time" with the levels "morning" and "evening". In this case, a pairwise test will be performed between the evening and morning samples for each patient.
The t-test is performed by subtracting the values of the paired samples in the two selected comparison levels. The p-value is two-sided and tests the null hypothesis that the difference between the comparison levels is zero.
For the selected factor, the levels with a numeric interpretation are chosen. A level can be converted to a number if it starts with, or is, a numeric values, such as "5 days" or "12.3". For each molecule, the samples with a numeric level and a non-missing value are used for the linear regression. The intercept and slope are calculated by least square regression.
The p-value is an Anova p-value and tests the null-hypothesis that the slope is zero, i.e. low p-values imply that there is a trend with a non-zero slope. The Anova test is performed by calculating a F-statistic as the ratio where the numerator is the variation between the line and a simple mean value, and the denominator is the residual variation from the fitted line. A large F-statistic, and correspondingly low p-value, implies that the line is a much better fit than a single mean value.
Multiple hypothesis testing
Multiple hypothesis testing is performed using the Benjamini-Hochberg false discovery rate method. The columns named false discovery rate contains the q-values of the molecules.
A histogram of the p-values can be seen by tapping "histogram". High frequencies for the low p-values imply a significant difference between the groups.
The supervised classifiers separate samples into two or more levels. A classifier is trained on a training set and afterwards tested on a test set.
There are three modes in which the classifiers in the app can be used; a fixed training set, leave-one-out cross validation, and k-fold cross validation.
When a fixed training set is chosen, the classifier is tested on all remaining samples, both those with actual levels matching the levels of the classifier and those with actual levels different from the levels known to the classifier. The latter type of samples can not be used to estimate the predictive power of the classifier. However, often it is still useful to know the predicted levels of these samples. For instance, if the classifier has the levels "sick" and "healthy", it might be useful to classify "borderline" samples into either "healthy" or "sick".
For leave-one-out cross validation, each sample is left out and tested with the remaining samples as the training set. In that way, each sample obtains one classified level which can be compared to the real level. Leave-one-out classification only uses the samples with levels known to the classifier.
For k-fold cross validation, the samples are divided randomly into k subsets of almost equal size; if the number of samples is not divisible by k, some subsets will be one sample larger than other subsets. Each subset is tested on a classifier trained by the remaining samples. In that way, each sample obtains one classified level which can be compared to the real level. Cross validation only uses the samples with levels that are known to the classifier.
In the special case where k is equal to the number of samples, k-fold cross validation is equal to leave-one-out cross validation. Leave-one-out cross validation has its own selection in the app because of it importance.
k nearest neighbor classification
Given a training set and a test sample, the k training samples with the smallest distances to the test sample are found. The levels of these k nearest samples are considered, and if there is a single level that constitutes a majority of the k samples, that level is chosen as the predicted level for the test sample. If no level obtains a majority alone, the test sample is considered unclassified. If the classifier has only two levels and k is odd, there is always one level with a majority. In other cases, samples might be unclassified.
The distance measure is euclidean distance. Only molecules without missing values in the combined training set and test set are included in the distance calculation.
In case of ambiguity in the k nearest neighbors due to ties in the distances, the classifier makes an arbitrary, but non-random, choice. This situation practically never occurs in real biological data.
Support vector machine classification
The support vector machine (SVM) classification uses the software library LIBSVM: "Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011". The LIBSVM software is available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. The Amberbio app uses the LIBSVM library to train an SVM classifier and calculate the decision values and predicted classes for test samples.
Copyright for LIBSVM
Copyright (c) 2000-2014 Chih-Chung Chang and Chih-Jen Lin. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
3. Neither name of copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
The SVM classifier in the app can compare two or more levels. A standard SVM classifier can directly classify two levels. Comparisons between K levels are performed by the LIBSVM library by performing all K(K - 1)/2 pairwise comparisons and selecting the level that wins most pairwise comparisons.
It is recommended to use logarithmic values for the SVM classifier. No scaling or other pre-processing is performed by the app.
For the binary classifier (two levels), a decision value is calculated. Samples with positive decision values are classified in one level and those with negative decision values in the other level. Using a variable threshold instead of a fixed zero threshold, a whole curve of classifiers is obtained. Plotting the true positive rate versus the false positive rate for variable thresholds leads to the receiver operating characteristic (ROC) curve. The area under the curve is a measure of the success of the classifier. Good classifiers have areas close to 1, whereas bad classifiers have areas close to 0.5.
The ROC curve only applies to the binary classifier. Call the two levels A and B. The true positive rate is defined as the fraction of samples with an actual level of A that are predicted to have level A. The false positive rate is defined as the fraction of samples of level B that are classified as level A. The true positive rate is also called the sensitivity. The false positive rate is equal to 1 - specificity.
The kernels are the linear kernel and the radial basis function (RBF) kernel of Gaussian functions. The parameter C is the coefficient of the error term in the optimizing function. The parameter gamma is the scale parameter in the exponent of the Gaussian function. A proper explanation can be found in any article or book about support vector machines. The app uses the same terminology as LIBSVM.
The linear kernel with default parameter is usually a good choice for biological data with many molecules and relatively few samples.
Only molecules without missing values in the training and test sets are used for the SVM.
k means clustering
K means clustering is an unsupervised clustering algorithm that divides the samples into k groups. The number of clusters, k, is user determined. The algorithm employed by the app is probabilistic and might give different results for several runs on the same data set. However, the results will be similar, and the differences are explained by movement of samples whose group membership is ambiguous. Since there is no unique biologically correct clustering in any case, the probabilistic nature of the algorithm is acceptable.
The algorithm is almost identical to the standard Lloyd algorithm. The steps are described below.
- Iterate over the entire cluster algorithm below, and take the clustering with the smallest sum of square deviation as the final result. The number of iterations is dynamic and depends on the duration for one iteration. The app should never become unresponsive by being stuck in a long computation.
- Assign the samples to random clusters.
- Resolve empty clusters by moving samples from the largest cluster to empty clusters.
- Iterate the following steps until the clusters are constant or a maximum limit is reached. In practice, the maximum limit is almost never needed.
- Calculate centroids as the average point in each cluster.
- Reassign samples to clusters. A sample belongs to the cluster whose centroid is closest.
- Resolve empty clusters by moving samples from the largest cluster to the empty clusters.
- Calculate the sum of square deviation of samples from their cluster centroid.
The Sammon map is a projection of the high dimensional samples to a low dimensional space. The map was invented by Sammon in 1969. (J. W. Sammon jr. A Nonlinear Mapping for Data Structure Analysis. IEEE Transactions on Computers, vol. C-18, no. 5, pp 401-409, May 1969). In this app the low dimensional space is two or three dimensional such that the samples can be visualized. The Sammon map attempts to preserve the pairwise distances between samples as much as possible.
The distances in the high dimensional space are Euclidean distances using only the molecules without missing values for the selected samples. The algorithm is described by by Kohonen ( T. Kohonen. Self Organizing Maps. Springer, 3 edition, 2001) and is iterative. The number of iterations depend on the time of one iteration; the app will never go into a long computation.
At each iteration, the algorithm loops over all pairs of samples and updates the two samples. The update moves the two points directly away from each other if their distance is too small, and moves them towards each other if it is to large. The size of the movement depends on the current distances and a variable multiplier. The multiplier starts at 0.4 and ends at a small value. The multiplier is reduced at each iteration.
The purpose of the Sammon map is to obtain a visualization of the samples and hopefully gain some biological insight. The Sammon map is an alternative to the PCA map. The Sammon map has the property that samples close to each other remain close after the projection which is not guaranteed by the PCA map.
Self organizing map
The self organizing map was invented by Kohonen (Kohonen, Teuvo (1982). "Self-Organized Formation of Topologically Correct Feature Maps". Biological Cybernetics 43 (1): 59–69).
The map works by mapping a grid of units into the space of the molecular values of the samples. In the Amberbio app the grid is made of hexagonal units with a user selected number of rows and columns. In the molecular space, only molecules without missing values are used.
The map is calculated using an iterative process. At initialization, the units are given an arbitrary starting position. In the Amberbio app, the starting positions are distinct linear combinations of the samples. In the iterative process, a random sample is chosen at each step. The random number generator is seeded such that the same map is generated on the same samples for repeat runs of the algorithm. The unit closest to the random sample is found. Call this unit the special unit. The position of all units are updated such that they move closer to the random sample. The units are updated to a weighted average of their current position and the position of the random sample. The weight is largest for the special unit and falls off according to
where the constants K and C themselves decrease exponentially for each iteration.
After generating the map, the samples are mapped to their closest unit. The plot in the app shows the grid of units with the embedded samples. In this way, the self organizing map can be used for unsupervised clustering and for visualization of the samples.
The borders between the neighboring units are colored according to the distance between the units in the molecular sample space. The borders of proximate units are light blue and the borders of distant units are dark blue.
The self organizing map can be thought of as a sheet that attempts to wrap around the samples in the high dimensional space of molecular values. The samples are projected onto the sheet and the plot shows the two dimensional grid of units with the samples embedded.