https://www.ncbi.nlm.nih.gov/geo/ - Gene Expression Omnibus
https://www.ebi.ac.uk/arrayexpress/ - ArrayExpress
https://datamed.org/ - genomics database search engine
http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi - Cancer program datasets from Broad Institute.
http://bioinformatics.mdanderson.org/main/Public_Datasets - MD Anderson public datasets
https://github.com/ramhiser/datamicroarray - A collection of small-sample, high-dimensional microarray data sets to assess machine-learning algorithms and models.
Selected datasets used in Nguyen, Tin, Rebecca Tagett, Diana Diaz, and Sorin Draghici. “A Novel Approach for Data Integration and Disease Subtyping.” Genome Research, October 24, 2017, gr.215129.116. https://doi.org/10.1101/gr.215129.116. Available as RData files from “PINS: A novel method for data integration and disease subtyping”, http://www.cs.wayne.edu/tinnguyen/PINS/PINS.html
Gene array prediction of AML transformation in MDS, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15061. GSE15061 include 366 leukemia related samples (202 acute myeloid leukemias and 164 myelodysplastic syndromes).
AML2004, from Jean-Philippe Brunet, Pablo Tamayo, Todd Golub, Jill Mesirov “Metagenes and molecular pattern discovery using matrix factorization” Proc. Natl. Acad. Sci. USA 2004 101: 4164-4169. Published: 2004.03.22 https://portals.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=89. Subtype information of AML2004 is described in Brunet et al., and is available in the file “ALL AML samples.txt” on the website. AML2004 includes 38 leukemia samples (11 acute myeloid leukemia, 19 acute lymphoblastic leukemia B cell, and 8 T cell).
BRAIN2002, Gene Expression-Based Classification and Outcome Prediction of Central Nervous System Embryonal Tumors, https://archive.broadinstitute.org/mpr/CNS/. The subtype information of this dataset is described in Pomeroy et al. (data set A) and is available in the file “Brain samples clinical table.xls” on the website. This dataset consists of 42 samples (10 meduloblastomas, 10 malignant gliolas, 10 atypical teratoid/rhaboid tumors, 4 normal cerebellums, and 8 primitive neuroectodermal tumors).
LUNG2001, Classification of Human Lung Carcinomas by mRNA Expression Profiling Reveals Distinct Adenocarcinoma Sub-classes, http://archive.broadinstitute.org/mpr/lung/. Subtype information of Lung2001 is available in the file “datasetA scans.txt” on the website. This dataset consists of 237 lung cancer samples (190 adenocarcinomas, 21 squamous cell carcinomas, 20 carcinoid, and 6 small-cell lung carcinomas).
136 datasets used by Hung, Jui-Hung, Tun-Hsiang Yang, Zhenjun Hu, Zhiping Weng, and Charles DeLisi. “Gene Set Enrichment Analysis: Performance Evaluation and Usage Guidelines.” Briefings in Bioinformatics 2012, Supplementary Data 1 http://bib.oxfordjournals.org/content/13/3/281/suppl/DC1
GDS1020 GDS1023 GDS1036 GDS1048 GDS1094 GDS1209 GDS1212 GDS1220 GDS1221 GDS1235 GDS1257 GDS1258 GDS1282 GDS1284 GDS1329 GDS1331 GDS1344 GDS1362 GDS1375 GDS1388 GDS1390 GDS1412 GDS1413 GDS1424 GDS1449 GDS1454 GDS1479 GDS1480 GDS1481 GDS1495 GDS1496 GDS1497 GDS1498 GDS1559 GDS1562 GDS1615 GDS1627 GDS1650 GDS1663 GDS1673 GDS1688 GDS1714 GDS1726 GDS1733 GDS1815 GDS1816 GDS1852 GDS1875 GDS1917 GDS1956 GDS1960 GDS1962 GDS1972 GDS1974 GDS1975 GDS1976 GDS1993 GDS2055 GDS2056 GDS2118 GDS2171 GDS2190 GDS2191 GDS2206 GDS2250 GDS2255 GDS2297 GDS2341 GDS2362 GDS2373 GDS2384 GDS2415 GDS2486 GDS2489 GDS2495 GDS2513 GDS2518 GDS2519 GDS2520 GDS2535 GDS2545 GDS2546 GDS2547 GDS2601 GDS2609 GDS2611 GDS2615 GDS2626 GDS2642 GDS2643 GDS2655 GDS2656 GDS2678 GDS2735 GDS2736 GDS2737 GDS2740 GDS274 GDS2767 GDS2771 GDS2772 GDS2785 GDS2819 GDS2822 GDS2824 GDS2831 GDS2835 GDS2842 GDS2855 GDS2922 GDS2926 GDS2935 GDS2942 GDS2960 GDS360 GDS395 GDS449 GDS495 GDS532 GDS534 GDS609 GDS610 GDS611 GDS612 GDS651 GDS690 GDS711 GDS724 GDS737 GDS806 GDS807 GDS841 GDS843 GDS946 GDS963 GDS987