- http://usgovxml.com
- http://aws.amazon.com/datasets
- http://databib.org
- http://datacite.org
- http://figshare.com
- http://linkeddata.org
- http://reddit.com/r/datasets
- http://thedatahub.org alias http://ckan.net
- http://quandl.com
- Social Network Analysis Interactive Dataset Library (Social Network Datasets)
- kdnuggets.comDatasets for Data Mining
- http://enigma.io
- http://www.ufindthem.com/
http://archive.ics.uci.edu/ml/
http://crawdad.org/
http://data.austintexas.gov
http://data.cityofchicago.org
http://data.govloop.com
http://data.gov.uk/
data.gov.in
http://data.medicare.gov
http://data.seattle.gov
http://data.sfgov.org
http://data.sunlightlabs.com
https://datamarket.azure.com/
http://developer.yahoo.com/geo/g...
http://econ.worldbank.org/datasets
http://en.wikipedia.org/wiki/Wik...
http://factfinder.census.gov/ser...
http://ftp.ncbi.nih.gov/
http://gettingpastgo.socrata.com
http://googleresearch.blogspot.c...
http://books.google.com/ngrams/
http://medihal.archives-ouvertes.fr
http://public.resource.org/
http://rechercheisidore.fr
http://snap.stanford.edu/data/in...
http://timetric.com/public-data/
https://wist.echo.nasa.gov/~wist...
http://www2.jpl.nasa.gov/srtm
http://www.archives.gov/research...
http://www.bls.gov/
http://www.crunchbase.com/
http://www.dartmouthatlas.org/
http://www.data.gov/
http://www.datakc.org
http://dbpedia.org
http://www.delicious.com/jbaldwi...
http://www.faa.gov/data_research/
http://www.factual.com/
http://research.stlouisfed.org/f...
http://www.freebase.com/
http://www.google.com/publicdata...
http://www.guardian.co.uk/news/d...
http://www.infochimps.com
http://www.kaggle.com/
http://build.kiva.org/
http://www.nationalarchives.gov....
http://www.nyc.gov/html/datamine...
http://www.ordnancesurvey.co.uk/...
http://www.philwhln.com/how-to-g...
http://www.imdb.com/interfaces
http://imat-relpred.yandex.ru/en...
http://www.dados.gov.pt/pt/catal...
http://knoema.com
http://daten.berlin.de/
http://www.qunb.com
http://databib.org/
http://datacite.org/
http://data.reegle.info/
http://data.wien.gv.at/
http://data.gov.bc.ca
https://pslcdatashop.web.cmu.edu/ (interaction data in learning environments)
http://www.icpsr.umich.edu/icpsrweb/CPES/ - Collaborative Psychiatric Epidemiology Surveys: (A collection of three national surveys focused on each of the major ethnic groups to study psychiatric illnesses and health services use)
http://www.dati.gov.it
http://dati.trentino.it
64. http://www.databagg.com/
65. http://networkrepository.com - Network/ML data repository w/ visual interactive analytics
66. Home (United Nations Environment Programme Grid Genava a lot of GIS datasets)
- The 1000 Genomes project makes 260 TB of human genome data available [13]
- The Internet Archive is making an 80 TB web crawl available for research [17]
- The TREC conference made the ClueWeb09 [3] dataset available a few years back. You'll have to sign an agreement and pay a nontrivial fee (up to $610) to cover the sneakernet data transfer. The data is about 5 TB compressed.
- ClueWeb12 [21] is now available, as are the Freebase annotations, FACC1 [22]
- CNetS at Indiana University makes a 2.5 TB click dataset available [19]
- ICWSM made a large corpus of blog posts available for their 2011 conference [2]. You'll have to register (an actual form, not an online form), but it's free. It's about 2.1 TB compressed.
- The Proteome Commons makes several large datasets available. The largest, the Personal Genome Project [11], is 1.1 TB in size. There are several others over 100 GB in size.
-
The Reference Energy Disaggregation Data Set [12] has data on home energy use; it's about 500 GB compressed.
-
The Tiny Images dataset [10] has 227 GB of image data and 57 GB of metadata.
-
The ImageNet dataset [18] is pretty big.
-
The MOBIO dataset [14] is about 135 GB of video and audio data
-
The Yahoo! Webscope program [7] makes several 1 GB+ datasets available to academic researchers, including an 83 GB data set of Flickr image features and the dataset used for the 2011 KDD Cup [9], from Yahoo! Music, which is a bit over 1 GB.
-
Google made a dataset mapping words to Wikipedia URLs (i.e., concepts) [15]. The dataset is about 10 GB compressed.
-
Yandex has recently made a very large web search click dataset available [1]. You'll have to register online for the contest to download. It's about 5.6 GB compressed.
-
Freebase makes regular data dumps available [5]. The largest is their Quad dump [4], which is about 3.6 GB compressed.
-
The Open American National Corpus [8] is about 4.8 GB uncompressed.
-
Wikipedia made a dataset containing information about edits available for a recent Kaggle competition [6]. The training dataset is about 2.0 GB uncompressed.
-
The Research and Innovative Technology Administration (RITA) has made available a dataset about the on-time performance of domestic flights operated by large carriers. The ASA compressed this dataset and makes it available for download [16].
-
The wiki-links data made available by Google is about 1.75 GB total [20].
-
[10] http://horatio.cs.nyu.edu/mit/ti...
-
[11] https://proteomecommons.org/data...
-
[13] http://www.1000genomes.org/ftpse...
-
[15] http://www-nlp.stanford.edu/pubs...
-
[16] http://stat-computing.org/dataex...
-
[17] http://blog.archive.org/2012/10/...
-
[19] http://cnets.indiana.edu/groups/...
-
[20] wiki-links - Wikipedia Links Data - Google Project Hosting
-
[21] The ClueWeb12 Dataset
-
[22] ClueWeb12 Related Data: