Skip to content

Latest commit

 

History

History
139 lines (126 loc) · 6.32 KB

DATASETS.md

File metadata and controls

139 lines (126 loc) · 6.32 KB

dataset-collection

Cross-disciplinary data repositories, data collections and data search engines:

Single datasets and data repositories

http://archive.ics.uci.edu/ml/
http://crawdad.org/
http://data.austintexas.gov
http://data.cityofchicago.org
http://data.govloop.com
http://data.gov.uk/
data.gov.in
http://data.medicare.gov
http://data.seattle.gov
http://data.sfgov.org
http://data.sunlightlabs.com
https://datamarket.azure.com/
http://developer.yahoo.com/geo/g...
http://econ.worldbank.org/datasets
http://en.wikipedia.org/wiki/Wik...
http://factfinder.census.gov/ser...
http://ftp.ncbi.nih.gov/
http://gettingpastgo.socrata.com
http://googleresearch.blogspot.c...
http://books.google.com/ngrams/
http://medihal.archives-ouvertes.fr
http://public.resource.org/
http://rechercheisidore.fr
http://snap.stanford.edu/data/in...
http://timetric.com/public-data/
https://wist.echo.nasa.gov/~wist...
http://www2.jpl.nasa.gov/srtm
http://www.archives.gov/research...
http://www.bls.gov/
http://www.crunchbase.com/
http://www.dartmouthatlas.org/
http://www.data.gov/
http://www.datakc.org
http://dbpedia.org
http://www.delicious.com/jbaldwi...
http://www.faa.gov/data_research/
http://www.factual.com/
http://research.stlouisfed.org/f...
http://www.freebase.com/
http://www.google.com/publicdata...
http://www.guardian.co.uk/news/d...
http://www.infochimps.com
http://www.kaggle.com/
http://build.kiva.org/
http://www.nationalarchives.gov....
http://www.nyc.gov/html/datamine...
http://www.ordnancesurvey.co.uk/...
http://www.philwhln.com/how-to-g...
http://www.imdb.com/interfaces
http://imat-relpred.yandex.ru/en...
http://www.dados.gov.pt/pt/catal...
http://knoema.com
http://daten.berlin.de/
http://www.qunb.com
http://databib.org/
http://datacite.org/
http://data.reegle.info/
http://data.wien.gv.at/
http://data.gov.bc.ca
https://pslcdatashop.web.cmu.edu/ (interaction data in learning environments)
http://www.icpsr.umich.edu/icpsrweb/CPES/ - Collaborative Psychiatric Epidemiology Surveys: (A collection of three national surveys focused on each of the major ethnic groups to study psychiatric illnesses and health services use)
http://www.dati.gov.it
http://dati.trentino.it
64. http://www.databagg.com/
65. http://networkrepository.com - Network/ML data repository w/ visual interactive analytics
66. Home (United Nations Environment Programme Grid Genava a lot of GIS datasets)

More than 1 TB

  • The 1000 Genomes project makes 260 TB of human genome data available [13]
  • The Internet Archive is making an 80 TB web crawl available for research [17]
  • The TREC conference made the ClueWeb09 [3] dataset available a few years back. You'll have to sign an agreement and pay a nontrivial fee (up to $610) to cover the sneakernet data transfer. The data is about 5 TB compressed.
  • ClueWeb12 [21] is now available, as are the Freebase annotations, FACC1 [22]
  • CNetS at Indiana University makes a 2.5 TB click dataset available [19]
  • ICWSM made a large corpus of blog posts available for their 2011 conference [2]. You'll have to register (an actual form, not an online form), but it's free. It's about 2.1 TB compressed.
  • The Proteome Commons makes several large datasets available. The largest, the Personal Genome Project [11], is 1.1 TB in size. There are several others over 100 GB in size.

More than 1 GB