Public (large) data sets for machine learning

http://webscope.sandbox.yahoo.com/#datasets

http://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=75

https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public

Cross-disciplinary data repositories, data collections and data search engines:

  1. http://usgovxml.com
  2. http://aws.amazon.com/datasets
  3. http://databib.org
  4. http://datacite.org
  5. http://figshare.com
  6. http://linkeddata.org
  7. http://reddit.com/r/datasets
  8. http://thewebminer.com/
  9. http://thedatahub.org alias http://ckan.net
  10. http://quandl.com
  11. Social Network Analysis Interactive Dataset Library (Social Network Datasets)
  12. Datasets for Data Mining
  13. http://enigma.io
  14. http://www.ufindthem.com/

Single datasets and data repositories

  1. http://archive.ics.uci.edu/ml/
  2. http://crawdad.org/
  3. http://data.austintexas.gov
  4. http://data.cityofchicago.org
  5. http://data.govloop.com
  6. http://data.gov.uk/
  7. data.gov.in
  8. http://data.medicare.gov
  9. http://data.seattle.gov
  10. http://data.sfgov.org
  11. http://data.sunlightlabs.com
  12. https://datamarket.azure.com/
  13. http://developer.yahoo.com/geo/g…
  14. http://econ.worldbank.org/datasets
  15. http://en.wikipedia.org/wiki/Wik…
  16. http://factfinder.census.gov/ser…
  17. http://ftp.ncbi.nih.gov/
  18. http://gettingpastgo.socrata.com
  19. http://googleresearch.blogspot.c…
  20. http://books.google.com/ngrams/
  21. http://medihal.archives-ouvertes.fr
  22. http://public.resource.org/
  23. http://rechercheisidore.fr
  24. http://snap.stanford.edu/data/in…
  25. http://timetric.com/public-data/
  26. https://wist.echo.nasa.gov/~wist…
  27. http://www2.jpl.nasa.gov/srtm
  28. http://www.archives.gov/research…
  29. http://www.bls.gov/
  30. http://www.crunchbase.com/
  31. http://www.dartmouthatlas.org/
  32. http://www.data.gov/
  33. http://www.datakc.org
  34. http://dbpedia.org
  35. http://www.delicious.com/jbaldwi…
  36. http://www.faa.gov/data_research/
  37. http://www.factual.com/
  38. http://research.stlouisfed.org/f…
  39. http://www.freebase.com/
  40. http://www.google.com/publicdata…
  41. http://www.guardian.co.uk/news/d…
  42. http://www.infochimps.com
  43. http://www.kaggle.com/
  44. http://build.kiva.org/
  45. http://www.nationalarchives.gov….
  46. http://www.nyc.gov/html/datamine…
  47. http://www.ordnancesurvey.co.uk/…
  48. http://www.philwhln.com/how-to-g…
  49. http://www.imdb.com/interfaces
  50. http://imat-relpred.yandex.ru/en…
  51. http://www.dados.gov.pt/pt/catal…
  52. http://knoema.com
  53. http://daten.berlin.de/
  54. http://www.qunb.com
  55. http://databib.org/
  56. http://datacite.org/
  57. http://data.reegle.info/
  58. http://data.wien.gv.at/
  59. http://data.gov.bc.ca
  60. https://pslcdatashop.web.cmu.edu/ (interaction data in learning environments)
  61. http://www.icpsr.umich.edu/icpsrweb/CPES/ – Collaborative Psychiatric Epidemiology Surveys: (A collection of three national surveys focused on each of the major ethnic groups to study psychiatric illnesses and health services use)
  62. http://www.dati.gov.it
  63. http://dati.trentino.it

64. http://www.databagg.com/
65. http://networkrepository.com – Network/ML data repository w/ visual interactive analytics
66. Home (United Nations Environment Programme Grid Genava a lot of GIS datasets)

More than 1 TB

  • The 1000 Genomes project makes 260 TB of human genome data available [13]
  • The Internet Archive is making an 80 TB web crawl available for research [17]
  • The TREC conference made the ClueWeb09 [3] dataset available a few years back. You’ll have to sign an agreement and pay a nontrivial fee (up to $610) to cover the sneakernet data transfer. The data is about 5 TB compressed.
  • ClueWeb12 [21] is now available, as are the Freebase annotations,FACC1 [22]
  • CNetS at Indiana University makes a 2.5 TB click dataset available [19]
  • ICWSM made a large corpus of blog posts available for their 2011 conference [2]. You’ll have to register (an actual form, not an online form), but it’s free. It’s about 2.1 TB compressed.
  • The Proteome Commons makes several large datasets available. The largest, the Personal Genome Project [11], is 1.1 TB in size. There are several others over 100 GB in size.


More than 1 GB

  • The Reference Energy Disaggregation Data Set [12] has data on home energy use; it’s about 500 GB compressed.
  • The Tiny Images dataset [10] has 227 GB of image data and 57 GB of metadata.
  • The ImageNet dataset [18] is pretty big.
  • The MOBIO dataset [14] is about 135 GB of video and audio data
  • The Yahoo! Webscope program [7] makes several 1 GB+ datasets available to academic researchers, including an 83 GB data set of Flickr image features and the dataset used for the 2011 KDD Cup [9], from Yahoo! Music, which is a bit over 1 GB.
  • Google made a dataset mapping words to Wikipedia URLs (i.e., concepts) [15]. The dataset is about 10 GB compressed.
  • Yandex has recently made a very large web search click dataset available [1]. You’ll have to register online for the contest to download. It’s about 5.6 GB compressed.
  • Freebase makes regular data dumps available [5]. The largest is their Quad dump [4], which is about 3.6 GB compressed.
  • The Open American National Corpus [8] is about 4.8 GB uncompressed.
  • Wikipedia made a dataset containing information about edits available for a recent Kaggle competition [6]. The training dataset is about 2.0 GB uncompressed.
  • The Research and Innovative Technology Administration (RITA)has made available a dataset about the on-time performance of domestic flights operated by large carriers. The ASA compressed this dataset and makes it available for download [16].
  • The wiki-links data made available by Google is about 1.75 GB total [20].


[1] http://imat-relpred.yandex.ru/en…
[2] http://www.icwsm.org/2011/data.php
[3] http://lemurproject.org/clueweb0…
[4] http://wiki.freebase.com/wiki/Da…
[5] http://download.freebase.com/dat…
[6] http://www.kaggle.com/c/wikichal…
[7] http://webscope.sandbox.yahoo.co…
[8] http://americannationalcorpus.or…
[9] http://kddcup.yahoo.com/datasets…
[10] http://horatio.cs.nyu.edu/mit/ti…
[11] https://proteomecommons.org/data…
[12] http://redd.csail.mit.edu/
[13] http://www.1000genomes.org/ftpse…
[14] https://www.idiap.ch/dataset/mobio
[15] http://www-nlp.stanford.edu/pubs…
[16] http://stat-computing.org/dataex…
[17] http://blog.archive.org/2012/10/…
[18] http://www.image-net.org/index
[19] http://cnets.indiana.edu/groups/…
[20] wiki-links – Wikipedia Links Data – Google Project Hosting
[21] The ClueWeb12 Dataset
[22] ClueWeb12 Related Data:

Google BigQuery is an awesome place to share open datasets: Once data is loaded in BigQuery, you can make it public – allowing others to instantly analyze it using just SQL.

See a list of some of the amazing datasets shared on BigQuery:http://www.reddit.com/r/bigquery…

http://www.inside-r.org/howto/finding-data-internet

The following list of data sources has been modified as of 3/18/14. Most of the data sets listed below are free, however, some are not.

If an (R) appears after source this means that the data are already in R format or there exist R commands for directly importing the data from R. (See http://www.quantmod.com/examples/intro/ for some code.) Otherwise, i have limited the list to data sources for which there is a reasonably simple process for importing csv files. What follows is a list of data sources organized into categories that are not mutually exclusive but which reflect what’s out there.

Economics

American Economic Ass. (AEA): http://www.aeaweb.org/RFE/toc.php?show=complete
Gapminder: http://www.gapminder.org/data/
UMD:: http://inforumweb.umd.edu/econdata/econdata.html
World bank: http://data.worldbank.org/indicator

Data Science Practice

This section contains data sets used in the book “Doing Data Science” by Rachel Schutt and Cathy O’Neil (O’Reilly 2014)
Datasets on the book site: https://github.com/oreillymedia/doing_data_science
Enron Email Dataset: http://www.cs.cmu.edu/~enron/
GetGlue (time stamped events: users rating TV shows): http://bit.ly/1aL8XS0
Titanic Survival Data Set: http://bit.ly/1kJ4pkF
Half a million Hubway rides: http://hubwaydatachallenge.org/trip-history-data/

Finance

CBOE Futures Exchange: http://cfe.cboe.com/Data/
Google Finance: https://www.google.com/finance (R)
Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
St Louis Fed: http://research.stlouisfed.org/fred2/ (R)
NASDAQ: https://data.nasdaq.com/
OANDA: http://www.oanda.com/ (R)
Quandl: http://www.quandl.com/
Yahoo Finance: http://finance.yahoo.com/ (R)

Government

Archived national government statistics: http://www.archive-it.org/
Australia: http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument
Canada: http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1
DataMarket: http://datamarket.com/
FDA: https://open.fda.gov/index.html
Fed Stats: http://www.fedstats.gov/cgi-bin/A2Z.cgi
Guardian world governments: http://www.guardian.co.uk/world-government-data
HUD: http://www.huduser.org/portal/datasets/pdrdatas.html
London, U.K. data: http://data.london.gov.uk/catalogue
New Zealand: http://www.stats.govt.nz/tools_and_services/tools/TableBuilder/tables-by…
NYC data: http://nycplatform.socrata.com/
OECD: http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html
RITA: http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
San Francisco Data sets: http://datasf.org/
U.K. Government Data: http://data.gov.uk/data
United Nations: http://data.un.org/
U.S. Federal Government Data Catalog: http://catalog.data.gov/dataset
U.S. Federal Government Agencies: http://www.data.gov/metric
US CDC Public Health datasets: http://www.cdc.gov/nchs/data_access/ftp_data.htm
The World Bank: http://wdronline.worldbank.org/
UK 2011 Census Open Atlas Project: http://www.alex-singleton.com/2011-census-open-atlas-project/

Health Care

Gapminder: http://www.gapminder.org/data/

Machine Learning

Amazon Web Services Data: http://aws.amazon.com/datasets
Airlines Data (2009 ASA Challenge): http://stat-computing.org/dataexpo/2009/the-data.html
Airports and their locations: http://www.infochimps.com/datasets/airports-and-their-locations
AppliedPredictiveModeling (R package): http://bit.ly/16wyvkG
Australian Weather: http://www.bom.gov.au/climate/dwo/
Causality Workbench: http://www.causality.inf.ethz.ch/repository.php
Edge data for US domestic flights 1990 to 2009: http://www.infochimps.com/datasets/us-domestic-flights-from-1990-to-2009
Infochimps (Tag = Bigdata): http://www.infochimps.com/tags/bigdata?page=1
Kaggle competition data: http://www.kaggle.com/
KDNuggets competition site: www.kdnuggets.com/datasets/
The Koblenz Network Collection: http://konect.uni-koblenz.de/
Machine Learning Data Set Repository: http://mldata.org/
Medicare Data File: http://go.cms.gov/19xxPN4
Microsoft Research: http://research.microsoft.com/apps/dp/dl/downloads.aspx
Million Song Dataset: http://blog.echonest.com/post/3639160982/million-song-dataset
More song datasets: http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets
MovieLens Data Sets: http://datahub.io/dataset/movielens
RDataMining.com R and Data Mining ebook data: http://www.rdatamining.com/data
The Revolution Analytics Collection: http://www.revolutionanalytics.com/subscriptions/datasets/
Social Networking: http://www.cs.cmu.edu/~jelsas/data/ancestry.com/
UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/
53.5 billion clicks: http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset

Networks

Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/

Public Domain Collections

Data360: http://www.data360.org/index.aspx
Datamob.org: http://datamob.org/datasets
Factual: http://www.factual.com/topics/browse
Freebase: http://www.freebase.com/
Google: http://www.google.com/publicdata/directory
infochimps: http://www.infochimps.com/
numbray: http://numbrary.com/
Quora: http://www.quora.com/Data/Where-can-I-find-large-datasets-open-to-the-pu…
RS Collection 100+ : http://rs.io/2014/05/29/list-of-data-sets.html
Sample R data sets: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html (R)
SourceForge Research Data: http://www.nd.edu/~oss/Data/data.html
StatSci.org: http://www.statsci.org/datasets.html
UFO Reports: http://www.nuforc.org/webreports.html
Wikileaks 911 pager intercepts: http://911.wikileaks.org/files/index.html
Stats4Stem.org: R data sets: http://www.stats4stem.org/data-sets.html (R)
The Washington Post List: http://www.washingtonpost.com/wp-srv/metro/data/datapost.html

Science

Agricultural Experiments: http://www.inside-r.org/packages/cran/agridat/docs/agridat (R)
Climate data: http://www.cru.uea.ac.uk/cru/data/temperature/#datter
and ftp://ftp.cmdl.noaa.gov/
Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/
Geo Spatial Data: http://geodacenter.asu.edu/datalist/
Human Microbiome Project: http://www.hmpdacc.org/reference_genomes/reference_genomes.php
MIT Cancer Genomics Data: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
NASA: http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html
NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/ (R)
Protein structure: http://www.infobiotic.net/PSPbenchmarks/
Public Gene Data: http://www.pubgene.org/
Stanford Microarray Data: http://smd.stanford.edu//

Social Sciences

General Social Survey: http://www3.norc.org/GSS+Website/
ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/index.jsp
Pew Research: http://www.pewinternet.org/datasets/pages/2/
SNAP: http://snap.stanford.edu/data/index.html
UCLA Social Sciences Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
UPJOHN INST: http://www.upjohn.org/erdc/erdc.html

Time Series

Time Series data Library: http://robjhyndman.com/TSDL/

Universities

Carnegie Mellon University Enron email: http://www.cs.cmu.edu/~enron/
Carnegie Mellon University StatLab: http://lib.stat.cmu.edu/datasets/
Keel Repository: http://sci2s.ugr.es/keel/datasets.php
Carnegie Mellon University JASA data archive: http://lib.stat.cmu.edu/jasadata/
Ohio State University Financial data: http://fisher.osu.edu/fin/osudata.htm
UC Berkeley: http://ucdata.berkeley.edu/
UCLA: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
UC Riverside Time Series: http://www.cs.ucr.edu/~eamonn/time_series_data/
University of Toronto: http://www.cs.toronto.edu/~delve/data/datasets.html

Advertisements

About VC Ramesh

Artificial Intelligence entrepreneur.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s