A bot by any other name…

What’s a bot?

Understanding bots through 4 interface categories

Bot Landscape


Conversational Bots — AI Arms Race (FT.com)



Google uses Assistant to square up to Siri in AI arms race

New software designed to usher in a more natural and intelligent form of human-computer interaction

Public (large) data sets for machine learning




Cross-disciplinary data repositories, data collections and data search engines:

  1. http://usgovxml.com
  2. http://aws.amazon.com/datasets
  3. http://databib.org
  4. http://datacite.org
  5. http://figshare.com
  6. http://linkeddata.org
  7. http://reddit.com/r/datasets
  8. http://thewebminer.com/
  9. http://thedatahub.org alias http://ckan.net
  10. http://quandl.com
  11. Social Network Analysis Interactive Dataset Library (Social Network Datasets)
  12. Datasets for Data Mining
  13. http://enigma.io
  14. http://www.ufindthem.com/

Single datasets and data repositories

  1. http://archive.ics.uci.edu/ml/
  2. http://crawdad.org/
  3. http://data.austintexas.gov
  4. http://data.cityofchicago.org
  5. http://data.govloop.com
  6. http://data.gov.uk/
  7. data.gov.in
  8. http://data.medicare.gov
  9. http://data.seattle.gov
  10. http://data.sfgov.org
  11. http://data.sunlightlabs.com
  12. https://datamarket.azure.com/
  13. http://developer.yahoo.com/geo/g…
  14. http://econ.worldbank.org/datasets
  15. http://en.wikipedia.org/wiki/Wik…
  16. http://factfinder.census.gov/ser…
  17. http://ftp.ncbi.nih.gov/
  18. http://gettingpastgo.socrata.com
  19. http://googleresearch.blogspot.c…
  20. http://books.google.com/ngrams/
  21. http://medihal.archives-ouvertes.fr
  22. http://public.resource.org/
  23. http://rechercheisidore.fr
  24. http://snap.stanford.edu/data/in…
  25. http://timetric.com/public-data/
  26. https://wist.echo.nasa.gov/~wist…
  27. http://www2.jpl.nasa.gov/srtm
  28. http://www.archives.gov/research…
  29. http://www.bls.gov/
  30. http://www.crunchbase.com/
  31. http://www.dartmouthatlas.org/
  32. http://www.data.gov/
  33. http://www.datakc.org
  34. http://dbpedia.org
  35. http://www.delicious.com/jbaldwi…
  36. http://www.faa.gov/data_research/
  37. http://www.factual.com/
  38. http://research.stlouisfed.org/f…
  39. http://www.freebase.com/
  40. http://www.google.com/publicdata…
  41. http://www.guardian.co.uk/news/d…
  42. http://www.infochimps.com
  43. http://www.kaggle.com/
  44. http://build.kiva.org/
  45. http://www.nationalarchives.gov….
  46. http://www.nyc.gov/html/datamine…
  47. http://www.ordnancesurvey.co.uk/…
  48. http://www.philwhln.com/how-to-g…
  49. http://www.imdb.com/interfaces
  50. http://imat-relpred.yandex.ru/en…
  51. http://www.dados.gov.pt/pt/catal…
  52. http://knoema.com
  53. http://daten.berlin.de/
  54. http://www.qunb.com
  55. http://databib.org/
  56. http://datacite.org/
  57. http://data.reegle.info/
  58. http://data.wien.gv.at/
  59. http://data.gov.bc.ca
  60. https://pslcdatashop.web.cmu.edu/ (interaction data in learning environments)
  61. http://www.icpsr.umich.edu/icpsrweb/CPES/ – Collaborative Psychiatric Epidemiology Surveys: (A collection of three national surveys focused on each of the major ethnic groups to study psychiatric illnesses and health services use)
  62. http://www.dati.gov.it
  63. http://dati.trentino.it

64. http://www.databagg.com/
65. http://networkrepository.com – Network/ML data repository w/ visual interactive analytics
66. Home (United Nations Environment Programme Grid Genava a lot of GIS datasets)

More than 1 TB

  • The 1000 Genomes project makes 260 TB of human genome data available [13]
  • The Internet Archive is making an 80 TB web crawl available for research [17]
  • The TREC conference made the ClueWeb09 [3] dataset available a few years back. You’ll have to sign an agreement and pay a nontrivial fee (up to $610) to cover the sneakernet data transfer. The data is about 5 TB compressed.
  • ClueWeb12 [21] is now available, as are the Freebase annotations,FACC1 [22]
  • CNetS at Indiana University makes a 2.5 TB click dataset available [19]
  • ICWSM made a large corpus of blog posts available for their 2011 conference [2]. You’ll have to register (an actual form, not an online form), but it’s free. It’s about 2.1 TB compressed.
  • The Proteome Commons makes several large datasets available. The largest, the Personal Genome Project [11], is 1.1 TB in size. There are several others over 100 GB in size.

More than 1 GB

  • The Reference Energy Disaggregation Data Set [12] has data on home energy use; it’s about 500 GB compressed.
  • The Tiny Images dataset [10] has 227 GB of image data and 57 GB of metadata.
  • The ImageNet dataset [18] is pretty big.
  • The MOBIO dataset [14] is about 135 GB of video and audio data
  • The Yahoo! Webscope program [7] makes several 1 GB+ datasets available to academic researchers, including an 83 GB data set of Flickr image features and the dataset used for the 2011 KDD Cup [9], from Yahoo! Music, which is a bit over 1 GB.
  • Google made a dataset mapping words to Wikipedia URLs (i.e., concepts) [15]. The dataset is about 10 GB compressed.
  • Yandex has recently made a very large web search click dataset available [1]. You’ll have to register online for the contest to download. It’s about 5.6 GB compressed.
  • Freebase makes regular data dumps available [5]. The largest is their Quad dump [4], which is about 3.6 GB compressed.
  • The Open American National Corpus [8] is about 4.8 GB uncompressed.
  • Wikipedia made a dataset containing information about edits available for a recent Kaggle competition [6]. The training dataset is about 2.0 GB uncompressed.
  • The Research and Innovative Technology Administration (RITA)has made available a dataset about the on-time performance of domestic flights operated by large carriers. The ASA compressed this dataset and makes it available for download [16].
  • The wiki-links data made available by Google is about 1.75 GB total [20].

[1] http://imat-relpred.yandex.ru/en…
[2] http://www.icwsm.org/2011/data.php
[3] http://lemurproject.org/clueweb0…
[4] http://wiki.freebase.com/wiki/Da…
[5] http://download.freebase.com/dat…
[6] http://www.kaggle.com/c/wikichal…
[7] http://webscope.sandbox.yahoo.co…
[8] http://americannationalcorpus.or…
[9] http://kddcup.yahoo.com/datasets…
[10] http://horatio.cs.nyu.edu/mit/ti…
[11] https://proteomecommons.org/data…
[12] http://redd.csail.mit.edu/
[13] http://www.1000genomes.org/ftpse…
[14] https://www.idiap.ch/dataset/mobio
[15] http://www-nlp.stanford.edu/pubs…
[16] http://stat-computing.org/dataex…
[17] http://blog.archive.org/2012/10/…
[18] http://www.image-net.org/index
[19] http://cnets.indiana.edu/groups/…
[20] wiki-links – Wikipedia Links Data – Google Project Hosting
[21] The ClueWeb12 Dataset
[22] ClueWeb12 Related Data:

Google BigQuery is an awesome place to share open datasets: Once data is loaded in BigQuery, you can make it public – allowing others to instantly analyze it using just SQL.

See a list of some of the amazing datasets shared on BigQuery:http://www.reddit.com/r/bigquery…


The following list of data sources has been modified as of 3/18/14. Most of the data sets listed below are free, however, some are not.

If an (R) appears after source this means that the data are already in R format or there exist R commands for directly importing the data from R. (See http://www.quantmod.com/examples/intro/ for some code.) Otherwise, i have limited the list to data sources for which there is a reasonably simple process for importing csv files. What follows is a list of data sources organized into categories that are not mutually exclusive but which reflect what’s out there.


American Economic Ass. (AEA): http://www.aeaweb.org/RFE/toc.php?show=complete
Gapminder: http://www.gapminder.org/data/
UMD:: http://inforumweb.umd.edu/econdata/econdata.html
World bank: http://data.worldbank.org/indicator

Data Science Practice

This section contains data sets used in the book “Doing Data Science” by Rachel Schutt and Cathy O’Neil (O’Reilly 2014)
Datasets on the book site: https://github.com/oreillymedia/doing_data_science
Enron Email Dataset: http://www.cs.cmu.edu/~enron/
GetGlue (time stamped events: users rating TV shows): http://bit.ly/1aL8XS0
Titanic Survival Data Set: http://bit.ly/1kJ4pkF
Half a million Hubway rides: http://hubwaydatachallenge.org/trip-history-data/


CBOE Futures Exchange: http://cfe.cboe.com/Data/
Google Finance: https://www.google.com/finance (R)
Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
St Louis Fed: http://research.stlouisfed.org/fred2/ (R)
NASDAQ: https://data.nasdaq.com/
OANDA: http://www.oanda.com/ (R)
Quandl: http://www.quandl.com/
Yahoo Finance: http://finance.yahoo.com/ (R)


Archived national government statistics: http://www.archive-it.org/
Australia: http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument
Canada: http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1
DataMarket: http://datamarket.com/
FDA: https://open.fda.gov/index.html
Fed Stats: http://www.fedstats.gov/cgi-bin/A2Z.cgi
Guardian world governments: http://www.guardian.co.uk/world-government-data
HUD: http://www.huduser.org/portal/datasets/pdrdatas.html
London, U.K. data: http://data.london.gov.uk/catalogue
New Zealand: http://www.stats.govt.nz/tools_and_services/tools/TableBuilder/tables-by…
NYC data: http://nycplatform.socrata.com/
OECD: http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html
RITA: http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
San Francisco Data sets: http://datasf.org/
U.K. Government Data: http://data.gov.uk/data
United Nations: http://data.un.org/
U.S. Federal Government Data Catalog: http://catalog.data.gov/dataset
U.S. Federal Government Agencies: http://www.data.gov/metric
US CDC Public Health datasets: http://www.cdc.gov/nchs/data_access/ftp_data.htm
The World Bank: http://wdronline.worldbank.org/
UK 2011 Census Open Atlas Project: http://www.alex-singleton.com/2011-census-open-atlas-project/

Health Care

Gapminder: http://www.gapminder.org/data/

Machine Learning

Amazon Web Services Data: http://aws.amazon.com/datasets
Airlines Data (2009 ASA Challenge): http://stat-computing.org/dataexpo/2009/the-data.html
Airports and their locations: http://www.infochimps.com/datasets/airports-and-their-locations
AppliedPredictiveModeling (R package): http://bit.ly/16wyvkG
Australian Weather: http://www.bom.gov.au/climate/dwo/
Causality Workbench: http://www.causality.inf.ethz.ch/repository.php
Edge data for US domestic flights 1990 to 2009: http://www.infochimps.com/datasets/us-domestic-flights-from-1990-to-2009
Infochimps (Tag = Bigdata): http://www.infochimps.com/tags/bigdata?page=1
Kaggle competition data: http://www.kaggle.com/
KDNuggets competition site: www.kdnuggets.com/datasets/
The Koblenz Network Collection: http://konect.uni-koblenz.de/
Machine Learning Data Set Repository: http://mldata.org/
Medicare Data File: http://go.cms.gov/19xxPN4
Microsoft Research: http://research.microsoft.com/apps/dp/dl/downloads.aspx
Million Song Dataset: http://blog.echonest.com/post/3639160982/million-song-dataset
More song datasets: http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets
MovieLens Data Sets: http://datahub.io/dataset/movielens
RDataMining.com R and Data Mining ebook data: http://www.rdatamining.com/data
The Revolution Analytics Collection: http://www.revolutionanalytics.com/subscriptions/datasets/
Social Networking: http://www.cs.cmu.edu/~jelsas/data/ancestry.com/
UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/
53.5 billion clicks: http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset


Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/

Public Domain Collections

Data360: http://www.data360.org/index.aspx
Datamob.org: http://datamob.org/datasets
Factual: http://www.factual.com/topics/browse
Freebase: http://www.freebase.com/
Google: http://www.google.com/publicdata/directory
infochimps: http://www.infochimps.com/
numbray: http://numbrary.com/
Quora: http://www.quora.com/Data/Where-can-I-find-large-datasets-open-to-the-pu…
RS Collection 100+ : http://rs.io/2014/05/29/list-of-data-sets.html
Sample R data sets: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html (R)
SourceForge Research Data: http://www.nd.edu/~oss/Data/data.html
StatSci.org: http://www.statsci.org/datasets.html
UFO Reports: http://www.nuforc.org/webreports.html
Wikileaks 911 pager intercepts: http://911.wikileaks.org/files/index.html
Stats4Stem.org: R data sets: http://www.stats4stem.org/data-sets.html (R)
The Washington Post List: http://www.washingtonpost.com/wp-srv/metro/data/datapost.html


Agricultural Experiments: http://www.inside-r.org/packages/cran/agridat/docs/agridat (R)
Climate data: http://www.cru.uea.ac.uk/cru/data/temperature/#datter
and ftp://ftp.cmdl.noaa.gov/
Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/
Geo Spatial Data: http://geodacenter.asu.edu/datalist/
Human Microbiome Project: http://www.hmpdacc.org/reference_genomes/reference_genomes.php
MIT Cancer Genomics Data: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
NASA: http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html
NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/ (R)
Protein structure: http://www.infobiotic.net/PSPbenchmarks/
Public Gene Data: http://www.pubgene.org/
Stanford Microarray Data: http://smd.stanford.edu//

Social Sciences

General Social Survey: http://www3.norc.org/GSS+Website/
ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/index.jsp
Pew Research: http://www.pewinternet.org/datasets/pages/2/
SNAP: http://snap.stanford.edu/data/index.html
UCLA Social Sciences Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
UPJOHN INST: http://www.upjohn.org/erdc/erdc.html

Time Series

Time Series data Library: http://robjhyndman.com/TSDL/


Carnegie Mellon University Enron email: http://www.cs.cmu.edu/~enron/
Carnegie Mellon University StatLab: http://lib.stat.cmu.edu/datasets/
Keel Repository: http://sci2s.ugr.es/keel/datasets.php
Carnegie Mellon University JASA data archive: http://lib.stat.cmu.edu/jasadata/
Ohio State University Financial data: http://fisher.osu.edu/fin/osudata.htm
UC Berkeley: http://ucdata.berkeley.edu/
UCLA: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
UC Riverside Time Series: http://www.cs.ucr.edu/~eamonn/time_series_data/
University of Toronto: http://www.cs.toronto.edu/~delve/data/datasets.html

Visualizing data scientist skills

A few cool visual depictions of the skills and knowledge a data scientist is expected to possess.

1. The original data science venn diagram by Drew Conway



2. Data Science Disciplines



3. Data Science Venn Diagram v2.0 by Steve Geringer




4. The long road to becoming a data scientist — a curriculum metromap by Swami Chandrasekaran



5. Machine Learning Skills Pyramid by Steve Geringer



6. The 8 skills of data scientists from Accenture



7. Visual re-design of the Accenture graphic by Aleksey Nozdryn-Plotnicki



8. The stars of data science from Columbia University



9.  The 4 categories of data scientists from O’Reilly



