A Resource Provided by The Center for Global Data Visualization
The Pandemic Data Room is a comprehensive global COVID-19 data repository created by a consortium of partners and led by QED Group to improve understanding of the impact of physical distancing policies on social behavior, disease rates, hospital utilization, and local/national economies. This initiative will generate critical information needed to adjust policies to control the outbreak. We hope to bring amazing talent to work on the data and generate new tools that can be used to manage and understand this pandemic.
Contribute Research Questions and Analysis Ideas to the Pandemic Data Room
In order that the Pandemic Data Room best reflects questions being asked among the global health and international development communities, we have created a portal where people can pose questions on COVID-19 they are looking to get answered. This question portal will be available to Data Challenge participants and they can use it to generate ideas in creating compelling visualization and analysis tools. Please visit the portal here.
Participate in the COVID-19 Data Challenge using this data resource. Both students and professionals are encouraged to participate. For each track, submissions are judged separately and prizes (1st Place $2000, 2nd Place $1500, 3rd Place $1000, Honorable Mentions $100) are awarded.
Contribute to the Pandemic Data Room by submitting a new data source request in this Google Form. We will evaluate your data source and get back to you soon!
Detailed Description: Flattening the curve depends on the population following policies on physical distancing (e.g., staying home, avoiding contact closer than 6’) and hygiene (e.g., washing hands, wearing masks). Getting back to work depends on measuring the effectiveness of policies and behaviors so that governments, institutions, businesses, and individuals can make better decisions on what can be done without triggering new outbreaks. IDS is a data and technology company working to create data collection and analysis tools to better measure compliance and effectiveness of pandemic behavior safety.
On April 6th, 2020, IDS conducted an online survey among a nationally representative sample about Physical Distancing and Hygiene Behaviors. This survey was also conducted April 20. See and download latest results of the Clear Outcomes and IDS nationwide poll, summary for survey results, and survey questions here.
Data Resolution: US
File type: .csv
Fraym: Geospatial Data For Covid-19 Prevention and Crisis Response
Detailed Description: The risks posed by coronavirus are especially high for millions of people who live in low-and middle-income countries, where financial, medical equipment, and health personnel resources are highly constrained.
To rapidly identify countries, cities and communities that exhibit the greatest risk of emergency cases and rapid transmission, Fraym provides access to relevant data layers including Emergency Case Risk Factors (Smoking prevalence, Elderly households, Body health - obesity, child stunting, child wasting) and Transmission Risk Factors (Population density, Household size, Occupation, Transportation modes, Hand Washing Practices).
CGDV has requested the above data layers for countries including Guatemala, Kenya, Nigeria, Pakistan, Philippines, Rwanda, Senegal, and South Africa. Each folder should have a data dictionary and a citation guide for use. Download raster files with high-resolution down to 1km2 in CGDV Google Drive.
Data Resolution: Country
File type: TIF File
Geopoll: Coronavirus In Sub-Saharan Africa -- Updated April 21
Detailed Description: As a research organization that conducts remote research, GeoPoll takes an initiative to assist the global response to coronavirus. From March 10th – 13th, 2020, GeoPoll administered a first-round survey on the knowledge of and perceptions towards coronavirus in South Africa, Kenya, and Nigeria. This survey examined awareness levels, primary information sources, knowledge of how to prevent the virus, and levels of worry.
On April 15th, Geopoll further conducted a second-round survey about How Africans in 12 Nations are Responding to the COVID-19 Outbreak. The remote study examined the effects coronavirus is already having on people throughout the region.
Click links above to read full reports of both surveys. Download a copy of the survey data in CGDV Google Drive.
Data Resolution: County in African countries
File type: Excel
Exovera: COVID-19 Related Articles Published In US Newspaper
Detailed Description: Exovera provides COVID-19 social media data through its robust API platform. Download data files in CGDV Google Drive.
politics_coronavirus_rawdata_Jan012020-Apr072020.json: The US Politics dataset is a set of ~1m articles since Jan 01 2020, from ~10k sources both local/national of US newspapers/online news related to US Politics (using an Exovera Classifier that tags politics related content at a high level of recall).
coronavirus_english_topSources_04072020.json: Data from the top 500 largest publishers (in English/by reach) in Exovera's overall dataset. The data is collected via API from social media posts that contain URL's from the top publishers.
coronavirus_general_media_timeseries-04072020.csv: The timeseries are from Coronavirus related terms/content within all-english online News/Print media that we have access to worldwide, it encompasses 55k sources and uses an initial set of keywords to pull up content. The initial set of search terms has ~15m results with keywords 'Coronavirus', 'covid-19', 'covid19', "2019-nCoV" and "Sars-COV-2". Data are based around tagging / subtopic detection with labels applied.
Detailed Description: Contains recovered, infected, and fatility case numbers for all countries, province-level for many countries, and county level for the US. Data is sourced from a variety of health organizations around the world.
Data Resolution: Global (some province level), U.S. County
Detailed Description: There are a lot of files in the github repo, however only 2 datasets that I think valuable (case-hosp-death.csv and tests-by-zcta.csv). The case-hosp-death accounts cases by date of diagnosis, hospitalized and deaths in NYC hospitals. The latter dataset is cumulative positive cases per zip code
New York Times Data: Two time-series datasets collected by the New York Times from various U.S. state and local agencies; the first record is aligns with the first case in the United States on 21 January 2020.
Detailed Description: Two time-series datasets collected by the New York Times from various state and local government agencies; the first record is the first case in the United States on 21 January 2020. One dataset contains information aggregated at the state-level and the other is information broken down by county. Features contained are: date, county/state, fips, cases, and deaths. NOTE: This source only provides information about positive cases.
INDIA COVID-19 TRACKER: Crowdsourced India COVID-19 data. Some interesting points because it takes data from anyone.
Detailed Description: This is a link to a GitHub repository that is used to crowdsource data about COVID-19 in India. The crowdsourced data has been used to make an HTML page (the link is in the GitHub repository). The data is crowdsourced through telegram, a social media type application, but it is not thoroughly validated. It is really interesting data about India, but it needs to be used appropriately in analysis. It is submitted through a social media platform, so some of it is likely incorrect, but could make fantastic supplementary data.
Detailed Description: The COVID Tracking Project is a volunteer organization launched from The Atlantic and dedicated to collecting and publishing the data required to understand the COVID-19 outbreak in the United States. Every day, they collect data on COVID-19 testing and patient outcomes from all 50 states, 5 territories, and the District of Columbia. Their dataset is currently in use by national and local news organizations across the US and by research projects and agencies worldwide.
Detailed Description: Testing is our window onto the pandemic and how it is spreading. Without testing we have no way of understanding the pandemic. Goal of
Our World in Data is to provide testing data over time for many countries around the world. Alongside the data, they also provides a good understanding of the definitions used and any important limitations they might have. You will also find descriptions of the data for each country.
Detailed Description: 2 files. List of lockdown dates for each countries. A lockdown is assumed to be complete when all schools and non-essential businesses are closed. References for each country are also listed for where the information was found. Some rows contain blank provinces if it pertains to the whole nation.
Detailed Description: Dates of when is each state / county's stay-at-home order becomes effective as a result of the covid-19 pandemic. This dataset is updated daily as more states & counties issue stay-at-home order. Some rows contain blank counties if it pertains to the whole state.
IHME: Institute for Health Metrics and Evaluation COVID-19 Estimate Data
Detailed Description: IHME has produced forecasts which show hospital bed use, need for intensive care beds, and ventilator use due to COVID-19 based on projected deaths for the United States, at the country and subnational level, and countries in the European Economic Area (EEA). Forecasts at the subnational level are included for three EEA countries: Germany, Italy, and Spain. These projections are produced by models based on observed death rates from COVID-19, and include uncertainty intervals. They incorporate information about social distancing and other protective measures and are being updated daily with new data. These forecasts were developed in order to provide hospitals, policy makers, and the public with crucial information about how expected need aligns with existing resources, so that cities and countries can best prepare.
Data Resolution: US, Countries in the European Economic Area (EEA)
Frequency of update: Last updated at 1 p.m. Pacific, April 13, 2020. as of date 4/15/2020
Detailed Description: The stated purpose for this data is "Does health spending levels (public or private), or hospital staff have any effect on the rate at which Covid-19 spreads in a country? Can we use this data to predict the rate at which Cases or Fatalities will grow?". It is only data on healthcare expenditures and the amount of healthcare available in countries throughout the world. There is not any direct COVID-19 data, but this could make good supplementary data for a question similar to one they posed as inspiration
Detailed Description: GoogleTrends data is phenomenal, it is interesting, important, and can be so insightful, IF IT IS USED CORRECTLY. It can be a little confusing the first time you see it, and the instructions given will help you understand the graphs presented on the GoogleTrends page when you input a search term. However, figuring out how to use it further and get more from it, is not super clear. All of the data is given in search intensity, scaled from 0 to 100, where 100 is the maximum search intensity. The maximum search intensity does not give you any information about the actual number of searches, that number is that search terms peak in searches, then everything else is scaled to that value. A search intensity of 50 means that term was searched half as many times as the search intensity of 100.
Now, lets put that in context, google trends allows you to vary the time period, regional resolution, and the search term(s).
- You can specify a time period of any range dating back to 2014.
- Time periods of less than a week will return hourly data
- Time periods over a week, but less than 269 days (about 9 months, but using 8 is safe) returns daily data
- Time periods over 269 days return weekly data
- You can choose the whole world or a specific country
- The whole world will give you country level comparisons
- Different countries have different levels you can compare from, for example U.S. has a default of comparing states, but you can also choose to compare by metro region.
Let's start with relative search intensities (i.e. comparing different searches):
- You will specify a time period, and what is returned may be hourly, daily or weekly search intensities.
- Only one term is going to reach 100 over that time period. This represents the highest search intensity for that term, and any of the other terms you are comparing.
- Then every other search intensity is scaled from that point. No matter what term you are looking at in a relative search intensity on GoogleTrends it's search intensity = # searches for that term / # searches at the peak search intensity (100)
- GoogleTrends allows you to compare up to five words or phrases at one time. There are ways to overlap time periods and search terms together to get a pretty good estimate to compare from, but DO NOT DO THIS UNLESS IT IS ABSOLUTELY NECESSARY. It is very difficult, and a tiny mistake makes all of your data innaccurate.
Regional Search Intensities (comparing a terms search intensity based on location):
- You enter a search term and you can specify whether it is the whole world, or one particular country.
- GoogleTrends gives you colored maps representing this data.
- What the actual data has for you is similar to the relative search intensities.
- Only one region in the region and time period you specified will be reach 100.
- The rest of the regions are scaled the same way as relative search intensity to that moment and regions search intensity
*** You can also do regional searches that compare multiple terms, and it is really interesting. However, manipulation of that data is even more difficult, and requires a lot of attention to unravel. It is very easy to make a small mistake, and that small mistake will echo throughout all of the data, again making it worthless.
This is just a brief summary of the data given, and what I have found to be the things to watch out for, look at google trends descriptions as well for details specific to their user interface. If you still feel like you want to dive deeper into some of this data, there is a library full of research articles using the data and webpages dedicated to some manipulation of the data to get more out of it. I will just warn you to be careful, the manipulation, overlapping and other methods to change the data are always approximations, and not always correct, so read them thoughourly and check that they validated their method in some clear and accurate way.
Data Resolution: Global, Country Level, U.S. State Level, U.S. Metro Region Level, Other Countries Have Unique Regional Breakdowns
Postman: COVID-19 API Resource Center: Contains links and detailed information about accessing public feeds from 28 different organizations and topics via application program interfaces (API). Organizations represented include the WHO, CDC, and John Hopkins University.
Detailed Description: Contains links and detailed information about accessing public feeds from 28 different organizations and topics via application program interfaces (API). This site contains information to connect to feeds from the WHO, CDC, COVID Tracking Project, and John Hopkins University COVID Database just to name a few. There are examples of how to access an organization's Twitter and Youtube feed, however individuals must have the requisite API Key / Access Tokens to access the information contained on those sites.
SafeGraph Dataset: Data on foot traffic throughout the US. It has the number of times people pass by over 6 million different points of interest in the US.
Detailed Description: This Data is based on businesses and consumer hot spots. It uses over 6 million points throughout the US and tracks the amount of foot traffic at each of these points. They give data like number of visitors over a certain period, and also offer shapefiles for mapping or any locational visualizations.
Detailed Description: Interesting dataset of social media, including daily top 1000 terms, bigrams, trigrams etc., also contains cleaned version on tweet text. Tweets languages including English Spanish and French
COVID-19 Legislation: Interactive site for users to access: statewide or nationwide data on all covid-19 legislation
Detailed Description: Queryable and downloadable data pertaining to United States COVID-19 legislation. The data contains name of the bill, the region it spans, description of the legislation, link to the source, status, last action, date of last action, type (house/senate/other), the internal quorum link.
Detailed Description: These Community Mobility Reports aim to provide insights into what has changed in response to policies aimed at combating COVID-19. The reports chart movement trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential.
Detailed Description: 538 compiled surveys from pollsters on concern about the economy, concern about getting infected, and approval of President Trump's response. Users shold use files ending in '_toplines' as this is the data that is used on the site. Polls files show how various polls are weighted to get to topline numbers.
Detailed Description: Survey data on american trends. Data is individual-level survey responses for the American Trends Panel by the Pew Research Center. Ensure that you appropriately weighted measures when analyzing this data. The download includes multiple pdf files with the methodology and questionnaire, which should be closely reviewed.
Detailed Description: The survey from CMU Delphi Research Center asks people to self-report symptoms associated with COVID-19 or the flu that they or anyone in their household has experienced in the last 24 hours. Data are gathered nationwide with the help of Facebook and Google. High correlation between self-reported descriptions of COVID-19-related symptoms and test-confirmed cases of the disease suggests self-reports might soon help the researchers in forecasting COVID-19 activity.
Delphi COVID-19 Response Team develops API for accessing the Delphi's COVID-19 Surveillance Streams (covidcast) data source of the Delphi's epidemiological data. COVIDcast displays signals related to COVID-19 activity levels across the United States, derived from a variety of anonymized, aggregated data sources made available by multiple partners.
Each signal may reflect the prevalence of COVID-19 infection, mild symptoms, or more severe disease over time. Each signal can be presented at multiple geographic resolutions: state, county, and/or metropolitan area. All these signals taken together may suggest heightened or rising COVID-19 activity in specific locations.
Find home of Delphi's epidemiological data API here
Scholarly Article Database: Big database of scholarly article metadata with links and queryable json files for Natural Language Processing
Detailed Description: This dataset combines 44k+ scholarly articles/literature pertaining to the coronavirus. It can be used to analyze the main authors, sources, titles, journal and abstract for the analyst to look into. Each row provides a link to the article if Natural Language Processing should be a desired task.
AirNow Data: Real time air quality data for major world cities and US locations - Added April 30
Detailed Description: Link goes to the Embassies and Consulates section where you can click a specific city, then choose the historical tab to download csv files for specific years of air quality data. There is also a developer API located here https://docs.airnowapi.org/.
Data Resolution: Global, Major World Cities, US Cities
We encourage you to recognize both the limitations of the data and your ability to draw conclusions from this data. The importance of this COVID-19 means that any visuals created may be displayed in other contexts, and we ask you not to overreach in any conclusions you attempt to draw. See below for a list of articles around this topic.