COVID-19 SAS Studio Project No.2: Visualising global trends in Johns Hopkins University data

Loading

This is a project to read the daily Johns Hopkins COVID-19 data and visualise the national infection and fatality trends using Base SAS and SAS/STAT:

  1. Download the GitHub Desktop software from https://desktop.github.com/ and install it on your computer where you will be running SAS Studio or SAS University Edition. For instructions on how to install SAS University Edition on your own computer please read my blog post “Are you learning about SAS?”.
  2. Clone the Johns Hopkins COVID-19 data at https://github.com/CSSEGISandData/COVID-19, and then Pull the latest data, using the GitHub Desktop. This will reduce the time need to download all of the latest data each time you run the SAS Studio project, as a simple and quick Pull request in GitHub Desktop is all that is required each time.
  3. Download my SAS Studio CPF project file (John-Hopkins-GitHub-data.cpf), which is a zipped CPF file and will be updated occasionally with accepted submitted updates. Please check for comments here when updates are added.
  4. Open the CPF project file in SAS Studio (requires Base SAS and SAS/STAT) or SAS University Edition (making certain you have created a Shared Folder(s) first that are pointing to where your GitHub files and CPF project file are stored).
  5. Update the “run first” program to include your GitHub file folder in the &_dir macro variable assignment. The CSV files we will be using can be found in the /csse_covid_19_data/csse_covid_19_daily_reports folder.
  6. Submit each program in order given below (or submit all of the programs in the project’s flow together):
    • (1) “run first” assigns the location of the data to the &_dir macro variable.
    • (2) “Read CSV files” creates the SAS data sets in WORK by reading all of the CSV files in the csse_covid_19_daily_reports folder. Summarise the records by Country_Region to remove finer detail in the csse_covid_19_daily_reports.
    • (3) “Calculate regression lines” generates the regression lines for confirmed cases between 100 and 10,000, and deaths between 10 and 1,000, to include on the graphs. The regression lines appear to be straight in the semi-log plots, but are actually exponential to match the initial growth of confirmed cases, so that “flattening” of the curves can be identified more easily.
    • (4) “Semi-log plots of confirmed vs deaths” generates the graphs for countries where COVID-19 has had more than 1,000 confirmed cases or more than 100 deaths.

Some questions for you to answer:

    • (a) Where could my “Read CSV files” program be improved?
    • (b) Why is the US graph split at around 20Mar2020? Is this a problem with the data or my program?
    • (c) Are all of cases being included?

This project is open to SAS programmers and to researchers. Follow the above instructions yourself, and then see if you can improve my SAS code by answering the questions.

Please send your saved SAS Studio flow containing your improved versions of the SAS programs to phil@hollandnumerics.org.uk. Anyone providing improvements that can be incorporated will be added to the credits for this project.

My first COVID-19 SAS project for SAS Studio/SAS University Edition can be found at “Can you help? Supporting Coronavirus Research by searching research papers with SAS“.

If you are still looking for SAS training, then please go to my blog post “SAS training for home-workers: Keeping your mind active and your skills current” for some more training options.

COVID-19 can be defeated, and, working together, we can make a difference!

Can you help? Supporting Coronavirus Research by searching research papers with SAS

Loading

Kaggle are running a competition to develop a Python or R application to filter the vast collection of medical research papers that are being published every day.


The CORD-19 dataset represents the most extensive machine-readable coronavirus literature collection available for data mining to date. This allows the worldwide AI research community the opportunity to apply text and data mining approaches to find answers to questions within, and connect insights across, this content in support of the ongoing COVID-19 response efforts worldwide. There is a growing urgency for these approaches because of the rapid increase in coronavirus literature, making it difficult for the medical community to keep up.

Many of these questions are suitable for text mining, and they are encouraging researchers to develop text mining tools to provide insights on these questions.

This dataset was created by the Allen Institute for AI in partnership with the Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft Research, and the National Library of Medicine – National Institutes of Health, in coordination with The White House Office of Science and Technology Policy.


I am not a Python or R programmer, but a SAS programmer, so I decided to make use of the freely available dataset and try to develop a simple data mining application in SAS instead, which I would like to publish to the benefit of the fight against COVID-19. I have now created a basic framework, which I am opening up to the SAS-programming community to test, improve and enhance as a saved SAS Studio flow (*.cpf), which can be imported into a Single-User SAS Studio installation, or into a SAS University Edition installed on a PC (see my blog post “Are you learning about SAS?” for details about how to install this version of SAS), as these installations can directly access files on your own computer.

The SAS programs in the SAS Studio flow are as follows:

  • run first:
    • Assigns the location of the downloaded Kaggle dataset into &_dir.
      • This macro variable will need to be edited to match the location of the folder where you have downloaded and extracted the CORD-19 dataset. SAS University Edition users will also need to assign a shared folder pointing to this location.
    • Includes %create_json_extract_script used in the Read json xxx programs below to read the JSON files containing the selected research papers into SAS, and then print out the contents of the paper.
      • If more non-printable characters are present that are not catered for in this macro, then additional TRANWRD() statements will need to be added here.
      • If anyone can devise a more elegant solution than using multiple TRANWRD() statements to convert unicode (\u9999) strings to printable 8-bit ASCII values, then I will welcome tested suggestions.
  • Read metadata:
    • Reads the CSV file containing the metadata about the research papers, including the abstract, which can be searched, and the location(s) of the related paper(s). SAS data set created = work.metadata.
  • Filter metadata 31Dec19:
    • Filters the extracted metadata to only include papers published in or after December 2019. SAS data set created = work.Since_31Dec19.
  • Filter metadata xxx:
    • Filters the SAS data set (work.Since_31Dec19) created by Filter metadata 31Dec19 to select paper abstracts containing specific particular keywords. SAS data set created = work.Since_31Dec19_xxx:
      • xxx=infect: WHERE INDEX(lowcase(abstract), ‘infect‘) AND INDEX(lowcase(abstract), ‘rate’) AND INDEX(lowcase(abstract), ‘age’) AND (INDEX(lowcase(abstract), ‘hcov’) OR INDEX(lowcase(abstract), ‘-cov’) OR INDEX(lowcase(abstract), ‘covid’));
      • xxx=cured: WHERE INDEX(lowcase(abstract), ‘cured‘) AND INDEX(lowcase(abstract), ‘rate’) AND INDEX(lowcase(abstract), ‘age’) AND (INDEX(lowcase(abstract), ‘hcov’) OR INDEX(lowcase(abstract), ‘-cov’) OR INDEX(lowcase(abstract), ‘covid’));
      • xxx=fatal: WHERE INDEX(lowcase(abstract), ‘fatal‘) AND INDEX(lowcase(abstract), ‘rate’) AND INDEX(lowcase(abstract), ‘age’) AND (INDEX(lowcase(abstract), ‘hcov’) OR INDEX(lowcase(abstract), ‘-cov’) OR INDEX(lowcase(abstract), ‘covid’));
      • xxx=recover: WHERE INDEX(lowcase(abstract), ‘recover‘) AND INDEX(lowcase(abstract), ‘rate’) AND INDEX(lowcase(abstract), ‘age’) AND (INDEX(lowcase(abstract), ‘hcov’) OR INDEX(lowcase(abstract), ‘-cov’) OR INDEX(lowcase(abstract), ‘covid’));
  • Read json xxx:
    • Prints all of the papers in the filtered abstracts to HTML using the metadata in the SAS data set (work.Since_31Dec19_xxx) created by Filter metadata xxx.

This project is open both to SAS programmers and to researchers. Please download the CORD-19 dataset and my SAS Studio flow. Try it out yourself, and then see if you can improve the performance, usability, flexibility or maintenance of my SAS code.

Please send your saved SAS Studio flow containing your improved versions of the SAS programs to phil@hollandnumerics.org.uk. Anyone providing improvements that can be incorporated will be added to the credits for this project.

If you are still looking for SAS training, then please go to my blog post “SAS training for home-workers: Keeping your mind active and your skills current” for some more training options.

COVID-19 can be defeated, and, working together, we can make a difference!