Can you help? Supporting Coronavirus Research by searching research papers with SAS

Loading

Kaggle are running a competition to develop a Python or R application to filter the vast collection of medical research papers that are being published every day.


The CORD-19 dataset represents the most extensive machine-readable coronavirus literature collection available for data mining to date. This allows the worldwide AI research community the opportunity to apply text and data mining approaches to find answers to questions within, and connect insights across, this content in support of the ongoing COVID-19 response efforts worldwide. There is a growing urgency for these approaches because of the rapid increase in coronavirus literature, making it difficult for the medical community to keep up.

Many of these questions are suitable for text mining, and they are encouraging researchers to develop text mining tools to provide insights on these questions.

This dataset was created by the Allen Institute for AI in partnership with the Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft Research, and the National Library of Medicine – National Institutes of Health, in coordination with The White House Office of Science and Technology Policy.


I am not a Python or R programmer, but a SAS programmer, so I decided to make use of the freely available dataset and try to develop a simple data mining application in SAS instead, which I would like to publish to the benefit of the fight against COVID-19. I have now created a basic framework, which I am opening up to the SAS-programming community to test, improve and enhance as a saved SAS Studio flow (*.cpf), which can be imported into a Single-User SAS Studio installation, or into a SAS University Edition installed on a PC (see my blog post “Are you learning about SAS?” for details about how to install this version of SAS), as these installations can directly access files on your own computer.

The SAS programs in the SAS Studio flow are as follows:

  • run first:
    • Assigns the location of the downloaded Kaggle dataset into &_dir.
      • This macro variable will need to be edited to match the location of the folder where you have downloaded and extracted the CORD-19 dataset. SAS University Edition users will also need to assign a shared folder pointing to this location.
    • Includes %create_json_extract_script used in the Read json xxx programs below to read the JSON files containing the selected research papers into SAS, and then print out the contents of the paper.
      • If more non-printable characters are present that are not catered for in this macro, then additional TRANWRD() statements will need to be added here.
      • If anyone can devise a more elegant solution than using multiple TRANWRD() statements to convert unicode (\u9999) strings to printable 8-bit ASCII values, then I will welcome tested suggestions.
  • Read metadata:
    • Reads the CSV file containing the metadata about the research papers, including the abstract, which can be searched, and the location(s) of the related paper(s). SAS data set created = work.metadata.
  • Filter metadata 31Dec19:
    • Filters the extracted metadata to only include papers published in or after December 2019. SAS data set created = work.Since_31Dec19.
  • Filter metadata xxx:
    • Filters the SAS data set (work.Since_31Dec19) created by Filter metadata 31Dec19 to select paper abstracts containing specific particular keywords. SAS data set created = work.Since_31Dec19_xxx:
      • xxx=infect: WHERE INDEX(lowcase(abstract), ‘infect‘) AND INDEX(lowcase(abstract), ‘rate’) AND INDEX(lowcase(abstract), ‘age’) AND (INDEX(lowcase(abstract), ‘hcov’) OR INDEX(lowcase(abstract), ‘-cov’) OR INDEX(lowcase(abstract), ‘covid’));
      • xxx=cured: WHERE INDEX(lowcase(abstract), ‘cured‘) AND INDEX(lowcase(abstract), ‘rate’) AND INDEX(lowcase(abstract), ‘age’) AND (INDEX(lowcase(abstract), ‘hcov’) OR INDEX(lowcase(abstract), ‘-cov’) OR INDEX(lowcase(abstract), ‘covid’));
      • xxx=fatal: WHERE INDEX(lowcase(abstract), ‘fatal‘) AND INDEX(lowcase(abstract), ‘rate’) AND INDEX(lowcase(abstract), ‘age’) AND (INDEX(lowcase(abstract), ‘hcov’) OR INDEX(lowcase(abstract), ‘-cov’) OR INDEX(lowcase(abstract), ‘covid’));
      • xxx=recover: WHERE INDEX(lowcase(abstract), ‘recover‘) AND INDEX(lowcase(abstract), ‘rate’) AND INDEX(lowcase(abstract), ‘age’) AND (INDEX(lowcase(abstract), ‘hcov’) OR INDEX(lowcase(abstract), ‘-cov’) OR INDEX(lowcase(abstract), ‘covid’));
  • Read json xxx:
    • Prints all of the papers in the filtered abstracts to HTML using the metadata in the SAS data set (work.Since_31Dec19_xxx) created by Filter metadata xxx.

This project is open both to SAS programmers and to researchers. Please download the CORD-19 dataset and my SAS Studio flow. Try it out yourself, and then see if you can improve the performance, usability, flexibility or maintenance of my SAS code.

Please send your saved SAS Studio flow containing your improved versions of the SAS programs to phil@hollandnumerics.org.uk. Anyone providing improvements that can be incorporated will be added to the credits for this project.

If you are still looking for SAS training, then please go to my blog post “SAS training for home-workers: Keeping your mind active and your skills current” for some more training options.

COVID-19 can be defeated, and, working together, we can make a difference!