GSoC Proposal UCSC Xena

June 12, 2019

Table of Contents
Project: Update GDC Data on Xena

Project: Update GDC Data on Xena

Project Proposal

A high-level description of what you plan to work on

UCSC Xena is a functional genomics visualization and analysis platform. For this (genomics visualization and analysis), there are many datasets which are provided by data hubs available with Xenabrowser itself which users can use. The data hubs are: UCSC Public Hub, TCGA Hub, Pan-Cancer Atlas Hub, ICGC Hub, UCSC Toil RNAseq Recompute, Treehouse Hub and GDC Hub. Our discussion of interest lies in the GDC Hub.

GDC (Genomic Data Commons) Hub fetches data from the GDC repository: https://portal.gdc.cancer.gov/repository via the GDC API. However, the data on the Xena is not updated. Current Xena data was updated on release 10, roughly 1.5 yrs ago. This project mainly revolves around updating data on xena with the current release (release 15). This will also simplify the process of adding data to Xena removing the XML files as a source.
Specific details, key aspects and anticipated challenges

The process involves two high-level goals:
- Updating GDC data wrangling pipeline: The pipeline code (i.e. https://github.com/yunhailuo/xena-GDC-ETL currently fetches the data from GDC repository via GDC API) is outdated. This part of the goals aims to update with the latest release.
- Updating GDC data on Xena: As the data wrangling pipeline will be updated, this part of the goals will aim to update the GDC data on xena.
Tell us what research into this project you have done
- I have analyzed the gdc module of the xena-GDC-ETL package and added tests for the same. The module currently has 82% test coverage. In my analysis, I have found that the code and APIs used in the module are updated with the upstream.
- During my contributions to the repo, I got sound knowledge on the modules and what task each of them performs.
Break the project down into smaller tasks

Phase 1 (Week 1-4):
- Improve the overall project in this phase, i.e. adding tests, fixing lint issues, improving CLI and setting up
doctests for existing ones and release the project to pypi.
- Fix broken APIs and code for fetching new data from API.
- Start adding code for the new data provided by GDC API. The package should be able to fetch and upload
FM Simple Somatic Mutation data by the end of this period.

Phase 2 (Week 5-7):
- Add and upload GISTIC - Copy Number Score, DNAcopy data and Star - Counts data into xena and document the process.
Phase 3 (Week 8-11):
- Finish uploading the data into xena, add docs and doctests.

Timeline

This week-by-week timeline provides a rough guideline of how the project will be done.

Phase	Date	Tasks
Community Bonding	May 7 - May 26	Familiarising myself with the mentors and the codebase. I will dive deep into the code and learn all the ins and outs of the code with the help of mentors. Also, check how the data is uploaded into xena cohorts after downloading and transforming.
Phase 1	Week 1: May 27 - Jun 2	Fix all the linting issue as reported by flake8. Also, as the linting issues are fixed, in CI no longer allow lint to fail. Identify parts of the code which will be incompatible for Python3, especially Python 3.8 and rectify them. This will make the current code future-proof. Add tests for the existing untested code for the smaller modules, i.e. under scripts `panTCGA`, `merge_xena`, `make_metadata,` and `TARGET-CCSK_phenotype_ETL` modules. Setup CI for doctests. Example of one such test can be found inside the docstrings utils module. `xena-GDC-ETL 0.2` should be released by now in GitHub/pypi.
	Week 2: Jun 3 - Jun 9	Currently, the package has only ~11% code coverage. Good code coverage is required to enhance the maintainability of the project. I will identify parts of the code base (other than the `xena_dataset` module) which is out of coverage and add tests for the same. Investigate the xena_datasets module and add tests for the code which fetches public data from `TCGA` and `TARGET` data. In this process, we will get to know which APIs are working and which are not. Fix the broken APIs and remove them which are obsolete.
	Week 3: Jun 10 - Jun 16	Start working on adding missing data. Add and wrangle the data from `FM` program by the end of this week.
	Week 4: Jun 17 - Jun 24	Modify the `gdc2xena` script such that it is able to load `FM data` into xena. Document the whole process.
	Jun 24 - Jun 28	Phase 1 Evaluation
Phase 2	Week 5: Jul 1 - Jul 7	Add and wrangle `GISTIC - Copy Number Score` data by the end of this week. Modify the gdc2xena script such that it is able to load `GISTIC - Copy Number Score` data into xena. Since in the last week most part of the APIs would be already done, it should not take much longer. Document the whole process.
	Week 6: Jul 8 - Jul 14	Add `DNAcopy` data by the end of this week. Modify the `gdc2xena` script such that it is able to load `DNAcopy` data into xena. Document the whole process.
	Week 7: Jul 15 - Jul 21	Add gene expression data `STAR - Counts` data by the end of this week. Modify the `gdc2xena` script such that it is able to load `STAR - Counts` data into xena. Document the whole process
	Jul 22 - Jul 26	Phase 2 Evaluation
Phase 3	Week 8: Jul 29 - Aug 4	Finish uploading data to xena. Fix any broken codes.
	Week 9: Aug 5 - Aug 12	Buffer week for any unfinished task.
	Week 10: Aug 13 - Aug 19	Add a cron job using Travis CI which will run the pipeline automatically bi-weekly or monthly and will update the data in xena, in this way it will be never out of sync and even if GDC updates their API, the job will fail and actions can be taken. Check the robustness of the cron job.
	Week 11: Aug 20 - Aug 26	Finish up documentation and tests. Add tests for the code inside docs. Also setup CI for the same. Finish setup for writing docs using Sphinx. Release the existing docs in https://readthedocs.io as sometimes GitHub wiki does not play well with small screens.
	Aug 26 - Sep 2	Final Evaluation
Post GSoC	-	Keep an eye on the cron job and fix failures if the API structure changes. Help the newcomers to get started in the UCSCXena organization.

Personal Background

Name: Ayan Banerjee
Email: ayanbn7@gmail.com
Timezone: Indian Standard Time (UTC +05:30)
Links:
- GitHub: https://github.com/ayan-b
- GitLab: https://gitlab.com/ayan-b
- Gitter: https://gitter.im/ayan-b
- CodeChef: https://codechef.com/users/ayan_nitd
- LinkedIn: https://linkedin.com/in/ayanb
Education
- Bachelor of Technology
  - Department of Electronics and Communication Engineering
    - National Institute of Technology, Durgapur
      - July 2016 - Present
      - CGPA: 9.30/10 (till 5th semester)
  - Higher Secondary (10+2)
    - February 2016
    - Sargachi Ramakrishna Mission High School, West Bengal
    - Mathematics & Biology
    - Score: 96%
  - Secondary (10th)
    - January 2014
    - Sargachi Ramakrishna Mission High School, West Bengal
    - Mathematics, Biology, History & Geography
    - Score: 92.43%
Work Experience:
- Student mentor, Google Code In
  
  Oct - Dec 2018
  - Mentored pre-university students to get started in open source
  - Mentored the students for the organization coala
- Summer Intern at Indian Institute of Technology, Bombay
  
  May - July 2018
  - Integrated a plagiarism checker for Python programming language with Yaksh, a course-taking application
  written in Django.
  - Wrote an elaborate test suite for the same.

Relevant Skills

What are your languages of choice and how do they relate to the project?

Python: Since the current data wrangler is completely written in Python, naturally The language of choice for this project is Python. Some shell scripting may be required for automating stuff (tests, linting, etc.). Some other technology requirements are Git (for version control), Jinja2(for templating), CI(Continuous Integration) and writing docs using Sphinx.

Any prior experience with open source development?

I have been a core developer at the organization moremoban for the last 4 months. I am mainly involved with the project moban, a jinja2 CLI command for static text generation (contributions) and pypi-mobans (contributions). These 2 projects are used as repository management tools, and they are written in Python and Jinja2. I developed 2 plugins for the project: moban-velocity, moban-haml and released them into PyPi.
I have also contributed to the xena-GDC-ETL repository. My contributions can be found here. Also, I have created some issues for the overall improvement of the project.
I have also developed some personal projects, mostly in Python and Vanilla JavaScript, all of those are open-sourced and can be found in my GitHub and GitLab profile.

Please point us to a code sample you have written.

My contributions to the project can be taken as code samples as well as the projects I contributed mentioned previously:

Any prior experience in bioinformatics or cancer genomics?

I have high-school level knowledge of genomics and am familiar with the basics of it. Also, I am familiar with the commonly used terms related to Genomics.

Your Availability

Do you have any school-related activities scheduled during the coding period?

Summer vacation at my university starts from first week of May and ends on the second week of July. So, from second week of July, I will be at my university and will continue coding there.

Do you have a full- or part-time job or internship planned for this summer?

No.

Any other plans during the coding period that might impact what hours or days you can work?

No.

How many hours per week do you have for a summer project?

Since I can do this project full-time, I will be able to devote 50-60 hrs per week for the same. Once my university starts in mid-july, I will be able to give 3-4 hrs per day during weekdays and 8-10 hrs per day during weekends.

Table of Contents

Project: Update GDC Data on Xena

Project Proposal

A high-level description of what you plan to work on

Specific details, key aspects and anticipated challenges

Tell us what research into this project you have done

Break the project down into smaller tasks

Phase 1 (Week 1-4):

Phase 2 (Week 5-7):

Phase 3 (Week 8-11):