Table of Contents
- Table of Contents
- Project: Update GDC Data on Xena
- Project Proposal
- Personal Background
- Relevant Skills
- Your Availability
- Do you have any school-related activities scheduled during the coding period?
- Do you have a full- or part-time job or internship planned for this summer?
- Any other plans during the coding period that might impact what hours or days you can work?
- How many hours per week do you have for a summer project?
Project: Update GDC Data on Xena
Project Proposal
-
A high-level description of what you plan to work on
UCSC Xena is a functional genomics visualization and analysis platform. For this (genomics visualization and analysis), there are many datasets which are provided by data hubs available with Xenabrowser itself which users can use. The data hubs are: UCSC Public Hub, TCGA Hub, Pan-Cancer Atlas Hub, ICGC Hub, UCSC Toil RNAseq Recompute, Treehouse Hub and GDC Hub. Our discussion of interest lies in the GDC Hub.
GDC (Genomic Data Commons) Hub fetches data from the GDC repository: https://portal.gdc.cancer.gov/repository via the GDC API. However, the data on the Xena is not updated. Current Xena data was updated on release 10, roughly 1.5 yrs ago. This project mainly revolves around updating data on xena with the current release (release 15). This will also simplify the process of adding data to Xena removing the XML files as a source.
-
Specific details, key aspects and anticipated challenges
The process involves two high-level goals:
- Updating GDC data wrangling pipeline: The pipeline code (i.e. https://github.com/yunhailuo/xena-GDC-ETL currently fetches the data from GDC repository via GDC API) is outdated. This part of the goals aims to update with the latest release.
- Updating GDC data on Xena: As the data wrangling pipeline will be updated, this part of the goals will aim to update the GDC data on xena.
-
Tell us what research into this project you have done
- I have analyzed the gdc module of the
xena-GDC-ETL
package and added tests for the same. The module currently has 82% test coverage. In my analysis, I have found that the code and APIs used in the module are updated with the upstream. - During my contributions to the repo, I got sound knowledge on the modules and what task each of them performs.
- I have analyzed the gdc module of the
-
Break the project down into smaller tasks
Phase 1 (Week 1-4):
- Improve the overall project in this phase, i.e. adding tests, fixing lint issues, improving CLI and setting up
doctests for existing ones and release the project to pypi.
- Fix broken APIs and code for fetching new data from API.
- Start adding code for the new data provided by GDC API. The package should be able to fetch and upload
FM Simple Somatic Mutation data by the end of this period.
Phase 2 (Week 5-7):
- Add and upload GISTIC - Copy Number Score, DNAcopy data and Star - Counts data into xena and document the process.
Phase 3 (Week 8-11):
- Finish uploading the data into xena, add docs and doctests.
Timeline
This week-by-week timeline provides a rough guideline of how the project will be done.
Phase | Date | Tasks |
Community Bonding | May 7 - May 26 |
|
Phase 1 | Week 1: May 27 - Jun 2 |
|
Week 2: Jun 3 - Jun 9 |
|
|
Week 3: Jun 10 - Jun 16 |
|
|
Week 4: Jun 17 - Jun 24 |
|
|
Jun 24 - Jun 28 | Phase 1 Evaluation | |
Phase 2 | Week 5: Jul 1 - Jul 7 |
|
Week 6: Jul 8 - Jul 14 |
|
|
Week 7: Jul 15 - Jul 21 |
|
|
Jul 22 - Jul 26 | Phase 2 Evaluation | |
Phase 3 | Week 8: Jul 29 - Aug 4 |
|
Week 9: Aug 5 - Aug 12 |
|
|
Week 10: Aug 13 - Aug 19 |
|
|
Week 11: Aug 20 - Aug 26 |
|
|
Aug 26 - Sep 2 | Final Evaluation | |
Post GSoC | - |
|
Personal Background
- Name: Ayan Banerjee
- Email: ayanbn7@gmail.com
- Timezone: Indian Standard Time (UTC +05:30)
- Links:
- GitHub: https://github.com/ayan-b
- GitLab: https://gitlab.com/ayan-b
- Gitter: https://gitter.im/ayan-b
- CodeChef: https://codechef.com/users/ayan_nitd
- LinkedIn: https://linkedin.com/in/ayanb
- Education
- Bachelor of Technology
- Department of Electronics and Communication Engineering
- National Institute of Technology, Durgapur
- July 2016 - Present
- CGPA: 9.30/10 (till 5th semester)
- National Institute of Technology, Durgapur
- Higher Secondary (10+2)
- February 2016
- Sargachi Ramakrishna Mission High School, West Bengal
- Mathematics & Biology
- Score: 96%
- Secondary (10th)
- January 2014
- Sargachi Ramakrishna Mission High School, West Bengal
- Mathematics, Biology, History & Geography
- Score: 92.43%
- Department of Electronics and Communication Engineering
- Bachelor of Technology
- Work Experience:
-
Student mentor, Google Code In
Oct - Dec 2018
- Mentored pre-university students to get started in open source
- Mentored the students for the organization coala
-
Summer Intern at Indian Institute of Technology, Bombay
May - July 2018
- Integrated a plagiarism checker for Python programming language with Yaksh, a course-taking application
written in Django.
- Wrote an elaborate test suite for the same.
-
Relevant Skills
What are your languages of choice and how do they relate to the project?
Python: Since the current data wrangler is completely written in Python, naturally The language of choice for this project is Python. Some shell scripting may be required for automating stuff (tests, linting, etc.). Some other technology requirements are Git (for version control), Jinja2(for templating), CI(Continuous Integration) and writing docs using Sphinx.
Any prior experience with open source development?
-
I have been a core developer at the organization moremoban for the last 4 months. I am mainly involved with the project moban, a jinja2 CLI command for static text generation (contributions) and pypi-mobans (contributions). These 2 projects are used as repository management tools, and they are written in Python and Jinja2. I developed 2 plugins for the project: moban-velocity, moban-haml and released them into PyPi.
-
I have also contributed to the
xena-GDC-ETL
repository. My contributions can be found here. Also, I have created some issues for the overall improvement of the project. -
I have also developed some personal projects, mostly in Python and Vanilla JavaScript, all of those are open-sourced and can be found in my GitHub and GitLab profile.
Please point us to a code sample you have written.
My contributions to the project can be taken as code samples as well as the projects I contributed mentioned previously:
- Commits: https://github.com/yunhailuo/xena-GDC-ETL/commits?author=ayan-b
- PRs: https://github.com/yunhailuo/xena-GDC-ETL/pulls?q=is%3Apr+author%3Aayan-b
- Reviews: https://github.com/yunhailuo/xena-GDC-ETL/pulls?q=is%3Apr+reviewed-by%3Aayan-b
Any prior experience in bioinformatics or cancer genomics?
I have high-school level knowledge of genomics and am familiar with the basics of it. Also, I am familiar with the commonly used terms related to Genomics.
Your Availability
Do you have any school-related activities scheduled during the coding period?
Summer vacation at my university starts from first week of May and ends on the second week of July. So, from second week of July, I will be at my university and will continue coding there.
Do you have a full- or part-time job or internship planned for this summer?
No.
Any other plans during the coding period that might impact what hours or days you can work?
No.
How many hours per week do you have for a summer project?
Since I can do this project full-time, I will be able to devote 50-60 hrs per week for the same. Once my university starts in mid-july, I will be able to give 3-4 hrs per day during weekdays and 8-10 hrs per day during weekends.