The primary purpose of this data scientist is to contribute to building the data infrastructure of our flagship platform A3D3a: Adaptive, AI-augmented, Drug Discovery and Development. With expertise in data architecture, the Data Scientist will directly contribute our mission to discover novel therapies for cancer patients. Led by Prof. Bissan Al-Lazikani, Director of Therapeutics Data Science, the intelligent and ever-learning A3D3a platform is part of the new initiative in Therapeutics Data Science and part of our ambitious Institute for Data Science in Oncology at MD Anderson. A3D3a will accelerate the discovery and impact of novel therapies for cancer by enabling novel opportunities for optimized therapies for patients with a focus on rare and hard-to-treat cancers through the development of novel machine learning and AI technologies.
Central to this vision, the Data Scientist will build and maintain data infrastructure to enable the discovery of hidden therapeutic opportunities in integrated patient data and will work closely with data scientists, data engineers, bioinformaticians, and molecular modelers.
• Work with lead teammates on establishing architectural plan to encompass local, hybrid, and/or cloud infrastructure
• Utilize a variety of tools (e.g. Spark, KNIME, Airflow, SQL) to merge and extract data from multiple sources and environments
• Create data pipelines to validate and enrich data for use in ML models
• Generate and maintain metadata for all stages of data pipeline
• Work with a multidisciplinary team and stakeholders to define data requirements
• Establish and maintain interfaces to the data (APIs)
• Utilize industry standards for creating, storing, and documenting code
• Strong Python programming experience is a must and candidates must have demonstrated skills in that area
• Candidates having experience using Spark (PySpark) will be given preference
• Solid understanding of CI/CD practices
• Experience building and querying both relational and graph databases
• Familiarity with No-SQL
• Solid knowledge of metadata creation and management
• Experience with Airflow, Argo or equivalent workflow orchestration is required
• Must have demonstrated experience working with APIs
• Good understanding of Container based architectures (e.g. Docker/Kubernetes)
• Candidate must have demonstrated experience working on data engineering tasks using one of the major cloud vendors. Preference will be given to those with experience with Microsoft Azure
• Prefer candidates with demonstrated skills in building/deploying ML models
Required: Bachelor's degree in Biomedical Engineering, Electrical Engineering, Computer Engineering, Physics, Applied Mathematics, Science, Engineering, Computer Science, Statistics, Computational Biology, or related field.
Preferred: PhD in Biomedical Engineering, Electrical Engineering, Computer Engineering, Physics, Applied Mathematics, Science, Engineering, Computer Science, Statistics, Computational Biology, or related field.
Required: Three years experience in scientific software development/analysis. With Master's degree, one years experience required. With PhD, no experience required.
It is the policy of The University of Texas MD Anderson Cancer Center to provide equal employment opportunity without regard to race, color, religion, age, national origin, sex, gender, sexual orientation, gender identity/expression, disability, protected veteran status, genetic information, or any other basis protected by institutional policy or by federal, state or local laws unless such distinction is required by law. http://www.mdanderson.org/about-us/legal-and-policy/legal-statements/eeo-affirmative-action.html