Description
HPC systems
Clusters
Linux systems
HPC HW knowledge especially in the server, GPU, networking, Storage, BIOS & BMC arenas
TCP/IP fundamentals
BE/BTech or MS degree in Computer Engineering or Electrical Engineer related fields � Design, implementation & support of high-performance compute clusters
� Solid knowledge on HPC systems, including CPU/GPU architecture, scalable/robust storage, high-bandwidth inter-connects, and a knowledge of cloud based computing architectures
� Apply their attention to detail to generate HW BOMs for the HCP Clusters, provide vendor management and oversee HW release activities.
� Use their strong skills with the Linux OS to configure appropriate operating systems for the HPC system
� Understand and assemble the project specifications and performance requirements at the subsystem and system levels. Adhere and drive to project timelines to insure program achievements complete on time.
� Support design and release of new products to manufacturing and ultimately the customer, providing quality golden images, procedures, scripts and documentation to the manufacturing team and customer support team. � BS or MS degree + 6 to 10 years validated experience Computer Engineering or Electrical Engineer related fields � Validated in-depth and flavor agnostic knowledge of Linux systems (SuSE, RedHat, Rocky, Ubuntu)
� Experience of crafting and maintaining robust storage
� Strong HPC HW knowledge especially in the server, GPU, networking, Storage, BIOS & BMC arenas.
� Experience in System-D, Net boot/PXE, Linux HA.
� Strong understanding of TCP/IP fundamentals and knowledge of protocols, DNS, DHCP, HTTP, LDAP, SMTP.
� Ability to code and develop Shell and Python scripts.
� Experience with one or more of the listed Configuration Mgmt utilities. (Salt, Chef, Puppet etc) .
� Possess a strong DevOps focus: Knowledge of setting up a continuous development pipeline (Jenkins), Repository software (Git-based), Singularity & Docker Containers.
� Kubernetes, Prometheus & Grafana experience
� Knowledge of Apache/Nginx, Setting up proxy/reverse proxy, application server routing, load balancing (HA Proxy)
� Team Orientation & Interpersonal – Highly motivated teammate with ability to develop and maintain collaborative relationships with all levels within and external to the organization.
� Organization & Time Management – Able to plan, schedule, organize, and follow up on tasks related to the job to achieve goals within or ahead of established time frames.
� Multi-task - Ability to expeditiously organize, coordinate, manage, prioritize, and perform multiple tasks simultaneously to swiftly assess a situation, determine a logical course of action, and apply the appropriate response.
� Adaptability to Change – Able to be flexible and supportive, and able to assimilate change positively and proactively in rapid growth environment.
� Outstanding teammate with excellent written and verbal communications skills.