building ML systems, optimizing GPU workloads, and experimenting across AI infrastructure, performance engineering, and scalable software systems.
currently focused on Triton/CUDA optimization, distributed training systems, efficient inference, and systems-aware deep learning.
I work across the intersection of:
- ML systems
- GPU programming
- AI infrastructure
- performance engineering
- scalable backend systems
- full-stack product development
- AI consulting + technical strategy
Alongside engineering-focused work, I’ve collaborated with startups and product teams on building practical, cost-efficient AI and software solutions.
My approach combines:
- deep technical understanding
- systems-level optimization
- product thinking
- business-aware engineering decisions
I enjoy bridging the gap between technical and non-technical teams — translating complex systems into scalable, usable, and commercially practical solutions.
This includes helping teams:
- optimize infrastructure costs
- choose efficient AI/ML architectures
- scale products pragmatically
- improve engineering workflows
- ship faster without sacrificing quality
- balance performance with maintainabilityI’m comfortable working across different environments:
- early-stage startups
- hackathon teams
- fast-moving product groups
- research-oriented engineering teams
- enterprise-scale workflows
and across vastly different levels of technical complexity.
That can range from:
- building lightweight automations for small businesses
- designing AI agents for operational workflows
- creating internal productivity tools
- shipping full-stack MVPs quickly
- improving backend scalability
- optimizing cloud/resource usage
- designing efficient ML pipelines
- tuning GPU kernels for high-throughput inference
- optimizing Triton/CUDA workloads for LLM systems
- experimenting with systems-level performance engineeringI enjoy solving both ends of the spectrum: practical business problems that need clean execution, and deeply technical infrastructure problems that require low-level optimization and systems thinking.
My work ranges from low-level kernel optimization and distributed training experiments to building real-world applications, developer tools, and AI-powered products.
I enjoy understanding systems from the inside out — memory movement, scheduling, throughput, compiler behavior, kernel execution, and the engineering tradeoffs behind modern AI workloads.
This GitHub is essentially an active engineering workspace where I explore:
- GPU kernels + Triton/CUDA
- efficient deep learning systems
- training + inference infrastructure
- compiler-aware optimization
- AI-powered products
- distributed systems
- developer tooling
- experimental ML infrastructureA lightweight deep learning framework built to understand tensor systems, autograd internals, and the foundations behind modern deep learning libraries.
focus areas:
- tensor abstractions
- automatic differentiation
- computational graphs
- backend execution mechanics
- educational systems design
Collection of kernel optimization experiments focused on maximizing GPU throughput and understanding low-level execution behavior.
includes work around:
- GEMM optimization
- memory coalescing
- occupancy tuning
- shared memory optimization
- tiling strategies
- warp-level execution
- benchmarking + profiling
recent work includes iterative optimization of matrix multiplication kernels achieving extremely high GFLOPS through scheduling and memory-access improvements.
An experimental AI assistant platform exploring real-time interactions, AI tooling, and product-scale system design.
focus areas:
- AI integration
- real-time workflows
- product engineering
- scalable architecture
- user-focused AI experiences
A collection of fast-built but ambitious projects exploring:
- AI applications
- full-stack systems
- developer tooling
- automation
- real-time platforms
- rapid product iteration
these projects helped shape my approach toward shipping quickly while maintaining strong engineering fundamentals.
Repositories exploring:
- distributed training
- TPU/GPU experimentation
- inference optimization
- scalable training workflows
- systems-oriented deep learning
- infrastructure-aware experimentation
built while experimenting with Kaggle TPUs, large-scale workloads, and efficient AI system design.
- Kaggle Grandmaster
- Harvard Edge Computing Lab
- IIT Guwahati (Class of 2028)
- compiler-aware ML optimization
- efficient transformer systems
- GPU kernel engineering
- distributed AI systems
- high-throughput inference
- scalable LLM infrastructure
- systems-level AI research
- performance benchmarkinglanguages → python, c++, cuda, javascript, typescript, c, R,
ml/ai → pytorch, triton, tensorflow, jax
systems → cuda, distributed systems, gpu programming,
backend → node.js, express, fastapi, express.js, svelte, sveltekit
frontend → react, next.js
infra → linux, docker, git, vercel, kubernetes
areas → ML systems, AI infra, performance engineering, AI Engineering, ML Engineering, Frontier Research - portfolio → https://shashankt.vercel.app
- linkedin → https://linkedin.com/in/rocky0714
building systems that make AI workloads faster, scalable, and usable in the real world.