Rocky Shashank-Tripathi-07

Shashank Tripathi

building ML systems, optimizing GPU workloads, and experimenting across AI infrastructure, performance engineering, and scalable software systems.

currently focused on Triton/CUDA optimization, distributed training systems, efficient inference, and systems-aware deep learning.

About

I work across the intersection of:

ML systems
GPU programming
AI infrastructure
performance engineering
scalable backend systems
full-stack product development
AI consulting + technical strategy

Alongside engineering-focused work, I’ve collaborated with startups and product teams on building practical, cost-efficient AI and software solutions.

My approach combines:

deep technical understanding
systems-level optimization
product thinking
business-aware engineering decisions

I enjoy bridging the gap between technical and non-technical teams — translating complex systems into scalable, usable, and commercially practical solutions.

This includes helping teams:

- optimize infrastructure costs
- choose efficient AI/ML architectures
- scale products pragmatically
- improve engineering workflows
- ship faster without sacrificing quality
- balance performance with maintainability

I’m comfortable working across different environments:

early-stage startups
hackathon teams
fast-moving product groups
research-oriented engineering teams
enterprise-scale workflows

and across vastly different levels of technical complexity.

That can range from:

- building lightweight automations for small businesses
- designing AI agents for operational workflows
- creating internal productivity tools
- shipping full-stack MVPs quickly
- improving backend scalability
- optimizing cloud/resource usage
- designing efficient ML pipelines
- tuning GPU kernels for high-throughput inference
- optimizing Triton/CUDA workloads for LLM systems
- experimenting with systems-level performance engineering

I enjoy solving both ends of the spectrum: practical business problems that need clean execution, and deeply technical infrastructure problems that require low-level optimization and systems thinking.

My work ranges from low-level kernel optimization and distributed training experiments to building real-world applications, developer tools, and AI-powered products.

I enjoy understanding systems from the inside out — memory movement, scheduling, throughput, compiler behavior, kernel execution, and the engineering tradeoffs behind modern AI workloads.

This GitHub is essentially an active engineering workspace where I explore:

- GPU kernels + Triton/CUDA
- efficient deep learning systems
- training + inference infrastructure
- compiler-aware optimization
- AI-powered products
- distributed systems
- developer tooling
- experimental ML infrastructure

selected repositories

TinyTorch [Built in Harvard's CS249r repository]

A lightweight deep learning framework built to understand tensor systems, autograd internals, and the foundations behind modern deep learning libraries.

focus areas:

tensor abstractions
automatic differentiation
computational graphs
backend execution mechanics
educational systems design

Triton + CUDA Optimization [Current Focus]

Collection of kernel optimization experiments focused on maximizing GPU throughput and understanding low-level execution behavior.

includes work around:

GEMM optimization
memory coalescing
occupancy tuning
shared memory optimization
tiling strategies
warp-level execution
benchmarking + profiling

recent work includes iterative optimization of matrix multiplication kernels achieving extremely high GFLOPS through scheduling and memory-access improvements.

Turing — AI Integrated Real-Time Assistant [Individual project on Agents and System Automation]

An experimental AI assistant platform exploring real-time interactions, AI tooling, and product-scale system design.

focus areas:

AI integration
real-time workflows
product engineering
scalable architecture
user-focused AI experiences

Hackathon + Product Projects

A collection of fast-built but ambitious projects exploring:

AI applications
full-stack systems
developer tooling
automation
real-time platforms
rapid product iteration

these projects helped shape my approach toward shipping quickly while maintaining strong engineering fundamentals.

ML Systems Experiments

Repositories exploring:

distributed training
TPU/GPU experimentation
inference optimization
scalable training workflows
systems-oriented deep learning
infrastructure-aware experimentation

built while experimenting with Kaggle TPUs, large-scale workloads, and efficient AI system design.

Experience

Kaggle Grandmaster
Harvard Edge Computing Lab
IIT Guwahati (Class of 2028)

Engineering interests

- compiler-aware ML optimization
- efficient transformer systems
- GPU kernel engineering
- distributed AI systems
- high-throughput inference
- scalable LLM infrastructure
- systems-level AI research
- performance benchmarking

Tech stack

languages     → python, c++, cuda, javascript, typescript, c, R, 
ml/ai         → pytorch, triton, tensorflow, jax
systems       → cuda, distributed systems, gpu programming, 
backend       → node.js, express, fastapi, express.js, svelte, sveltekit
frontend      → react, next.js
infra         → linux, docker, git, vercel, kubernetes 
areas         → ML systems, AI infra, performance engineering, AI Engineering, ML Engineering, Frontier Research

links

portfolio → https://shashankt.vercel.app
linkedin → https://linkedin.com/in/rocky0714

building systems that make AI workloads faster, scalable, and usable in the real world.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly