Muhammad Hilman Beyri (mbeyri), Zixu Ding (zixud)

Final Report


We have developed a cloud-based CPU (SIMD)/GPU hybrid interactive raytracing demo, using an adaptive load-balancing algorithm which partitions work among nodes without knowing their computing power in advance.

Our goal is to combine multiple levels of parallelism : CPU Multi-threading, SIMD, GPU (CUDA), and distributed computing over multiple nodes to maximize each node's performance and make the ray-tracing application run in real-time.


We started with hand-crafting a single-threaded pool table raytracing demo, and parallelized it with CUDA and CPU multi-threaded SIMD. The CUDA version achieves a 181.6x speedup, and the SIMD version has a MEASURING speedup. However, as the speedups on the same node are different, and the ratio of GPU/CPU computing power can also differ across nodes, we decided to implement a general heterogeneous load-balancing algorithm, which estimates a computing device's overall response time as a function of workload: y = f(x). Using different functions to estimate results in different load-balancing algorithms. We will give details of different algorithms we have tried in the following section.

Load Balancer

We created several load balancer algorithm.


In computer graphics, ray tracing is a technique for generating an image by tracing the path of light through pixels in an image plane and simulating the effects of its encounters with virtual objects. As such, ray tracing is able to simulate necessary effects to create photorealistic images. However, a high-quality ray tracing can take very long time to render, and is not suitable for real-time applications.

Essential operations in ray tracing, such as intersection tests, transformations, vector math and shading are all data-parallel operations, and therefore are SIMD friendly. Making use of SIMD processors like GPU or CPU SIMD instructions can immensely improve the speed of raytracing.

Many computers nowadays have both CPU and GPU installed. However, most of the time they are idle or poorly utilized. To extract maximum performance from a node, using both the CPU and GPU on the same node is definitely better than using either of them.

With the advent of cloud computing, it is possible to access massive computing resources without actually purchasing them. Even GPU computing instances have become available in the past few years. Therefore, with enough cloud instances, real-time rendering is possible without sacrificing the quality.

Load balancing across heterogeneous nodes can be difficult. That is largely due to the differences of network latency and GPU performance among nodes. In order to balance the overall latency, which we estimate as network latency + work load * rendering time per unit work load, we will gather the rendering time and latency from worker nodes, and use an adaptive linear regression algorithm to predict the network latency and unit rendering time of next frame, and then assign the work load of each node accordingly.

The architecture of this project is simple. Worker nodes are responsible for rendering. They take scene data and the work assignment information (like from Row A - Row B) as input, and send back the rendered portion of the image (compressed). The local client divides the screen into tiles, sends the scene data and work allocation to worker nodes, and collects all the image parts from worker nodes and display the frames. It also adjusts the work assignment each frame using the adaptive algorithm we mentioned above.

Checkpoint Report April 19

The Challenge




We have taken 15-662 Computer Graphics and developed a ray tracer on CPU. We will write a single-threaded hardcoded version of the pool scene, and parallelize it using both CUDA and multi-threaded SIMD. During the development, we will use GHC clusters with GPU to test. We will also make sure it works on Amazon EC2, and will run the demo game on it.

Goals and Deliverables

We plan to achieve:

We hope to achieve:

Platform Choice

We are using CUDA and C++ as our main programming language. The client will be a thin client application using SDL and OpenGL. The client will communicate with other worker nodes using a network protocol that we are going to design. We chose to assume the worker nodes have GPUs that support CUDA.