Researchers have developed a technique that can automatically accelerate certain types of computer programs while maintaining program accuracy. Their system accelerates programs that run in the Unix shell, a popular programming environment that was created 50 years ago and is still widely used today. Their method parallelizes these programs, which means it divides them into pieces that can be run concurrently on multiple computer processors. This allows programs to perform tasks like web indexing, natural language processing, and data analysis in a fraction of the time they used to.
“These programs are used by a large number of people, including data scientists, biologists, engineers, and economists. They can now automatically accelerate their programs without fear of producing incorrect results” Nikos Vasilakis, research scientist at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), agrees.
The system also makes it easier for programmers who create tools used by data scientists, biologists, engineers, and others. They don’t need to make any changes to their program commands to enable this automatic, error-free parallelization, adds Vasilakis, who chairs a committee of researchers from around the world who have been working on this system for nearly two years.
Our system is the first to demonstrate this type of completely correct transformation, but there is an additional benefit. Because of the way our system is designed, other researchers and users in industry can build on top of it.
Nikos Vasilakis
A decades-old problem
The PaSh system focuses on programs, or scripts, that run in the Unix shell. A script is a set of commands that tells a computer how to do something. Correct and automatic parallelization of shell scripts is a difficult problem that researchers have been attempting to solve for decades.
The Unix shell is still widely used because it is the only programming environment that allows a single script to contain functions written in multiple programming languages. Different programming languages are better suited for specific tasks or data types; if a developer uses the correct language, problem solving can be much easier.
“People also enjoy developing in different programming languages, so composing all these components into a single program is something that happens very frequently,” Vasilakis adds.
While the Unix shell allows for multilanguage scripts, its flexible and dynamic structure makes it difficult to parallelize these scripts using traditional methods. Parallelizing a program is typically difficult because some parts of the program rely on others. This determines the order in which components must be executed; if the order is incorrect, the program will fail.
When a program is written in a single language, developers have explicit information about its features and the language at their disposal, which allows them to determine which components can be parallelized. However, such tools do not exist for Unix shell scripts. Users can’t easily see what’s going on inside the components or extract data that would help with parallelization.
A just-in-time solution
To overcome this problem, PaSh uses a preprocessing step that inserts simple annotations onto program components that it thinks could be parallelizable. Then PaSh attempts to parallelize those parts of the script while the program is running, at the exact moment it reaches each component. This avoids another problem in shell programming — it is impossible to predict the behavior of a program ahead of time.
By parallelizing program components “just in time,” the system avoids this issue. It is able to effectively speed up many more components than traditional methods that try to perform parallelization in advance.
Just-in-time parallelization also ensures the accelerated program still returns accurate results. If PaSh arrives at a program component that cannot be parallelized (perhaps it is dependent on a component that has not run yet), it simply runs the original version and avoids causing an error.
“No matter the performance benefits — if you promise to make something run in a second instead of a year — if there is any chance of returning incorrect results, no one is going to use your method,” Vasilakis says.
Users don’t need to make any modifications to use PaSh; they can just add the tool to their existing Unix shell and tell their scripts to use it.
Acceleration and accuracy
PaSh was tested on hundreds of scripts, ranging from classical to modern, and it did not break a single one. When compared to unparallelized scripts, the system was able to run programs six times faster on average, with a maximum speedup of nearly 34 times. It also increased the speed of scripts that other approaches could not parallelize.
“Our system is the first to demonstrate this type of completely correct transformation, but there is an additional benefit. Because of the way our system is designed, other researchers and users in industry can build on top of it” Vasilakis explains.
He is eager to hear more from users and see how they improve the system. Last year, the open-source project became a member of the Linux Foundation, making it widely available to users in industry and academia.
Moving forward, Vasilakis intends to use PaSh to address the problem of distribution, which involves dividing a program to run on multiple computers rather than multiple processors within a single computer. He also wants to make the annotation scheme more user-friendly and capable of describing complex program components.