CHICAGO, June 18 (Xinhua) -- Scientists working in Xian-He Sun's group at the Illinois Institute of Technology (IIT) have recently developed a cross-platform Hadoop reader known as PortHadoop, which has made data flow between Hadoop Distributed File System (HDFS) and Parallel File System (PFS) possible.
"PortHadoop, the system we developed, moves the data directly from the parallel file system to Hadoop's memory instead of copying from disk to disk," said Xian-He Sun, distinguished professor of computer science at the IIT.
It was difficult to move date between HDFS and PFS, and scientist who want to make the best use of analytics on Hadoop had to copy the data from PFS.
According to Sun, PortHadoop uses the concept of virtual blocks which map the data from PFS directly into Hadoop memory by creating a virtual HDFS environment.
These "virtual blocks" reside in the centralized namespace in HDFS NameNode. The HDFS MapReduce application cannot see the "virtual blocks"; a map task triggers the MPI file read procedure and fetches the data from the remote PFS before its Mapper function processes its data. In other words, a dexterous slight-of-hand from PortHadoop tricks the HDFS to skip the costly input/output operations and data replications it usually expects.
Sun sees PortHadoop as the consequence of the strong desire for scientists to merge high performance computing with cloud computing. With traditional scientific computing being merged with big data analytics, "it creates a bigger class of scientific computing that is badly needed to solve today's problems," Sun said.