Big Data Networks Connect Higher Education Researchers
Mon, 07/30/2018 – 11:19
Academic research might be considered the quiet corner of campus, but these days it’s humming, powered by high-performance computing and robust networks.
At Harvard University, Research Computing has grown from a three-person life sciences team to a 23-person department supporting about 1,800 researchers. As the department has swelled, so has its supporting infrastructure, particularly the networks that make it possible for teams to share data-intensive workloads.
A decade ago, Harvard had an on-campus data center and a colocated backup facility in Boston. In 2012 it opened the Massachusetts Green High Performance Computing Center (MGHPCC) in partnership with the Massachusetts Institute of Technology, Northeastern University, Boston University and the University of Massachusetts, as well as the state, Cisco Systems and EMC.
“With three data centers, networking becomes a big focus,” says Scott Yockel, Harvard’s director of research computing. “Networking can be a limiting factor to creating good collaboration because you have to move the data. Otherwise, you end up with small islands of storage and compute without the economy and advantages of aggregating data.”
Big Networks Can Serve More Than an Academic Purpose
The benefits of a big network go beyond academic collaboration. Research networks can integrate with enterprise networks, attract grant dollars and serve as a bellwether for emerging IT trends.
Often, the need to accommodate a specific use case will spur innovations that ultimately benefit the campus as a whole.
“Research networking gives us a way to look into the future at tools and use cases that are going to become common on our business networks,” says Jerry Sheehan, vice president and CIO at Montana State University.
He cites software-defined networking as an example. The research community has been using open-source tools for a few years to replace networking hardware with a software layer for greater flexibility and lower costs.
Now, vendors such as Cisco are making SDN available in their hardware and software stacks, Sheehan says. In fact, MSU had one of the first deployments of Cisco’s Software-Defined Access, an intent-based networking solution.
Sheehan also points to the National Science Foundation Network created in the 1980s to serve research and education, which ultimately became a foundation for the internet. “The only reason there is an internet is because of NSFNET,” he says.
Teamwork Yields New Insights into Data Management
The benefits of research networks aren’t limited to providing access to advanced cyberinfrastructure. They can also facilitate the sharing of valuable data-intensive materials with distant colleagues.
A few years ago, a new MSU faculty member complained to IT about the “terrible” campus network — a surprising criticism since he had been using only a fraction of its 30-gigabit capacity.
A climate modeler working on the United Nations Intergovernmental Panel on Climate Change, he connected an external hard drive to his Macbook Pro and created an FTP server so a colleague in England could access terabytes of data.
The IT department, however, blocked what appeared to be a denial of service attack. “We weren’t sure why the port was open and why that much data was moving,” says Sheehan.
Once IT staff understood the researcher’s mission, they used the Department of Energy’s Science DMZ design, a dedicated network environment, to allow research data to move freely past the firewall.
“His expectation was the network can do anything,” Sheehan says. “That view isn’t wrong, but it’s the minority. Most researchers don’t know what the network is capable of and, because of that, they don’t think about the network as enabling their research.”
Creating a central location to aggregate data is key to produce valuable research, says Scott Yockel. Photo: Ken Richardson
Sheehan set out to change that. For instance, when speaking with a physics professor about an educational planetarium show he created with the Museum of the Rockies, Sheehan’s team learned that when other planetariums expressed interest in the show, called Einstein’s Gravity Playlist, the physicist shared it by mailing copies on hard drives. “He said that’s how everyone does it,” recalls Sheehan.
In search of a better way to facilitate the exchange, IT created a pilot project using Globus, middleware developed by the University of Chicago and Argonne National Laboratory for large file transfers. That let the physicist use the network to share Einstein’s Gravity Playlist with the University of Illinois and Adler Planetarium in Chicago.
He told others about the pilot, and within weeks, almost a dozen institutions had expressed interest in getting on board. “So, we’re using the network to do something fundamentally different than what they thought had been possible before,” Sheehan says.
He got a similar reception from MSU faculty after creating a workflow to use the research network to back up their data at the University of Texas at Austin’s Texas Advanced Computing Center, an NSF-funded national HPC center.
“We could move it at no cost to them at very high speeds, protecting their research by creating archival copies offsite,” says Sheehan.
Update Workstations to Avoid Networking Challenges
At Harvard, the biggest hurdle in research networking isn’t transferring data from one facility to another. Rather, it’s moving data between the instrument-attached workstations and the data center due to the small network backbone of the numerous historic buildings on campus.
Another challenge is that researchers tend to have little control over changing or upgrading the workstations attached to the instruments, because the workstations are deployed and sometimes maintained by the instrument vendors, many of which don’t offer timely updates.
The instruments and workstations have an average 10-year lifecycle, and their outdated operating systems lend themselves to older, slower transfer protocols.
“The data can originate from instruments that need to be close to the research lab, but that data needs to flow out and be aggregated into some centralized place to produce valuable research,” says Yockel. “The network is underpinning all of that.”
Upgrading the instruments and workstations can result in a tenfold increase in data, he explains. “So what worked on 1 gig just fine all of a sudden is producing more data than you can transfer on that type of network.”
The solution might be to change the transfer protocol or upgrade the workstation or even an entire building’s network backbone, Yockel adds. Harvard isn’t alone. Institutions around the country struggle to accommodate aging instruments and computers with software that’s no longer vendor supported.
Traditionally, MSU required those machines be off the campus network because of their vulnerabilities, says Sheehan. That meant that researchers had to find new ways to move data.
Some institutions permit outdated instruments and workstations to remain on the network but install firewalls to segment the networks and contain outdated equipment. One way to simplify that process is software-defined networking.
Rather than install physical firewalls, administrators can use software to create a private network — lowering costs and increasing flexibility. It’s one of many illustrations of the fine line between research networking and IT.
“These efforts become a microcosm of us understanding the future of what’s going to happen across all of our networks, not just those that are purposed for research,” Sheehan says.
High-Performance Computing Brings in the Money
Although the rewards of HPC can be high, so can the costs. There are ways, however, to lower the price tag on research networking. For example, collaborating with partners on a colocated facility makes sense, says Yockel. As a nonprofit, MGHPCC operates more cost-effectively than a private cloud data center. The partners also saved money by locating the facility about 90 miles away in Holyoke, an old mill town that offers hydroelectric power at about half the price of energy in Boston.
State incentives to bring business to the economically depressed town further lowered the project cost. And, Yockel adds, “we’re doing this in Massachusetts, which is not a cheap place.”
Once established, MGHPCC became a catalyst for multi-university grants, including a $4 million project to create large-scale storage that all the partner institutions could use. “We would never be able to do that if we were a single institution,” says Yockel.
The collaboration extends to the network. Harvard, BU and MIT are also members of the Northern Crossroads (NoX) consortium, which operates a metro ring, a network that runs from Harvard Square to BU, past the Longwood Medical area, through downtown and MIT, then back to Harvard. It also has paths to New York City and Holyoke.
Researcher collaborations can even further the practice of HPC itself. For instance, the Society of HPC Professionals is building databases in medicine and oil and gas to serve as training data sets.
The goal is to help users learn research computing and expose them to the fundamentals of machine learning so an organization doesn’t have to create an 800TB data set for beginners, says Executive Director Gary Crouse.
Around the country, efforts are underway to expand the reach of research networks and connect stakeholders. The Pacific Research Platform, for example, connects the research networks at institutions along the West Coast.
This summer, MSU will host the second annual workshop on the National Research Platform, a scalable research network that will connect institutions around the country to help advance data-intensive science.
“This is a social and cultural change that’s going to take efforts that build off our regional networks to move this forward,” says Sheehan.
How to Design for Optimal Performance
Who’s using the network? For what? How much bandwidth are projects consuming?
The answers to such questions can reveal how networks are used and uncover issues before they hamper performance.
The University at Buffalo, with funding from the National Science Foundation, developed software to monitor high-performance computing resources. XD Metrics on Demand (XDMoD) runs in all NSF-funded HPC centers, where it collects data on and reports to NSF about utilization and job-level performance, explains Furlani. There’s also an open-source version used by a few hundred academic and commercial HPC centers worldwide.
In addition to gathering data such as CPU capacity, disc I/O rate and cache, XDMoD runs application kernels daily to measure quality of service. “Since we run them every day, when something gets out of sync, we can tell that the performance has dropped and ask what happened between then and now to cause the system performance to be poor,” Furlani says. “We don’t want to wait for users to notice problems. This allows us to find them before the canaries in the coal mine.”
Melissa Delaney is a freelance journalist who specializes in business technology. She is a frequent contributor to the CDW family of technology magazines.
Read more: feedproxy.google.com