Five years ago Clemson University named James Bottum chief information officer and gave him the mandate to overhaul the school's IT infrastructure and build out a high performance computing environment. The goal: catapult the school into a leading research university and help attract faculty and students.
Mission accomplished. The South Carolina school is now among the top five non-federally funded University Supercomputing sites. But just as importantly, the environment Bottum helped create is driving creative funding efforts, everything from attracting partners that want to use the high-performance computing (HPC) system to sale of commercial software and new grants that benefit both the school and IT.
BACKGROUND: HPC experts look past petaflop to the exascale
"Last year the Clemson president told us our best years of public sector funding from the state were most likely behind us because of the financial crisis, and we needed to rethink our business model," Bottum says. "The encouragement was to become entrepreneurial."
Fortunately many of the changes Bottum's team made properly positioned Clemson for the new normal. The university has seen 180% growth in revenue from external sources, which helps supplement the school's IT budget, and a 250% increase in federal grants, part of which help offset IT costs.
"The main goal is to continue to run and support a robust set of services and infrastructure for Clemson University," Bottum says, "but do it in a way where we can grow and leverage what we're doing and create a stronger set of infrastructure and services that also contributes to the state economic development."
Bottum has unique qualifications that are helping get it all done. He spent 20-plus years in the research sector, including a stint at the National Science Foundation, then 15 years at the National Center for Supercomputing Applications, and for the last 10 years he has been a CIO (at Purdue before this).
Bottum's team at Clemson has a lot of recent achievements to be proud of, but they also get to investigate leading-edge stuff, everything from the huge HPC grid to new OpenFlow tools and the school's own Orange File System. It's a rich environment.
When Bottum ( pictured at right) arrived at Clemson the school had 48 IT groups, each of which had its own servers and storage and many of which ran their own networks.
"I saw a departmental IT person in a room with fans blowing on a server," he says. "All of the high-performance computing was in a little data center in the engineering science college. They had about six or seven clusters but didn't have enough juice to power them all up at the same time. It was a real belt and suspenders kind of operation, a cluster in the closet model."
A couple of other surprises: The university was buying commodity 100Mbps Internet service at a much-inflated price from local telecom companies, and the school had a large data center 10 miles off campus with expansion potential to 30,000 square feet. The former meant the university could make a big leap forward by joining Internet2, and the latter was going to make it easier to aggregate the IT operations and modernize.
While the initial funding for the overhaul would come from the school itself, the new HPC capabilities attracted new monies along the way and Clemson won many grants, including an NSF Research Infrastructure Improvement Award.
MORE ON NETWORK RESEARCH: Follow our Alpha Doggs blog
Job one was rehabbing the data center and the Information Technology Center, and aggregating most of the IT groups and resources. The building was 20-plus years old and was upgraded in two phases.
"We had 7,000 or 8,000 square feet of space, half a megawatt, and 20-something-year-old power and air conditioning when I got here," says CTO Jim Pepin, who came over from the University of Southern California (USC). "We went up to 2 megawatts and filled that up in less than two years as we consolidated operations and started to build our HPC cluster."
From left to right in front of the HPC cluster: Jay Harris, director of operations; Boyd Wilson, executive director of computing, systems and operations; Mike Cannon (front), data storage architect; Jim Pepin (back), CTO; Lanae Neild, HPC administrator; Becky Ligon, file system developer. (Photo by Zac Wilson)
The first phase ended in December 2007, and in the second phase, which was completed in December 2010, the data center space was built out to 16,000 square feet and split between two environments, one for enterprise gear -- everything from email and student systems to a mainframe to support the state's Medicaid system -- and the other for the HPC system, a 1,629-node Linux cluster. "So now we have two physically separate rooms with different air conditioning profiles and 4.5 megawatts," Pepin says.
Connectivity was increased from the 100Mbps connection serving the university to multiple 10G fiber wavelengths to Charlotte, N.C., and Atlanta, which are used to access Internet2 and link to partners and other universities. "We're also building out multiple 10G wavelengths around the state," Pepin says. Together these links -- and access to the National LambdaRail -- enable Clemson to connect to national infrastructure, allow other state institutions to access Internet2 through Clemson, and provide nationwide access to the Clemson HPC cluster and other collaborative resources.
The school also now has two gigabit connections on the National Higher Education Network to Pepin's former employer, USC, where Clemson has three racks of backup gear for disaster recovery. "No money changes hands, but I have rack space in California and they have rack space here and it makes their data center look like an extension of mine and vice versa," Pepin says. "That's the model we're looking at building, where the network is the basic building block of how we can connect these things together."
Demand for HPC
The cluster -- what the group sometimes refers to as a cloud -- is one of the crown jewels.
"We're not building some generic Joni Mitchell cloud," Pepin says. "Not some vanila, virtualized, blah, blah, blah. There's all of that stuff inside, but it's much more comprehensive, it's a much richer texture than that. We're building a cloud that is really infrastructure and services so we can actually do science with national labs and other people in the state."
The massive 1,629-node cluster is a combination of Dell, IBM, HP and Sun gear (mostly four FLOPs Intel/AMD architecture). Each node is a physical server with two sockets holding quad core processors, meaning eight cores per device and a total count of 14,304 server cores.
Nodes are interconnected using a combination of 88 10G Ethernet ports from Arista and Cisco, and 3,008 ports of low-latency 10G Myrinet network technology from Myricom. Four 16-port, 4Gbps QLogic Fibre Channel switches are used to support storage needs.
The servers aren't virtualized because the jobs supported are typically numerically intensive and very high performance. "So this is more of a grid than a cloud," Pepin says. "We call it a cloud because it's the shared resources model, but we run it like a grid you would see at one of the national labs."
All told, the cluster, with its latest nodes, will benchmark at above 100 trillion floating point instructions per second, making it about 90th on the list of the fastest supercomputers in the world.
The open source Maui Cluster Scheduler is used to allocate cluster resources -- which are allotted by the cores required -- but some users are guaranteed access to specific resources at specific times in condominium fashion.
Cluster usage has been tremendous, but Bottum had some trepidation going in. "One of the things I was afraid of was, if we spent this money and put up these capabilities, that nobody would come and use it," Bottum says.
Turns out he didn't need to worry. "In a state like South Carolina where no public institutions were on Internet2, if you build something like this you start attracting attention," Bottum says. "The one thing I did that you could construe as marketing was speak at a South Carolina IT Directors meeting in Charleston. They wanted to know what we were doing, so I threw out the idea of building a South Carolina cloud, an environment for shared services, and told them if they were interested to sign up at the door."
A half a dozen signed up. "We then went and we got some capital from various sources, including private and federal, and tried to stand this HPC thing up under the rubric of what we call the Cyber Institute. And that allowed us to have a neutral ground for bringing in researchers and other parties and not run this out of the IT organization. We were bootstrapping it out of IT but it gave us a way to think about it and not just break the backs of people who had more than full-time jobs to do. We now have about a dozen universities -- and even a high school -- that have allocations on high-performance computing."
Since then Clemson has held high-performance computing workshops around the state, many of which attract 70 or more people. "There's this sort of pent-up demand," Bottum says.
Today cluster utilization rates run at 80%-85% and often peak above 90%. "In the cluster world, this is incredible," Bottum says.
Clemson NOC: Used to monitor and control the local and wide area networks and the research, education and business computing systems, including the cluster. (Photo by Zac Wilson)
OrangeFS and OpenFlow
Of course the cluster is also core to a lot of work the university is doing, including development of a parallel virtual file system and work on OpenFlow, one of the highest-level projects to come out of the Global Environment for Network Innovations (GENI).
After trying several popular file systems for Clemson's cluster, researchers determined they needed higher performance and greater reliability, says Boyd Wilson, executive director of computing, systems and operations. The result: revival of development work on the open source Parallel Virtual File System (PVFS) with the original architect, Clemson faculty member Walt Ligon. Ligon is working with a Clemson spin-off company called Omnibond that is providing commercial services for the file system.
In the Clemson cluster, OrangeFS is used to virtualize 32 commodity Dell storage servers while providing a single name space for the cluster nodes, Wilson says. Directory and file metadata are distributed on 1.6TB of solid state drives across the 32 storage nodes and there is a total of 256TB of raw rotational disk storage.
Unlike other high-performance file systems such as Lustre, which can only have a single metadata server, OrangeFS' distributed metadata approach and unified name space enable the file system to scale nicely while also simplifying operations, Wilson says.
These capabilities may ultimately benefit enterprise computing environments. "With a unified name space across potentially hundreds of storage nodes, you can add and remove nodes as needed and customers won't notice their files moving or ever have to be pointed to a new storage location," Wilson says. "Your unstructured data stores can grow and resize and be redundant and you won't have all of these different little silos of data. So it holds some potential to become an enterprise computing solution a couple of years down the road."
One Clemson researcher, Sebastien Goasguen, is using OrangeFS to develop a cloud-based infrastructure that can launch and work with tens of thousands of cluster-based virtual machines at once. "It leverages OrangeFS by enabling you to have a shared high-performing file system between all cluster nodes," Wilson says.
Goasguen is collaborating with KC (Kuang-Ching) Wang to build software-defined networks between VMs and client machines using OpenFlow, "which represents a nice convergence point with the university's work on OpenFlow," he says.
Clemson is one of seven collaborators with Stanford on the initial OpenFlow deployment. What started out as a tool to facilitate network research by adding an open, centralized, software-defined layer of network routing, OpenFlow promises to "change the whole way we think about networking," Wilson says. "A lot of people are realizing they would like more software-based control over their network infrastructure. ... You can do some really neat stuff."
For example, while it isn't too painful for Clemson to shift IP addresses from its main data center to a smaller center on campus because they share subnets, when you start doing that over long distances and with multiple locations, it becomes extremely difficult, Wilson says. OpenFlow should vastly simplify the task by allowing dynamic networks to be created and changed at the infrastructure level, but also at the application level, opening up significant opportunities for improvement in network flexibility and security.
While it is unclear when and if Clemson will be able to profit from work on OpenFlow, it is already profiting from OrangeFS and other software that is licensed through Omnibond Systems, Wilson says. For example, companies interested in OrangeFS can purchase a 10-server bundle from Omnibond with support for $45,000.
Other Clemson work that Omnibond licenses includes identity management tools (including drivers for Novell's Identity Manager) and even traffic vision technology that state transportation departments can use to help turn roadside video feeds into sensors.
While the license fees help offset Clemson IT costs, the work also helps attract and keep really good people, Wilson says.
As important as the HPC cluster is, if it goes down, "researchers understand that's the way life goes," says CTO Pepin. "If the enterprise side goes down, we get fired. It's a smaller portion of the computer electrical power but 90% of the pain, so we care deeply about it."
The enterprise side of the data center includes a mainframe that supports two major systems, the main Medicaid system for the state and the university's student information system, which includes financial aid and registration. "We're on the front end of a transition to a new Medicaid system based on MITA (the Medical Information Technology Architecture) and a student information system replacement project, so the mainframe will be gone in about five years," CIO Bottum says. The new systems will be based on redundant commodity hardware and virtual machines.
The rest of the enterprise infrastructure -- some 700 x86 boxes, mostly Dell and Sun with a little bit of IBM mixed in -- supports about 155 applications, including everything from email and payroll to the school's Blackboard course management system. Most of the machines are running Linux but there is a modest amount of specific-purpose Windows and some Unix. "Our direction is to move toward Linux," Pepin says.
Enterprise computing row (Photo by Zac Wilson)
"This is where we're looking at doing some cloudy things in the Joni Mitchell model," he says. "It will be more of what you traditionally think of as a cloud because we probably will go down the virtualization path for a large portion of it."
Clemson has more than 200 systems virtualized today, mostly to support smaller applications. "We're virtualized where it makes sense," Wilson says. "One of the problems with virtualization is, once you go down a path you're kind of stuck."
BACKGROUND: Start your virtualization research here
The team hopes to avoid that elephant trap by using Dell's Advanced Infrastructure Manager (AIM), which Wilson describes as an abstraction layer between the hardware and the services supported.
"AIM lets you manage the hardware behind VMware, and manage the VMs on top of VMware as well, so you have this view of your whole enterprise and you can mix and match resources," Wilson says.
One of the primary benefits: the ability to move applications between virtual and hardware-based environments, regardless of which virtualization tools are used. "If we need three more Blackboard instances we can spin that up on hardware," Wilson says, "and when things slow down, with a single reboot, shift those to virtual machines and use the hardware for something else. This is a really good product to manage your whole infrastructure and it gives you an exit strategy if you want to switch virtualization vendors."
AIM also represents Clemson's first serious dip into iSCSI. With AIM, the school can boot a host from a remote instance over an iSCSI link, then move that machine around virtualization platforms. "AIM solves all the driver problems," Wilson says. "If an instance crashes you can restart, or try to boot it on another box based on policy. Hands free."
Mike Cannon, data storage architect, says Clemson just brought in two iSCSI arrays and two new QLogic 9200 Fibre Channel switches to grow out the university's Fibre Channel network to 1,024 ports. The Fibre Channel network is split into two fabrics (with diverse paths) and spans both Clemson data centers.
"The storage network really needs to converge at some point," he says, "but we're not ready yet. Today we have a Fibre Channel network, we have our enterprise Ethernet network and we have the Myrinet network, which ties all of the high-performance computing nodes together. We also have a little bit of Infiniband for testing."
Cannon says Hitachi storage systems are becoming the basic infrastructure the school is using on the enterprise side for both directly attached and VM cloud-type environments.
Mission critical resources are supported by Hitachi HDS AMS-2100 series arrays, Cannon says. "Prior to that we were using a product from another vendor that required considerable time to figure out how to properly lay out the array and segment sizes. And once we delivered that to the application, if we found out we made a mistake it was real complicated to go back and retrofit another array and move the data. Now we use Hitachi Dynamic Provisioning. Hitachi configures those for us when they deliver the array and if we need more I/O, we can much more easily add spindles. We weren't able to do that with our former vendor."
Long term, does the enterprise side of the house end up as one big Joni Mitchell cloud? "I think you'd have to end up there," Wilson says. "There will be pockets that aren't, but as you abstract your computing layer from the personas that run on it you can dynamically allocate hardware for various things. It gives you that flexibility. Virtualization is just a component of this."
Changing finance mix
One of the ways that Bottum and his team are funding all of these initiatives is through grants. Five years ago "the grant money didn't really exist," Bottum says. "And we're running about $5.5 million this year."
The majority of the grants are for specific faculty. Wilson and Ligon, for example, have grants for parallel virtual file system work. "It's usually almost a 50-50 split between what goes to the faculty and their departments and what goes into IT's account, so it's a nice healthy IT/faculty partnership," Bottum says.
The goal, of course, is to cover as many costs as possible. "Recognizing that Clemson is a public institution and the future of state funding is not clear, we are encouraged to become entrepreneurial. So the goal is to bring in funding in a way that doesn't detract from what we are doing for Clemson."
But there is only so far that you can take this model, Bottum says. "We're at a point now where I don't think any of this happens without a public/private partnership, where we really poke holes in our respective walls and reach inside the other one and start to maximize ways we collaborate. The private schools had to re-engineer themselves in the '90s and into the 2000s, and now it is the public schools' turn. I think the future is figuring out how we fill the gaps, how we take advantage of some of the opportunities."
Read more about data center in Network World's Data Center section.
This story, "Clemson IT team embraces call to be entrepreneurial" was originally published by Network World.