Gordon Brebner (Xilinx Labs), email: firstname.lastname@example.org ; Stephen Ibanez (Stanford University), email: email@example.com
Programmable networking hardware allows both enhancing legacy communication protocols (e.g. adding monitoring to traditional L2/L3 switching) and developing new protocols to proceed at a much faster pace. To support programmability for network devices, P4 (www.p4.org) has been developed as a new programming language for describing how network packets should be processed on a variety of targets ranging from general-purpose CPUs to NPUs, FPGAs, and custom ASICs. P4 was designed with three goals in mind: (i) protocol independence: devices should not “bake in” specific protocols; (ii) field reconfigurability: programmers should be able to modify the behavior of devices after they have been deployed; and (iii) portability: programs should not be tied to specific hardware targets. P4 is the first widely-adopted domain-specific language for packet processing. Several research groups have already developed FPGA-based P4 implementations, and Xilinx has added P4 support to its SDNet product. The P4 community has created – and continues to maintain and develop – the language specification, a set of open-source tools (compilers, debuggers, code analyzers, libraries, software P4 switches, etc.), and sample P4 programs, all with the goal of making it easy for P4 users to quickly and correctly author new data-plane behaviors, and so prototype new ideas for networking applications. The aim of the tutorial is to discuss the basic operations in networking and how they have influenced the design of the P4 language. It will overview the main features of the language, and how P4 components are deployed within packet processing architectures. This will include showing how the design goals of the P4 language are met through program samples, including some examples of applications (e.g., inband telemetry) that are enabled by the availability of programmable networking hardware. The tutorial will particularly discuss the mapping of P4 to FPGAs as a target technology, illustrated by the Xilinx SDNet P4 compilation flow. In particular, the new P4 to NetFPGA work flow, based on SDNet, will be presented and demonstrated, as a route for networking researchers to easily map P4 applications to hardware implementations on the 4x10Gb/s NetFPGA SUME platform. The tutorial will conclude with an overview of ongoing and future research questions surrounding P4 and its FPGA implementation. The target audience for the tutorial is FPGA researchers with some familiarity with networking or at least an enthusiasm for becoming active in that area. We expect that, at the end of the tutorial, attendees will be familiar with the application domain and the use of P4, and that researchers will be encouraged both to attack interesting applications, and also to contribute to the development of the P4-based ecosystem.
Gordon Brebner is a Distinguished Engineer in Xilinx Labs, leading an international group researching issues surrounding networked processing systems of the future. This research has led to the Xilinx SDNet product for P4-programmed SDN, IBN and NFV at 100Gb/s rates. He is currently co-chair of the P4 Language Design working group in the P4.org consortium.
Stephen Ibanez is a Ph.D. Candidate at Stanford University, working with Professor Nick McKeown. His research focuses on finding new and exciting applications for high-speed programmable data planes. He has hosted numerous P4-related tutorials at venues such as SIGCOMM, and he is now leading the P4 to NetFPGA community of developers and users.
Parimal Patel, XUP Senior Systems Engineer
The increasing computational requirements of next-generation Cloud and High-Performance Computing (HPC) applications are pushing the adoption of accelerated computing based on heterogeneous architectures into mainstream, as traditional CPU technology is unable to keep pace. FPGA accelerators complement CPU-based architectures and deliver significant performance and power efficiency improvements. In this regard, Xilinx FPGAs are now available on the Amazon Elastic Compute Cloud (EC2) F1 instances, which are designed to accelerate data center workloads, including machine learning inference, data analytics, video processing, and genomics. These are available in two different sizes that include up to eight Virtex® UltraScale+ VU9P FPGAs with a combined peak compute capability of over 170 TOP/sec (INT8). Furthermore, Amazon Web Services offers the SDAccel™ Development Environment for cloud acceleration, enabling the user to easily and productively develop accelerated algorithms and then efficiently implement and deploy them onto the heterogeneous CPU-FPGA system. The high performance and high-level of scalability offered by F1 instances, paired with the power and ease of use of Xilinx SDAccel, is very appealing for the development of high high-performance FPGA-based accelerated solutions, and will be the focus of this tutorial.
Attendees will use their laptops to connect to the wifi network and use AWS and work with SDAccel.
Paul Chow (University of Toronto) and Derek Chiou (University of Texas at Austin)
There has been a rapid growth in interest to use FPGAs in the cloud that also reveals many issues that need to be explored and solved, bringing with it many research opportunities. Research with FPGAs has always had the benefit of being able to actually demonstrate many working concepts, but this becomes a real challenge when we move to a cloud environment where scalability is an important criteria. A “cloud” is a significant infrastructure that few can fully access. For those that want to work on applications, many could leverage platforms like the Amazon F1 or FABRIC/Catapult at TACC, but even in this case, if you wanted to explore how the architecture of the platform affects the application, you have little ability to modify the platform. There are also interesting problems at higher levels of the stack that must be addressed to build a proper ecosystem, such as new shells, virtualization, provisioning, deployment, communication, scheduling, programming models, security, privacy, etc., where you might need to touch the whole system, i.e., all the layers in a platform. You cannot do this with a production cloud platform.
Following the F1 workshop, this workshop will begin with presentations from academics and industry on other platforms that are currently available with a focus on how they could be used for research. We also want the audience to participate and provide their input on the kinds of research they are doing or considering with the goal of understanding what kind of infrastructure would be required to achieve meaningful results. A desired outcome for the workshop is to understand the requirements of researchers and begin a plan towards finding ways to carry out effective research for FPGAs in the Cloud.
If you are doing or contemplating research related to FPGAs in the Data Center, please fill out the short survey at link before the workshop. We want to gather what people want to do and the infrastructure required. This could also help you identify possible collaborators. The results will be presented at the workshop.
Thomas Preusser (Xilinx), email: TPREUSSE@xilinx.com
Quantized Neural Networks (QNNs) have been demonstrated to effectively discharge a significant share of the enormous memory and compute challenges of neural network inference while maintaining competitive levels of result accuracy. Quantization is the key technique to leverage the success of neural networks even in small resource-, compute- and power-constrained embedded environments.
This tutorial establishes a technical base on the established flows for training and employing convolutional neural networks. It shows how already the training flow is adjusted to account for a quantized inference target. The audience is walked through the quantization-aware training flow released as part of the public BNN-PYNQ framework. Finally, the enabling power of quantization is demonstrated by HW-accelerated QNN inference within Jupyter notebooks on the PYNQ-Z1 platform, which features an entry-level Zynq Z020 device.
Thomas Preusser is an EU-funded Marie-Curie-Fellow investigating designated compute optimizations for quantized neural networks in Michaela Blott’s group at Xilinx Research, Ireland. After studying at TU Dresden and UT Austin, he has worked as PostDoc researcher and instructor at his alma mater in Dresden. His research expertise are computer arithmetic and digital design. Thomas is the Michael Servit Award winner of FPL 2010 and won the two latest records of calculating the solution counts of the N-Queens Puzzle using algebra and a vast distributed FPGA computation.
Robert Green (Embedded/FPGA design engineer at ASIC Design Services), email: firstname.lastname@example.org
Multi-layer convolutional neural networks (CNNs) have led to state-of-the art improvements in the accuracy of non-trivial recognition tasks such as large-category image classification and automatic speech recognition. The use of convolutional neural networks on embedded devices can be problematic since these platforms are resource, power, and space constrained. The number of resources such as multiply-accumulate units, fabric memory, and logic elements available on field programmable gate arrays (FPGAs) have increased significantly over the past few years. FPGAs have flexible hardware configuration together with high energy efficiency, and the parallel nature of these devices make them an ideal candidate for accelerating CNNs on embedded devices.
Recently it has been shown that 8 or 16-bit fixed point integers can be used to represent the weights and data within a CNN to reduce the memory storage and bandwidth requirements. Even with this optimized data representation, CNN models are computationally-expensive and resource-consuming. This talk will focus on how FPGAs can be used to accelerate quantized CNNs and how the numerous challenges involved in the design process can be addressed. Firstly, an overview of the design challenges faced when implementing CNNs on an FPGA is provided. The different sources of parallelism within a CNN that can be explored for an efficient hardware implementation are identified. Optimal multiply-accumulate unit configuration for convolution operations are shown. Techniques that can be used to maximize the computational throughput by exploring parallelism while concurrently minimizing the amount of memory access are discussed. Different optimizations for a low-power design are also considered. Lastly, a scalable and generative framework for accelerating CNNs on an FPGA to achieve a low-power solution for implementing intelligence on the node/edge is shown. The talk is concluded with two live demos of quantized CNNs running on an FPGA.
Recent advances in machine learning have led to breakthroughs in many fields and enabled automation of many applications that were thought to require human cognition and be unsuitable for computing machines. Deep learning has become the de-facto standard for diverse applications including image classification, speech recognition, autonomous driving, and game strategy, and even opened up new areas like machine music composition and artistic painting. A key to these enhancements has been ever more powerful compute platforms that enable larger training sets and larger models.
To continue improving the performance of deep learning and meet the energy-efficiency demands of both embedded and datacenter-scale applications, innovative computational hardware that is still more efficient is needed. Many compute models and hardware architectures are competing for this huge opportunity, including new CPUs, enhanced GPUs, FPGAs and dedicated neural net ASICs. Complicating the picture further is the fact that machine learning algorithms are varied, with precisions ranging from 32-bit floating point down to single bits, and are also changing quickly, with new sparsity techniques and mathematical functions being incorporated. Application needs are also varied, with models spanning orders of magnitude in size, and some areas like financial modeling needing high throughput and others like autonomous driving requiring low latency. This presents a challenge for acceleration – while a very specific hardware accelerator with little programmability may yield the best performance on some of today’s algorithms, it may not support future ML algorithms and applications well.
In this panel we will hear from experts on the needs of current and emerging applications, the strengths of different architectural approaches, the gaps in current solutions and the opportunities they present. Expect a lively discussion about one of the most exciting frontiers in computing!