Barron’s Tech Trader Daily
A year ago, I had a great conversation with database legendMichael Stonebraker, a professor at MIT‘s Computer Science and Artificial Intelligence Lab.
Among the many trends that Stonebraker mentioned at the cutting edge of database technology is the use of graphics chips, so-called graphics processing units, or GPUs, to accelerate database operations. These include parts from, most prominently, Nvidia(NVDA) and AMD (AMD).
Today I had the pleasure of talking with a rising young database star who’s come out of that same CSAIL lab and making the research into GPU-driven database work a reality, Todd Mostak, who is CEO and Founder of a startup called MapD.
MapD makes a relational database management system that runs part of its operations, the most compute-intense operations, on Nvidia GPUs.
Nvidia is an investor in a new, $10 million round of funding, along with Vanedge Capital, Verizon Ventures, and GV (the former Google Ventures.) The company has customers using the product, including Verizon.
The technology is particularly well-suited to analytics, big data, and things that are generally classified as “data warehouse,” and “online analytical processing,” or OLAP.
Mostak posted today about the announcement. The technology already has The funding follows a $2 million round back in 2014.
Mostak started out at Harvard as a graduate student the field of Middle Eastern Studies, but soon realized he couldn’t resist the call of programming. He started taking programming courses. He finally moved to computer science and was “coaxed” to join the database group at CSAIl, working under the tutelage of another database pioneer, Sam Madden, who helped found database startup Vertica.
That’s a departure from traditional database use, where the tasks are all handled by the CPU.
As Mostak explains it, “We take SQL code, and we compile into GPU code and CPU code. The logic for where to parse query is still happening on the CPU, but the actual execution of queries, joins, and other operations is happening on the GPU, so you get these very nice speed ups.”
The payoff is two orders of magnitude improvement in performance, or milliseconds for query results to come back even when table sizes in a database are huge. “You can take a billion-row data set and there’s zero lag.”
Part of what made that possible was the release by Nvidia of its “CUDA” programming environment for GPUs in 2009. That made the GPU “generally programmable,” he notes.
The insight came as Mostak was trying to do graduate work on the Arab Spring and ran up against headaches trying to crunch things such as Twitter (TWTR) posts.
Sam [Madden] was pretty interested in hardware-accelerated databases. I had been doing research into the Arab Sprint through Twitter, harvesting hundreds of millions of tweets. There weren’t the tools to crunch this interactively. You had to wait hours for C code or Postgres to run these queries. I thought to myself, There has to be better way, and then I thought of GPUs.
The innovation in MapD’s case is to use many, many GPUs, which makes it possible to store many tables in memory, which means scaling massively parallel collections of GPU cores will rise and rise:
There’s been been a handful of researchers looking at GPUs, and one insight I had was that these people were using single GPUs to accelerate queries. They were losing any GPU advantages because they had to go over the PCI bus from the CPU. So, I thought, let’s use multiple GPUs, and cache data on the GPUs themselves, as a buffer pool of L1 cache memory. The memory available wasn’t very big at the time — maybe 6 gigs [gigabytes] per GPU, but I saw these memory sizes would get much bigger.
“With this big push into neural networks today we can get several hundreds of gigs on a single server” between all the GPUs operating in parallel, he points out.
This is a continual evolution:
Off the shelf, you can get an [Nvidia] K80 [GPU] with 24 gigs of memory these days. Probably, there will be a significant increase in the coming generation of [Nvidia] Pascaland [AMD] Polaris GPUs. Pascal and Polaris should mean a lot more power for less energy all around, both on servers as well as on desktops/laptops (We can run on laptops, desktops/workstation and servers, but our product focus is on servers currently). But even today, some people will get a workstation with 4 Titan GPUs, and have a minicomputer for for under $12,000.
MapD’s promotional literature lays out the scope of its ambition: combining many of such GPUs in parallel with tons of memory:
MapD’s database software is able to attain unprecedented speeds by leveraging up to eight GPU cards with 192GB ultra-fast video RAM in a single server, querying data at rates approaching 3TB/second by almost 40,000 cores. MapD supports standard SQL queries, compiling them to GPU code to achieve maximum efficiency. MapD also supports a visualization API by which the results of a SQL query can be rendered using the native graphics pipeline of the GPUs. Depending on the use case, MapD can be used either as a standalone database, with third-party visualization tools such as Tableau, or with its own visualization frontend, MapD Immerse.
Mostak added to that literature his own top-of-mind thoughts on the flourishing multi-GPU prospects:
An Nvidia K80 is Nvidia’s flagship GPU model for GPGPU (General Purpose Computing on Graphics Processing Units). It has 24GB of Video RAM on board, and you can put 8 of these in a standard server or up to 16 in a PCI extension chassis, for either 192GB or 384GB total GPU RAM. Normally we’d back this by a terabyte of CPU RAM to serve as a second-tier cache for the hot data (keeping the rest on SSD). Each K80 has 4,992 cores, so this means that with our standard 8 GPU server we have almost 40,000 cores to compute the SQL and render the visualizations in parallel. It is expected that Nvidia will be introducing its Pascal architecture GPUs next week at its GTC conference, which are rumored to have even more onboard RAM and cores.
Despite the funding from Nvidia, Mostak is not averse to considering AMD GPUs in future:
While currently we only support Nvidia our architecture is general purpose enough that we could target AMD GPUs in a relatively straightforward manner (we can already execute on X86 and ARM on the CPU side). We use LLVM to compile generate machine code from SQL and could use the LLVM backend for AMD GPUs to target the AMD’s architecture. Right now however we find that Nvidia owns the lion’s share of the GPGPU market so we focus our efforts there, but at some point we will shoot to be (GPU) hardware agnostic.
Mostak is also open to other forms of computing hardware, including Intel’s (INTC) forthcoming “PHI” technology.
Today, the software is sold as a traditional license product, priced on a per-core basis. Mostak expects increasing integration with cloud computing services. For example, perhaps running the software in Amazon.com’s (AMZN) AWS as an “Amazon Image,” or AMI.
The company already has customers using the software in a “bring-your-own license fashion. This year, the company will start moving in the direction of having an expressly cloud version of the software, for customers who’d rather buy that way, says Mostak. “We are aiming to have native cloud images so customers can directly license our software from Amazon or from IBM Softlayer.”
He notes that the technology can be competitive with data analysis tools such as those ofSplunk (SPLK) that run either on-premise or in the cloud.
To MapD’s relational database, it adds a graphical tool, or “dashboard,” software that lets you see graphical representations of the data, either in the old pie-and-chart format, or more sophisticated whiz-bang representations.
As the company describes its “Immerse” visualization tool, it too makes use of the parallel GPU configurations:
MapD’s Immerse visualization client leverages the MapD database to provide both complex data visualizations (such as detailed GIS representations, scatterplots, data animations, and more) and standard reporting charts (such as line, bar, and pie charts, among others) in the browser. Simpler charts are rendered in-browser while complex renderings of large datasets are fetched from the backend, where they can be generated in milliseconds. Multiple chart types can be laid out within a single dashboard, providing multi-dimensional insights. Immerse enables easy drill-down and correlation analysis with its use of the cross- filter model, where applying a filter to one chart instantly applies it all other charts in the dashboard.
Can MapD be a really big software company? Without refusing anything, Mostak says “our goal is not to exit, we are getting a lot of traction,” adding that “this funding gives us a good amount to build the engineering and sales team and marketing teams, and to pursue our go-to-market approach.”
At the same time, he says, there’s something of an analogy to data warehousing companyNetezza that got bought in 2010 by International Business Machines (IBM) for $1.78 billion. Netezza, notes Mostak, did some work with field-programmable gate arrays, or FPGAs, another family of specialized chip — and something that Stonebraker mentioned as another significant frontier in database exploration.
Original Article at