Friday, February 23, 2024

The Sinking of the Itanic: free ebook


Throughout my stint as an IT industry analyst during the 2000s, one of my significant interests was Intel's Itanium processor, a 64-bit design intended to succeed the ubiquitous 32-bit x86 family. I wrote my first big research report on Itanium and it seemed like something of an inevitability given Intel's dominance at the time.

But there were storm clouds. Intel took an approach to Itanium's design that was not wholly novel but it had never been commercially successful. The dot-com era was also drawing to a close even as Itanium's schedule slipped out. Furthermore, the initial implementation was not ready for primetime for a variety of reasons.

Especially with the benefit of hindsight, there were other problems with the way Intel and its partner, Hewlett-Packard, approached the market with Itanium as well. Itanium would ultimately fail, replaced by a much more straightforward extension to the existing x86 architecture.


This short book draws six lessons from Itanium's demise:

  • Lesson #1: It’s all about the timing, kid 
  • Lesson #2: Don’t unnecessarily proliferate risk
  • Lesson #3: Don’t fight the last war
  • Lesson #4: The road to hell is paved with critical dependencies
  • Lesson #5: Your brand may not give you a pass
  • Lesson #6: Some animals can’t be more equal than others

While Itanium is the study point for this book, many of the lessons are applicable to many other projects. 

Download your free PDF ebook today.


Wednesday, May 24, 2023

AI is looking summer-y

 

It never got to the point where the whispers about an impending AI winter got that commonplace, loud, or confident. However, as widespread commercialization of some of the most prominent AI applications—think autonomous vehicles—slipped well past earlier projections, doubts were inevitable. At the same time, the level of commercial investment relative to past AI winters made betting against it wholesale seem like a poor bet.


It’s the technology in the moment’s spotlight. On May 23, it was foundational to products announced at Red Hat Summit in Boston such as Ansible Lightspeed. However, the surprise today would be were AI not to have a prominent position at a technology vendor’s show.

But, as a way to get a perspective that’s less of a pure technologist take, consider the prior week’s MIT Sloan CIO Symposium Driving Resilience in a Turbulent World held in Cambridge MA. This event tends to take a higher-level view of the world, albeit one flavored by technology. Panels this year about how the CIO has increasingly evolved to a chief regulation officer, chief resilience officer, and chief transformation officer are typical of the sort of lenses this event uses to examine the current important trends for IT decision makers. As most large organizations become technology companies—and software companies in particular—it’s up to the CIO to partner with the rest of the C-suite to help chart strategy in the face of changing technological forces.And that means considering tech in the context of other forces—and concerns. For example, supply chain optimization is a broad company business challenge even if it needs technology as part of the puzzle.


AI rears its head


But even if AI was a relatively modest part of the agenda on paper, mostly in the afternoon, everyone was talking about it to a greater or lesser degree.


For example, Tom Peck, Executive Vice President & Chief Information Officer and Digital Officer, Sysco said that they were still “having trouble finding a SKU of AI in the store. We’re trying to figure out how to pluck AI and apply it to our business. Bullish on it but still trying to figure out build vs. buy.”


If I were to summarize the overall attitude towards AI at the event, it was something like: really interesting, really early, and we’re mostly just starting to figure out the best ways to get business value from it.


A discussion with Irving Wladawsky-Berger


I’ve known Irving Wladawsky-Berger since the early 2000s when he was running IBM’s Linux Initiative; he’s now a Research Affiliate at MIT’s Sloan School of Management, a Fellow of MIT’s Initiative on the Digital Economy and of MIT Connection Science, and Adjunct  Professor  at the Imperial College Business School. He’s written a fair bit on AI; I encourage you to check out his long-running blog.


There were lots of things on the list to talk about. But we jumped straight to AI. It was that sort of day. To Irving, “There’s no question in my mind that what’s happening with AI now is the most exciting/transformative tech since the internet. But it takes a lot of additional investment, applications, and lots and lots of [other] stuff.” (Irving also led IBM’s internet strategy prior to Linux.)


At the same time, Irving warns that major effects will probably not be seen overnight. “It’s very important to realize that many things will take years of development if not decades. I’m really excited about the generative AI opportunity but [the technology is] only about 3 years old,” he told me. 


We also discussed The Economist’s How to Worry Wisely about AI issue, especially an excellent essay by Ludwig Siegele titled “How AI could change computing, culture and history.” One particularly thought provoking statement from that essay is “For a sense of what may be on the way, consider three possible analogues, or precursors: the browser, the printing press and practice of psychoanalysis. One changed computers and the economy, one changed how people gained access and related to knowledge, and one changed how people understood themselves.”


Psychoanalysis? Freud? It’s easy to see the role the browser and the printing press have had as world-changing inventions. He goes on to write: “Freud takes as his starting point the idea that uncanniness stems from ‘doubts [as to] whether an apparently animate being is really alive; or conversely, whether a lifeless object might not be in fact animate’. They are the sort of doubts that those thinking about llms [Large Language Models] are hard put to avoid.”


This in turn led to more thought-provoking conversation about linguistic processing, how babies learn, and emergent behaviors (“a bad thing and a bug that has nothing to do with intelligence”). Irving concluded by saying “We shouldn’t stop research on this stuff because it’s the only way to make it better. It’s super complex engineering but it’s engineering. It’s wonderful. I think it will happen but stay tuned.”


The economics


“The Impact of AI on Jobs and the Economy” closed out the day with a keynote by David Autor, Professor of Economics, MIT. 


If you want to dive into an academic paper on the topic, here’s the paper by Autor and co-authors Levy and Murnane


However, Autor’s basic argument is as follows. Expertise is what makes labor valuable in a market economy. That expertise must have market value and be scarce but non-expert work, in general, pays poorly.


With that context, Autor classifies three eras of demand for expertise. The industrial revolution first displaced artisanal expertise with mass production. But as the industry advanced it demanded mass expertise. Then the computer revolution started, really going back to the Jacquard loom. The computer is a symbolic processor and it carries out tasks efficiently—but only those that can be codified. 


Which brings us to the AI revolution. Artificially intelligent computers can do things we can’t codify. And they know more than they can tell us. The question Autor posits is ”Will AI complement or commodify expertise? The promise is enabling less expert workers to do more expert tasks”—though Autor has also argued that policy plays an important role. As he told NPR in early May: “[We need] the right policies to prepare and assist Americans to succeed in this new AI economy, we could make a wider array of workers much better at a whole range of jobs, lowering barriers to entry and creating new opportunities.” 

Sunday, April 23, 2023

Kubecon: From contributors to AI

 I find that large industry shows like KubeCon + CloudNativeCon (henceforth referred to as just KubeCon for short) are often at least as useful for plugging into the overall zeitgeist of the market landscape and observing the trajectory of various trends as they are for diving deep on any single technology. This event, held in late April in Amsterdam was no exception. Here are a few things that I found particularly noteworthy; they may help inform your IT planning.


Contributors! Contributors! Contributors!


Consider first who attended. With about 10,000 in-person registrations it was the largest KubeCon Europe ever. Another 2,000 never made it off the waiting list. Especially if you factor in tight travel budgets at many tech companies, it’s an impressive number by any measure. By comparison, last year’s edition in Valencia had 7,000 in-person attendees; hesitancy to attend physical events has clearly waned.


Some other numbers. There are now 159 projects within the Cloud Native Computing Foundation (CNCF) which puts on this event; the CNCF is under the broader Linux Foundation umbrella. It started with one, Kubernetes, and even as relatively recently as 2017 had just seven. This highlights how the cloud native ecosystem has become about so much more than Kubernetes. (It also indirectly suggests that a lot of complaints about Kubernetes complexity are really complaints about the complexity of trying to implement cloud-native platforms from scratch. Hence, the popularity of commercial Kubernetes-based platforms that do a lot of the heavy lifting with respect to curation and integration.)


Perhaps the most striking stat of all though was the percentage of first-timers at Kubecon: 58%. Even allowing for KubeCon’s growth, that’s a great indicator of new people coming into the cloud-native ecosystem. So all’s good, right?


Mostly. I’d note that the theme of the conference was “Communities in Bloom.” (The conference took place with tulips in bloom around Amsterdam.) VMware’s Dawn Foster and Apple’s Emily Fox also gave keynotes on building a sustainable contributor base and saving knowledge as people transition out of a project respectively. This all has a common theme. New faces are great but having a torrent of new faces can stress maintainers and various support systems. The torrent needs to be channeled.


Liz Rice, Chief Open Source Officer at Isovalent and Emeritus chair of the Technical Oversight Committee put it to me this way. The deliberate focus on community at this KubeCon doesn’t indicate a crisis by any means. But the growth of the CNCF ecosystem and the corresponding level of activity is something to be monitored and perhaps some steps taken in response.


It’s about the platform


The rise of the platform engineer and the “platform” term generally has really come into the spotlight over the past couple of years. Panelists on the media panel about platform engineering described platforms as having characteristics  such as documentable, secure, able to connect to different systems like authentication, incorporating debuggability and observability, and perhaps most of all, flexibility. 


From my perspective, platform engineering hasn’t replaced DevOps as a concept but it’s mostly a more appropriate term in the context of Kubernetes and the many products and projects surrounding it. DevOps started out as something that was as much about culture as technology; at least the popular shorthand was that of breaking down the wall between developers and operations. While communicating across silos is (mostly) a good thing, at scale, operations mostly provisions a platform for developers — perhaps incorporating domain-specific touches relevant to the business — and then largely gets out of the way. Site Reliability Engineers (SRE) shoulder much of the responsibility for keeping the platform running rather than sharing that responsibility with developers. The concept isn’t new but “DevOps” historically got used for both breaking down walls between the two groups and creating an abstraction that allowed the two groups to largely act autonomously. Platform engineering is essentially co-opting the latter meaning.


The latest abstraction that we’re just starting to see is the Internal Developer Platform (IDP) — such as the open source Backstage that came out of Spotify. “Freedom with guardrails” is how one panelist described the concept. An IDP provides developers with all the tools they need under a governing IT governance umbrella; this can create a better experience for developers by presenting them with an out-of-the-box experience that includes everything they need to start developing. It’s a win for IT too. It cuts onboarding time and means that development organizations across the company can use the same tools, have access to the same documentation, and adhere to the same standards.


Evolving security (a bit)


Last fall, security was pervasive at pretty much every IT industry event I attended, including KubeCon North America in Detroit. It featured in many keynotes. Security vendor booths were omnipresent on the show floor.


It’s hard to quantify the security presence at this KubeCon by comparison. To be clear, security was well-represented both in terms of booths and breakouts. And security is so part and parcel of both platforms and technology discussions generally that I’m not sure if it would even be possible to quantify how much security was present.


However, after making myself a nuisance with several security vendors on the show floor, I’ll offer the following assessment. Security is as hot a topic as ever but the DevSecOps and supply chain security messages are getting out there after a somewhat slow start. So there may be less need to bang the drum quite so loudly. One security vendor also suggested that there may be more of a focus on assessing overall application risk rather than making security quite so much about shifting certain specific security processes earlier in the life cycle. Continuous post-deployment monitoring and remediation of the application as a whole is at least as important. (They also observed that the biggest security focus remains in regulated industries such as financial services.)


An AI revolution?


What of the topic of the moment — Large Language Models (LLM) and generative AI more broadly? These technologies were even the featured topic of The Economist weekly magazine that I read on my way back to the US from Europe.


The short answer is that they were an undercurrent but not a theme of the event. I had a number of hallway track discussions about the state of AI but the advances, which are hard to ignore or completely dismiss even for the most cynical, have happened so quickly that there simply hasn’t been time to plug into something like the cloud-native ecosystem. That will surely change.


It did crop up in some specific contexts. For example, in the What’s Next in Cloud Native panel, there was an observation that Day 2 operations (i.e. after deployment) are endlessly complex. AI could be a partial answer to having a more rapid response to the detection of anomalies. (To my earlier point about security not being an island relative to other technologies and processes.) AIOps is already an area of rapid research and product development, but there’s the potential for much more. And indeed, a necessity, as attackers will certainly make use of these technologies as well.

Monday, January 17, 2022

Hardware data security with John Shegerian

John Shegerian is the co-founder and executive chairman of recycling firm ERI and the author of The Insecurity of Everything. In this podcast, we talk about both the sustainability aspects of electronic waste and the increasing issue of the security risk associated with sensitive data stored on products that are no longer in use. Shegerian argues that CISOs and others responsible for the security and risk mitigation at companies have historically been mostly focused on application security. 

However, today, hardware data security is very important as well. He cites one study where almost 42% of 159 hard drives purchased online still held sensitive data. And the problem extends beyond hard drives and USB sticks to a wide range of devices such as copiers that store data that has passed through them.

Shegerian details some of the steps that companies (and individuals) can take to reduce the waste they send to landfills and to prevent their data from falling into the wrong hands. 

Fill out this form for a free copy of The Insecurity of Everythinghttps://eridirect.com/insecurity-of-everything-book/

Listen to the podcast [MP3 - 26:55]

Tuesday, January 04, 2022

RackN CEO Rob Hirschfeld on managing operational complexity

There's a lot of complexity, both necessary and unnecessary, in the environments where we deploy our software. The open source development model has proven to be a powerful tool for software development. How can we help people better collaborate in the open around operations? How can we create a virtuous cycle for operations?

Rob and I talked about these and other topics before the holidays. We also covered related topics including the skills shortage, complexity of the software supply chain, and building infrastructure automation pipelines.

Related links:

Listen to the podcast [30:39 - MP3]

[TRANSCRIPT]

Gordon Haff:  I'm here today with Rob Hirschfeld, the co founder and CEO of RackN, on our just before the holidays discussion. Our focus today is going to be on managing complexity.

The reason this is an interesting question for me is, we seem to be getting to this stage where open source on the one hand gives you the ability to get under the hood and play with code, and to run it wherever you want to.

But we seem to be getting to the point though where people are saying, that's really hard to do though, so maybe we should just put everything on a software as a service or on Amazon Web Services (which I think is actually down at the moment) but purports to solve that complexity problem. 

Welcome, Rob.

Rob Hirschfeld:  I'm excited to be here. This complexity tsunami that people are feeling is definitely top of mind to me because it feels like we're reaching a point where the complexity of the systems we've built is sort of unsustainable, even to the point where I've been describing it as the Jevons Paradox of Complexity. It's a big deal.

I do think it's worth saying up front, complexity is not bad in itself. We have a tendency to be like, "Simplify, simplify. Get rid of all the complexity." It's not that complexity is bad or avoidable. It's actually management like you stated right at the start. It's a managing complexity problem, not a eliminating complexity problem.

Gordon:  To a couple of your points to managing complexity, I mentioned we'll just use a software as a service. Using a software as a service may be just fine.

At Red Hat, we don't run our own email servers any longer. We used to. We use software as a service for email and documents, which obvious of course causes this little tension, but shouldn't we be doing everything in open source?

The reality is that with modern businesses, you have to decide where to focus your energy. Way back when Nick Carr at Harvard Business Review wrote an article, basically, "Does IT Matter?" Nick, I think, particularly the way we view things today, perhaps deliberately overstated his case, but he was absolutely right that you have to pick and choose which IT you got to focus on and where you differentiate.

Rob:  I think that that's critical and we've been talking a lot in the last two years about supply chains. It is very much a supply chain question. Red Hat distributes the operating system as a thing and there's a lot of complexity between the code and your use of it that Red Hat is taking care of for you.

That's useful stuff to take out of your workflow and your process. Then one of the challenges I've had with the SaaSification piece here and I think we've seen it with outages lately is that there is a huge degree of trust in how the SaaS is running that business for you and their operational capability and what they're doing behind the scenes.
The Amazon outage, the really big one early in December exposed that a lot of the SaaSs that depended on Amazon had outages because Amazon was out. So you don't just have the SaaS and delegate to the SaaS.
I've been asking a question of how much you need to pierce that veil of… do you care about where the SaaS is running and how the SaaS is operating and how the SaaS is protecting what you're doing because you might have exposure that you happily are ignoring based on employing a SaaS.

That could come back to bite you or you could be operationally responsible for any work.

Gordon:  Of course, if you are a customer of that SaaS, you don't care if the SaaS is saying, "But, but...It's not our fault. It's Amazon's fault." That's not a very satisfactory answer for a customer.

Rob:  It could be if you've acknowledged the risk. People were talking about some of these outages as business snow days where everybody is down and you can't do anything. Some businesses have the luxury for that, but not very many want their email systems to be down or their internal communication systems to be down.

Those are business critical systems or their order entry, order taking or the delivery systems and those outages take a lot to recover from.

I think that, and if somebody is listening to this with a careful ear, they're like, "But if I was doing it myself, it could go down just as easily," and that's entirely true and this is a complexity balance thing.
It's not like your email system is.. you're going to do a better job managing it than a service provider is doing. They arguably have better teams and they focus on doing it, it's the main thing they do, but they actually do it in a much more complex way than you might.

You might be able to run a service or a system in the backend for yourself in a much simpler way, than Amazon would do it for you. They might have the capability to absorb that difference, but we're starting to see that they might not.

Gordon:  I want to touch on another element of supply chain that it's very much in my ballpark, in open source, and that is the software supply chain. One of the things that we've been seeing recently, and in fact, it was a [US Federal] executive order earlier this year that related to this, among other things.

That software out there, open source or non open source, 90 percent of it came from somewhere else. That somewhere else might include somebody does this as a hobby in their spare time in their basement. There was a lot of publicity around this with the Heartbleed exploit a few years ago.

I think some of those low hanging fruit have been cleared off, but at the same time they...

Rob:  We're talking about Log4j is dominating the news cycle right now, and that's maintained by a couple of volunteers, because we thought it was static stable code. No. It is a challenge, no matter which way you go. I think there's two places that you and I both want to talk about with this.

Part of it is, that open source aspect and how the community deals with it. Also, the assumption of we're going to keep finding vulnerabilities and patches and errors in everything and the user's responsibility to be able to patch and update from that perspective, which is also part of open source.

Gordon:  Yeah. Actually to the point of the user responsibility, we ran a survey fairly recently, it's something we do every year, our Global Tech Outlook survey. One of the questions that we asked there, a lot of questions are around funding priorities.

As you could expect, security was...Well, at least ostensibly a funding priority, sometimes a little bit uncertain about how to take things. That, oh yes, we know there should be a funding priority, whether or not it actually is, but anyway, we asked about what the security funding priorities were underneath that.
The top were things you'd expect. Classic security stuff like network security, so presumably firewalls and things like that. The very bottom, though, we were just talking about supply chains. The very bottom was essentially the software supply chain. This is after Joe Biden putting out the executive order and everything.

I don't know quite how to take that one. One interpretation says, the message hasn't really gotten out there yet. I don't know to what degree I believe that. The other way to take it is that, yeah, this is important, but Red Hat's taking care of that for us.

Even though, we are using all of this open source code in our own software. Then I think the third area may be that, yes, this is a priority, but we don't think this is very expensive to fix.

Rob:  I think that the security budget is very high for magic wands and the availability of magic wands is very limited. [laughs] What you're describing at the bottom of the stack of the software supply chain, is part of what I see the complexity problem being.

We have to step back and say, "How do companies cope with complexity?" [laughs] The number one way they cope with complexity is to ignore it.

It's like, I'm going to send it to a SaaS and pretend they're going to be up all the time, or I'm going to use this library and pretend that it's perfect and flawless. Everything's great.

I agree with you. Red Hat with the distro effectively is doing some of that work for you, and you're buying the, "OK, somebody has eyes on this for me." That assumption is maybe marginally more tested. 

I think when we start looking at these systems, we need to think through, OK, software is composed of other components and those other components have supply chains. Those components have components.
Before we got into the containers, we used to do RPM, install whatever it is, keep up whatever it is. We had to resolve all those dependency graphs dynamically at the moment and it was incredibly hard to do. Software was very fragile from that perspective.

A lot of people avoided patching, changing, or updating because they couldn't afford to resolve that dependency graph. They ignored it.

Docker, let us make that immutable and put it all into a container and do a build time resolution for it. Which I think is amazing, but it still means that you have to be thinking through at least at some point when you're pulling all those things in.

I don't think people think of containers as solving the complexity problem of the application dependency graph, I do. It's one of those ways that you can very consciously come in and say, "We're going to manage this very complex thing in the process." It's a complex thing if it's fragile.

Part of managing complexity is being able to say, "Where do I have hidden complexity lurking for what I do?" If you have something that's fragile and hard to repeat, or requires a lot of interventions to fix, you've identified a part of your process that that is complex, or maybe, dangerously complex from that perspective.

Gordon:  Some things inherently have a degree of complexity. Again in most cases, there's probably some magic SaaS things out there for some very small problems but by and large, you're still not eliminating complexity there.

I think the other related problem that we're seeing right now too, and again, from our Global Tech Outlook survey is a big problem, was training people skills. There seems to be a real shortage of, it's hard to hire people. You're CEO of a company, I'm sure you're exposed to this all the time.

Rob:  We are, and our customers are. Our number one challenges with any go to market with our customers is the life cycle of the employees at the customer site. We have customers where they have a reorg and a management change, or they lose the lead on a solution, and we have to reset and retrain.

It sets schedules back, let alone hiring our own people and getting them up to speed to train. There's a huge bug with this. I don't think companies are particularly watching for the idea of how the work that their people do, or adding to the complexity and increasing the risk of what they do.

A lot of times, people inherently absorb complexity risk by doing the work. We do DevOps and automation work. Our job is to create standard repeatable automation, Infrastructure as Code.

The tools that are out there today, people use them. They use them in ways that they think are right and work for them. They don't have a lot of repeatable practice that each team follows or can be replicated across the organization and so you get into the training.

This is where I'm trying to go with the skills training piece. Skills training is partially, "How do I use the tools and the infrastructure and the automation?" Part of it is, "How do I figure out what the person before me did, or the person next to me is doing so that we can connect those pieces together?"

We spend a lot of time and add a lot of complexity when we don't take time to understand. This is a development practice, "How do I get code that I don't have to maintain, that I don't have to understand?" That actually is another way to reduce complexity with this. Does that make sense?

I think about Infrastructure as Code, which is a new way of thinking about DevOps and automation. Today, a lot of people write their own automation and it's very hard to reuse or share or take advantage of standard practice. But we do in development. In development we do a good job with that.

You can say, "All right, this library or this component is standard, and I'm going to bring it in. I'm not going to modify it, I don't need to. I'm going to use it the way it's written." That approach reduces the complexity of the systems that we're talking about quite a bit.

Gordon:  I think one of the things that we're seeing, right now I'm working on something called Operate First, and our idea here is to basically form a community around essentially an open source approach to large scale operations.

It's still pretty early on. Operations have been divorced from open source, and frankly, DevOps notwithstanding from a lot of the development effort, and practices that have gone there.

Rob:  I strongly agree with you. It's one of those things because open source communities are amazing at building infrastructure, or building code, building practice.

It's been a struggle to me to try and figure out how to help people collaborate about operations. What I used to think of back in my OpenStack days as glass house operations. You can't expose everything because they're secrets and credentials and things like that.

In building this next generation of automation, infrastructure, pipelines generation, we need to figure out how do we have more collaboration? How do we have people share stuff that works? How do people build on each other's work, and this is the heart, and recognize that operations have a lot of complexity in them.
There's a difference. You can't turn around and say, "Hey, you're going to operate this exactly like I do and it's going to be simpler for you because you're following me"

We saw this. Early Kubernetes days, there were some people who wrote an Amazon installer, called it as kOps. This is the one I'm thinking of specifically, K O P S. Literally, if you set it up exactly as is, it would install Kubernetes on Amazon very reliably because Amazon was consistent. [laughs]

From there, it fell apart when people were like, "Well, wait a second. My Amazon's different then your Amazon. I want to do...these other clouds are totally different than that. They don't have anything that works like that."

What we see is, the collaborating around an operational strategy gets really hard as soon as the details of the operational system hit. To the point where people can't collaborate on the ops stuff at all.

Gordon:  Yeah, it's definitely challenging. That's one of the things we're still trying to work out as part of this. Speaking of OpenStack, we're working closely with OpenInfra Labs on this project. It's definitely challenging. I think it's something that we need to get to though. There are tools out there. You mentioned Terraform I think when we were discussing this, for example.

Rob:  This is for us trying to...I like where you're going because open source to me is about collaboration. Fundamentally, we're trying to work together so that we're not repeating work that we're able to build things together.

When I look at something like Terraform which is very focused on provisioning against a YAML file. The providers are reusable, but the Terraform plans aren't. There's two levels of collaboration here. There's collaboration in the industry which we want to have, and there's also just collaboration inside of your organization in your teams.

That's one of the things I always appreciated with open source and the open source revolution where we really started focusing on Git, and code reviews, and pull requests and that process which has really become standardized in industry which is amazing.

People forget how much that's an open source creation. The need to do all these code reviews and merges. We need to figure out how to apply that into the Infrastructure as a Code conversations.

I've seen a lot of people get excited about GitOps. To me, GitOps that's not really the same thing. It's not as Infrastructure as a Code as you're building a CI/CD pipeline for your automation, or being able to have teams reuse parts of that automation stack or even better delegate parts of that delegation, or that stack so that the Ops team can focus on something that the Dev team can focus on something just like a CI/CD pipeline, would put those pieces together.

That's right, that sharing of concerns is really, I think, part of what we're talking about in the open source collaboration model.

Gordon:  Yeah, what a very powerful thing. I don't think people really thought about open source in this way at the beginning, but it's really come to be to a large degree about the open source development model, creating, in many cases, enterprise software using an open source development model.

We don't have an open source operations model with that same virtuous circle of projects, products, profits, feeding back into the original community; that's stolen from Jim Zemlin from the Linux Foundation.

I think you want to create that same virtuous cycle around operations. In fact I think it was at the Linux Foundation member summit. There was a very elaborate version of that virtuous cycle for open source development, and operations was not mentioned once in that entire slide.

Rob:  This, to me, is where it's why we focus on trying to create that virtuous cycle around operations. It really does come back to thinking through the complexity cycle. When we think about how do you prime the pump and have, people even within an organization sharing operational components, and let alone feeding it back into a community and having them be able to take advantage of that.

I should be specific. Ansible has this huge library, the galaxy of all these playbooks, but it has a copy. In some cases, hundreds of copies of the same playbook. [laughs] Because the difference is    this is where the complexity comes in    between one person's thing and another person's thing are enough to break it.

That complexity, and sometimes, I think, people don't want to invest in solving the complexity problem to make that work. You have to be willing to say, it's simpler for me to ignore all that stuff, and write my own custom playbook or Terraform template or something like that. From us building this virtuous cycle, you've already broken the cycle as soon as you do that.

You have to look at not eliminating the complexity but managing it. Defining, for us, we spend a lot of time with definable parameters.

When we put something through an infrastructure pipeline, the parameters are actually defined and immutable    part of that input    because that creates a system that lets things move forward and can be discovered. I think that this is where you sit back, and you're like, "OK, the fastest, most expedient thing for me in Ops might be not managing the complexity that somebody else could pick up."

Gordon:  Yeah, I think there's a certain aspect of tradition. The way things are often done in Ops, is you've traditionally hack together a script that solved a particular problem and went on with the rest of your day rather than sitting down and generalizing and working with other people.

I'm sure there's exceptions to that at some of the large scale companies out there. It definitely is historically, how things tended to happen.

Rob:  There's two things that we see driving this. One of them is not accepting that heterogeneity is part of life. When you look at a script, you're like, "OK, I can eliminate a whole bunch of code generally seen as a good thing."

By ignoring all these edge cases, that somebody else injected in there. Those things exist for a reason. Even if you don't care about the server, the vendor, or the Cloud that you're dealing with, it's smart to try and figure out how that works so that you can keep it in because that keeps the code better from that perspective.

Then there's another piece to it, where, as we connect these pieces together, we actually need to be aware that they have to hand off and connect to things. The other mistake I've seen people make in a complexity management perspective is assuming their task is the only task.

I see this a lot when we look at end to end provisioning flows, is that like configuration tasks are different than provisioning tasks, which are different than monitoring tasks which are different than orchestration. You have to understand that they're usually intermixed in building a system well, but they are different operational components.

You might have to make something that's less efficient or harder, or has more edge cases, in one case, to do a better job interacting with the next thing down. I'm thinking through how this translates in open source more generally. Do you want to add something to that?

Gordon:  No, I think one of the things we've struggled with around Operate First and similar type of work is turning this into what we can concretely do going forward. I'm curious maybe in closing out what some of your thoughts are. What are some main next steps for the industry for operations, communities over the next year or so?

Rob:  We've gotten really excited about this generalized idea of an infrastructure pipeline because it's letting us talk about Infrastructure as a Code beyond the get an YAML discussion and talk about how do we connect together all of these operations.

When we think about collaboration here, what we're really looking at is getting people out of the individual step of that pipeline conversation and start thinking about how things connect together in the pipeline.

It's a nice analogy back to the CI/CD revolution from five or six years ago, where people would be like, 
"Oh, CI/CD pipelines are too hard for me to build, it's all this stuff, and I'm gonna deploy 100 times a day like they do in the..." At the end of the day, you don't have to look at it that way at first.

The goal is to be doing daily deployments, and every commit goes to production and things like that. The first thing you just need to do is start connecting two adjacent pieces together in a repeatable way. That means a lot of times to teams collaborating together, or it means, you being able to abstract out the difference between different Cloud types or hardware types or operating system types.

That's what I hope people start thinking about is, how do we connect together a couple of links in this infrastructure pipeline chain.

The reason why I'm so focused on that, is because if you can connect those links together, you've actually managed the complexity of the system. In some cases, what you've done is you've made it so that the tools that you're using focus on doing what they do well.

One of the things in ops that I find really trips people out is when they use a tool outside of its scope. [laughs] Using a tool for something that it's not designed for, often causes a lot of complexity. This is right. You can do it.

Sometimes that adds a lot of complexity because the tool wasn't designed for that and you're going to run it into a fragile state or you're going to have to do weird things to it. This lets you say, "All right, this tool works really well. At this stage, I hand off to the next stage, I hand off to the next stage."

It's more complex, maybe to build a pipeline that does that. Individual components of the pipeline are actually then simpler. The connections between things now that they're exposed have also reduced the complexity, the complexity budget in your system, by working on those interconnects.

The way I think about that is a coupling perspective and thinking about coupling.

Gordon:  Great. That's probably a good point to end on. Anything you'd like to add?

Rob:  No, this was fantastic. It's exactly the topic that I've been hoping to have around managing complexity and thinking about it from a community perspective. I appreciate you opening up the mic so that we could talk about it. It's a really important topic.

Gordon:  Thanks, Rob. Enjoy the holidays.

Rob:  Thanks, Gordon. It's been a pleasure.









Monday, December 20, 2021

RISC-V with CTO Mark Himelstein


RISC-V is an open instruction set architecture that's growing rapidly in popularity. (An estimated two billion RISC-V cores have shipped for profit to date.) In this podcast, I sat down with Mark Himelstein, the CTO of RISC-V International, to talk about all things RISC-V including its adoption, how it's different from past open hardware projects, how to think about extensibility and compatibility, and what comes next.

Listen to the podcast [MP3 - 22:54]

[TRANSCRIPT]

Gordon Haff: I'm very pleased to have with me today Mark Himelstein, who's the CTO of RISC-V, who just got off having a summit in San Francisco that I was pleased to be able to attend in person.

Welcome Mark. Maybe you could just introduce yourself and maybe give a brief overview of what RISC-V is.

Mark Himelstein: I'm Mark Himelstein. I'm the CTO. I've been in the industry for a bit. I was an early employee of MIPS, I ran Solaris for Sun. I've done a lot of high-tech stuff, and I've been with RISC-V for about a year and a half. Very excited. This was an incredible year for us, a very big change for us.

First of all, we believe that there's been well over 2 billion RISC-V cores deployed for profit this year which is an important thing. Success begets success and adoption begets adoption.

A lot of people joined us early on and they're early adopters, and now, you're seeing people say, "Oh, they're successful now. I can be successful."

RISC-V is an instruction set architecture kind of halfway between a standard and open source Linux, kind of right in between there. We don't do implementations. We're totally implementation-independent. We work with other sister organizations that are nonprofit like lowRISC, and CHIPS Alliance, and Open Hardware who do specific things in hardware with RISC-V.

We just really work on the ISA -- the instruction set architecture -- and we work on fostering the software ecosystem. All compilers, runtimes, operating systems, hypervisors, boot loaders, etc., all the things that are necessary for members to be successful with RISC-V.

It's a community of probably about 300 institutions and corporations. There's probably over 2,000 individual members, somewhere around 55 groups, doing technical work, about 300 active members in those groups and about 50 leaders.

They just did an incredible job this year ratifying 16 specifications. In 2020, we did one, so a very big growth for us. A lot of things that have been hanging out there for some time, four to six years, things like Vector, Scalar Crypto, very innovative things as well as some some basic stuff like hypervisor and bit manipulation.

We finally got the standard out, so everybody's grateful for that.

Gordon: I want to talk about standards a little bit more in a moment. You mentioned this open ISA. What was the thinking behind taking this approach? Because obviously, there have been earlier, open hardware or semi-open hardware types of projects, which haven't necessarily had a big impact, or at least not as big an impact as maybe some people had hoped they would have at the time.

How is RISC-V different?

Mark: Yeah, it's a really good question. One of the problems when you hand something whole cloth as open source, is it's hard for people to really feel ownership around it. The one thing that Linux did was everybody felt a pride of ownership. That was really hard to do.

We are the biggest open source ISA that was born in open source. Unlike the other ones, we were actually born in open source. People are afraid that if one of these big corporations goes away, that's behind them, then the open source will go away, the actual standard will go away. Rightfully so, we've seen that occur before in the past.

RISC-V comes along, and it's different. Krste Asanović at Berkeley wanted to do some stuff. The story was, he was wanting to do some vector work and Dave Patterson had done RISC I, II, III, IV. They came up with this RISC-V, and with a V

V doubles as RISC-V and vector, and start off doing this. All of a sudden, there's this groundswell of people who are interested in it. It got so exciting for folks that in 2015, they started plotting how to make it an open source organization, and they did in 2016. It's just taken off from there. People have been dying for this.

It's very clear. There's flexibility with respect to pricing. It's free. More importantly, it's also flexibility with respect to customize. You can do anything you want with it, nobody's standing over your shoulder.

We provide places for people to do custom opcodes and codings and stuff like that. It's set up for extensibility. We believe that it will last for a long time because you can extend it over and over and over again, as we did this year, we added vector, we added these other things.

It's extensible. It's free. It's flexible to use any way you want to. We've also had a renaissance in EDA over the last 15 years.

It's a lot easier to pump down a bit of logic to go off and do, hey, some security module using a RISC-V core, where it may have been harder to do that around the year 2000. That's gotten easier. This combination of things has been incredible.

You see adoption and you see deployment of products more in the IoT embedded space because the runway is shorter. It's not a general-purpose computer. You're running one application, you get it working.

Wearables and industrial controllers and disk drives and accelerators that go into data center service for AI and ML graphics. All those things, you're seeing them first. Then, the general-purpose computers come out a little bit later.

Accepting there's always exceptions, Alibaba announced at the summit last year that they have a cloud server based on they have their next-generation coming out.

You see RISC-V in every single part of computer science, from embedded to IoT to Edge to desktop to data center to HPC. I even have a soldering iron made by pine64.org that has a RISC-V processor.

Gordon: To this point about extensibility, there was a fair bit of discussion at the RISC-V Summit over, centrally, fragmentation versus diversity. This idea that you have all these extensions out there, but if people use them willy-nilly, then you're breaking compatibility.

I know there are some specific things like profiles and platforms that are intended to address that potential issue to some degree. Could you discuss this whole thing?

Mark: Yeah. I have a bumper sticker statement that says, "Innovate. Don't duplicate." That's the only thing that keeps us together as a community. Why do you want to go ahead and implement addition and subtraction for the thousandth time? Why do you want to implement the optimizers for addition and subtraction the thousandth time? You don't.

The reason why so many people are coming to the table as part of the community with a contributor culture that was built by Linux.

Why are they showing up? Why are they doing work? They're doing it because they realize they don't want to do it all. It's too expensive to do it all. There are many, either countries or companies or whatever, that were doing unique processors themselves because the licenses or the flexibility were available in other architectures.

They don't want to do their own stuff. The same reason why people didn't want to get hooked into Solaris or AIX. All those things that are going to Linux have gone to Linux.

Is the same reason why the coders in RISC-V, they don't want to be beholden to our company, they want the flexibility and the freedom to prosecute their business the best way that they see fit, and we allow them to do that.

Now. they want to share, how are we going to have them share? We have the same thing that shows up with something like Linux, in that we have to make sure that there are versions that work together.

We've done the same thing, on the same way that you have generational sets of instructions that work together, either by a version number or a new product name or a year.

We have the same thing with us, with profiles. RVA is the application profile, RVM is the microcontroller bare-metal profile. They'll both be coming out almost every year, initially, and probably slower as time goes on.

RVA 20 is the stuff that was ratified in 2019. RVA 22 is the stuff that was ratified in 2021. It works for all applications. We can tell the distros, we can tell the upstream projects like the compilers, GCC, LLVM, this is what you go after.

Everybody knows, all the members know. If they're going to do something unique and different, they have to support that themselves. If they want to negotiate with the upstream projects, we don't get in the way, they can go ahead and do that.

The upstream projects know the profiles that are most important. The platforms are very similar, but for operating systems. We want to show it to be able to create a single distro, a single set of bits, people download and configure and work. Things like ABI's, things like Discovery, things like ACPI, all those things are found in the platform.

The same thing will happen, it will come out on a yearly basis. There's, again, an application layer platform, and there's a microcontroller for real-time OSs and bare-bones things. As you might imagine, the bare bones both in the profiles and platform, very sparse.

There's not much in there, because people don't want you to do a whole lot to the point where we had the M extension previously, and that M extension had multiply and divide. They don't divide. It's too expensive in IoT, so we're breaking it down.

We're going to have a separate multiply extension that people can go ahead and use. Both of them are optional down the bottom. We've provided a way that all the upstream things can go ahead and deal with it, all the distros can deal with it. Then, people can jump on board and use those things.

Ultimately, the goal is simple, be able to take the application that was compiled for one implementation and have it run on another implementation, have them produce the same results within the bounds of things like timing and other things like that.

Same thing as operating systems, one set of bits will be able to download multiple implementations, configure it, and have it work. That's how we're working on constricting fragmentation and giving you a tool to be able to do it. Again, the only reason for people who want to fragment is so that they can share.

Gordon: It was Dave Patterson who made a comment in "Meet the board" before the RISC-V Summit for a lot of uses. You alluded to this with IoT devices. The certain microprocessor compatibility like you've had with that x86, is often not the right lens through which to look at RISC-V. It can be, of course, but it isn't, necessarily.

Mark: Even those guys want to share things. They're not going to want to do their compiler from scratch, but they're using the base 47 instructions, instead of all the rest of the extensions. They don't care about those because of exactly what you said.

Again, the thing that brings people together are common things that they have to do over and over again. I'll give you one very simple example, working on something called Fast Interrupts right now, what does it mean? It's shortening the pathway to get into an interrupt handler, not having to save and restore all the registers for embedded.

That's what it's for. Very simple. All the embedded guys are in there, even though they're doing their own thing. They want to agree on one set of calling conventions and make it easy for them to do that.

That's not something that they're using for interoperability between their parts. That's something they're using, so they don't have to duplicate the work between the companies.

Gordon: Let me ask you a couple of related questions. The first of them is, where were the initial wins for RISC-V? A related question is, have there been wins with RISC-V that you didn't expect?

Mark: First of all, remember, we don't collect any reporting information. We don't require that somebody tell us how many cores, what they're used for, or anything like that. Anything we get is anecdotal.

The other thing is we don't announce for anybody. It's not our job to do that. We'll help amplify. We have a place on the RISC-V website for everybody to advertise for free? Whether I remember or not, called the RISC-V Exchange. All that's wonderful.

The stuff we hear is when we have side meetings at conferences, like the summit and stuff like that. We know that there's more design wins and deployments that we know of in the IoT embedded space, again, because of the runway. It's not a general-purpose computer.

One that's exciting that people may not realize, is that a lot of the earbud manufacturers, especially out of China, are using RISC-V as their core. One is called Bluetrum, now remember, probably tens of millions of units per month with RISC-V cores. That's exciting to me.

I think that again, it's one of those things where it shows off the ability to take a RISC-V core, do something with it quickly, and get it out there. I have in my house 85 WiFi connected devices with switches and outlets and doorbells and gates and garages and all that stuff. 10 percent of them are Espressif.

Espressif, again, a member. They have gone ahead and gone and produced the RISC-V. You can see the RISC-V module, home automation stuff. There's a lot of things that are showing up and a lot of places that we may not hear about right away.

We hear about secondarily, that are A, a surprise, but B, exciting, and C, what it does is it engender success. When people see other people being successful doing this, they go and say, "Hey, I can do this, too." I think that that's amazing.

Again, you're going to see this continue up the chain. There are exceptions like Alibaba doing their cloud server, the servers are a little bit further out. The HPC guys are actively working in European processor initiative, Barcelona Supercomputer Center. All those guys are working on stuff. We know that the United States government in various places is working on things.

The gentleman who runs our technology sector committee, this guy named John Liddell from Tactical Labs in Texas. He works with various government organizations, and has simple things like Jenkins rigs to do test for RISC-V and stuff like that.

There's a lot of work that goes in various areas, but I don't think there's a single part of computer science that isn't looking at RISC-V for something or another, whether it be a specialized processor to help them do security or processing for ETL, or something like that, or something that's a general-purpose thing. It's everywhere.

You're going to see more and more products come out over time. We're not the only ones who are taking a look at how much it's coming out. All the state or analysts have patent numbers, and they're predicting 50 billion to 150 billion range of cores out there in a very short period of time. It's going to grow as people see that it's an easy thing to do.

Gordon: What is your role at RISC-V? What do you see your primary mission as being?

Mark: I like to make things simple. The most important thing for me is the proliferation of RISC-V cores for profit. That has to be the thing that stays in your mind. In the short term, my goal is to get people over the goal line with the pieces they need to get over the goal line with.

In 2020, we produced one spec, in 2021 we did 16. That's through the effort of me and everybody else in the team in order to prioritize, put governance in place, get them help where they needed help, and try to push things over the goal line. Get those specs out there that the members care about in order to make their customers successful.

Then, finally, the ecosystem. Look, without compilers, without optimizers, without libraries, without hypervisors, without operating, it's just, it doesn't matter. It doesn't matter how good your ISA is. Having all those pieces there is really important.

I'm a software guy, and they hired a software guy to do this job because of that. I've worked in the NSA, but I understand software everywhere from boot loaders up to applications.

I've worked all those pieces. It's really critical, and you're going to see us provide even more emphasis over that. That's been the greatest growth area in our groups over the last year, and you're going to see continued effort by the community.

Gordon: I think you've maybe just kind of answered this, but if you look out in a year, two years, what does success look like or conversely, what would you consider to be flashing alarm lights or bells going off?

Mark: One of the things that we haven't done up until now is really put a concerted effort after industries. A lot of it has been really bottoms up, "Hey, we need an adder, right? We need multiply. We need vector, right?" Those are things we go, "Hey, other architectures have this."

Now, we're really starting to take a look from the board, to the steering committee, down through the groups at things like automotive, at things like data center, at finance, at oil and gas, at industries and trying to take a look holistically at what they need to succeed.

Some of it's going to be ISA. Some of its going to be partnering with some of these other entities out there. Some of it's going to be software ecosystem. The goal is to not peanut butter spread our efforts to a point where nobody can be successful in any industry, right?

It's important we say, "OK, you're doing automotive." All of a sudden, you have to look at ASIL and all these ISO standards, functional safety, blah, blah, blah, and we have to make sure that stuff occurs. We have a functional safety SIG by the way.

Success, to me, looks like continued deployment of cores that are sold for profit, and then starting to attack some of these industries holistically that need these pieces and make sure that all the pieces they need inside of RISC-V are there and working and completed.

Gordon: Well, thank you very much. Is there anything else you'd like to add?

Mark: Well, again, I think the biggest thing is just a big thank you to you and the rest of the community of being inquisitive and participating and joining the contributor culture, and helping make RISC-V a success. We're always looking for people to help us and join us, so look at riscv.org. If you have any questions, send mail to help@riscv.org. Thank you very much.

Gordon: Other than just going to riscv.org, are there any particular resources that somebody listening to this podcast might want to consider looking at?

Mark: If they're very tech, under the riscv.org, there's a tech tab. Underneath there, there's a tech wiki. That sends pointers to GitHub with all the specs, to the upstream projects, GCC, LLVM, our governance, all those things. It gives you a really good jumping-off point. There's a getting started guide there as well for tech guys.

In general, if you're not a member, become a member. It's really easy. If you're an individual, you can become a member for free. If you're a small corporation just starting out, we have some breaks. There's different levels of membership, strategic, premier TSC, premier. Come join us. Help us change the world. This is really different.

I had no clue what this was when I joined it. I'm very grateful, and I'm very happy to see it really is making a very big difference in the world.