Jeremy Barnes, Vice President, Product, Platform AI, ServiceNow Rebecca Gorman, Co-founder and CEO, Aligned AI Raza Habib, Co-founder and CEO, Humanloop Moderator: Jeremy Kahn, FORTUNE
Category
🤖
TechTranscript
00:00 The biggest impediments businesses are encountering
00:02 in trying to implement generative AI
00:05 is around the issue of trustworthiness.
00:07 These systems just do not seem reliable enough
00:09 to use in many business contexts.
00:13 And I'm curious-- businesses are very concerned
00:15 about reputational damage.
00:16 They're concerned about giving out inaccurate information.
00:19 But we're going to explore here maybe ways
00:21 we can try to solve some of those problems
00:23 and make AI more controllable.
00:24 And I think control is also a big issue,
00:26 particularly as we move from AI that can just generate content
00:31 to AI that's actually going to take actions, which
00:33 is where everyone says that this is probably going.
00:36 Jeremy, I'm going to start with you.
00:37 At ServiceNow, how have you tried
00:39 to address this problem of trustworthiness?
00:41 You're trying to implement systems
00:43 for a lot of customers at scale, one of the biggest
00:45 enterprise tech service companies out there.
00:49 How have you tried to deal with trustworthiness?
00:52 Yeah, and it's something that you
00:54 need to think about very carefully from the beginning.
00:56 It's not something that you can bolt on.
00:58 So I've been thinking about it for quite a number of years.
01:02 A big part of our approach there is
01:03 to make sure that we understand what we're
01:09 solving for for the end user.
01:11 There's a little bit of a false idea
01:15 that if you say that each of the components of the system
01:17 you build is trustworthy, therefore the entire system is,
01:21 that's not entirely true.
01:23 How you work to analyze the end state of it.
01:28 And that comes really into how we design our products,
01:31 design them, test them, and build them.
01:34 So we have a fundamental research lab
01:36 who is working on issues of trustworthiness,
01:39 published about 70 papers on the subject.
01:43 We have a set of what we call our human-centric AI guidelines.
01:47 So that's ensuring that there is transparency,
01:50 that it's human-centric, that we have--
01:53 it's accountable, and that it's inclusive as well.
01:56 So each of our products goes through that kind of design
01:58 process to ensure that we have those built in.
02:02 And there's just a giant amount of testing,
02:04 testing of the end product.
02:07 We use it internally before we release it.
02:09 And we've trained hundreds of people to do this testing.
02:14 And it's important.
02:15 It's not just testing the model.
02:16 It's testing the system end to end
02:18 so that we can get some kind of a guarantee of it's not
02:22 going to do certain things.
02:23 Because we've tested very carefully around the edges
02:26 to make sure that it is--
02:27 I was going to ask about edge cases.
02:28 How do you identify those?
02:30 And how do you make sure that--
02:31 because you might have a system that seems like it's working.
02:33 And then you put it in deployment.
02:34 And suddenly, a user asks something
02:36 that you weren't expecting.
02:37 And you get a failure mode.
02:39 Right.
02:39 Yeah, so part of that is addressed
02:42 with our human-centric AI guidelines.
02:45 So we make sure that we have a human
02:46 in the loop for any consequence or decision.
02:49 So there is the ability for the human
02:51 to understand what's going on and override it.
02:54 But a lot of it is the mentality of testing
02:57 is you need to figure out--
02:58 you can't just test that it works.
02:59 You have to test--
03:00 you have to know where it's going to break.
03:02 And then you can put guardrails in around that.
03:06 And so it's really how you equip that team
03:08 to be able to do that so you get the outcome.
03:12 Right.
03:12 Raza, I want to go to human loop.
03:14 You guys are creating systems for figuring out
03:17 reliable prompts for LLMs.
03:19 Can you talk about how you do that
03:20 and how you're able to implement that?
03:22 Because I think a lot of people struggle with that.
03:24 One particular subtle prompt works really well.
03:27 And then the next week, it doesn't work.
03:29 Or you change a few words in the prompt.
03:31 And suddenly, you get a completely different answer.
03:33 Yeah.
03:34 So I think Jeremy touched on some of this
03:36 already in terms of having good processes for testing
03:39 internally.
03:39 But maybe if we zoom out for a moment,
03:41 it's worth understanding what's changed
03:43 that means that this stuff is harder than before.
03:45 And I think fundamentally, if you
03:47 think about large language models or generative AI
03:49 compared to traditional software and machine learning,
03:51 we've gone from a paradigm with traditional software
03:53 where you run the software.
03:54 And every time, you get the same result.
03:56 So it's very easy to write a test for that.
03:58 And then with machine learning, OK, it was stochastic.
04:00 It was random.
04:01 But at least I could calculate a number.
04:02 I knew where the accuracy was.
04:03 I could calculate precision or recall.
04:05 We had ways of doing this.
04:06 We're now applying these technologies
04:08 to things that are much more subjective.
04:10 And so actually, even just saying what good looks like
04:12 can start to become difficult. And so the first thing we do
04:16 is give people tools to be able to measure performance
04:20 in a variety of different ways, so building
04:22 a suite of different evaluators that
04:24 are measuring different aspects of how well the system is
04:27 working, both during development and production,
04:29 and then iterating against those with some kind of test set.
04:33 And Jeremy asked Jeremy, how do you test for edge cases?
04:39 And I guess I'd say two things to that.
04:41 One, in some senses, because it's subjective,
04:44 the correct answer becomes what your users say
04:46 the correct answer is for some of these applications.
04:48 Not for factual ones, but if you're writing someone an email
04:51 or doing something that's creative in nature,
04:53 you're trying to optimize for end user experience,
04:56 not for some internal metric.
04:58 And you find the edge cases to a certain extent.
05:01 You can cover as much as you want pre-production.
05:03 But what you also need to be able to do
05:04 is if something goes wrong in production, know about it.
05:07 And if you don't have good evaluations in place,
05:09 you can't even catch those edge cases to fix them.
05:11 So I think a lot of people worry is
05:13 deploying systems that have edge cases
05:14 that they don't know about.
05:16 And we try to give people the tools to avoid that.
05:18 Interesting.
05:19 Rebecca, at Aligned AI, you're all
05:21 about trying to create AI that has
05:22 more conceptual understanding.
05:24 Can you talk about why that's important
05:26 and how you're trying to achieve that?
05:28 Yes.
05:29 So if you think about artificial intelligence,
05:32 it's made up of three pieces.
05:35 There's computer science.
05:37 There's statistical science.
05:40 And then there's data.
05:43 And sometimes an artificial intelligence
05:47 can go wrong because the computer science bit of it
05:53 isn't completely reliable, like in the Postmaster scandal
05:59 we had here with Fujitsu in the UK.
06:03 And sometimes it can go wrong because
06:06 of the statistical science not--
06:11 it has inherent limitations.
06:13 So anyone here who works in artificial intelligence
06:16 knows about things like concept drift and data drift.
06:21 And all models degrade over time.
06:27 95% of models, AI models, have to be retrained within a year
06:31 because the world changes.
06:35 So you might not think this affects large language models,
06:38 but actually, GPT-2 was out before the pandemic.
06:43 And it came out during the pandemic.
06:46 And it made absolutely incorrect statements
06:50 about what was going on in 2020, which
06:52 were a bit odd when you were trying to use it in 2020.
06:57 So even language models need to adjust to the world
07:03 as it changes, as our language changes.
07:05 And how are you trying to solve that?
07:08 Well, Jeremy, I was going to mention the third part, which
07:11 is data.
07:12 And data is also another wild card
07:17 when you come to artificial intelligence,
07:19 because all artificial intelligence is built on data.
07:23 So we make artificial intelligence that--
07:26 our goal is to make artificial intelligence that
07:31 can understand human concepts, even in situations
07:34 where the data environment has changed.
07:37 The data environment is different from what
07:40 it was yesterday.
07:41 It's different from what humans expected.
07:43 Human blind spots are--
07:46 there's always blind spots when humans look at data,
07:49 because we're the ones who collect it.
07:52 We're the ones who label it.
07:54 It looks right to us.
07:56 Obviously, we've created a data set that looks right to us.
07:59 So the blind spots that artificial intelligence
08:02 learns from this data is our own blind spots.
08:05 And so it's very, very hard for humans
08:07 to identify and recognize.
08:09 So that's why we need to create artificial intelligence
08:11 systems that can become aware of its own blind spots
08:14 and extend that to new environments.
08:16 And you've had some initial success with this.
08:18 I mean, I don't know if you want to talk about what you've
08:19 done with content moderation and some of the results
08:21 you've had there.
08:22 But I think that might be interesting,
08:23 that this is not complete science fiction.
08:25 People-- it's possible to actually use
08:27 some of these techniques to achieve some success.
08:30 Yeah, so one of the early open source chatbots--
08:39 open source language models that you would know the name of--
08:44 they released a chatbot that they had to shut down.
08:47 And it was because they were using OpenAI's content filter,
08:52 and a lot of toxicity was getting through.
08:56 So they asked us if we could do better.
08:57 So we looked at OpenAI's content filter.
09:00 They have an open source--
09:04 they present open source of data sets that they use, et cetera.
09:08 And it turned out that their content filter
09:11 worked great, 97% or something, on the data sets
09:18 they were trained on.
09:18 But they didn't work well on different data
09:20 sets of toxic content that the model hadn't been trained on.
09:24 So AlignedAI created a content filter
09:28 that beat the OpenAI content filter on data
09:33 sets of toxic content that the artificial intelligence hadn't
09:37 previously seen.
09:39 I want to do a quick sort of rapid fire set of questions
09:41 here.
09:42 One of the big issues is around hallucination
09:44 that people are concerned about.
09:45 I want to ask you all whether you think this is solvable,
09:49 and if so, on what time scale?
09:51 Do you have a prediction for how many years out
09:53 we are from solving this problem, if it's solvable?
09:56 Or maybe you think it's not.
09:57 So I'm curious.
09:57 We'll just go down the line.
09:58 Raza, what about you?
09:59 Do you think hallucination is a solvable technical challenge
10:02 for these LLMs?
10:03 Yeah, so I think it is solvable technically.
10:06 But I also think that it doesn't need
10:07 to be solved for this to be useful productively.
10:10 So there are ways--
10:11 we're used to designing user experiences that
10:13 are fault tolerant in some way.
10:14 You go to Google search, you get a ranked list of links,
10:16 you don't get an answer.
10:17 And people are in perplexity, get citations back now.
10:20 So I don't think it has to be solved to make it useful.
10:23 But the reason I'm optimistic that we can solve it
10:26 is if you look at the--
10:27 when you train a large language model,
10:29 it kind of goes through three stages.
10:30 There's kind of pre-training, and then there's
10:32 sort of fine tuning, and then reinforcement learning
10:35 from human feedback as the last stage.
10:37 And if you look at the models before they are fine
10:39 tuned on human preferences, they're
10:41 surprisingly well calibrated.
10:43 So if you ask the model for its confidence to an answer,
10:46 that confidence correlates really well
10:48 with whether or not the model is telling the truth.
10:50 We then train them on human preferences and undo this.
10:53 And so the knowledge is kind of there already.
10:56 And the thing we need to figure out is how to preserve it
10:58 once we make the models more steerable.
11:01 But the fact that the models are learning calibration
11:03 in the process already makes me very optimistic
11:05 that it should be much easier to solve than most people--
11:08 And what time frame would you guess?
11:10 Oh, like within a year.
11:12 OK.
11:12 Rebecca, what do you think?
11:15 Yeah, I think--
11:16 Solve or not?
11:18 Yeah, obviously solvable.
11:20 Right, and what time scale do you think?
11:22 I'll agree with Raza there.
11:27 In a year.
11:28 Jeremy?
11:29 Well, from just a model perspective,
11:31 you do have to bias them in some way.
11:33 You don't want them to only produce things
11:34 that are in the training data.
11:36 So they kind of need to--
11:37 it's part of their capability.
11:40 So from a model perspective, it's an unsolved question.
11:44 It's not going to be solvable.
11:45 From a systems perspective and user application perspective,
11:48 I agree with Raza.
11:50 It's already solved in many cases.
11:53 And when you can build a system around it,
11:56 you have other tools there.
11:57 So a model which will hallucinate
12:00 can still be incredibly useful in producing value.
12:02 Can I add one thing to that?
12:03 Yeah, go ahead, Raza.
12:04 It's not always a bug.
12:05 Sometimes it's a feature.
12:06 So if we want to have models that will one day
12:08 be able to create new knowledge for us,
12:10 then we need them to be able to act as conjecture machines.
12:13 We want them to propose things that are weird and novel
12:15 and then be able to filter that in some way.
12:17 And so in some senses-- and also if you're doing creative tasks,
12:21 actually having the model to be able to fabricate things that
12:24 are going off the data domain is not necessarily a terrible
12:26 thing.
12:27 Interesting.
12:27 I'd like to add to my answer as well.
12:29 Yeah, so as someone said up here yesterday,
12:33 foundation models as they're created today
12:35 are hallucination machines.
12:36 It's just sometimes the case that their hallucination
12:40 corresponds with reality.
12:43 So while I think it's solvable in a year,
12:46 I also think that, well, obviously,
12:49 making the models bigger is not going to solve that problem.
12:53 Interesting.
12:53 Would you agree that humans are hallucination machines?
12:55 That's a philosophical question.
12:59 I would say yes, that we are, but in a different way.
13:02 Interesting.
13:03 I want to go to the audience for questions.
13:05 Please raise your hand if you have a question,
13:07 and we'll get a mic to you.
13:08 And anybody have questions?
13:10 If not, I'll keep going.
13:13 There's a question there.
13:15 And if you could please identify yourself when you stand up,
13:17 that would be great.
13:19 Sam Simeonov, CTO of Real Chemistry.
13:21 So when new technology comes, there's adaptation.
13:24 We adapt technology to humans.
13:25 We also adapt humans to technology.
13:27 And we've seen that happening a lot with technology
13:29 over the years.
13:29 So I'm curious about your take on what's
13:31 the biggest adaptation of humans to generative AI
13:34 that you expect in the next few years?
13:36 Biggest adaptation of humans to generative AI.
13:37 Jeremy, you take that one first.
13:39 Well, we saw it happen when Google first
13:41 came onto the scene.
13:42 People just changed the way that they
13:44 searched in order to type in queries, which
13:46 would be achievable.
13:48 So I think that will lead to some
13:52 of the challenges with those systems
13:54 gradually as people learn how to steer them improving.
14:00 On the flip side, we probably don't
14:03 want to expect people to adapt.
14:04 I think that's one of the great things about the promise
14:08 of all this new generative AI technology
14:10 is it can meet humans where they are instead of humans
14:12 having to come.
14:14 So if you've used a chatbot from pre-gen AI,
14:17 you could get it to work, but you
14:19 had to really know exactly how to phrase things.
14:22 And so yes, I think it will happen.
14:24 But I hope that we don't count on it too much.
14:28 Interesting.
14:29 Other questions from the audience?
14:31 Please raise your hand if you have questions.
14:34 I wanted to ask about the Air Canada chatbot incident.
14:39 Because that's a case where it was
14:41 using some sort of rag mechanism.
14:42 It retrieves a document.
14:43 It thinks says something, but it has not
14:46 summarized the document correctly.
14:47 And then you have a user, a customer in that case,
14:49 getting incorrect information and ultimately suing
14:52 successfully.
14:53 I think that frightens a lot of companies
14:55 out there, that sort of incident.
14:56 I'm curious how you deal with that situation.
14:59 Since you're Canadian as well, Jeremy, I'll go to you first.
15:02 But what's your reaction to that incident?
15:04 Yeah, I'm on an Air Canada flight tomorrow.
15:08 It's close to my heart.
15:09 Listen, that's our core business.
15:11 And so it's a question of how you design these systems.
15:18 If you're just designing it to just
15:20 with a positive metric in mind of,
15:22 is it able to get to the end of the conversation,
15:25 not is it able to be accurate and you don't
15:29 design the system right, that's kind of what you get.
15:33 A lot of these systems here, we talk about prompt engineering.
15:36 And I'd say the engineering is very much,
15:38 in inverted commas here, we see that already there's
15:44 this kind of prompt debt where there
15:46 are these prompts which don't actually represent what we
15:49 wanted to do, but have been tested
15:51 on a small set of examples.
15:54 And so again, it's how you take something which seems
15:56 to work in a proof of concept.
15:57 You probably don't just want to put it straight
15:59 into production with real customers who
16:03 have expectations and terms and conditions and things
16:06 like that.
16:06 And so it's how you bring it into the enterprise context
16:09 so that it's able to knock off some of those rough edges.
16:13 That's where the tough work is.
16:15 And yeah, that needs to be done.
16:17 Prompt debt's an interesting concept.
16:18 I mean, is that something you've been considering or--
16:21 Yeah, absolutely.
16:22 I mean, in some senses, my answer--
16:24 this feels like a softball question for me
16:25 because how could I fix this?
16:27 Well, they should have used human loop.
16:30 But yeah, the management of prompts,
16:33 like versioning them, having history of them,
16:34 being able to test them rigorously,
16:36 being able to have the domain experts be
16:37 involved in that development.
16:39 Like, I do think that the Air Canada incident was
16:41 like completely avoidable.
16:42 Like, there's nothing fundamental
16:44 about the technology that meant that had to happen.
16:46 I just don't think that they had done enough
16:48 around testing of the system.
16:49 Yeah, I really disagree.
16:51 What we've seen in enterprises over the past several months
16:54 is that the really cool demos and the really cool prototypes
16:57 that the CIOs brought to their CEOs,
17:00 they just haven't been able to get
17:01 them reliable and robust enough to bring them into production,
17:04 at least--
17:05 I agree there's a perception of that.
17:06 But we've done it enough times now
17:08 that I think it's just a question of--
17:09 So how would--
17:09 I'm curious on the Air Canada--
17:11 what should they have done to solve that?
17:13 So I think if they'd had sufficient guardrails in place,
17:16 like, they gave the chatbot much far--
17:20 wide range than what it should have been able to say.
17:22 So you can put these things on much narrower rails.
17:25 And we used to do this with chatbot systems,
17:26 where you do intent inference.
17:28 And then you kind of follow up with smaller pieces of usage.
17:31 And they more or less gave people almost raw access
17:34 to chat GPT with a little bit of RAG attached.
17:36 And sure, that's a dangerous thing to do.
17:38 But most people shouldn't do that.
17:40 Right.
17:40 They should have used service now.
17:43 We've had hundreds of person years of work
17:45 to kind of make that work.
17:46 And a lot of it is just an intern system.
17:50 And then it's like, we want to push it into production.
17:53 So it's understanding what goes in.
17:55 I could give a point of agreement, too.
17:56 So don't just give raw access to chat GPT.
17:58 On that note, we're going to have
17:59 to end it because we're out of time.
18:00 But thank you very much to my panelists.
18:02 And thank you all for listening.
18:03 [APPLAUSE]
18:07 [AUDIO OUT]
18:10 [BLANK_AUDIO]