Brainstorm AI London 2024: Boosting Confidence In AI With Employees And Consumers

Fortune

Jeremy Barnes, Vice President, Product, Platform AI, ServiceNow Rebecca Gorman, Co-founder and CEO, Aligned AI Raza Habib, Co-founder and CEO, Humanloop Moderator: Jeremy Kahn, FORTUNE

Transcript

00:00 The biggest impediments businesses are encountering

00:02 in trying to implement generative AI

00:05 is around the issue of trustworthiness.

00:07 These systems just do not seem reliable enough

00:09 to use in many business contexts.

00:13 And I'm curious-- businesses are very concerned

00:15 about reputational damage.

00:16 They're concerned about giving out inaccurate information.

00:19 But we're going to explore here maybe ways

00:21 we can try to solve some of those problems

00:23 and make AI more controllable.

00:24 And I think control is also a big issue,

00:26 particularly as we move from AI that can just generate content

00:31 to AI that's actually going to take actions, which

00:33 is where everyone says that this is probably going.

00:36 Jeremy, I'm going to start with you.

00:37 At ServiceNow, how have you tried

00:39 to address this problem of trustworthiness?

00:41 You're trying to implement systems

00:43 for a lot of customers at scale, one of the biggest

00:45 enterprise tech service companies out there.

00:49 How have you tried to deal with trustworthiness?

00:52 Yeah, and it's something that you

00:54 need to think about very carefully from the beginning.

00:56 It's not something that you can bolt on.

00:58 So I've been thinking about it for quite a number of years.

01:02 A big part of our approach there is

01:03 to make sure that we understand what we're

01:09 solving for for the end user.

01:11 There's a little bit of a false idea

01:15 that if you say that each of the components of the system

01:17 you build is trustworthy, therefore the entire system is,

01:21 that's not entirely true.

01:23 How you work to analyze the end state of it.

01:28 And that comes really into how we design our products,

01:31 design them, test them, and build them.

01:34 So we have a fundamental research lab

01:36 who is working on issues of trustworthiness,

01:39 published about 70 papers on the subject.

01:43 We have a set of what we call our human-centric AI guidelines.

01:47 So that's ensuring that there is transparency,

01:50 that it's human-centric, that we have--

01:53 it's accountable, and that it's inclusive as well.

01:56 So each of our products goes through that kind of design

01:58 process to ensure that we have those built in.

02:02 And there's just a giant amount of testing,

02:04 testing of the end product.

02:07 We use it internally before we release it.

02:09 And we've trained hundreds of people to do this testing.

02:14 And it's important.

02:15 It's not just testing the model.

02:16 It's testing the system end to end

02:18 so that we can get some kind of a guarantee of it's not

02:22 going to do certain things.

02:23 Because we've tested very carefully around the edges

02:26 to make sure that it is--

02:27 I was going to ask about edge cases.

02:28 How do you identify those?

02:30 And how do you make sure that--

02:31 because you might have a system that seems like it's working.

02:33 And then you put it in deployment.

02:34 And suddenly, a user asks something

02:36 that you weren't expecting.

02:37 And you get a failure mode.

02:39 Right.

02:39 Yeah, so part of that is addressed

02:42 with our human-centric AI guidelines.

02:45 So we make sure that we have a human

02:46 in the loop for any consequence or decision.

02:49 So there is the ability for the human

02:51 to understand what's going on and override it.

02:54 But a lot of it is the mentality of testing

02:57 is you need to figure out--

02:58 you can't just test that it works.

02:59 You have to test--

03:00 you have to know where it's going to break.

03:02 And then you can put guardrails in around that.

03:06 And so it's really how you equip that team

03:08 to be able to do that so you get the outcome.

03:12 Right.

03:12 Raza, I want to go to human loop.

03:14 You guys are creating systems for figuring out

03:17 reliable prompts for LLMs.

03:19 Can you talk about how you do that

03:20 and how you're able to implement that?

03:22 Because I think a lot of people struggle with that.

03:24 One particular subtle prompt works really well.

03:27 And then the next week, it doesn't work.

03:29 Or you change a few words in the prompt.

03:31 And suddenly, you get a completely different answer.

03:33 Yeah.

03:34 So I think Jeremy touched on some of this

03:36 already in terms of having good processes for testing

03:39 internally.

03:39 But maybe if we zoom out for a moment,

03:41 it's worth understanding what's changed

03:43 that means that this stuff is harder than before.

03:45 And I think fundamentally, if you

03:47 think about large language models or generative AI

03:49 compared to traditional software and machine learning,

03:51 we've gone from a paradigm with traditional software

03:53 where you run the software.

03:54 And every time, you get the same result.

03:56 So it's very easy to write a test for that.

03:58 And then with machine learning, OK, it was stochastic.

04:00 It was random.

04:01 But at least I could calculate a number.

04:02 I knew where the accuracy was.

04:03 I could calculate precision or recall.

04:05 We had ways of doing this.

04:06 We're now applying these technologies

04:08 to things that are much more subjective.

04:10 And so actually, even just saying what good looks like

04:12 can start to become difficult. And so the first thing we do

04:16 is give people tools to be able to measure performance

04:20 in a variety of different ways, so building

04:22 a suite of different evaluators that

04:24 are measuring different aspects of how well the system is

04:27 working, both during development and production,

04:29 and then iterating against those with some kind of test set.

04:33 And Jeremy asked Jeremy, how do you test for edge cases?

04:39 And I guess I'd say two things to that.

04:41 One, in some senses, because it's subjective,

04:44 the correct answer becomes what your users say

04:46 the correct answer is for some of these applications.

04:48 Not for factual ones, but if you're writing someone an email

04:51 or doing something that's creative in nature,

04:53 you're trying to optimize for end user experience,

04:56 not for some internal metric.

04:58 And you find the edge cases to a certain extent.

05:01 You can cover as much as you want pre-production.

05:03 But what you also need to be able to do

05:04 is if something goes wrong in production, know about it.

05:07 And if you don't have good evaluations in place,

05:09 you can't even catch those edge cases to fix them.

05:11 So I think a lot of people worry is

05:13 deploying systems that have edge cases

05:14 that they don't know about.

05:16 And we try to give people the tools to avoid that.

05:18 Interesting.

05:19 Rebecca, at Aligned AI, you're all

05:21 about trying to create AI that has

05:22 more conceptual understanding.

05:24 Can you talk about why that's important

05:26 and how you're trying to achieve that?

05:28 Yes.

05:29 So if you think about artificial intelligence,

05:32 it's made up of three pieces.

05:35 There's computer science.

05:37 There's statistical science.

05:40 And then there's data.

05:43 And sometimes an artificial intelligence

05:47 can go wrong because the computer science bit of it

05:53 isn't completely reliable, like in the Postmaster scandal

05:59 we had here with Fujitsu in the UK.

06:03 And sometimes it can go wrong because

06:06 of the statistical science not--

06:11 it has inherent limitations.

06:13 So anyone here who works in artificial intelligence

06:16 knows about things like concept drift and data drift.

06:21 And all models degrade over time.

06:27 95% of models, AI models, have to be retrained within a year

06:31 because the world changes.

06:35 So you might not think this affects large language models,

06:38 but actually, GPT-2 was out before the pandemic.

06:43 And it came out during the pandemic.

06:46 And it made absolutely incorrect statements

06:50 about what was going on in 2020, which

06:52 were a bit odd when you were trying to use it in 2020.

06:57 So even language models need to adjust to the world

07:03 as it changes, as our language changes.

07:05 And how are you trying to solve that?

07:08 Well, Jeremy, I was going to mention the third part, which

07:11 is data.

07:12 And data is also another wild card

07:17 when you come to artificial intelligence,

07:19 because all artificial intelligence is built on data.

07:23 So we make artificial intelligence that--

07:26 our goal is to make artificial intelligence that

07:31 can understand human concepts, even in situations

07:34 where the data environment has changed.

07:37 The data environment is different from what

07:40 it was yesterday.

07:41 It's different from what humans expected.

07:43 Human blind spots are--

07:46 there's always blind spots when humans look at data,

07:49 because we're the ones who collect it.

07:52 We're the ones who label it.

07:54 It looks right to us.

07:56 Obviously, we've created a data set that looks right to us.

07:59 So the blind spots that artificial intelligence

08:02 learns from this data is our own blind spots.

08:05 And so it's very, very hard for humans

08:07 to identify and recognize.

08:09 So that's why we need to create artificial intelligence

08:11 systems that can become aware of its own blind spots

08:14 and extend that to new environments.

08:16 And you've had some initial success with this.

08:18 I mean, I don't know if you want to talk about what you've

08:19 done with content moderation and some of the results

08:21 you've had there.

08:22 But I think that might be interesting,

08:23 that this is not complete science fiction.

08:25 People-- it's possible to actually use

08:27 some of these techniques to achieve some success.

08:30 Yeah, so one of the early open source chatbots--

08:39 open source language models that you would know the name of--

08:44 they released a chatbot that they had to shut down.

08:47 And it was because they were using OpenAI's content filter,

08:52 and a lot of toxicity was getting through.

08:56 So they asked us if we could do better.

08:57 So we looked at OpenAI's content filter.

09:00 They have an open source--

09:04 they present open source of data sets that they use, et cetera.

09:08 And it turned out that their content filter

09:11 worked great, 97% or something, on the data sets

09:18 they were trained on.

09:18 But they didn't work well on different data

09:20 sets of toxic content that the model hadn't been trained on.

09:24 So AlignedAI created a content filter

09:28 that beat the OpenAI content filter on data

09:33 sets of toxic content that the artificial intelligence hadn't

09:37 previously seen.

09:39 I want to do a quick sort of rapid fire set of questions

09:41 here.

09:42 One of the big issues is around hallucination

09:44 that people are concerned about.

09:45 I want to ask you all whether you think this is solvable,

09:49 and if so, on what time scale?

09:51 Do you have a prediction for how many years out

09:53 we are from solving this problem, if it's solvable?

09:56 Or maybe you think it's not.

09:57 So I'm curious.

09:57 We'll just go down the line.

09:58 Raza, what about you?

09:59 Do you think hallucination is a solvable technical challenge

10:02 for these LLMs?

10:03 Yeah, so I think it is solvable technically.

10:06 But I also think that it doesn't need

10:07 to be solved for this to be useful productively.

10:10 So there are ways--

10:11 we're used to designing user experiences that

10:13 are fault tolerant in some way.

10:14 You go to Google search, you get a ranked list of links,

10:16 you don't get an answer.

10:17 And people are in perplexity, get citations back now.

10:20 So I don't think it has to be solved to make it useful.

10:23 But the reason I'm optimistic that we can solve it

10:26 is if you look at the--

10:27 when you train a large language model,

10:29 it kind of goes through three stages.

10:30 There's kind of pre-training, and then there's

10:32 sort of fine tuning, and then reinforcement learning

10:35 from human feedback as the last stage.

10:37 And if you look at the models before they are fine

10:39 tuned on human preferences, they're

10:41 surprisingly well calibrated.

10:43 So if you ask the model for its confidence to an answer,

10:46 that confidence correlates really well

10:48 with whether or not the model is telling the truth.

10:50 We then train them on human preferences and undo this.

10:53 And so the knowledge is kind of there already.

10:56 And the thing we need to figure out is how to preserve it

10:58 once we make the models more steerable.

11:01 But the fact that the models are learning calibration

11:03 in the process already makes me very optimistic

11:05 that it should be much easier to solve than most people--

11:08 And what time frame would you guess?

11:10 Oh, like within a year.

11:12 OK.

11:12 Rebecca, what do you think?

11:15 Yeah, I think--

11:16 Solve or not?

11:18 Yeah, obviously solvable.

11:20 Right, and what time scale do you think?

11:22 I'll agree with Raza there.

11:27 In a year.

11:28 Jeremy?

11:29 Well, from just a model perspective,

11:31 you do have to bias them in some way.

11:33 You don't want them to only produce things

11:34 that are in the training data.

11:36 So they kind of need to--

11:37 it's part of their capability.

11:40 So from a model perspective, it's an unsolved question.

11:44 It's not going to be solvable.

11:45 From a systems perspective and user application perspective,

11:48 I agree with Raza.

11:50 It's already solved in many cases.

11:53 And when you can build a system around it,

11:56 you have other tools there.

11:57 So a model which will hallucinate

12:00 can still be incredibly useful in producing value.

12:02 Can I add one thing to that?

12:03 Yeah, go ahead, Raza.

12:04 It's not always a bug.

12:05 Sometimes it's a feature.

12:06 So if we want to have models that will one day

12:08 be able to create new knowledge for us,

12:10 then we need them to be able to act as conjecture machines.

12:13 We want them to propose things that are weird and novel

12:15 and then be able to filter that in some way.

12:17 And so in some senses-- and also if you're doing creative tasks,

12:21 actually having the model to be able to fabricate things that

12:24 are going off the data domain is not necessarily a terrible

12:26 thing.

12:27 Interesting.

12:27 I'd like to add to my answer as well.

12:29 Yeah, so as someone said up here yesterday,

12:33 foundation models as they're created today

12:35 are hallucination machines.

12:36 It's just sometimes the case that their hallucination

12:40 corresponds with reality.

12:43 So while I think it's solvable in a year,

12:46 I also think that, well, obviously,

12:49 making the models bigger is not going to solve that problem.

12:53 Interesting.

12:53 Would you agree that humans are hallucination machines?

12:55 That's a philosophical question.

12:59 I would say yes, that we are, but in a different way.

13:02 Interesting.

13:03 I want to go to the audience for questions.

13:05 Please raise your hand if you have a question,

13:07 and we'll get a mic to you.

13:08 And anybody have questions?

13:10 If not, I'll keep going.

13:13 There's a question there.

13:15 And if you could please identify yourself when you stand up,

13:17 that would be great.

13:19 Sam Simeonov, CTO of Real Chemistry.

13:21 So when new technology comes, there's adaptation.

13:24 We adapt technology to humans.

13:25 We also adapt humans to technology.

13:27 And we've seen that happening a lot with technology

13:29 over the years.

13:29 So I'm curious about your take on what's

13:31 the biggest adaptation of humans to generative AI

13:34 that you expect in the next few years?

13:36 Biggest adaptation of humans to generative AI.

13:37 Jeremy, you take that one first.

13:39 Well, we saw it happen when Google first

13:41 came onto the scene.

13:42 People just changed the way that they

13:44 searched in order to type in queries, which

13:46 would be achievable.

13:48 So I think that will lead to some

13:52 of the challenges with those systems

13:54 gradually as people learn how to steer them improving.

14:00 On the flip side, we probably don't

14:03 want to expect people to adapt.

14:04 I think that's one of the great things about the promise

14:08 of all this new generative AI technology

14:10 is it can meet humans where they are instead of humans

14:12 having to come.

14:14 So if you've used a chatbot from pre-gen AI,

14:17 you could get it to work, but you

14:19 had to really know exactly how to phrase things.

14:22 And so yes, I think it will happen.

14:24 But I hope that we don't count on it too much.

14:28 Interesting.

14:29 Other questions from the audience?

14:31 Please raise your hand if you have questions.

14:34 I wanted to ask about the Air Canada chatbot incident.

14:39 Because that's a case where it was

14:41 using some sort of rag mechanism.

14:42 It retrieves a document.

14:43 It thinks says something, but it has not

14:46 summarized the document correctly.

14:47 And then you have a user, a customer in that case,

14:49 getting incorrect information and ultimately suing

14:52 successfully.

14:53 I think that frightens a lot of companies

14:55 out there, that sort of incident.

14:56 I'm curious how you deal with that situation.

14:59 Since you're Canadian as well, Jeremy, I'll go to you first.

15:02 But what's your reaction to that incident?

15:04 Yeah, I'm on an Air Canada flight tomorrow.

15:08 It's close to my heart.

15:09 Listen, that's our core business.

15:11 And so it's a question of how you design these systems.

15:18 If you're just designing it to just

15:20 with a positive metric in mind of,

15:22 is it able to get to the end of the conversation,

15:25 not is it able to be accurate and you don't

15:29 design the system right, that's kind of what you get.

15:33 A lot of these systems here, we talk about prompt engineering.

15:36 And I'd say the engineering is very much,

15:38 in inverted commas here, we see that already there's

15:44 this kind of prompt debt where there

15:46 are these prompts which don't actually represent what we

15:49 wanted to do, but have been tested

15:51 on a small set of examples.

15:54 And so again, it's how you take something which seems

15:56 to work in a proof of concept.

15:57 You probably don't just want to put it straight

15:59 into production with real customers who

16:03 have expectations and terms and conditions and things

16:06 like that.

16:06 And so it's how you bring it into the enterprise context

16:09 so that it's able to knock off some of those rough edges.

16:13 That's where the tough work is.

16:15 And yeah, that needs to be done.

16:17 Prompt debt's an interesting concept.

16:18 I mean, is that something you've been considering or--

16:21 Yeah, absolutely.

16:22 I mean, in some senses, my answer--

16:24 this feels like a softball question for me

16:25 because how could I fix this?

16:27 Well, they should have used human loop.

16:30 But yeah, the management of prompts,

16:33 like versioning them, having history of them,

16:34 being able to test them rigorously,

16:36 being able to have the domain experts be

16:37 involved in that development.

16:39 Like, I do think that the Air Canada incident was

16:41 like completely avoidable.

16:42 Like, there's nothing fundamental

16:44 about the technology that meant that had to happen.

16:46 I just don't think that they had done enough

16:48 around testing of the system.

16:49 Yeah, I really disagree.

16:51 What we've seen in enterprises over the past several months

16:54 is that the really cool demos and the really cool prototypes

16:57 that the CIOs brought to their CEOs,

17:00 they just haven't been able to get

17:01 them reliable and robust enough to bring them into production,

17:04 at least--

17:05 I agree there's a perception of that.

17:06 But we've done it enough times now

17:08 that I think it's just a question of--

17:09 So how would--

17:09 I'm curious on the Air Canada--

17:11 what should they have done to solve that?

17:13 So I think if they'd had sufficient guardrails in place,

17:16 like, they gave the chatbot much far--

17:20 wide range than what it should have been able to say.

17:22 So you can put these things on much narrower rails.

17:25 And we used to do this with chatbot systems,

17:26 where you do intent inference.

17:28 And then you kind of follow up with smaller pieces of usage.

17:31 And they more or less gave people almost raw access

17:34 to chat GPT with a little bit of RAG attached.

17:36 And sure, that's a dangerous thing to do.

17:38 But most people shouldn't do that.

17:40 Right.

17:40 They should have used service now.

17:43 We've had hundreds of person years of work

17:45 to kind of make that work.

17:46 And a lot of it is just an intern system.

17:50 And then it's like, we want to push it into production.

17:53 So it's understanding what goes in.

17:55 I could give a point of agreement, too.

17:56 So don't just give raw access to chat GPT.

17:58 On that note, we're going to have

17:59 to end it because we're out of time.

18:00 But thank you very much to my panelists.

18:02 And thank you all for listening.

18:03 [APPLAUSE]

18:07 [AUDIO OUT]

18:10 [BLANK_AUDIO]

Category

Transcript

Recommended