#13 "The New Frontier for Audio and Video Search"

Speaker 0 00:00 Welcome to the workflow show episode one 13. I'm Meryl Davis, along with my cohost, Nick gold. And today we're talking about search beyond search search 2.0 or 3.0 8.0, there's a lot of point. Oh, but okay. So search. What are we talking about? Okay. So we have spent time on this show. We spend a lot of time talking with our clientele about the ability to power search through metadata, right? Metadata being information about your information, whether this is technical metadata, like EXIF data that gets buried in your digital photographs about like the you we're using and the F stop, or, you know, for video the Kodak, the timecode is a type of metadata. The size of the frame or frames per second is other types of technical metadata or qualitative metadata metadata that a human had to kind of associate with a clip. Like the sky is blue. Speaker 0 01:06 The kid with the rail, the wide shot it's, it's a, you know, it's a female versus a male it's B roll. It was used in this production. These are things that a human has to kind of identify well today we're talking about two technology companies that we're involved with that have searched technologies that actually search within the real data, both audio and visual data itself, not metadata. We're talking about being able to search within audio and within video for snippets of audio and video. That's pretty exciting. And that's like, like war games stuff right there. Right. And I will say, and we'll leave it at this. These companies did kind of come out of the space of developing these technologies for the government. But the fun thing about living in the year 2013 is that some of this stuff is like trickling into the private sector now. Speaker 0 01:58 And you know, these companies have stuff that they feel might be applicable to media broadcast people with, you know, media archives, all of these types of things, and that they are starting to market through partners like ourselves and themselves directly to the more mainstream media applications and customers that are out there. So we're, we're helping them to kind of think of interesting ways that these technologies can apply to folks inside of our customer base. Cause it's not necessarily the absolute common area that these companies have based their existing operations around. I will say this as we describe these technologies, we have some ideas for them. We'd love to get feedback. I think on how our listeners can imagine using these either in their existing application environments or whether you can just have some interesting ideas that you can imagine might be useful to other people in the media industry. So keep that in mind, our email addresses workflow show at <inaudible> dot, just work flow [email protected]. Shoot us those ideas about these as we describe them. So Meryl let's, let's talk about the a in the AAV first audio. Speaker 1 03:11 Sure. And I think it's, it's always been something that, that certainly reality and documentary producers have wanted for some time is a little bit of flexibility over the spoken word, save for, well, let's talk about what the process is without any automation, right? Logging, lot of logging. So you go, you shoot, then you have your time code and you have somebody sit there with a pair of headphones, and then they transcribe all the spoken audio that took place on all, all the clips and the associated time code. And then based on that, you then look at that and put it together as a script. And from there, you have to say, okay, well, I want to pull such and such a soundbite between time code here and there, but you know, the sound bite that's been transcribed as a 32nd sound bite. And what you want is like 15 seconds in between the, the clip it's manual manual manual process to say the least manual on the front end manual. Um, uh, when you, when you do the script construction and manual, when it comes time to pull the actual clip from the edit Bay, because you have to do, you have to key it in, you gotta search for, then you gotta scrub to it. Then you got to paste it down. Speaker 0 04:21 So we, it was actually just about one year ago. Exactly. Cause it was, I met them and heard about them at NAB, uh, of last year and AB 2012, a company called Nick CEDIA and they're spelled N E X I D I a, their website is <inaudible> dot com. And at NAB last year through one of our distributor, a distributor partners, uh, I met a fellow associated with them named drew Lanham. And drew was a really cool guy. And he had a technology that my eyes like opened wide, or I should say my ears opened wide, except my left ear doesn't really work very well anymore. I think from many years of techno parties too much, but anyway, I was exposed to this and here's what they do. They have a product now they've been building this intellectual property and using it in a variety of applications for many years, but they have a product now called dialogues search. And what it is is a audio and phonetic dialogue searching technology. Now you may be vaguely aware of something like this because for a while in avid, there's been a feature called phrase, fine. Speaker 1 05:29 And the phrase find in the script sync there, you know, that's licensed technology from Nick CDO, correct Speaker 0 05:35 Phrase find and script sync. We're licensed from Nick Sidia. But what Nick city is now doing is they're, they've built kind of more of an enterprise slash work group version of this. So instead of it just being tied into your nonlinear editor, they now have a server that you can run and you have a little web front end that you use to do your search queries. And here's how it works. You basically get a Google like search bar and this assumes that you've already indexed all of your media. And what it does is it's, if it's an audio file, it's just analyzing the audio. If it's a video file, it's extracting the audio and analyzing that, because again, this is an audio search technology. And what it does is it basically builds this database of all of the little phonetic sound segments of every portion of all of the audio and all of the media files that you've scanned when it does the scanning process. Speaker 0 06:28 Initially it does it on a, on a pretty average single server, something like several dozen times as fast as real time, if memory serves, it might even be like 60 times as fast on a high end eight core machine or something. And so the scanning process itself is very quickly, but now it's basically got all of this, this database of sound fragments. And so you go into your web browser, you type a query, uh, it could be a phrase or a whole section of spoken word, a phrase, um, but even just a single word it could be. And what the next city of dialogue search product does, is it kind of does a phonetic interpretation of what that textual query sounds like. And then it does a hyper fast as in like for thousands of hours of audio slash video, the searches take place in less than a second, typically one to two seconds tops. Speaker 0 07:26 And it basically compares what that textual quarry's resulting sound. It gives you basically the statistical analysis of out of all of the stuff that's already sitting in its database of these sound fragments, which sections statistically are more likely to be what you just searched for and listens to the words y'all that's right. Words and phrases of words. And the neat thing is, is, you know, it's actually pretty good with accents. If you want it to kind of slightly interpret something a little differently, you can tweak the way you spell it. When you do your textual query, this is just a English, or do they have, they have modules for other languages. So they do have different modules, but it's pretty good. And you may be wondering, well, what if it's really noisy? Or what if it's all sorts of other aspects of the audio? It basically does a really good job. Speaker 0 08:20 If you, as a human listener can, could kind of listen to the segment and hear that phrase in it. It's about that. Good. So if it's a really noisy audio and you yourself would have a hard time finding that phrase in it, listening to it and kind of cognitively being aware of it, it's probably not going to do a great job. And you basically get the slider bar, right? The slider bar adjusts the sensitivity. So again, it's going to show you more or less results based on how statistically likely it thinks it is that the thing that it's popping up as a search result is what your, your original query is. That they've got a legitimate map. Speaker 1 08:57 So potentially, and let me just ask you a couple of questions here, because you know, it excites me to think about the idea that, okay, something's still gotta be transcribed or is there a way now for me to generate a transcription list? Speaker 0 09:10 No. So here's the answer, right? This is not a speech to text technology in the, in the form that a lot of others are. And you know what Nick city will often say is, listen, it's a very different beast than speech to text. First of all, even the best speech to text software still like flails about a third of the time. It's just not still a terribly accurate science, this isn't to convert massive swaths of audio into a text document. It's not an auto transcription software surveys don't exist, right? Well, no, there is plenty of speech to text software out there. It doesn't work terribly well. That's what I mean, like, you know, the perfect robot that just knows how to turn spoken words into this a hundred dollar intern to transcribe, but yeah, exactly. But, but this is all about being able to very quickly find fragments or phrases of spoken audio without having to have done a transcription in the first place, because it is comparing sound to sound. It's not even taking text and comparing to a textual version of all the sounds. So it's literally taking sound fragments and comparing them to sound fragments. Speaker 1 10:26 So I'm going to, I'm going to suppose a scenario where this would probably be the best thing to use, right? So I don't have time to pay even the best hundred dollar intern to do my transcription. I am out on a shoot. I know the questions on I'm asking for talking head. I asked those questions and while, while, you know, they're giving me the answers, I'm making key notes as to the phrases they're using. So, you know, uh, my dog, Jimmy is a phrase that he used when I asked him about whatever, right? So I know after everything gets ingested into the server that I can search for the phrase, my dog, Jimmy, I know there's three tastes of that. I find three, my dog Jimmy's is that the value that we're seeing here by effectively searching for the assets without having to do that. Speaker 0 11:05 That is a good example, but you know, I see other areas as well. I mean an area, I think this is interesting for as reality television, right? The reason I think reality TV is a neat non DOD area that you could use. This type of technology is that as we know these shooting ratios and ingest ratios of how much material is both shot and ingested onto storage per 20 to 24 minutes, that ends up on television is huge. It could be hundreds and hundreds to one I've even heard thousand to one in certain circumstances, all of that stuff right. Is transcribed, but there might be, Oh God, no searchable audio in that. Well, that's the thing. You have a lot of personalities in these shows. There's certainly some key phrases that an individual may use, certainly circumstances where you know, that they were talking about some aspect of something. And you could find that phrase, but you haven't necessarily logged everything yet. Well, this is a way you could give producers, maybe a story producers on a reality TV show, the ability to just quickly start entering some interesting queries into their media. Very quick turnaround from the time that it was ingested, just to start seeing what kind of material they have. I mean, if you've got like Jimmy on some like weird hick, you know, fishing shows, I was like, Speaker 1 12:22 He ha or whatever then was my overall. So Tammy is, she needs raises them as my overalls. You know, when he wipes it slimy Fisher, I mean, you could search for that and you could find out anybody who fishes or wears overalls, but named Jimmy, I'm sorry. Yes, but that's a fake Southern accent. I don't know where that came from, but, but the truth is if you were a story producer on reality television and you walk into a room and they're like, you know what? We just need a montage of him saying, Oh, those are my, my overalls eight times in a row. Can you just like bring it up? And you could instantly find like every circumstance, you know, two seasons of production and we've been in, you know, 14 different locations. You want me to find every instance of where them are? My overalls, are you serious? I think it's Dems my old voice, Speaker 0 13:08 What you would type in as D E M S M U H overalls, but you probably wouldn't even need to tweak it that much. Cause it is pretty good at comparing properly written English and Speaker 1 13:21 Loosely interpreted. Let me ask you this question. You have accents and stuff. So this is all very good. I could see myself wanting to utilize this because it's very flexible. But then the question becomes, how do I get direct access to that asset? So I do, I do a search and I find the phrase that I want, yes. Now if it's eight phrases of Dems overalls, how am I taking each of those and bringing them into my timeline? Speaker 0 13:46 So you use their web browser, user interface and they do make a version of the technology available that you can hit using its API or application programming interface. They have a software development kit. And so let's say you had a web tool that you already used for research purposes, or it's part of your mini, a media asset management database system. You could actually do direct software level integration with their fundamental technology and not have to use their kind of quote unquote, off the shelf product. However, using the dialogue search product, you've got this search bar and you've had to generate proxy versions. You may already have proxies of your media because using their search tool, when you are viewing these video clips in your web browser, you know, you don't want it to be the edit quality footage. So you will have had to have a proxy process. Speaker 0 14:36 And those proxies are actually with the next city of server has done it's audio searching and indexing on that indexing phase that comes before you do your search. So you search for things and it gives you this very kind of like YouTube, B result list, or just a list of video files, which are the ones that it has flagged as being likely to have that audio and you click on them. And it brings up this nice little player window where it shows you kind of like a little playback bar, you know, showing you, you know, the, your, the, the length of the file. And it flags the individual points in time code that it sees those occurrences where it thinks that these correspond to what your phrase was now, the neat thing is you can play them back. You can click around on them and just listen through and make sure that they're accurate. Speaker 0 15:21 And then they have an export function where you can basically export the metadata. It's just an add an XML export is just like they do XML. They have, they've already done some direct hooks for some media asset management systems like cat DV, they're working on some others. You could always, uh, I think they already have it or are working on it where you can export it as basically markers or sub clips, uh, straight for your NLE itself. But it basically points back to the original media file. The timecode, you know, instances at which it sees those phrases. And then you could like import those into your NLE or your media asset management system as true text metadata. It's no longer kind of in the realm of the next city of dialogue search. It's not phonetic anymore. You're exporting back out that phrase as meta-data, that was based on that textual query that you made in the first place. And now you can bring that into your media asset management system, as, as text metadata, and then, you know, generate new searches using city for new segments of actual phonetic audio. Speaker 1 16:25 It's, it's pretty wicked stuff. I'm excited about it because it has application, I think outside of. So when most people talk about a technology like this, they're thinking, okay, anything that's transcription heavy, certainly when it comes to documentary production, reality production, but it also can really be an effective tool if you're shooting a feature film. I think where you know that you have a script where that script is, is your, your baseline document. And then when you're in the edit Bay, uh, you were trying to cut to script in a whole different manner, I think is very useful. So just a, again, there are so many different ways that you can use this technology. There's are just three examples. And we're curious to see what else, uh, Speaker 0 17:07 If you guys have ideas, please, we'd love to hear about them. Um, I'm sure there's a lot of stuff we're not thinking of. Cause you know, it's a pretty new way in a new capability of searching for stuff. And I think you're right. It's like you can't directly compare it to the transcription workflow because it's almost like turning that whole equation on its head and saying like, but what if you didn't really need to do a transcription per se in the first place, especially if it's already scripted material. So it's like, you know what? The spoken words are, you now have a very automated way of finding in the time code when those phrases occurred. So please workflow [email protected] would love to hear ideas. Again, the company is called <inaudible> dot com. They're good friends of ours. We know them well. Speaker 1 17:55 Oh, I got it. I got another idea real quick one, right? Say you're a YouTube remix artist and you want to auto tune a bunch of people saying a whole bunch of things. Are you going to sit there and scrub every single thing that was said on the news or whatever clips you're, you're like trying to find here to put that together. Oh, for mashup and stuff like that. I think there's all sorts of funky artistic appetite. You could make anybody say anything that you ever wanted them to say ever. Oh yeah. That's a good idea. Like you do those things where they cut up, like they make it say whatever, right. Oh, that's a really good idea, man. Yeah. Yeah. We could do a daily show if you're listening. Um, you know, keep that in mind, keep that in mind after you purchase. Um, so Speaker 0 18:41 Let's talk about the V a V video. We talked about kind of phonetic and auditory search for dialogue. We came across a company recently they're called nerve and they're spelled a little weird it's N E R V V E nerve there with two of these and kind of think about what Nick Cydia does with audio. These guys do with imagery in video. So again, you can probably imagine that some of these guys' original customers were more, uh, uncle Sam oriented, but they too looking for applications in the private sector, in the media broadcast industries. So here's how their software works. It basically scans all of its content creates this, this database of kind of what things look like. So we're talking, we're talking color well, it's a lot of characteristics. It's size, shape, color, all the characteristics that make something visually stand out. Now you can create what they call an object model. Speaker 0 19:46 So let's say you've got an hour of video, right? Your hour video is, you know, a landscape. And you know that there's a chance that a particular car that's a, a green minivan might drive by at some point in this hour of video. And you want to quickly identify when that is without having to watch the whole hour of video, even if it's in fast-forward mode or whatever you just want to like, boom, have it. What are the what's that two seconds that the green minivan drove by out of all these other cars that may be going by you create an object model you find, or construct several images that it kind of averages into one, or you can use a single image to say, this is what the green minivan should look like. You could find a generic shot of one online or something like that. Speaker 0 20:35 And you say, use this as your object model, this one or several shots, and now boom, perform this kind of statistical visual analysis and comparison of this too. Every frame in this range of video. And again, show me the areas. And you also have this little sensitivity slider bar. Show me the areas that you think are statistically areas of high match likelihood for that object model in those frames of video. And I saw the demo from a fellow named Robert Robi. Who's one of their reps. He's actually the senior vice president of sales there. And he too very interested to hear what folks in the media and broadcast industries have to say about how they can imagine using this stuff in their contexts. Cause again, yes, it is a little more surveillance oriented, that kind of a thing. But we think that there's a lot you could do with this, but boom, it shows you very quickly again, it's funny cause it does its initial scan about as fast as the Nick Cydia does on the, on the audio, you know, several dozen times real time on an average server to do its initial scan. Speaker 0 21:43 And then again, uh, I forget the exact metrics and it does depend on how fast your, your systems are. But many, many, many times real time, as fast as it does this, this search, you know, on the show us, you know, the hits that you think are this thing. And it's really good. He gave me an example where a guy had a logo on his hat and it found frames, even though the guy's head was kind of at an angle and looking up, it could tell based on the color and the general shape that this thing, at least statistically had a percentage likelihood of being this thing. And so, yeah, it's again like all of these types of technologies, next city and, and nerves, they're not perfect. They're giving you guesses. It's not, you know, a human mind that can clearly define these things. It's but it's a, it's a great start. Speaker 0 22:33 It is. It's a time saver, right? That's what it comes down to saving time, speeding up processes, being more efficient. Now certain things like if you're searching for every frame of video that has your bug in the corner of the screen, if you're searching for frames of video that have like color bars, those things are going to stick out like a sore thumb and it would very easily identify those things. But I'm really curious to hear what you guys, as our listeners have for ideas about just, you know, within your existing workflows or workflows, you can imagine what, what would you do with the ability to find, you know, again, it can work pretty well with people's faces. Now, again, it's not doing facial recognition. Like a lot of facial recognition systems have where it's performing kind of these geometric analyses of distance between eyes and nose and other facial features. Speaker 0 23:22 It is really breaking an image down to this kind of average sense of what this thing looks like, but it can identify faces. It can certainly identify vehicles, landmarks, you know, the, the Capitol building in Washington, D C you know, and if you're, you know, 50,000 hours of video, you probably got some shots of that over years. You need to reuse it. You know, some B roll on something you don't want to have to go send another crew out to DC. You get the permissions to film in D C again, boom, let's show it this shot, or a few shots of the Capitol building. Boom. It it'll find things like that. So I don't know, you know, there's a lot of stuff I can imagine people doing with it. Speaker 1 24:04 Well, I have a good one. That's just sort of a short of thinking about this, right? Well, I like the idea of logo. So obviously there are there instances where certainly on a lot of reality television where, where they gotta like fuzz it out, they got to flush it out or whatever, and that's a time consuming task in and of itself, but certainly identifying every single instance I've been on, on set where you grab the gaffer's tape and you just, you take over as many logos as you can find because it's just more time, you know, in the backend. But another thing that comes to mind is say that you're, you're doing multiple days of production and, you know, for continuity sake, there are things that you're shooting on both days, as far as the movie or the segment is concerned happened at the same time. So there's, there's two days to two sets of costume changes, but you only want to look at when, uh, Jimmy and Andy are in their red and blue shirts, not when they're in their, in their evening formal. So I could see an instance where I want to at least set up how I'm going to cut this thing continuity wise, by separating the video with their costume changes. Speaker 0 25:06 Yeah, that's an idea again, I don't know if that would work. I'm more newly getting familiar with nerves technologies and we ought to be having a demo server set up here pretty soon. I'm probably in the post NAB timeframe. Uh, you know, so we can start playing with these things. Again, if you guys have interesting ideas, we'll be able to like throw some ideas at it, do some demos and see just how well it does. It's, it's pretty bleeding edge stuff. You know, it was interesting talking to them. You know, I maybe from years of watching alias and being kind of a conspiracy freak, I kind of assumed that this stuff was just really prevalent out there. Apparently folks who do this well, which nerve really seemed to be in that, that list of candidate companies who do this type of stuff, it's actually not nearly as advanced as I might've thought it was. Some of these things are really just starting to get to the point of sophistication. And also just the speed of today's processors are fast enough where it really can improve the timeliness of performing some of these operations above and beyond what a human would be capable of that, you know, this does represent bleeding edge technology. It's, it's, it's pretty cutting edge stuff. And yeah, we're definitely still trying to figure out use cases and just how effective it is with some of those use cases. So hit us up workflow [email protected]. Would it work for, Speaker 1 26:26 For say if I wanted like evening time lapses, where it went from daylight Speaker 0 26:30 To dusk it's to find an object, right? So it's not necessarily at this point, although I did ask him about this and I think it could be worked on, I used the example of a dunk shot, if were doing sports production, could you kind of show it what a sequence of images looked like that constituted, like something that it thought was like a dunk in basketball, and it's more about a freeze frame of an object being statistically likely to a peer in a frame of video. And you gave it a still image in the first place. Now you could feed it like a sunset and see if it came up with shots that had thought were a sunset, but, you know, I wasn't necessarily an object. I think it's like the more discrete the object as a self identifiable entity is within a frame of video. Speaker 0 27:21 The better it'll do find the banana exactly. Find the banana folks. Wow. That just kind of summarizes it. That's right. We are going to NAB next week. I am Meryl will be here holding down the Fort. Fewer other folks are out there. I know that Nick CDO will be there. We can arrange demos in their meeting room with them. Drew Lanham. Again is the guy that we're very friendly with, who we can arrange meetings with. So please reach out to workflow [email protected]. If you are going to NAB or sending some folks out there from your contingent and want to hear more about Nick city and maybe even a lockdown, a demo in their meeting room. And as far as the nerve goes, we will be setting that up, doing more demos as time goes on. If you have ideas about that, again, shoot us a line. Speaker 0 28:08 We'd be happy to talk about that. And, uh, you know, I think that's pretty much it for this week, but we wanted to kind of talk about some of these bleeding edge, audio and video search technologies. Cause it's stuff that's just hitting the market. And you know, really, really curious to hear how this could impact people's organizations. We talk about metadata all the time. We think metadata is great. Metadata are those things that a human kind of has to define, but there are some things that computers are really good at. And a lot of them is doing statistical analysis and of doing statistical analysis on both audio and video. We'll find you what a banana looks like, then. Boom, boom. Find that banana fast. Alright, so that'll do it for us this week. I think the next episode will be our are back from Las Vegas and a B review, right? Speaker 0 28:58 And a yes, we'll be going to damn New York soon enough and chat about that as well. Yeah. Damn New York. Just bear in mind, be aware of it. It's put on by Henry Stuart conferences. It's a fantastic show for digital asset management. In fact, we'll be talking about these types of things as well as more traditional metadata search at our booth presence at, at, at damn New York. But again, damn New York, great show in New York. If you're interested, it's a mace. Uh, second and third, and we're going to have a fully integrated storage media asset management archive system right there at our tabletop. We can get you exhibit hall badges. You have to reach out to us work flow [email protected] and we can make that happen as well. You guys have a good weekend. All right. We'll be back from Vegas and hopefully alive. Okay. Take care. Bye bye.

Show Notes

Episode Transcript

Other Episodes

Episode 0

#45 "Work from Home Culture with Dave Helmly of Adobe Part 2"

Episode 0

#33 "Filesystems and Beyond"

Episode 0

#58 Video Production and Post Production in the Cloud During Covid with Michael Kammes of Bebop Technology, Part 1