IoT Spotlight - EP062 - Advanced machine vision and deep learning systems - Iain Smith, Managing Director, Fisher Smith

Podcast Technology EP062 - Advanced machine vision and deep learning systems - Iain Smith, Managing Director, Fisher Smith

	EP062 - Advanced machine vision and deep learning systems - Iain Smith, Managing Director, Fisher Smith
Podcast Date	Apr 07, 2020
Rating
Podcast Description	In this episode, we discuss the use of deep learning mechanisms to accomplish tasks that are not possible with traditional rule-based systems. We use two cases to illustrate how deep learning can be used to solve non-traditional and recognition problems within hours. This is part 2 of 2 with Iain Smith on machine vision. Iain Smith is Managing Director and Co-Founder at Fisher Smith. Fisher Smith designs and supplies machine vision systems for automatic inspection and identification of manufactured parts on industrial production lines. https://fishersmith.co.uk


	Subscribe

EP062 - Advanced machine vision and deep learning systems - Iain Smith, Managing Director, Fisher Smith

Podcast Date

Apr 07, 2020

Rating

Podcast Description

In this episode, we discuss the use of deep learning mechanisms to accomplish tasks that are not possible with traditional rule-based systems. We use two cases to illustrate how deep learning can be used to solve non-traditional and recognition problems within hours.

This is part 2 of 2 with Iain Smith on machine vision.

Iain Smith is Managing Director and Co-Founder at Fisher Smith. Fisher Smith designs and supplies machine vision systems for automatic inspection and identification of manufactured parts on industrial production lines. https://fishersmith.co.uk

Contents

Erik: Welcome to the Industrial IoT Spotlight, your number one spot for insight from industrial IoT thought leaders who are transforming businesses today with your host, Erik Walenza. Welcome back to the Industrial IoT Spotlight podcast. I'm your host, Erik Walenza, CEO of IoT ONE. And our guest today will be Iain Smith, Managing Director and cofounder of Fisher Smith. Fisher Smith designs and supplies machine vision systems. And this is our second discussion with Iain on the topic of machine vision. In this talk, we focused on the use of deep learning algorithms to accomplish tasks that are challenging or impossible with traditional rules based systems. We also walked through two case studies. The first case illustrates how deep learning can be used to solve some recognition problems in a matter of hours. While the second case illustrates a difficult problem that would have been impossible with rule based systems, and is pushing the bounds of deep learning capabilities. If you find these conversations valuable, please leave us a comment and a five-star review. And if you'd like to share your company's story or recommend a speaker, please email us at team@IoTone.com. Thank you. Iain, welcome back. And thanks for joining us again. Iain: A pleasure to be speaking with you again, Erik. Erik: Let's just do the 62nd background on who you are and what your company Fisher Smith does and then we can dive into the topics that we're going to be focusing on today. Iain: So I've been working in machine vision industry for about 20 years now, having done a degree in engineering maths in the UK, and then pretty much gone straight into working for a machine builder of vision inspection machines. And then for the last 15, nearly 16 years, I've been running Fisher Smith, very much focused on just the vision aspects of industrial machine vision. And this tends to be predominantly inspection, quality control, robot guidance tasks, and we tend to be working A, with a range of vision suppliers, we’re then adding value to that equipment by integrating it, writing the software, doing the front ends, deploying, getting the systems actually working on the factory floor to solve the application. And we're usually doing that through a chain of customers. Predominantly, those customers are automation companies, people who are making the machines that they're doing the robots, they're doing the conveyor, the material handling the moving of parts, the assembly, and we come in as a specialist supplier to do the vision aspects of that production line. So we're quite a small specialized team just focused on the machine vision in industrial stuff. Erik: Today, we want to do more of a deep dive on deep learning and the impact there. I think in our last podcast, we covered more of, we could say the traditional solutions, even though some of those solutions are still on the cutting edge of machine vision technology. As a starting point, it would be great if I can just understand your definition of that because we have this term deep learning, we have machine learning and then we have AI as an umbrella concept. Are there technical differences between these terms? Or are these different categories in terms of hierarchy? How do you look at these? What does deep learning actually mean in the context of machine vision? Iain: Yeah. So, I guess, all of those terms are probably very well misused and they get swapped and interchange quite a lot. AI really sits above all of these as a more general concept really of computer based intelligence, and often the general consensus of AI it's a sort of human level of awareness and intelligence. And what we're really looking at only ever looking at is a specific or a focused or targeted AI at a particular function that's when the separation from AI to really deep learning starts to happen. So we're really looking at deep learning in industrial vision context to mean teaching a neural network on images in particular, and looking for particular characteristics in those images. We're not using the neural networks for just data processing, or speech recognition, or any of these other data sources that can happily go into a deep learning network and neural network. We're really just focusing on image processing aspects of that. And then even within image processing, it's focusing that down again into specifically industrial applications. Erik: Let's go into a little bit then how does this differ from traditional process? So I suppose with a traditional process, you're looking at maybe this shape is diameter of two millimeters, and maybe, so if it's out of range by point X millimeters, then it's a fault? Or maybe we're looking for Black and if there's a white pigment, then it's a fault? How does deep learning differ from a program approach that you might have taken or might still be taken in most cases today? Iain: Yeah, so it's a very different concept really. So the traditional machine vision, and deep learning really complements each other in a lot of aspects. There are some areas where they overlap where you think I could do that one way or could do it the traditional way. But often, the two are separate. And like you say, the traditional methods tend to be more so they're rules based or logic based where you're saying I'm going to count this many dark pixels or blue pixels in an image; I'm going to make a measurement, which is generally finding an area of contrast, or a feature at one bit of an image in a different area of contrast, or a different feature in other image, and then measuring between them. And for some of those techniques, classic machine vision is still the right way to do it. But where the deep learning changes things is that it can cope with different ways of teaching it to start with, and then it can cope with different scenarios far better. But it's trying to find where that separation is to work out which one's best to deploy. So for instance, if you're looking at fault detection, if you're looking at a bland background, say, gray conveyor belts, or a gray surface, you're looking for scratches on a piece of metal or painted material and that background is consistently a color, a gray, a blue, and you know that the defects on that are black marks, then that's fairly easy to set up with a traditional machine vision approach, where you can say, okay, I'm going to ignore pixels that are blue, or ignore pixels that are gray. And I'm going to look for anything that's different to that. I'm going to look for black pixels. And then you can start counting them and saying, okay, this many black pixels is unacceptable to the customer. This is a fault, we reject the parts. So that's the sort of traditional logical rules based approach. But as soon as that background or the object is not a nice consistent even surface, if that was multiple colors or had a surface texture, had dark areas, light areas, was a fabric or some other complex sort of shape and color like a grain of wood or all sorts of different surface textures and you're looking for scratches going across them, the traditional techniques just completely fall down. You can't easily say if you're looking at a bit of wood, and you're looking for scratch on that, where your piece of wood has all these grains, these lines and contours running through it already. So how do you quantify a scratch with a rules based approach? You can't say it's long and thin and dark because all the other grains in the world are already long and thin and dark. You may be able to say well it's going horizontally and all the other grains are going vertically. But that may only work in some of the instances. It may be that some of the blemishes are lighter or some of them are darker, or some of the blemishes are actually very similar color. And to a human, this is often the frustration with traditional vision techniques. To a human with a bit of training, they can look at that object and say, no, we don't like that blemish on this piece of wood, this piece of fabric. It's wrong. But trying to codify that, trying to put the logical rules around that to say, well, is it darker? Well, not all the time. Is it lighter? Sometimes, but no. Is it same or different shapes to what's already there in otherwise good products? Almost impossible to use a traditional machine vision approach to that. And that's where deep learning really wins that you can give it lots and lots of samples of what are good and all the variations that come through and good, and then you can give it samples that are bad. And [inaudible 10:40] with suitable levels of training separate those different classes and start to find faults that would have been almost impossible to find with traditional machine vision techniques. I've sort of touched on defects, that's tends to be what the quality control aspects of what we do. That's where a lot of our projects end up going down. But really deep learning wins in a few areas. One is defect finding. The other is object detection, that if you don't have such a defined shape, or a strong contrast, you can use the deep learning to actually match features in an image and find and locate features in an image. The core and classical use case for deep learning is classification, separating apples from oranges from pears from bananas, and saying, yeah, this is definitely of this type, this is definitely of that type. And then we often look at this now as a layered approach, that we may build up a deep learning application where we're looking for a defect type, and then once we found that defect, then we're actually, using the deep learning as a secondary operation on that to say, now we've extracted the defects, we’ll it, this is a scratch, this is an oil mark, this is a fingerprint, this is a piece of cardboard, all the different types, so then you can start to get proper separation, and use that feedback to really give the customer valuable information about their process and where the defects coming from, what are their high value problems. Whereas with traditional techniques, some of that is possible. But certainly, if you're just looking at a found an area of dark pixels, it's bad, then trying to put lots of rolls around and say, well, if it's long and thin, and in this direction, it's a scratch. But if it's long and thin, but curved, then it's maybe a bit of fabric or a hair or something like that, and trying to actually separate those might be very difficult in a rules-based way. Whereas the deep learning is really set up to do all of that. Erik: So, in a deep learning environment, it's easier than to categorize because the algorithm can kind of tell you that the shapes are somehow similar. But one of the complaints about deep learning or one of the challenges there is that it's also somewhat of a black box, so it can tell you these are similar, but it doesn't necessarily explain why they're similar. How do you get around this understanding barrier? Iain: This is one of the key hurdles we have when explaining this or even getting to the point of assigning a machine or a system off with the customer. When you've got to say to them, well, we've put all of your good images in this side of the black box, some magic has happened in the middle, and then at the other end, it said this one's good and that one's bad. And there's no real way of some people can and do understand what's going on inside that neural network. But the reality is that you don't necessarily know what features the deep learning has chosen in that image to separate one class from another, good from bad, apples from oranges. And it might be something that you don't expect. And this is something we've got to be quite careful with. Because you can inadvertently where you think you've taught a particular fault type, what the deep learning is actually picked up is not on the fault itself, but maybe the proximity of that fault to the edge of the parts. So if you've trained up, for instance, and all of your images, the scratch is next to the edge of the part, then the deep learning may have detected some of the aspects of the scratch, a little bit of contrast, change of texture, whatever it may be, but it may have also picked up the edge of the part is a significant feature that always appears near it. Then when the same scratch which might be otherwise identical appears in the middle of the parts, the deep learning misses it completely. And it's trying to have that understanding of we can't assume that it's found the scratch or the blemish and it understands what the blemish is. There are other factors in play, and the way that the neural networks are trained, the image is broken up, manipulated, changed, transformed in various ways, mathematically all very sound concepts can be quite abstract. And it may be finding some data in there to separate the classes that you can't visibly tell or you can't obviously see or attribute to what you've got in front of you. Erik: There was a study out recently, I think it was Google that was using deep learning to look at lung cancer, and they were comparing this against results from whatever species of surgeon typically will assess cancer in the lung. And I think the result was that there was a class where the humans could easily identify that this was a cancer, and the machines were missing it. And there were others where the machine would very high accuracy identify, and the humans completely missed it. And so they're using different processes in some cases, for identification, this, I imagine can be very frustrating for somebody who's trying to understand the quality control process. Iain: Yes. Traditional vision techniques have been around for a good number of years now. So they're starting to become generally understood in industry so that people who have may be purchasing project people who have brought in a number of projects into their business which have had machine vision on them have an understanding of what they want that system to achieve, and how they want us to present our methodology to them so that they can understand what's happening, and sign offs to understand that it's this robust method they've got control if they need to adjust something. They can see what's happening in there and they can see that, yes, the features that they've asked us to detect, we are detecting when they change like this, we can see the values changing, and we can sort of prove that we are meeting their specification. With a deep learning, it's very much a results-based sign off for it because you're asking the customers say, well, give us the good ones, give us the bad ones, we'll put them through and give you the result, which says these ones are good, and these ones are bad. But we can't tell you how it's made that choice because that's all happened in the deep learning. Erik: So the example that you gave earlier with scratches being on the edge or on the inside, so the challenge there seems to be that there were not enough samples in the training data of situations where there was a scratch on the inside. So we didn't have sufficient data of that case, to train the model. And in a lot of cases where let's say if the quality control is terrible, well, you have a ton of examples of faults, and then you have some good training data. But in a situation where quality control is already fairly good, you might not have that many examples of actual faults. How do you get around understanding how to collect the data and when there is sufficient or insufficient data to train on a deep learning model? Iain: This is one of the big separations between traditional machine vision and deep learning is that with traditional machine vision, you can basically take a single image, you can set your rules and your logic and your tests up on that image and say, well, I'm measuring this, I want to find this. If its pattern changes more than 10% from what I've taught it, I'm going to reject it. If this number of pixels I’m counting over here exceeds a certain value, I'm going to reject it and you can set those rules up on one image if you need to. You can then you can start challenging it with other images. With deep learning, you need a quantity of images to start with, and the quantity is a variable thing. So depending on how subtle the faults are, you may be able to get away with talking tens of images as a starting point. But if you want something to get really robust, then more is more. If you can provide more samples of all the different classes you're trying to put into that deep learning model, then you're going to get better, more robust results. But you also need to spend the time in there to actually separate them as well. So if you have only got one image of a single fault, then the chances of you reliably finding that fault in multiple locations is limited. There are techniques. So, one of the bits of software we've used most of this which is a cognitive product called ViDi, that has the option to add introduced variations into your trained images. So you can basically take an image that's part of your training, and say well, all of the trained images, I'm going to rotate them all by ±90 degrees in two degree steps. And I'm going to make them all slightly brighter, slightly darker to try and so artificially introduced some variations which we may expect as a way of multiplying up the training set. So if I only got a handful of images of faults, I can add some of these manipulations, either in position, size, scale, skew, and brightness and contrast to try and make my training set more robust by adding some of these manipulations in on day one, but it's still no substitute. It can certainly help. But if you really don't have the data to start with, then it's going to be hard to work with. And this probably also leads into a separation, certainly, the sort of commercial software that we're tend to use of two different techniques in the deep learning. One is where you specifically teach the faults. So certainly, with the cognitive product, we refer to that as supervised training where you are specifically drawing on to each one of those faults. This is where the fault is and you're highlighting it on each failure image. This is the fault of this type, and you have to manually mark all those images, specifically where the fault is. And that then allows you to be much more targeted about the fault type or the feature that you're trying to find and highlights. But the downside of that is you do need to have lots of different variations. So if you're looking at scratches, you want those scratches to have a good training set where those scratches are different sizes, different lengths, different curvatures on different positions where you'd expect to find them. Because with that method, what you're teaching in will take into account all sorts of things including the position, size, the length, the pixels that surround that area that you've marked all go into the mixing part when neural network is trained to decide how to separate that class or that fault for everything else. And if you haven't got very many rejects images, the other way we call it unsupervised training or its novelty detection is where you just train it on good images. You say all of these images are of good products, these are all acceptable. You train those up, and then you ask it to find things that are different. And then that's a good way if your faults work with that technique, some do, some don't. But if the faults work with that technique, then you're selecting good ones and you're highlighting areas on those images which do not conform to what the good ones having them. So it allows you to detect faults that you may not have seen before or a difference or varying. So the two techniques, often we combine them as well because you might want to generally have a defect detection, which is looking for differences from the good products. But there may also be some very specific or quite fine or subtle features that that technique doesn't reliably separate. It might find them when they're more obvious. But when they're more subtle, we need to go to a more direct and a supervised method of training where you're highlighting the faults manually as part of your training. And then to combine the two, you have the two deep learning models running either sequentially in parallel to give you the best of both worlds really. Erik: And then I guess maybe a follow up would be this training then, is there any off the shelf software where somebody can purchase this, they can plug it in, they can upload images, and output a functional model? Or does pretty much all of this require a data scientist right now to get into the dataset in order to create a functional model? Iain: So you can go either way on this. On one level, anyone can go and register for an account with Google, Amazon, Microsoft, you can get access to their virtual machines. You can use open source, deep learning programs, things like TensorFlow that you can start off the news for absolutely minimal cost and you can go and train a deep learning network. There's obviously an overhead to that you got to understand approximately what you're doing. But there are tutorials out there. So at one level, you can go off and do it absolutely minimal cost. You don't need to have any. If your own hardware, you can rely on uploading images to the cloud and using virtual machines to do all that training. What we're really focusing on is commercially available products. Although sort of data scientist side of things is very interesting, for us, it's not very commercially viable. If we can deploy a solution in a relatively short space of time, then that allows us to solve a customer's problem deploy a solution, get it working on the factory floor, and then we can move on to the next projects. We don't want to be tied up for hours or days or months training up neural networks. So to that end, we do certainly focus. And really what I talked about today is based on these commercially available products. So to mention a couple, I think I've already mentioned one, which was Cognex ViDi software. And we also use the MVTec, German company their [inaudible 27:28] deep learning library as well. And these are commercially available deep learning products, where you have a software user interface that allows you to do the training. So you can import images, you can mark on the images, the features that you're interested in. And then you can train that model locally on local hardware, and then you can deploy it again within the framework of those software environments. And this is really where we see for directs industrial deployments, this is the most sensible route to market for us. So it may not be the most powerful, the most flexible, obviously, companies like Google has got their self-driving cars, they're spotting all these street signs, shop fronts, so the cars, number plates, all of this within very, very complex and powerful neural networks, deep learning, AI, things were looking at usually not having the benefit to use the cloud as a computing platform for this for a couple of reasons. So one is a commercial reason, these are often specific and proprietary processes and products that companies using and generally cautious about exfiltrating all of that image data and their quality control data up to the cloud to be stored somewhere else. But then there's also practical aspects, so the training is one element of it. But when we're talking about deploying a deep learning system, this is often on a production machine. We might be checking multiple parts a second, and in which case the practicality of taking an image with a camera on a production line, then uploading that image to some cloud service, doing the deep learning on a virtual machine, getting the results and sending that back down to the machine to pass or fail, it might be quick enough, but we might not be able to guarantee the reliability of that or the latency. And it's going to be dependent on local network conditions, local internet conditions, other bottlenecks. It will be too risky and too difficult to look at that. So we're looking usually at what it's described as edge inference where you're learning a local computer to do the deep learning runtime, the inference on that computer locally to the camera, and give the results back there and then. So that then requires that computer has the capability to do that. And this is where you start. You generally are looking at having GPU acceleration. And for most of the products that we're working with, the commercially available products, that GPU acceleration is Nvidia based because of the CUDA libraries that Nvidia has are the ones that are leveraged in the software. And basically, the highest spec graphics card, you work with, the fast you’re on your network, the fast your deep learning will run. And that's how we tend to the deployment. On the training side, all of the same things really apply that what we're generally doing here when we're training is we're using a local computer with graphics acceleration. And, for instance, we've got a gaming spec laptop, and that is capable of training these deep learning models. Within minutes, if it's a small image set, small image size, maybe hours, we might sometimes leave it running overnight to do a more complicated set of training. The training is generally more of an offline process. So, using a cloud based service for that is more viable. But it's also a reasonable overhead to do it locally on local hardware. What we are starting to see is that companies like Cognex are exploring whether or not a cloud based service for this will work. But again, we've got this sort of two things. One is that we might have gigabytes of images to upload. The more images you can take and supply to that deep learning, generally, the better and more robust your model is. And certainly, here in the UK, we have asymmetric internet connections, so the upload speeds might only be 10% 20% of your download speeds. So, to upload tens or hundreds of gigabytes of images to a cloud service, well, that might take you a couple of days. So actually running stuff locally on a local computer can be beneficial for an app point of view. Certainly, the software we're using is licensed. So rather than the sort of Google and Amazon where it's largely open sourced, if we're using proprietary or licensed software, if you've got a physical USB dongle that you can use it for a local laptop or a local PC, how does that licensing work if you have a virtual machine in the cloud, which could be a little bit? So we are seeing some movement with certainly some of the dominant companies like Cognex, who are starting to explore is that a model that us as partners as integrators could maybe use as a benefit of our partnership, that rather than us maintaining our own local infrastructure, we rely on our suppliers infrastructure to do that training. And then further on is would customers want to do that? Some customers absolutely wouldn't want their data to be stored in the cloud somewhere. They'd be much happier to have it locally, even if that meant investing in some decent PC hardware to train that. Erik: Yeah, and that's, I think, more or less what we see across other IoT use cases as well is that there's a general move towards the cloud. So there's a trajectory in that direction, but it's low, it's very cautious. And it's not so much around the cost model, that's why people are moving in that direction. So any use cases that are deemed as high risk of putting IP at risk, I think those are going to move very slowly. Iain, can you walk us through a situation if you have maybe a customer and mind that you've worked on this, you don't have to mention the customer's name, but just to walk us through from initial conversations through the decision process of what is the right approach for this particular situation, and then how you select the right technology. What's your evaluation process, I think it's quite useful for people to kind of have an end-to-end perspective of how this would be deployed? Iain: So I can probably talk you through a couple of use cases that we've come up with. So, one that we had a little while ago was for some plastic lenses. So these are solid lenses, I think they were used in some sort of smoke or fire detector systems. And they were being inspected manually. And the way they inspect them was they had a human look through them and they have a grid underneath, a white piece of paper with black lines on it. And the human is able to look through that and see if the lens is correctly formed, then you see that grid distorted in this sort of classic fishbowl or pin cushion type of distortion. But if there was a blemish in that lens, then you get an anomaly that the lines wouldn't be evenly distorted. As you look through the lens, they will be uneven and varying. We looked at that, and tried to do that with a traditional machine vision approach. I wouldn't say it was impossible, but it was very difficult to do it and do it robustly, partly because the product had a little bit of variability, which was acceptable. It wasn't a precise product, it had to be uniform. But that uniformity could vary a little bit. So long as it was an even distortion, it could vary. So if you tried to set up a rules-based system to find the edges, which were black and white, so nice and straightforward, then you could do that. But then the spacing between them would change all the time, not by very much, but it could be sort of unevenly depending on how that distortion occurred. And then you start doing that in two axes. And then what happens when you get a little, like a bubble or a dark mark or a blemish in there? Can you detect that because now you need to be looking for black pixels in between the grid? It wasn't impossible, but it was very, very difficult to do and to do it robustly. Whereas we were able to basically reproduce that, take the images and put them into Cognex ViDi in this instance. And with actually, only quite a few samples, we're talking any sort of 10 samples of good products, we could then put one with a bubble in it, one with uneven curvature on the lens, one with a scratch or a spot in it, and straightaway, it would pick up those features. Regardless of sort of distortions in the grid that you were seeing, it could pick up those blemishes very quickly. And actually, because these parts took quite a long while to make, we had 20 or 30 seconds per part to do the inspection. Once it had been trained, we worked out that we could actually run that without GPU acceleration, so we could run it on a CPU and it wasn't very quick. You were talking 10 or 15 seconds to detect it. But we had that amount of time. It was acceptable to the customer. And it meant that we could go with almost a standard machine vision hardware setup for the deployment but we've just done the deep learning training on some graphics accelerated software. But one of the benefits for us using something like ViDi, or [inaudible 39:08] to do this is that when you're deploying it to the customer, we will need to have the ability to take images from the cameras, will those products have the traditional vision tools and the interfaces to cameras and things like that. So we could use the same software environment to grab the image to put it through the deep learning, to maybe do a little bit of post processing on that with standard vision tools. So, maybe put some user editable thresholds on the size of the defects that were being found. So basically, taking the defects that are found, and then doing standard blob and pixel counting type analysis on those which then becomes runtime, user-editable, so they can trim their quality levels without having to retrain the deep learning, and then wrap all that up in a user interface. So that was one area that we're looking at it. Erik: Could you give me a ballpark for what the timeline would be for development of this, and also what the budget would be the full including hardware, software, the full budget might be for a deployment like this? Iain: Yeah, sure. So the timescale, in terms of us evaluating it, once we got samples from the customer, I think we had the deep learning model trained in an hour or so. Once we got images, the training took probably less than an hour and they were able to start doing testing, and effectively go back and demonstrate to the customer within a few days. So that aspects of it was quick because it just works. The training work very quickly. We got a good result straightaway. We were able to sort of move to the deployment. The overall cost of that, I think, we were somewhere if I say ballpark around 30,000 euros, which was for the hardware. So reasonably high resolution camera, the lighting, the grid, that we were using as part of the imaging solution, and industrial computer to do the processing, and the software license, and our time to actually put a little user interface together, do all the communication and do the sign off and the installation aspects of the integration. And that compares reasonably favorably to a traditional machine vision. If we would have done it with a smart camera, for instance, of the same sorts of resolution, we might have been 75% or 80% of the cost. But in the same ballpark, which is not always the case, because as soon as you start looking at faster systems, more high end systems where we start needing more graphics cards, bigger hardware, and more involved, image processing and deep learning time, then those costs can obviously escalate quite a lot. Erik: How would this scale? Let's say it's 30,000 euro to develop the solution initially and if they say we have five other factories, or five other production lines that are exactly the same and we want to deploy this, would it also be 30,000 or are you looking at 70%, 50%, what would the cost look like if you wanted to scale the exact same solution? Iain: On that particular one, you'd be looking at 70-ish percent to do repeats because a lot of the overhead of us writing user interface doing the development aspects during the training of the deep learning has all been done already. So then the hardware costs, the licensing for the software and there's still some deployment charges to be made to actually get the case on site set up and working. But obviously, we're then not having to create a user interface or anything like that because that can be copied over from the previous one. So on larger deployments, that difference could be greater. We could be down to sort of 50-60% repeat costs. So a lot of what we do, the repeats very rarely go into the tens and hundreds, it tends to be either a single production line, or 5, maybe 10 at the most. And if you went to very really large scale, if you were talking really just picking names out of that yeah, really. But Samsung or Sony where they've got maybe hundreds of production lines making consumer electronics, then maybe you wouldn't look at some of these commercial off the shelf products necessarily. You might be considering starting from scratch, because, okay, the development overhead is much, much higher, but then the deployment costs would be much, much lower potentially. But then having said that, our suppliers, if we said to them, we've got 100 off deployment or 1,000 off deployment, they'd be very keen to discuss commercial terms on that basis. So I'm sure costs for various things would potentially come right down at those sort of levels. So another little example, and one that we're actively working with at the moment is we've got a manual assembly process where we're trying to detect a particular feature where the customer bolts 500 or 600 of these particular items onto this system, this framework that they're building. And what they're trying to do is to remove the amount of human inspection required to validate that every single one of those has been correctly placed and is in the right position. And because rather than being a maybe fully automated production line where we can put the camera directly over the object we're looking at, we can control the lighting, we can control the testing of it, this we're talking about a large item where you've got multiple people climbing ladders to bolt stuff on moving around it, so were restricted to what we can do. We haven't got the same control over our environment. We're having to have the cameras set back from the objects so that the people can move around in front of them. We can't shine stupidly bright lights at the surface to even illuminate it. We've got to rely on the ambient lighting. Which all makes then detecting these objects, very tricky to do, because they appear all over different angles, different heights, we've got a camera, high resolution, color camera, looking at the scene or actually several of them looking all the way around. And we're trying to determine all of these different objects are they there? And the objects themselves as they're bolted in, they can twist and conform. They hold on to other aspects of the bills. And depending on how hard they're tightened and what they're gripping, then they deform slightly. They look a bit smaller. And they come basically in one color, a couple of different sizes. But the color is not a controlled aspect of the build. As long as it's functional and the color is approximately right, it's acceptable. So we've got all of these variations: lighting, shadowing. We've got multiple positions, multiple poses and angles that these appear at and the variation in the actual color and size of the objects. And we're trying to locate all of them around this surface. And we're using deep learning at the moment to teach this is what this feature looks like. We're now up to, I think, training on to 2,000 or 3,000 images, and each one of those images might contain multiple instances. This is the other thing I haven't really touched on is that you need all these deep learning techniques rely on a ground truth, the human to say, this is good, this is bad, or this is this type of feature, this is this type of feature. And all those images need to be labeled correctly as accurately as possible and consistently to allow the deep learning to say, well, I know that these are this class, and these are this class. Because without that human interaction at the start with training, it doesn't know what it's classifying. So that has been quite a big overhead in terms of time. It's not necessarily particularly high skilled. But it does require somebody to sit there and say, I'm going to draw a box around this one, we're now going to draw a box around this one. And what we're finding with that at the moment is what is the best technique? How do we best encapsulate these objects? Because sometimes they're fully visible, sometimes they're partially visible, sometimes we see more of the side of it than the front of it. How do we go about teaching that? And there is sort of no right answer to this. It becomes a bit of trial and error, bit of trying one technique. Okay, we're going to train all of these, where we're only going to focus on certain aspects of it. And if we can't see that aspect we won't train them. And then we train another version of the same model, where we include the surroundings of it much more and say, this is it, but it's surrounded by this so we have bigger areas to define it. And we're having to spend quite a lot of time training a deep learning model, testing it. We're not getting very good results with this. Okay. Now, to retrain that, we've got to go over all of those 2,000 images, and we've got to redraw every single box on all of those and retrain it and see does that give us a different better result. Is that more consistent? Is that more reliable? Do we need to start to separate these into different classes? This is the object looking straight onto it. And this is the object, if it's to its side, so do we treat the two rather than putting them all into one class and saying these are all the same? But they do look quite different from the side or from the front. Do we start to separate them and say, this is the object from the front and this is the object from the side? And this is where we’re finding that deep learning, although it's very powerful, although, it's enabling us to do something that we really would have struggled with before with traditional techniques, it's not for free: there is a significant overhead cost. And we're even seeing that some of the manufacturers and suppliers of this software, they themselves are, I will say, struggling, but they're finding that they're getting lots of inquiries. Or we'd really like you to evaluate whether your products can work with our issue on our projects, can you evaluate it for us? But that evaluation time is time consuming, and inexpensive and using up a lot of resources at the software manufacturer’s end. So it’s very powerful, but there is a considerable overhead to it in terms of the time that goes in at the training level and what humans need to add to that. Erik: In this situation, do you think the end result is going to be that out of these 500x instances, you'll say, well, these 300 are definitely a pass or these 20 are definitely a fail, but the remainers, we need to have a human go and look at them a second checker? Do you think that you'll actually be able to come up with a sufficient accuracy that a human doesn't have to be involved in the end here? Iain: I think on this particular application, if we get to 80-90% success rates, and then the human asked to intervene to the remaining few, then that will be acceptable for this customer in this particular use case. Clearly, if you're looking at 100% quality inspection on a production line, those sort of numbers do not inspire confidence. But with this, because it's a manual operation anyway, they're used to spending a lot of time inspecting this and rechecking it, because it's a high quality component that comes out at the end. It's not a fast process. Then some amount of human intervention is acceptable. And there's also the fact that compromises on that system have been made, where we know that we're going to look at some aspects of the build. And some of these objects will naturally be obscured by a later bit of the build or even something of the same stage. So there may be bits that we physically cannot see. And as good as deep learning is, it can't see if the image doesn't present it to it. So there's an understanding for that particular project. Erik: So far, we've always been talking about using machine vision with camera systems, right? But I guess you could use infrared, you could use some other sensor in quite a similar way. Have you ever found that to be particularly useful? Or do you find that a more standard camera system is generally the most effective solution. Iain: So, so far, we have only looked at this with standard cameras. But certainly with the Cognex products, it's capable of working with multichannel images. So color is fine. Black and white is fine. But there's no reason why that data couldn't be, for instance, 3D data. I mean, infrared or thermal stuff presents, when you get the image, it's basically the same format as a standard’s color camera or a black and white camera anyway. The 3D, again, we would work very happy and might need to be what we call a range image, 2.5D image where you've got an XY image with height as the color, if you like, the pixels, rather than a full point cloud. But the techniques work with any image format. But what I haven't really touched on is the fact that these commercial products that we're using, the neural network in them has been pre-trained, is ready biased towards industrial type of images. So if we supply standards, images, black or white or color of industrial type of inspections, then the deep learning is already sort of preset. It doesn't have to work as hard to hone in on the features that you're looking at. To some degree, there's a bit of flexibility around this. But if you went to an Amazon or somebody to take a fresh off the peg neural network and train it from scratch on images, then you're not narrowing that down at all. So the software products we're using, if you want to try and say, is it a dog or a monkey or a cat it's probably an unlikely thing for an industrial inspection to be doing. But if we're looking at here's a printed circuit board, let's find the ends of a chip or let's check the solder on each one of the pins is correctly formed, then those sort of images go directly through the deep learning networks that we're looking at here much faster because the network's already been predisposed to work well with that image type. Erik: I really appreciate the deep dive here. Is there any last thoughts or I think we covered a lot of territory here today? Iain: I think we have covered quite a lot. There's lots of other use cases. Deep learning and deep learning, even in machine vision is not new. But really what we're now getting at is commercially viable off-the-shelf products that we can deploy in a reasonably short period of time. So this is now becoming commercially, really, really viable and more and more acceptable. And it's opening up avenues to us that previously were really shut. And it has been a bit of a sort of wow factor. When we started seeing, in particular, the Cognex ViDi because the training user interfaces is so nicely presented on that that you are able to very quickly get somewhere with it. And it really sent us looking away through our back catalogue of applications that previously we said, either we can't do it, or we really don't think this is going to be robust if we deployed this. It's right on the borderline, we think we can find the or we can find this type of fault but not that. And you start looking back at those and thinking, actually, this would be possible. This could really be really viable now. And we're seeing some of the software companies focusing in on some certain use cases. So one of the big ones that certainly Cognex and MVTec have picked up on is text reading and [inaudible 58:02] has had a ready trained font for beginning industrial recognition text reading, OCR, OCV, that has been trained with a neural network with a deep learning technique, and then given back to you as a runtime to just use and its capability is just you give it you're looking at industrial markings generally. So we're not necessarily talking handwritten stuff. But almost any industrial fonts that you get on a label of anything that's printed or marked on something for traceability, or even just for stuff like food and fire packaging where you've got date codes, lock codes, things like that, these fonts that have been trained with a massive network of images of text behind them are really robust. Without any teaching, you just say read this line of text. And it's very robust about reading it back to you and getting that back. And we're seeing that as being one of the key benefits of this that can be trained so the end user doesn't have to do the training of that. They can just benefit straightaway from the fact that they've got a readymade deep learning model that reads text. And certainly, I know that some of the companies are looking at other use cases. There’re other things that we could focus in on. So maybe for logistics, this is a box. If you can then say okay, that's the box, where do we look for labels on it? Then it speeds up everything. Or number plate recognition, finding a white rectangle or a yellow rectangle, we're going to read codes in there, certain, maybe slightly niche use cases but areas where the training aspect could be done before the product is sold basically, and then you're just ready to go with a pre-trained, deep learning model that you can use straight out of the box to solve certain tasks. So I think we'll see more of that coming through as we go forwards really. Erik: I guess for you, maybe it hasn't happened so quickly. But for a lot of people, I think it seems like this is come from nowhere and all of a sudden we're moving towards pretty cost effective solutions. So thanks for walking us through it and give us an update of where we are. And yeah, just really appreciate your time, Iain. Iain: No problem, pleasure. Erik: Thanks for tuning in to another edition of the industrial IoT spotlight. Don't forget to follow us on Twitter at IotoneHQ, and to check out our database of case studies on IoTONE.com. If you have unique insight or a project deployment story to share, we'd love to feature you on a future edition. Write us at erik.walenza@IoTone.com.

Erik: Welcome to the Industrial IoT Spotlight, your number one spot for insight from industrial IoT thought leaders who are transforming businesses today with your host, Erik Walenza.

Welcome back to the Industrial IoT Spotlight podcast. I'm your host, Erik Walenza, CEO of IoT ONE. And our guest today will be Iain Smith, Managing Director and cofounder of Fisher Smith. Fisher Smith designs and supplies machine vision systems. And this is our second discussion with Iain on the topic of machine vision. In this talk, we focused on the use of deep learning algorithms to accomplish tasks that are challenging or impossible with traditional rules based systems. We also walked through two case studies. The first case illustrates how deep learning can be used to solve some recognition problems in a matter of hours. While the second case illustrates a difficult problem that would have been impossible with rule based systems, and is pushing the bounds of deep learning capabilities.

If you find these conversations valuable, please leave us a comment and a five-star review. And if you'd like to share your company's story or recommend a speaker, please email us at team@IoTone.com. Thank you. Iain, welcome back. And thanks for joining us again.

Iain: A pleasure to be speaking with you again, Erik.

Erik: Let's just do the 62nd background on who you are and what your company Fisher Smith does and then we can dive into the topics that we're going to be focusing on today.

Iain: So I've been working in machine vision industry for about 20 years now, having done a degree in engineering maths in the UK, and then pretty much gone straight into working for a machine builder of vision inspection machines. And then for the last 15, nearly 16 years, I've been running Fisher Smith, very much focused on just the vision aspects of industrial machine vision.

And this tends to be predominantly inspection, quality control, robot guidance tasks, and we tend to be working A, with a range of vision suppliers, we’re then adding value to that equipment by integrating it, writing the software, doing the front ends, deploying, getting the systems actually working on the factory floor to solve the application. And we're usually doing that through a chain of customers. Predominantly, those customers are automation companies, people who are making the machines that they're doing the robots, they're doing the conveyor, the material handling the moving of parts, the assembly, and we come in as a specialist supplier to do the vision aspects of that production line. So we're quite a small specialized team just focused on the machine vision in industrial stuff.

Erik: Today, we want to do more of a deep dive on deep learning and the impact there. I think in our last podcast, we covered more of, we could say the traditional solutions, even though some of those solutions are still on the cutting edge of machine vision technology. As a starting point, it would be great if I can just understand your definition of that because we have this term deep learning, we have machine learning and then we have AI as an umbrella concept. Are there technical differences between these terms? Or are these different categories in terms of hierarchy? How do you look at these? What does deep learning actually mean in the context of machine vision?

Iain: Yeah. So, I guess, all of those terms are probably very well misused and they get swapped and interchange quite a lot. AI really sits above all of these as a more general concept really of computer based intelligence, and often the general consensus of AI it's a sort of human level of awareness and intelligence. And what we're really looking at only ever looking at is a specific or a focused or targeted AI at a particular function that's when the separation from AI to really deep learning starts to happen.

So we're really looking at deep learning in industrial vision context to mean teaching a neural network on images in particular, and looking for particular characteristics in those images. We're not using the neural networks for just data processing, or speech recognition, or any of these other data sources that can happily go into a deep learning network and neural network. We're really just focusing on image processing aspects of that. And then even within image processing, it's focusing that down again into specifically industrial applications.

Erik: Let's go into a little bit then how does this differ from traditional process? So I suppose with a traditional process, you're looking at maybe this shape is diameter of two millimeters, and maybe, so if it's out of range by point X millimeters, then it's a fault? Or maybe we're looking for Black and if there's a white pigment, then it's a fault? How does deep learning differ from a program approach that you might have taken or might still be taken in most cases today?

Iain: Yeah, so it's a very different concept really. So the traditional machine vision, and deep learning really complements each other in a lot of aspects. There are some areas where they overlap where you think I could do that one way or could do it the traditional way. But often, the two are separate. And like you say, the traditional methods tend to be more so they're rules based or logic based where you're saying I'm going to count this many dark pixels or blue pixels in an image; I'm going to make a measurement, which is generally finding an area of contrast, or a feature at one bit of an image in a different area of contrast, or a different feature in other image, and then measuring between them.

And for some of those techniques, classic machine vision is still the right way to do it. But where the deep learning changes things is that it can cope with different ways of teaching it to start with, and then it can cope with different scenarios far better. But it's trying to find where that separation is to work out which one's best to deploy. So for instance, if you're looking at fault detection, if you're looking at a bland background, say, gray conveyor belts, or a gray surface, you're looking for scratches on a piece of metal or painted material and that background is consistently a color, a gray, a blue, and you know that the defects on that are black marks, then that's fairly easy to set up with a traditional machine vision approach, where you can say, okay, I'm going to ignore pixels that are blue, or ignore pixels that are gray. And I'm going to look for anything that's different to that. I'm going to look for black pixels.

And then you can start counting them and saying, okay, this many black pixels is unacceptable to the customer. This is a fault, we reject the parts. So that's the sort of traditional logical rules based approach. But as soon as that background or the object is not a nice consistent even surface, if that was multiple colors or had a surface texture, had dark areas, light areas, was a fabric or some other complex sort of shape and color like a grain of wood or all sorts of different surface textures and you're looking for scratches going across them, the traditional techniques just completely fall down.

You can't easily say if you're looking at a bit of wood, and you're looking for scratch on that, where your piece of wood has all these grains, these lines and contours running through it already. So how do you quantify a scratch with a rules based approach? You can't say it's long and thin and dark because all the other grains in the world are already long and thin and dark. You may be able to say well it's going horizontally and all the other grains are going vertically. But that may only work in some of the instances. It may be that some of the blemishes are lighter or some of them are darker, or some of the blemishes are actually very similar color.

And to a human, this is often the frustration with traditional vision techniques. To a human with a bit of training, they can look at that object and say, no, we don't like that blemish on this piece of wood, this piece of fabric. It's wrong. But trying to codify that, trying to put the logical rules around that to say, well, is it darker? Well, not all the time. Is it lighter? Sometimes, but no. Is it same or different shapes to what's already there in otherwise good products? Almost impossible to use a traditional machine vision approach to that.

And that's where deep learning really wins that you can give it lots and lots of samples of what are good and all the variations that come through and good, and then you can give it samples that are bad. And [inaudible 10:40] with suitable levels of training separate those different classes and start to find faults that would have been almost impossible to find with traditional machine vision techniques.

I've sort of touched on defects, that's tends to be what the quality control aspects of what we do. That's where a lot of our projects end up going down. But really deep learning wins in a few areas. One is defect finding. The other is object detection, that if you don't have such a defined shape, or a strong contrast, you can use the deep learning to actually match features in an image and find and locate features in an image.

The core and classical use case for deep learning is classification, separating apples from oranges from pears from bananas, and saying, yeah, this is definitely of this type, this is definitely of that type. And then we often look at this now as a layered approach, that we may build up a deep learning application where we're looking for a defect type, and then once we found that defect, then we're actually, using the deep learning as a secondary operation on that to say, now we've extracted the defects, we’ll it, this is a scratch, this is an oil mark, this is a fingerprint, this is a piece of cardboard, all the different types, so then you can start to get proper separation, and use that feedback to really give the customer valuable information about their process and where the defects coming from, what are their high value problems.

Whereas with traditional techniques, some of that is possible. But certainly, if you're just looking at a found an area of dark pixels, it's bad, then trying to put lots of rolls around and say, well, if it's long and thin, and in this direction, it's a scratch. But if it's long and thin, but curved, then it's maybe a bit of fabric or a hair or something like that, and trying to actually separate those might be very difficult in a rules-based way. Whereas the deep learning is really set up to do all of that.

Erik: So, in a deep learning environment, it's easier than to categorize because the algorithm can kind of tell you that the shapes are somehow similar. But one of the complaints about deep learning or one of the challenges there is that it's also somewhat of a black box, so it can tell you these are similar, but it doesn't necessarily explain why they're similar. How do you get around this understanding barrier?

Iain: This is one of the key hurdles we have when explaining this or even getting to the point of assigning a machine or a system off with the customer. When you've got to say to them, well, we've put all of your good images in this side of the black box, some magic has happened in the middle, and then at the other end, it said this one's good and that one's bad. And there's no real way of some people can and do understand what's going on inside that neural network.

But the reality is that you don't necessarily know what features the deep learning has chosen in that image to separate one class from another, good from bad, apples from oranges. And it might be something that you don't expect. And this is something we've got to be quite careful with. Because you can inadvertently where you think you've taught a particular fault type, what the deep learning is actually picked up is not on the fault itself, but maybe the proximity of that fault to the edge of the parts.

So if you've trained up, for instance, and all of your images, the scratch is next to the edge of the part, then the deep learning may have detected some of the aspects of the scratch, a little bit of contrast, change of texture, whatever it may be, but it may have also picked up the edge of the part is a significant feature that always appears near it. Then when the same scratch which might be otherwise identical appears in the middle of the parts, the deep learning misses it completely. And it's trying to have that understanding of we can't assume that it's found the scratch or the blemish and it understands what the blemish is.

There are other factors in play, and the way that the neural networks are trained, the image is broken up, manipulated, changed, transformed in various ways, mathematically all very sound concepts can be quite abstract. And it may be finding some data in there to separate the classes that you can't visibly tell or you can't obviously see or attribute to what you've got in front of you.

Erik: There was a study out recently, I think it was Google that was using deep learning to look at lung cancer, and they were comparing this against results from whatever species of surgeon typically will assess cancer in the lung. And I think the result was that there was a class where the humans could easily identify that this was a cancer, and the machines were missing it. And there were others where the machine would very high accuracy identify, and the humans completely missed it. And so they're using different processes in some cases, for identification, this, I imagine can be very frustrating for somebody who's trying to understand the quality control process.

Iain: Yes. Traditional vision techniques have been around for a good number of years now. So they're starting to become generally understood in industry so that people who have may be purchasing project people who have brought in a number of projects into their business which have had machine vision on them have an understanding of what they want that system to achieve, and how they want us to present our methodology to them so that they can understand what's happening, and sign offs to understand that it's this robust method they've got control if they need to adjust something.

They can see what's happening in there and they can see that, yes, the features that they've asked us to detect, we are detecting when they change like this, we can see the values changing, and we can sort of prove that we are meeting their specification. With a deep learning, it's very much a results-based sign off for it because you're asking the customers say, well, give us the good ones, give us the bad ones, we'll put them through and give you the result, which says these ones are good, and these ones are bad. But we can't tell you how it's made that choice because that's all happened in the deep learning.

Erik: So the example that you gave earlier with scratches being on the edge or on the inside, so the challenge there seems to be that there were not enough samples in the training data of situations where there was a scratch on the inside. So we didn't have sufficient data of that case, to train the model. And in a lot of cases where let's say if the quality control is terrible, well, you have a ton of examples of faults, and then you have some good training data. But in a situation where quality control is already fairly good, you might not have that many examples of actual faults. How do you get around understanding how to collect the data and when there is sufficient or insufficient data to train on a deep learning model?

Iain: This is one of the big separations between traditional machine vision and deep learning is that with traditional machine vision, you can basically take a single image, you can set your rules and your logic and your tests up on that image and say, well, I'm measuring this, I want to find this. If its pattern changes more than 10% from what I've taught it, I'm going to reject it. If this number of pixels I’m counting over here exceeds a certain value, I'm going to reject it and you can set those rules up on one image if you need to. You can then you can start challenging it with other images.

With deep learning, you need a quantity of images to start with, and the quantity is a variable thing. So depending on how subtle the faults are, you may be able to get away with talking tens of images as a starting point. But if you want something to get really robust, then more is more. If you can provide more samples of all the different classes you're trying to put into that deep learning model, then you're going to get better, more robust results. But you also need to spend the time in there to actually separate them as well. So if you have only got one image of a single fault, then the chances of you reliably finding that fault in multiple locations is limited. There are techniques.

So, one of the bits of software we've used most of this which is a cognitive product called ViDi, that has the option to add introduced variations into your trained images. So you can basically take an image that's part of your training, and say well, all of the trained images, I'm going to rotate them all by ±90 degrees in two degree steps. And I'm going to make them all slightly brighter, slightly darker to try and so artificially introduced some variations which we may expect as a way of multiplying up the training set.

So if I only got a handful of images of faults, I can add some of these manipulations, either in position, size, scale, skew, and brightness and contrast to try and make my training set more robust by adding some of these manipulations in on day one, but it's still no substitute. It can certainly help. But if you really don't have the data to start with, then it's going to be hard to work with.

And this probably also leads into a separation, certainly, the sort of commercial software that we're tend to use of two different techniques in the deep learning. One is where you specifically teach the faults. So certainly, with the cognitive product, we refer to that as supervised training where you are specifically drawing on to each one of those faults. This is where the fault is and you're highlighting it on each failure image. This is the fault of this type, and you have to manually mark all those images, specifically where the fault is. And that then allows you to be much more targeted about the fault type or the feature that you're trying to find and highlights. But the downside of that is you do need to have lots of different variations.

So if you're looking at scratches, you want those scratches to have a good training set where those scratches are different sizes, different lengths, different curvatures on different positions where you'd expect to find them. Because with that method, what you're teaching in will take into account all sorts of things including the position, size, the length, the pixels that surround that area that you've marked all go into the mixing part when neural network is trained to decide how to separate that class or that fault for everything else.

And if you haven't got very many rejects images, the other way we call it unsupervised training or its novelty detection is where you just train it on good images. You say all of these images are of good products, these are all acceptable. You train those up, and then you ask it to find things that are different. And then that's a good way if your faults work with that technique, some do, some don't. But if the faults work with that technique, then you're selecting good ones and you're highlighting areas on those images which do not conform to what the good ones having them.

So it allows you to detect faults that you may not have seen before or a difference or varying. So the two techniques, often we combine them as well because you might want to generally have a defect detection, which is looking for differences from the good products. But there may also be some very specific or quite fine or subtle features that that technique doesn't reliably separate. It might find them when they're more obvious. But when they're more subtle, we need to go to a more direct and a supervised method of training where you're highlighting the faults manually as part of your training. And then to combine the two, you have the two deep learning models running either sequentially in parallel to give you the best of both worlds really.

Erik: And then I guess maybe a follow up would be this training then, is there any off the shelf software where somebody can purchase this, they can plug it in, they can upload images, and output a functional model? Or does pretty much all of this require a data scientist right now to get into the dataset in order to create a functional model?

Iain: So you can go either way on this. On one level, anyone can go and register for an account with Google, Amazon, Microsoft, you can get access to their virtual machines. You can use open source, deep learning programs, things like TensorFlow that you can start off the news for absolutely minimal cost and you can go and train a deep learning network. There's obviously an overhead to that you got to understand approximately what you're doing. But there are tutorials out there.

So at one level, you can go off and do it absolutely minimal cost. You don't need to have any. If your own hardware, you can rely on uploading images to the cloud and using virtual machines to do all that training. What we're really focusing on is commercially available products. Although sort of data scientist side of things is very interesting, for us, it's not very commercially viable. If we can deploy a solution in a relatively short space of time, then that allows us to solve a customer's problem deploy a solution, get it working on the factory floor, and then we can move on to the next projects.

We don't want to be tied up for hours or days or months training up neural networks. So to that end, we do certainly focus. And really what I talked about today is based on these commercially available products. So to mention a couple, I think I've already mentioned one, which was Cognex ViDi software. And we also use the MVTec, German company their [inaudible 27:28] deep learning library as well.

And these are commercially available deep learning products, where you have a software user interface that allows you to do the training. So you can import images, you can mark on the images, the features that you're interested in. And then you can train that model locally on local hardware, and then you can deploy it again within the framework of those software environments. And this is really where we see for directs industrial deployments, this is the most sensible route to market for us.

So it may not be the most powerful, the most flexible, obviously, companies like Google has got their self-driving cars, they're spotting all these street signs, shop fronts, so the cars, number plates, all of this within very, very complex and powerful neural networks, deep learning, AI, things were looking at usually not having the benefit to use the cloud as a computing platform for this for a couple of reasons.

So one is a commercial reason, these are often specific and proprietary processes and products that companies using and generally cautious about exfiltrating all of that image data and their quality control data up to the cloud to be stored somewhere else. But then there's also practical aspects, so the training is one element of it. But when we're talking about deploying a deep learning system, this is often on a production machine.

We might be checking multiple parts a second, and in which case the practicality of taking an image with a camera on a production line, then uploading that image to some cloud service, doing the deep learning on a virtual machine, getting the results and sending that back down to the machine to pass or fail, it might be quick enough, but we might not be able to guarantee the reliability of that or the latency. And it's going to be dependent on local network conditions, local internet conditions, other bottlenecks. It will be too risky and too difficult to look at that. So we're looking usually at what it's described as edge inference where you're learning a local computer to do the deep learning runtime, the inference on that computer locally to the camera, and give the results back there and then. So that then requires that computer has the capability to do that.

And this is where you start. You generally are looking at having GPU acceleration. And for most of the products that we're working with, the commercially available products, that GPU acceleration is Nvidia based because of the CUDA libraries that Nvidia has are the ones that are leveraged in the software. And basically, the highest spec graphics card, you work with, the fast you’re on your network, the fast your deep learning will run. And that's how we tend to the deployment.

On the training side, all of the same things really apply that what we're generally doing here when we're training is we're using a local computer with graphics acceleration. And, for instance, we've got a gaming spec laptop, and that is capable of training these deep learning models. Within minutes, if it's a small image set, small image size, maybe hours, we might sometimes leave it running overnight to do a more complicated set of training. The training is generally more of an offline process. So, using a cloud based service for that is more viable. But it's also a reasonable overhead to do it locally on local hardware.

What we are starting to see is that companies like Cognex are exploring whether or not a cloud based service for this will work. But again, we've got this sort of two things. One is that we might have gigabytes of images to upload. The more images you can take and supply to that deep learning, generally, the better and more robust your model is. And certainly, here in the UK, we have asymmetric internet connections, so the upload speeds might only be 10% 20% of your download speeds.

So, to upload tens or hundreds of gigabytes of images to a cloud service, well, that might take you a couple of days. So actually running stuff locally on a local computer can be beneficial for an app point of view. Certainly, the software we're using is licensed. So rather than the sort of Google and Amazon where it's largely open sourced, if we're using proprietary or licensed software, if you've got a physical USB dongle that you can use it for a local laptop or a local PC, how does that licensing work if you have a virtual machine in the cloud, which could be a little bit?

So we are seeing some movement with certainly some of the dominant companies like Cognex, who are starting to explore is that a model that us as partners as integrators could maybe use as a benefit of our partnership, that rather than us maintaining our own local infrastructure, we rely on our suppliers infrastructure to do that training. And then further on is would customers want to do that? Some customers absolutely wouldn't want their data to be stored in the cloud somewhere. They'd be much happier to have it locally, even if that meant investing in some decent PC hardware to train that.

Erik: Yeah, and that's, I think, more or less what we see across other IoT use cases as well is that there's a general move towards the cloud. So there's a trajectory in that direction, but it's low, it's very cautious. And it's not so much around the cost model, that's why people are moving in that direction. So any use cases that are deemed as high risk of putting IP at risk, I think those are going to move very slowly.

Iain, can you walk us through a situation if you have maybe a customer and mind that you've worked on this, you don't have to mention the customer's name, but just to walk us through from initial conversations through the decision process of what is the right approach for this particular situation, and then how you select the right technology. What's your evaluation process, I think it's quite useful for people to kind of have an end-to-end perspective of how this would be deployed?

Iain: So I can probably talk you through a couple of use cases that we've come up with. So, one that we had a little while ago was for some plastic lenses. So these are solid lenses, I think they were used in some sort of smoke or fire detector systems. And they were being inspected manually. And the way they inspect them was they had a human look through them and they have a grid underneath, a white piece of paper with black lines on it. And the human is able to look through that and see if the lens is correctly formed, then you see that grid distorted in this sort of classic fishbowl or pin cushion type of distortion.

But if there was a blemish in that lens, then you get an anomaly that the lines wouldn't be evenly distorted. As you look through the lens, they will be uneven and varying. We looked at that, and tried to do that with a traditional machine vision approach. I wouldn't say it was impossible, but it was very difficult to do it and do it robustly, partly because the product had a little bit of variability, which was acceptable. It wasn't a precise product, it had to be uniform. But that uniformity could vary a little bit. So long as it was an even distortion, it could vary.

So if you tried to set up a rules-based system to find the edges, which were black and white, so nice and straightforward, then you could do that. But then the spacing between them would change all the time, not by very much, but it could be sort of unevenly depending on how that distortion occurred. And then you start doing that in two axes. And then what happens when you get a little, like a bubble or a dark mark or a blemish in there? Can you detect that because now you need to be looking for black pixels in between the grid? It wasn't impossible, but it was very, very difficult to do and to do it robustly.

Whereas we were able to basically reproduce that, take the images and put them into Cognex ViDi in this instance. And with actually, only quite a few samples, we're talking any sort of 10 samples of good products, we could then put one with a bubble in it, one with uneven curvature on the lens, one with a scratch or a spot in it, and straightaway, it would pick up those features. Regardless of sort of distortions in the grid that you were seeing, it could pick up those blemishes very quickly.

And actually, because these parts took quite a long while to make, we had 20 or 30 seconds per part to do the inspection. Once it had been trained, we worked out that we could actually run that without GPU acceleration, so we could run it on a CPU and it wasn't very quick. You were talking 10 or 15 seconds to detect it. But we had that amount of time. It was acceptable to the customer. And it meant that we could go with almost a standard machine vision hardware setup for the deployment but we've just done the deep learning training on some graphics accelerated software.

But one of the benefits for us using something like ViDi, or [inaudible 39:08] to do this is that when you're deploying it to the customer, we will need to have the ability to take images from the cameras, will those products have the traditional vision tools and the interfaces to cameras and things like that. So we could use the same software environment to grab the image to put it through the deep learning, to maybe do a little bit of post processing on that with standard vision tools.

So, maybe put some user editable thresholds on the size of the defects that were being found. So basically, taking the defects that are found, and then doing standard blob and pixel counting type analysis on those which then becomes runtime, user-editable, so they can trim their quality levels without having to retrain the deep learning, and then wrap all that up in a user interface. So that was one area that we're looking at it.

Erik: Could you give me a ballpark for what the timeline would be for development of this, and also what the budget would be the full including hardware, software, the full budget might be for a deployment like this?

Iain: Yeah, sure. So the timescale, in terms of us evaluating it, once we got samples from the customer, I think we had the deep learning model trained in an hour or so. Once we got images, the training took probably less than an hour and they were able to start doing testing, and effectively go back and demonstrate to the customer within a few days. So that aspects of it was quick because it just works.

The training work very quickly. We got a good result straightaway. We were able to sort of move to the deployment. The overall cost of that, I think, we were somewhere if I say ballpark around 30,000 euros, which was for the hardware. So reasonably high resolution camera, the lighting, the grid, that we were using as part of the imaging solution, and industrial computer to do the processing, and the software license, and our time to actually put a little user interface together, do all the communication and do the sign off and the installation aspects of the integration.

And that compares reasonably favorably to a traditional machine vision. If we would have done it with a smart camera, for instance, of the same sorts of resolution, we might have been 75% or 80% of the cost. But in the same ballpark, which is not always the case, because as soon as you start looking at faster systems, more high end systems where we start needing more graphics cards, bigger hardware, and more involved, image processing and deep learning time, then those costs can obviously escalate quite a lot.

Erik: How would this scale? Let's say it's 30,000 euro to develop the solution initially and if they say we have five other factories, or five other production lines that are exactly the same and we want to deploy this, would it also be 30,000 or are you looking at 70%, 50%, what would the cost look like if you wanted to scale the exact same solution?

Iain: On that particular one, you'd be looking at 70-ish percent to do repeats because a lot of the overhead of us writing user interface doing the development aspects during the training of the deep learning has all been done already. So then the hardware costs, the licensing for the software and there's still some deployment charges to be made to actually get the case on site set up and working. But obviously, we're then not having to create a user interface or anything like that because that can be copied over from the previous one.

So on larger deployments, that difference could be greater. We could be down to sort of 50-60% repeat costs. So a lot of what we do, the repeats very rarely go into the tens and hundreds, it tends to be either a single production line, or 5, maybe 10 at the most. And if you went to very really large scale, if you were talking really just picking names out of that yeah, really. But Samsung or Sony where they've got maybe hundreds of production lines making consumer electronics, then maybe you wouldn't look at some of these commercial off the shelf products necessarily.

You might be considering starting from scratch, because, okay, the development overhead is much, much higher, but then the deployment costs would be much, much lower potentially. But then having said that, our suppliers, if we said to them, we've got 100 off deployment or 1,000 off deployment, they'd be very keen to discuss commercial terms on that basis. So I'm sure costs for various things would potentially come right down at those sort of levels.

So another little example, and one that we're actively working with at the moment is we've got a manual assembly process where we're trying to detect a particular feature where the customer bolts 500 or 600 of these particular items onto this system, this framework that they're building. And what they're trying to do is to remove the amount of human inspection required to validate that every single one of those has been correctly placed and is in the right position.

And because rather than being a maybe fully automated production line where we can put the camera directly over the object we're looking at, we can control the lighting, we can control the testing of it, this we're talking about a large item where you've got multiple people climbing ladders to bolt stuff on moving around it, so were restricted to what we can do. We haven't got the same control over our environment. We're having to have the cameras set back from the objects so that the people can move around in front of them. We can't shine stupidly bright lights at the surface to even illuminate it. We've got to rely on the ambient lighting.

Which all makes then detecting these objects, very tricky to do, because they appear all over different angles, different heights, we've got a camera, high resolution, color camera, looking at the scene or actually several of them looking all the way around. And we're trying to determine all of these different objects are they there? And the objects themselves as they're bolted in, they can twist and conform. They hold on to other aspects of the bills. And depending on how hard they're tightened and what they're gripping, then they deform slightly. They look a bit smaller.

And they come basically in one color, a couple of different sizes. But the color is not a controlled aspect of the build. As long as it's functional and the color is approximately right, it's acceptable. So we've got all of these variations: lighting, shadowing. We've got multiple positions, multiple poses and angles that these appear at and the variation in the actual color and size of the objects. And we're trying to locate all of them around this surface. And we're using deep learning at the moment to teach this is what this feature looks like.

We're now up to, I think, training on to 2,000 or 3,000 images, and each one of those images might contain multiple instances. This is the other thing I haven't really touched on is that you need all these deep learning techniques rely on a ground truth, the human to say, this is good, this is bad, or this is this type of feature, this is this type of feature. And all those images need to be labeled correctly as accurately as possible and consistently to allow the deep learning to say, well, I know that these are this class, and these are this class. Because without that human interaction at the start with training, it doesn't know what it's classifying.

So that has been quite a big overhead in terms of time. It's not necessarily particularly high skilled. But it does require somebody to sit there and say, I'm going to draw a box around this one, we're now going to draw a box around this one. And what we're finding with that at the moment is what is the best technique? How do we best encapsulate these objects?

Because sometimes they're fully visible, sometimes they're partially visible, sometimes we see more of the side of it than the front of it. How do we go about teaching that? And there is sort of no right answer to this. It becomes a bit of trial and error, bit of trying one technique. Okay, we're going to train all of these, where we're only going to focus on certain aspects of it. And if we can't see that aspect we won't train them. And then we train another version of the same model, where we include the surroundings of it much more and say, this is it, but it's surrounded by this so we have bigger areas to define it.

And we're having to spend quite a lot of time training a deep learning model, testing it. We're not getting very good results with this. Okay. Now, to retrain that, we've got to go over all of those 2,000 images, and we've got to redraw every single box on all of those and retrain it and see does that give us a different better result. Is that more consistent? Is that more reliable? Do we need to start to separate these into different classes? This is the object looking straight onto it.

And this is the object, if it's to its side, so do we treat the two rather than putting them all into one class and saying these are all the same? But they do look quite different from the side or from the front. Do we start to separate them and say, this is the object from the front and this is the object from the side? And this is where we’re finding that deep learning, although it's very powerful, although, it's enabling us to do something that we really would have struggled with before with traditional techniques, it's not for free: there is a significant overhead cost.

And we're even seeing that some of the manufacturers and suppliers of this software, they themselves are, I will say, struggling, but they're finding that they're getting lots of inquiries. Or we'd really like you to evaluate whether your products can work with our issue on our projects, can you evaluate it for us? But that evaluation time is time consuming, and inexpensive and using up a lot of resources at the software manufacturer’s end. So it’s very powerful, but there is a considerable overhead to it in terms of the time that goes in at the training level and what humans need to add to that.

Erik: In this situation, do you think the end result is going to be that out of these 500x instances, you'll say, well, these 300 are definitely a pass or these 20 are definitely a fail, but the remainers, we need to have a human go and look at them a second checker? Do you think that you'll actually be able to come up with a sufficient accuracy that a human doesn't have to be involved in the end here?

Iain: I think on this particular application, if we get to 80-90% success rates, and then the human asked to intervene to the remaining few, then that will be acceptable for this customer in this particular use case. Clearly, if you're looking at 100% quality inspection on a production line, those sort of numbers do not inspire confidence. But with this, because it's a manual operation anyway, they're used to spending a lot of time inspecting this and rechecking it, because it's a high quality component that comes out at the end. It's not a fast process. Then some amount of human intervention is acceptable. And there's also the fact that compromises on that system have been made, where we know that we're going to look at some aspects of the build. And some of these objects will naturally be obscured by a later bit of the build or even something of the same stage. So there may be bits that we physically cannot see. And as good as deep learning is, it can't see if the image doesn't present it to it. So there's an understanding for that particular project.

Erik: So far, we've always been talking about using machine vision with camera systems, right? But I guess you could use infrared, you could use some other sensor in quite a similar way. Have you ever found that to be particularly useful? Or do you find that a more standard camera system is generally the most effective solution.

Iain: So, so far, we have only looked at this with standard cameras. But certainly with the Cognex products, it's capable of working with multichannel images. So color is fine. Black and white is fine. But there's no reason why that data couldn't be, for instance, 3D data. I mean, infrared or thermal stuff presents, when you get the image, it's basically the same format as a standard’s color camera or a black and white camera anyway.

The 3D, again, we would work very happy and might need to be what we call a range image, 2.5D image where you've got an XY image with height as the color, if you like, the pixels, rather than a full point cloud. But the techniques work with any image format. But what I haven't really touched on is the fact that these commercial products that we're using, the neural network in them has been pre-trained, is ready biased towards industrial type of images.

So if we supply standards, images, black or white or color of industrial type of inspections, then the deep learning is already sort of preset. It doesn't have to work as hard to hone in on the features that you're looking at. To some degree, there's a bit of flexibility around this. But if you went to an Amazon or somebody to take a fresh off the peg neural network and train it from scratch on images, then you're not narrowing that down at all.

So the software products we're using, if you want to try and say, is it a dog or a monkey or a cat it's probably an unlikely thing for an industrial inspection to be doing. But if we're looking at here's a printed circuit board, let's find the ends of a chip or let's check the solder on each one of the pins is correctly formed, then those sort of images go directly through the deep learning networks that we're looking at here much faster because the network's already been predisposed to work well with that image type.

Erik: I really appreciate the deep dive here. Is there any last thoughts or I think we covered a lot of territory here today?

Iain: I think we have covered quite a lot. There's lots of other use cases. Deep learning and deep learning, even in machine vision is not new. But really what we're now getting at is commercially viable off-the-shelf products that we can deploy in a reasonably short period of time. So this is now becoming commercially, really, really viable and more and more acceptable. And it's opening up avenues to us that previously were really shut. And it has been a bit of a sort of wow factor.

When we started seeing, in particular, the Cognex ViDi because the training user interfaces is so nicely presented on that that you are able to very quickly get somewhere with it. And it really sent us looking away through our back catalogue of applications that previously we said, either we can't do it, or we really don't think this is going to be robust if we deployed this. It's right on the borderline, we think we can find the or we can find this type of fault but not that. And you start looking back at those and thinking, actually, this would be possible. This could really be really viable now.

And we're seeing some of the software companies focusing in on some certain use cases. So one of the big ones that certainly Cognex and MVTec have picked up on is text reading and [inaudible 58:02] has had a ready trained font for beginning industrial recognition text reading, OCR, OCV, that has been trained with a neural network with a deep learning technique, and then given back to you as a runtime to just use and its capability is just you give it you're looking at industrial markings generally. So we're not necessarily talking handwritten stuff.

But almost any industrial fonts that you get on a label of anything that's printed or marked on something for traceability, or even just for stuff like food and fire packaging where you've got date codes, lock codes, things like that, these fonts that have been trained with a massive network of images of text behind them are really robust. Without any teaching, you just say read this line of text. And it's very robust about reading it back to you and getting that back.

And we're seeing that as being one of the key benefits of this that can be trained so the end user doesn't have to do the training of that. They can just benefit straightaway from the fact that they've got a readymade deep learning model that reads text. And certainly, I know that some of the companies are looking at other use cases. There’re other things that we could focus in on.

So maybe for logistics, this is a box. If you can then say okay, that's the box, where do we look for labels on it? Then it speeds up everything. Or number plate recognition, finding a white rectangle or a yellow rectangle, we're going to read codes in there, certain, maybe slightly niche use cases but areas where the training aspect could be done before the product is sold basically, and then you're just ready to go with a pre-trained, deep learning model that you can use straight out of the box to solve certain tasks. So I think we'll see more of that coming through as we go forwards really.

Erik: I guess for you, maybe it hasn't happened so quickly. But for a lot of people, I think it seems like this is come from nowhere and all of a sudden we're moving towards pretty cost effective solutions. So thanks for walking us through it and give us an update of where we are. And yeah, just really appreciate your time, Iain.

Iain: No problem, pleasure.

Erik: Thanks for tuning in to another edition of the industrial IoT spotlight. Don't forget to follow us on Twitter at IotoneHQ, and to check out our database of case studies on IoTONE.com. If you have unique insight or a project deployment story to share, we'd love to feature you on a future edition. Write us at erik.walenza@IoTone.com.

Overview

EP062 - Advanced machine vision and deep learning systems - Iain Smith, Managing Director, Fisher Smith

Transcript