Smart devices, Machine learning, and why you should care
Recently there has been some drama around the ‘revelation’ that some Amazon Echo voice recordings are sent to human reviewers and how this is somehow a massive scandal and gross invasion of privacy…or something.
To those who follow the machine learning space, this was somewhat surprising, as it was more a case of “Yes, of course, so what?”. What this belies, however, is the fundamental lack of awareness that consumers in general have around this technology. It’s not really fair to blame consumers for not making this connection, the field of artificial intelligence and big data is often mentioned, but is eye glazingly complex and abstract to most people.
It could be lamented that this is all part of a growing disconnect between the users of technology and how things actually work, but that’s a whole other discussion. What matters here is that tech companies are not very forthcoming with the basic information about the services they provide and the ways the data you provide them is used. Even with legislation mandating comprehensive privacy policies and disclosure, it’s still very easy to obfuscate what is actually happening in a block of legalese no one really wants to read.
That being said, it’s important that smart home enthusiasts are aware of how this stuff works, and why it’s not surprising that human reviewers are involved at all.
Machine Learning for Smart Devices
Let’s take a brief look at what this field of AI study actually is, and how it’s used for our devices right now. Artificial intelligence has long focused around the concept of neural nets; abstract data structures that are intended to mimic the connections of the human brain to enable computers to ‘learn’ in the way we do. What we call artificial intelligence is still a long way from what science fiction has long portrayed, but the advent of massive compute clusters and the ever forward march of processor power has enabled big strides to be made. The availability of massive data sources has pushed this area of research in several ways, machine learning being key.
Essentially, machine learning is a process by which large data sets can be analyzed for patterns of meaning without any underlying assumptions about what that may be. This is done by running many passes over the data and refining the output, hence a kind of learning. If we take that process and apply it to large data sets, we move into what is called Deep Learning.
Deep Learning allows us to leverage vast amounts of data and specialized neural networks to focus on specific problems to be solved. In the case of smart devices this is typically either language processing or image recognition. To do this requires many layers of analysis (hence ‘deep’) which allows the system to recognize relationships from the regularities occurring in those layers. This approach makes for more dynamic systems that can adapt and continuously improve themselves as opposed to traditional processing systems that rely on programmed rules and behaviors.
It is really only by using deep learning that we can perform effective natural language processing. Human language is extremely diverse and complex, with enormous variation and implied context. Just think of all the different ways we can convey an idea or even a simple request. Throw in regional dialects and accents, slang terms and colloquialisms and it’s a huge challenge. And that’s just for one language.
In many deep learning applications the system is looking for patterns and relationships from the data itself, such as in financial or industrial applications. In the case of language and image recognition there is a correct interpretation, but not one that can be easily defined by hard rules. As such, the neural networks require some supervision to ensure the outcome being determine is correct. By guiding the learning the system can be trained to produce correct answers more often, and from more diverse input.
This is very similar to teaching a child to read, for example. You present letters repeatedly to form recognition and correct mistakes until the results are consistent. Then you move onto words, repeating different combinations visually, and breaking them down into phonetic components to allow new words to be correctly pronounced from known ones. Each pass building up vocabulary and accuracy. Simply presenting words in isolation wouldn’t produce very good results, and would take much longer to get to the desired outcome, if at all.
There have been a number of high profile failures where machine learning has been allowed to run off unsupervised. Microsoft’s millennial chat bot experiment is a case in point, where a happy fun loving personality quickly turned racist and pro Hitler in under 24 hours. In another case Facebook had two chat bots that ended up going off the rails. The headlines claimed they ‘created their own language’ but it was more a case of sending each other increasingly bad data without supervision.
So, if we want the systems to produce results that we actually expect, we need supervision. Like with teaching humans, ‘supervised learning’ is a training approach in machine learning that provides both inputs and correct outputs as part of it’s analysis model. The correct outputs are then compared to it’s own outputs to influence the model, hence training the system to produce correct outputs from more diverse inputs over time. This is where humans come in.
In a typical smart device scenario, a digital assistant, there are a couple of keys steps that occur after the wake word is received. The smart device will start sending a recorded voice to the deep learning system. The first use of machine learning is to interpret the sounds into text. Once done, the second use kicks in on another system which is to determine the action to be taken in response to the request. In some cases this phase can be done locally, such as on a smart phone with sufficient processing power, as it’s less complex than understanding the spoken word.
In both of these cases, we expect a ‘correct’ result. Both that our command is correctly transcribed, and the intended action is performed. These are not things we can leave to the machine to figure out just from data, so they need to be trained to produce the best possible result in a wide variety of circumstances. In order to achieve this, a human must be involved. Not in each request, but in the training process. As such, every tech company that provides a service reliant on voice or image processing must use supervised learning.
Human Visibility of your data
Generally, all tech companies claim to only use anonymized data for training their systems, but what that actually means varies. Amazon uses third party contractors to review voice data, and states that the voice recordings only have your first name, account number and device serial number associated with them. Having your account number in there doesn’t seem very anonymous, but there it is.
Apple, as one might expect, takes the high road and claims robust privacy protections. Their review process is detailed on their privacy page, and the samples they take don’t have any user identifiers (even the temporary ones which automatically update every 15 minutes).
Google is the big data behemoth, and says it uses human review for some text transcriptions only. They are actively pursuing research into eliminating the need for human training, but do store voice recordings indefinitely.
Microsoft and Samsung were less clear on what they do with their voice data, or what information may be associated with it. Samsung seems to use a third party service for speech to text translation, which is perhaps to be expected as they don’t have a big data presence.
Along similar lines, Ring uses an army of Ukrainian reviewers to tag objects in camera videos in an effort to improve their motion notifications. Other video camera makers would have similar approaches, but some of these leverage the users themselves to provide this data. Canary in particular has a tagging system to help make motion sensing more intelligent. The device will identify and highlight the portion of the image which triggered the motion notification, and users can optionally select from a set of predefined tags such as “Blind”, “Fan”, “Shadow”, or “Dog”, or specify their own.
What can you do
In reality, not a lot. While there are usually options to opt out of sending telemetry or other data to “improve features”, it’s often unclear what this will actually do. It’s reasonable to assume it would opt you out of data reviews such as this, but it’s entirely up to the service provider what that actually turns off.
Some devices give you the option to delete your recorded data, but again, it’s unclear whether this is just from the device, your account, or from their archives.
You can of course opt to not use the service, or these devices at all, and that’s a personal choice. However, using these services isn’t all bad. Generally having voice requests reviewed for accuracy isn’t going to be an issue. The bigger concern would be having a smart speaker trigger unintentionally and record conversations you might prefer be kept private. In that case, you can sometimes turn off the microphone (such as on Amazon Echo devices) if you don’t want to be listened to.
It can also be a matter of prudent placement of smart speakers where they aren’t going to be exposed to private conversations. Maybe you don’t really need one in the bedroom. This applies even more so to smart cameras. If you camera maker doesn’t offer some sort of privacy controls, you may want to consider not placing them where they could record private moments. Perimeter security is going to be fine, but maybe inside the house is worth a second thought.
Perhaps the more important consideration is for us to recognize that privacy is important, and to hold our tech companies and legislators to higher standards of transparency and user control over our data.