The cryptography behind this is done very well so that Google does not ever see even your hashed password, or know if there was a breached password detected. What Google does see is a hash-prefix of your username, to narrow down the encrypted data set of compromised credentials being returned to your Chrome instance.
Here’s what happens. Your instance of Chrome hashes your username and password and encrypts that value with an ephemeral key which doesn’t leave Chrome, Kc
Google gets the encrypted cred-hash and applies a second round of encryption with a key only known to Google, Kg. They return the doubly-encrypted cred-hash, and also return 256 candidates encrypted with just Kg for you to compare against, those 256 candidates are selected based on a clear-text 3 byte prefix that Chrome also sends them.
When Chrome gets the results back, it decrypts the cred-hash using Kc which results in the cred-hash now being encrypted only with Kg; coincidentally this is the same exact key used to encrypt the candidates, allowing Chrome to do a simple byte-compare to see if there’s a match between your cred-hash and the encrypted breach candidates.
The cool part is the encryption operation is basically communicative. You can add your layer of encryption first, send the value to Google, they add their encryption and you get the double-encrypted blob back, and then you remove your encryption to end up with a plaintext encrypted just with Google’s key, even though Google never saw the plaintext, and you never saw their key.
This trick allows your browser, and only your browser, to compare your cred-hash against the breached cred-hashes while all the data you are comparing is actually encrypted with a secret key held by Google that never leaves Google’s server.
The only information leak is the 3-byte hash prefix to identify the slice of the comparison set. This is definitely not nothing, but perhaps it is a reasonable trade-off.
• I want to send my password (44) to Google but don't want them to know it's 44. So I multiply it by 37 (Kc) and send 1628.
• I also tell Google that the password is between 40 and 50. (Equivalent to sending three characters of cleartext.)
• Google multiplies 1628 by 78 (Kg) and sends me the result, 126984.
• Google also sends me every number between 40 and 50, multiplied by 78.
• I divide 126984 by Kc and get 3432. I see that 3432 is in the set of other numbers Google sent me (44x78). So I can infer that my password is in Google's set of breached passwords.
> I also tell Google that the password is between 40 and 50. (Equivalent to sending three characters of cleartext.)
I believe the (first 3 bytes of the) hash of your username is used, as opposed to your plaintext password. I realize you were simplifying, but the difference between "plaintext password" and "hashed username" is rather essential in this case.
IIUC, their breach data consists of pairs of values - a username hash and the matching username-password hash (all encrypted ofc). This way they aren't storing a bunch of breached plain text credentials on their servers. This allows for a range search based on your username hash prefix, while the encrypted packages you send and receive are username-password hashes.
Edit: I believe I was incorrect - they only retain the _prefix_ of the username hash (as opposed to the entire thing). This has the benefit of effectively rendering the username unrecoverable - you'd be forced to attack the combined username-password hash. So it's not really a range based search, but instead a hash table. They just return the entire (24-bit addressed) bucket to you, which for 4 billion (!) sets of credentials amounts to ~240 on average.
This scheme sounds (unsurprisingly) similar to a well known crypto scenario. Here is a page that describes it[0]. Copying from there:
You and I need to communicate via the National Postal Service. What you want to communicate is secret, so you don't want a postal worker or some random schmoe who finds your mail in the mailbox or a package on your stoop to be able to open it up and read what's inside, and you know from experience that any time you send something in the mail without securing it, it either gets read or (if it's not just text) stolen. We must live in a country with a really corrupt postal service, I guess -- kind of like the Internet.
We each have access to an arbitrary number of indestructible boxes, and an arbitrary number of indestructable key locks. Each box can have as many locks on it as you like. Unfortunately, each lock has only one key, and only the person who possesses the lock has the key. We can send things to each other in locked boxes, but of course the recipient doesn't have the key to the lock because the sender has it. Sending the key unsecured will get it stolen or, worse, copied. We also have no way to meet each other in person to exchange keys securely (and if we did, we could just skip the postal service altogether, anyway).
How can we arrange, with these resources, to communicate securely with each other?
It’s actually fundamentally different, because instead of trying to agree on a key we can use to securely communicate cleartext over a hostile network, this is a scenario where neither side wants to know anything about what the other side knows, except in the very special case where we both have the same exact value, in which case only one side should discover that.
The solution to the scenario is (spoiler alert) to attach multiple locks.
Alice attaches her lock.
Alice sends the locked package to Bob.
Bob attaches his lock.
Bob sends the twice locked package to Alice.
Alice removes her lock.
Alice sends the package, locked with only Bob's lock, to Bob.
Bob removes his lock and opens the package.
----
The scheme you discussed is really similar to this, but I guess with one key difference; in the scheme you describe the locks themselves contain information and are hidden by a second lock.
I don't think there is a good extension to the solution I posted above, but it's as if when Bob attaches his lock it is hidden until Alice removes her lock, and at that point Alice only cares about what Bob's lock looks like.
What is stopping me doing a MITM on that, especially when it's digital - couldn't I copy the locked package that Alice sent, forward the original to Bob, and then when Bob sends his locked copy, I use my copy and attach my OWN lock to it, and send that instead to Alice, who then removes her lock from it, leaving only a package that is locked with my lock?
Sure, you need a way to verify that the package has been locked by who you thought you were sending it to.
Your attack would work by removing Bob completely, you just play his part in the protocol.
This is true for any protocol though. If you don't know who you're talking to then you could be talking to anyone.
In practice most protocols get around this by either pre-sharing a private key, or using a public/private key pair with a challenge round to establish identities.
No really, or at least for anyone at all tech savvy -- and the reasons why are important.
The key trick to what Google are doing is that Google never ends up knowing anything about what their users passwords are AND their users never end up knowing which passwords Google has discovered YET between them they arrange that the user knows if any of their passwords are known to Google. Magic!
With Pwned Passwords users can learn any arbitrary subset of the Pwned Passwords full password hashes they please for just one API call per prefix - with Password Checkup you have to start by guessing a username plus password combination, then do a bunch of fairly expensive operations, and you only get back a boolean. It's essentially never worth it for bad guys who are guessing.
This has two advantages, and doubtless both are attractive to Google although one advantage is better PR than the other
1. Bad Guys can't sieve the Google system for valuable data. For Pwned Passwords this isn't a big concern because Troy is using readily available password lists. Spending say $5M to get the raw passwords Troy used as input back by processing Troy's data makes no sense, just download a Torrent of it. But Google's Password Checkup doesn't just use public sources. Spending $5M (again just an example figure) to steal data from Google that may not be available to your criminal gang otherwise might be worth it for a big enough score. And it'd be bad PR for Google if a customer gets attacked by such crooks using data which those crooks couldn't have obtained at all without Google. Troy can always say "This data already existed, I didn't make the problem worse" but Google's non-public data can't make that claim.
2. Good guys can't either. Google has a valuable service here, locked into Google's infrastructure. They can choose to give it away, but they could also (as they have with some other products) later decide they'd prefer to monetize it. Either way, nobody else can duplicate it even though millions of people are using it, thanks to cryptography.
The costs are pretty enormous too. Pwned Passwords is mostly just a CDN, and clients are trivial. The upfront maths to make it work was hard, but that's a one-time thing. But Password Check incurs a considerable ongoing cost for Google to do hard maths AND each client needs lots of RAM (this is a memory hard problem on purpose, to exhaust bad guys who might try to exploit it)
I'd like to add that (IIUC) this scheme doesn't require Google to retain plaintext credentials or even password hashes. (They still might of course, but they don't have to.) If implemented as described in the linked paper, it only requires storing the hash of the combined username-password (in encrypted form). It's difficult to imagine a less useful form from the perspective of an adversary.
Regarding costs, the one-time hash performed on intake is indeed computationally expensive. However, for queries the only cryptographic operation required is a single elliptic curve computation (using secp224r1) on the hashed record sent by the client. Computationally, that should be in the same ballpark as establishing a single TLS connection.
The primary expense (detailed in the paper) is bandwidth, on account of sending out an entire hash bucket with each response. The analysis in the paper assumes a 2 byte prefix, but they're using 3 bytes here to reduce bandwidth at the expense of privacy; for every query made, they have to send back ~250 values instead of just one.
This "But Google's Password Checkup doesn't just use public sources" is infering that Google's db is bigger and better. You don't know and this is example of magic lotion marketing.
"Our magic lotion for hair grow is muuuuch better, as it have secret formula".
Actually I suspect that Troy's db as for now is better/bigger.
See my other comment. Commercial outfits sell this data on a daily volume basis. Remember it has been (though admittedly not for over a year) literally my job to process the piles of new stolen password data or rather write and oversee programs which did that processing.
Didn't you ever wonder when you see viable SQL injections and other attacks announced so frequently why Troy's data set is so small? Troy is taking the ethical high ground by not buying data. So his data is the tip of the iceberg.
I guess I assumed the logic was that you take a sha256 of, say 32501257be4512a04a184fed989fd5e46cd4d30f6ef7c8d20d07d90f82e83aaf and say give me all entries from the 4B that begin with 325 (or end with aaf) which would take your 4B and divide it by 16 3 times giving you about 1M?
I don't like the direction of "popular websites are exempt, small websites are restricted".
Chrome has blocked autoplaying videos with sound for most sites except a small, hardcoded list that includes YouTube.
This means that if you create a YouTube competitor today, you are playing at a technical disadvantage. Or if you're just hosting videos on your own personal website!
What if Google's algorithms classify your new startup as "potential phishing", because users are re-using their own passwords on your site? How can you appeal? What recourse do you have against Big G's algorithm?
Disclosure: I'm the TL of this project on Chrome and I work very closely with the Safe Browsing engineers regularly.
> What if Google's algorithms classify your new startup as "potential phishing", because users are re-using their own passwords on your site?
That's not how our phishing detection works. In fact, our internal studies show that a lot of users reuse their passwords often and while that's not the best password hygiene, it's the user's choice to make and we have to respect that and build protections with this in mind.
> How can you appeal?
Right from your search console.
> What recourse do you have against Big G's algorithm?
Ultimately, Google/Safe Browsing has a lot more to lose if their users stop trusting their product(s). I can tell you that we take false positives very seriously and try hard to provide a fair and speedy resolution.
This assumes that someone from every site has a Google account and consents to their Privacy Policy and Terms. This is not a safe assumption, nor is it fair to make this a requirement.
After the youtube banning debacle, can anybody really trust there are humans working in Google's tech support or that given the amount of traffic they receive that a ticket will be treated in acceptable time?
It is based on a global population of browser usage, which is effectively a (dynamic) hardcoded list, that is also specifically filtered for a certain level of popularity and above.
Your interactions influence the default list, minutely, but also only if it’s a popular site. The distortions remain.
Apologies if we have different definitions of hardcoding. I’m not suggesting this list is a compiler argument. Just saying that a Google list decides whether a domain can autopsy video, instead of a more open web approach like “HTTPS can do this”, which is used for many features like WebRTC.
So, now you've gone from your original claim about a hardcoded list of sites with autoplay to a hardcoded list of sites which get MEI "preloaded" and an "explicit popularity filter".
"Chrome does this by learning your preferences. If you don’t have browsing history, Chrome allows autoplay for over 1,000 sites where we see that the highest percentage of visitors play media with sound. As you browse the web, that list changes as Chrome learns and enables autoplay on sites where you play media with sound during most of your visits, and disables it on sites where you don’t. This way, Chrome gives you a personalized, predictable browsing experience."
The inclusion of the "over 1,000 sites" list is what creates a distortion.
But that's not a hard coded list but instead "sites where we see that the highest percentage of visitors play media with sound" exactly as you supposedly wished they should do.
But only popular sites are even in the running for consideration. That benefits incumbents and harms new entrants. If you want to go out and make your own video platform, you'll have to do it without autoplay, even if 99% of your visitors play media with sound.
Let's take a very popular site indeed. Google.com. Does it get autoplay? Nope. Why not? Because people keep visiting without clicking on any videos.
If I create a new video site tomorrow, let's say I set up a new music label and the whole site is just music videos for our label's artists. That site will start with zero index value, and visitors will need intent to watch the videos. But visitors who come back dozens of times will be eligible for autoplay, because the MEI quickly elevates for them.
But if Google added that same feature to a new tab of the main search site it'd never get MEI high enough to autoplay because most visitors to this popular site of Google.com don't play videos. That's not what they expect this site to do so it doesn't get autoplay.
Most sites I visit which have high MEI don't even use autoplay, but a lot of sites I visit which have very low MEI do use it (and of course Chrome blocks it on those sites). Both causes are the same - the people giving me quality content are also polite enough not to try to shove it down my throat, and those shovelling everything they can down people's throats don't care about quality.
Chrome is weak about providing options, generally (and to the extent they do occasionally provide useful options, weak about keeping them from release to release).
So the major competitors who can sue Google are on the list but small independent sites are not? Sounds completely fair. You must have missed when they broke the web with the webaudio API.
It's just another variant on how right now, gmail can decide to spambin all the email you send, and it really doesn't matter why it decided to do it or who you are, because they can do whatever they want. An ever-increasing number of barriers stopping new companies from entering the market unless they're sitting on a big sack of venture capital.
To be fair, while Gmail is notorious for marking everything as spam, at least there are things you can do about it as a mail server admin. I am yet to find a way to get Outlook.com to stop blocking my entire mail server.
That said I wish Mozilla would allow other forms for income and just be honest about it (unlike Pocket which I think is a good idea destroyed by them being sneaky about it.)
Shockingly, the browser made by the operator of several of the world's most popular websites is applies its security policies less strictly to popular websites.
I don’t like how it’ll become a new tracking datapoint. Google will learn about your passwords and of course that will be something that can be targeted through ads at some point.
Even worse if you can use such targeting to figure out who is possibly vulnerable.
Or maybe that’s just paranoid, but I don’t trust an advertising firm (which is all google is at heart) to not do this.
Seems to me Troy Hunt (the independent developer of https://HaveIbeenPwned.com/) and Cloudflare's Junade Ali deserve at least a mention in the Google announcement. They're pioneers of this database of pwned passwords and the ability to look up candidate passwords securely.
I suppose you could give Troy Hunt credit for popularizing the idea of anonymously checking for breached credentials, but Google doesn't use HaveIBeenPwned's database, nor is their method for checking for exposed passwords quite the same.
I’m happy to see this comment. I guess I don’t really care if Google wants to succumb to NIH syndrome and build this themselves, but they should at least acknowledge the person whose idea they are repackaging.
Why? Did Apple acknowledge all the PDAs that came before iPhone? Did Spotify acknowledge subscription music services like Rhapsody from 2003? Did Flickr mention Smugmug?
Some of these changes, such as realtime sending of all non-popular URLs, concern me a little.
I am happy for Google to use this information for aggregated security purposes (e.g. bots analyse suspicious new pages).
I am not happy with this being associated to the Google account. However, AFAIK, I have no way of knowing this, and in the absence of an explicit statement, I have to assume the worst: that Google will use this to target me ads.
If any Chrome team members are reading this comment, could we get some sort of confirmation that these security features will not be used to collect more data to target me ads?
> If any Chrome team members are reading this comment, could we get some sort of confirmation that these security features will not be used to collect more data to target me ads
Any statement they make is open for change at a later date if management changes their mind. You’d be better off thinking about moving to a browser made by a company that whose privacy protections don’t exist in a perpetual conflict of interest with the company’s dominant revenue source
Do you see any other alternative than trust for such a system?
- Building a fully local system will not provide the same coverage and doesn't allow sharing findings across users (which means that when a new phishing page appears, it will appear as "new" for every single user, instead of being new for N users and then known bad for the rest of users). It also heavily reduces what kind of analysis can be done since you can't just store large datasets on every single device and/or run expensive algorithms on every single web page load on a mobile phone.
- Making it open source would not help. You can't know whether the code you can see is what is deployed remotely, so in the end you just end up trusting a different assertion instead (if you can't trust a privacy policy, why could you trust that the deployment is not backdoored?). It also has some significant cons: malware / phishing / abuse detection is in essence a cat-and-mouse game, and secrecy is unfortunately a key requirement in how everyone is building anti-abuse systems across the industry (not necessarily because they want to, but because nobody knows how it could work otherwise).
- You could even go all the way and have e.g. remote attestations, reproducible builds, etc. that allow proving that indeed the code running remotely is the open source code you want and can audit. This is barely doable with available technology these days, and even if someone was to do it there would maybe be 1K people on this planet able to understand why this is trustworthy. A prime example of this is looking at people in this very thread not understanding the differential privacy scheme for detecting compromised passwords.
Not trusting Google is a personal opinion, and I completely respect that. But implying that there is an alternative to trust for this kind of system is IMO misleading. Using DDG or Protonmail or any other service doesn't change the fact that you have to trust someone, it's just a different someone. You might personally believe their word more than Google's word, but if e.g. DDG started logging your identity and log requests and sell that to ad companies you would have very little way of learning about it either.
Disclaimer: I work for Google, not on Chrome, but I have worked on anti-abuse systems in the past.
AFAIK, Chrome was already sending everything for ad purposes (it's basically enhanced google analytics that happens to be the user agent too). It will just now send for another purpose.
Reading through the article and the other sources linked, this seems to be a different dataset and implementation than the "Have I been pwned" dataset that Firefox and other password managers reference. However, I don't see any information about where their data comes from, only information about how the feature/extension works to not leak your password during verification.
To be fair, if you have money (which Google does) you can buy a LOT of this data, every single day, forever.
There are at least two distinct outfits which pay grey hats to steal data from black hats who in turn obtained it typically through cheesy script kiddie attacks on web sites or phishing.
If you give them not very much money they'll "monitor" their stream of supposedly fresh stolen data (a lot of it isn't fresh because crooks are also often liars) for specific data items you tell them about. This is a bad deal, but apparently some pretty big US companies pay for that service. It's fine though because everybody reading this has unique strong passwords for every account right? Right?
If you give them a LOT of money (or if you were say one of my former employers and you owned the company that does this outright) you just get the raw data. As like UTF-16 XML or CSV files with different types of quoting on each line, or whatever other crazy and inadequately documented nonsense came to mind for each such type of file delivered.
At Google’s scale, I HOPE they have an internal team dedicated to building out their own HIBP, and at the very least scanning employee credentials on a near real-time basis.
I like the secured hash check method. I think this is good.
Most of my current life is in unique-per-site and 2FA but the burden of remembering which one(s) are using shared was more than my motivation to fix. This mechanism may take me to fixing faster.
I guess someone somewhere maintains a database of all leaked username passwords and this feature compares a credential with this database? Is that how this works?
When you’re google size and investing engineering and cryptography resources into building security features, applied at scale to a billion* users, you probably want assurance that this list will he maintained for as long as you want it to be maintained, AND for no third parties like HIBP to enumerate over all google emails.
Publicly, yeah. But you really think google would depreciate some core internal service actively used for account security, as well as internal employee security?
this is a solution looking for a problem, enabled by default. i think most people dont care that homomorphic encryption is being used, they dont want their user/passwords being sent across the wire in any way shape or form, especially enabled by default.
So let me get this straight. We currently consider it poor practice if passwords are stored for authentication purposes in anything but heavily salted PBKDF2, bcrypt, or sometimes scrypt. Yet it's okay to just send SHA256 unsalted hashes to web services for the purposes of verifying that those same SHA256 hashes aren't already stored?
It's good that Google is encrypting these hashes, but come on, really? Surely there has to be a better way than just shipping off unsalted password hashes to a centralized location.
Edit: Okay, unsalted scrypt for the password. Thanks for the clarification :)
It...appears to be so? The infographic states that they "send a strongly hashed and encrypted copy of your username and password to Google." They explicitly mention that the username, not the password, is sent with a 3-byte hash prefix. They make no mention of prefixing the password in the infographic.
In the linked paper where they discuss the tradeoffs and take blinding into consideration. They note that the protocol is that a client calls CreateRequest(u, p) which creates a Req that is then sent explicitly to Google. It appears to me that they consider the merits of sending a hash-prefixed password, but do not make it explicitly clear that the final solution sends a hash-prefixed password. I think that they would want to explicitly say that only a partial hash is sent to Google, if that were the case?
> ... do not make it explicitly clear that the final solution sends a hash-prefixed password
I'm not sure if you're actually talking about something else, but the paper says: "Post-canonicalization, the server calculates a computationally expensive hash of both the canonical username and credential password... This 2-byte prefix—while leaking some bits of password material—provides the client with k-anonymity over the universe of all username and password pairs."
IOW, the 3-byte hash prefix sent is of the username and password concatenated. (Note that Google seems to have added another byte to the prefix versus the paper).
To add to this, hashed username-password material is leaked only by the first variant described in the paper. The second variant described only leaks hashed username material. They reportedly used the first variant during testing but have now switched to the second variant.
They indeed appear to have increased the prefix from 2 to 3 bytes. This makes logistical sense though - with 4 billion items, a 2 byte address yields ~61k items per bucket (and thus sent to the client per request) while a 3 byte address yields only ~240 on average.
Here’s what happens. Your instance of Chrome hashes your username and password and encrypts that value with an ephemeral key which doesn’t leave Chrome, Kc
Google gets the encrypted cred-hash and applies a second round of encryption with a key only known to Google, Kg. They return the doubly-encrypted cred-hash, and also return 256 candidates encrypted with just Kg for you to compare against, those 256 candidates are selected based on a clear-text 3 byte prefix that Chrome also sends them.
When Chrome gets the results back, it decrypts the cred-hash using Kc which results in the cred-hash now being encrypted only with Kg; coincidentally this is the same exact key used to encrypt the candidates, allowing Chrome to do a simple byte-compare to see if there’s a match between your cred-hash and the encrypted breach candidates.
The cool part is the encryption operation is basically communicative. You can add your layer of encryption first, send the value to Google, they add their encryption and you get the double-encrypted blob back, and then you remove your encryption to end up with a plaintext encrypted just with Google’s key, even though Google never saw the plaintext, and you never saw their key.
This trick allows your browser, and only your browser, to compare your cred-hash against the breached cred-hashes while all the data you are comparing is actually encrypted with a secret key held by Google that never leaves Google’s server.
The only information leak is the 3-byte hash prefix to identify the slice of the comparison set. This is definitely not nothing, but perhaps it is a reasonable trade-off.