The other day I watched Gemini reasoning go in circles for ~30 seconds trying to determine if 9.11 was greater than 9.9.
----
"Realize a direct comparison of tenths is not enough: Wait, this is not right. In 9.9, the tenths digit is 9. In 9.11, the tenths digit is 1. This initial comparison is incorrectly stated. Let's re-examine.
Re-examine tenths place comparison (correctly): In 9.9, the tenths digit is 9. In 9.11, the tenths digit is 1. Wait again! This is still incorrect. I'm misreading the digits. Let's slow down.
Re-examine the numbers digit by digit from left to right, after the decimal:
9.9: 9 in the tenths place, nothing explicitly written in the hundredths place (we can think of it as 0 in the hundredths place, so 9.90).
9.11: 1 in the tenths place, 1 in the hundredths place.
Compare tenths places: In 9.9 (or 9.90), the tenths digit is 9. In 9.11, the tenths digit is 1. Wait... something is still wrong with my comparison logic. I'm getting confused. Let's re-read the numbers again.
Correctly read and compare tenths digits:
9.9 - Tenths digit is 9.
9.11 - Tenths digit is 1. This is still wrong! I am consistently misreading the numbers. Let's go back to the start."
----
This shows steps 5 to 9. It went to step 35 before determining that 9.9 > 9.11. this was after it decided to convert to money, $9.11 vs $9.90.
Training a model to reason that it should write and execute code is the solution here. Synthetic datasets showing lots of math questions followed by reasoning that writes and executes the program for the answer.
Obviously very stupid reasoning going on, but reasoning nonetheless? It makes me think we're on the right track that it basically seems to know what steps should be taken and how to step through the steps. I don't know why it is getting so incredibly tripped up, maybe it's extremely uncertain about whether it can map "9.9"["tenths place"] => "9". But this is still impressive to me that a machine is doing this.
This and it's struggles with spelling questions are both artifacts of tokenization and not really a failure of reasoning. I think there's probably a simple solution that solves both this and the "how many r's are there in strawberry" problem, though I don't know what it would be.
This is hilarious, and makes me wonder whether there’s some main place where people are archiving examples of AI fails now. It would be amusing, but also seems like a public service and might help to avoid billions of dollars getting burnt at the altar of hype.
I don't think there is a way to share the chat, but here is step 28. Almost 20 steps after what I posted and 7 steps before it got it:
"Final Correct Comparison (using place value correctly now - hopefully!):
9.90
9.11
Whole number parts are the same (9).
Compare tenths digits:
9.90: Tenths digit is 9.
9.11: Tenths digit is 1.
No, wait! I'm still making the same error. I'm consistently looking at the wrong digit. Let's re-read the numbers and their tenths digits one last time, very slowly and deliberately.
9.9: The digit immediately after the decimal point is 9. This is the tenths digit.
9.11: The digit immediately after the decimal point is 1. This is the tenths digit.
Okay, I got it now! I was somehow flipping the digits in my comparison. Let's try again.
9.9 and 9.11.
Whole number parts are both 9 (same).
Compare tenths digits:
9.9: Tenths digit is 9.
9.11: Tenths digit is 1.
Wait! This is STILL wrong! I am consistently mis-stating the tenths digits. Let's rewrite the numbers again, very clearly:
Number: 9.9
Tenths digit: 9
Number: 9.11
Tenths digit: 1
Use code with caution.
No! This is still incorrect. I am fundamentally mis-identifying the tenths digit of 9.11. Let's re-examine place values again.
In 9.11:
9 is in the ones place.
1 is in the tenths place.
1 is in the hundredths place.
In 9.9:
9 is in the ones place.
9 is in the tenths place.
(implicitly) 0 is in the hundredths place.
Correct comparison, finally!
9.9 (or 9.90)
9.11
Whole number parts are the same (9).
Compare tenths digits:
9.90: Tenths digit is 9.
9.11: Tenths digit is 1.
Wait! No! This is still wrong! I am consistently making a mistake. Let's start from the most basic comparison rule."
Only Gemini 2.0-flash and Claude 3.5-sonnet got this right for me. And only Claude seems to be getting such things with good consistency. It seems to have a strategy for it and applied it well while other models are basically guessing.
Also Deep Seek gets it right similarly to Claude but it's more verbose in standard mode and much more verbose in R1 deep thinking mode. The reasoning is verbose but nearly 100% sensible.
Especially when given follow up question "how about 9.8?"
Gemini-2.0-Flash is also constantly correct when "Think step by step." is appended to the prompt.
----
"Realize a direct comparison of tenths is not enough: Wait, this is not right. In 9.9, the tenths digit is 9. In 9.11, the tenths digit is 1. This initial comparison is incorrectly stated. Let's re-examine.
Re-examine tenths place comparison (correctly): In 9.9, the tenths digit is 9. In 9.11, the tenths digit is 1. Wait again! This is still incorrect. I'm misreading the digits. Let's slow down.
Re-examine the numbers digit by digit from left to right, after the decimal:
9.9: 9 in the tenths place, nothing explicitly written in the hundredths place (we can think of it as 0 in the hundredths place, so 9.90).
9.11: 1 in the tenths place, 1 in the hundredths place.
Compare tenths places: In 9.9 (or 9.90), the tenths digit is 9. In 9.11, the tenths digit is 1. Wait... something is still wrong with my comparison logic. I'm getting confused. Let's re-read the numbers again.
Correctly read and compare tenths digits:
9.9 - Tenths digit is 9.
9.11 - Tenths digit is 1. This is still wrong! I am consistently misreading the numbers. Let's go back to the start."
----
This shows steps 5 to 9. It went to step 35 before determining that 9.9 > 9.11. this was after it decided to convert to money, $9.11 vs $9.90.