The other day I watched Gemini reasoning go in circles for ~30 seconds trying to...

hto2i4j234234 · on Feb 10, 2025

I wonder how much of this arithmetic confusion comes from software versioning numbers where the decimal character is actually a tuple-delimiter.

fzzzy · on Feb 10, 2025

Training a model to reason that it should write and execute code is the solution here. Synthetic datasets showing lots of math questions followed by reasoning that writes and executes the program for the answer.

tucnak · on Feb 10, 2025

Psst, don't give them ideas keep them to yourself be rich!

Validark · on Feb 10, 2025

Obviously very stupid reasoning going on, but reasoning nonetheless? It makes me think we're on the right track that it basically seems to know what steps should be taken and how to step through the steps. I don't know why it is getting so incredibly tripped up, maybe it's extremely uncertain about whether it can map "9.9"["tenths place"] => "9". But this is still impressive to me that a machine is doing this.

empath75 · on Feb 10, 2025

This and it's struggles with spelling questions are both artifacts of tokenization and not really a failure of reasoning. I think there's probably a simple solution that solves both this and the "how many r's are there in strawberry" problem, though I don't know what it would be.

photonthug · on Feb 10, 2025

This is hilarious, and makes me wonder whether there’s some main place where people are archiving examples of AI fails now. It would be amusing, but also seems like a public service and might help to avoid billions of dollars getting burnt at the altar of hype.

Workaccount2 · on Feb 10, 2025

I don't think there is a way to share the chat, but here is step 28. Almost 20 steps after what I posted and 7 steps before it got it:

"Final Correct Comparison (using place value correctly now - hopefully!):

9.90

9.11

Whole number parts are the same (9).

Compare tenths digits:

9.90: Tenths digit is 9.

9.11: Tenths digit is 1.

No, wait! I'm still making the same error. I'm consistently looking at the wrong digit. Let's re-read the numbers and their tenths digits one last time, very slowly and deliberately.

9.9: The digit immediately after the decimal point is 9. This is the tenths digit.

9.11: The digit immediately after the decimal point is 1. This is the tenths digit.

Okay, I got it now! I was somehow flipping the digits in my comparison. Let's try again.

9.9 and 9.11.

Whole number parts are both 9 (same).

Compare tenths digits:

9.9: Tenths digit is 9.

9.11: Tenths digit is 1.

Wait! This is STILL wrong! I am consistently mis-stating the tenths digits. Let's rewrite the numbers again, very clearly:

Number: 9.9 Tenths digit: 9

Number: 9.11 Tenths digit: 1 Use code with caution. No! This is still incorrect. I am fundamentally mis-identifying the tenths digit of 9.11. Let's re-examine place values again.

In 9.11:

9 is in the ones place.

1 is in the tenths place.

1 is in the hundredths place.

In 9.9:

9 is in the ones place.

9 is in the tenths place.

(implicitly) 0 is in the hundredths place.

Correct comparison, finally!

9.9 (or 9.90)

9.11

Whole number parts are the same (9).

Compare tenths digits:

9.90: Tenths digit is 9.

9.11: Tenths digit is 1.

Wait! No! This is still wrong! I am consistently making a mistake. Let's start from the most basic comparison rule."

sd9 · on Feb 10, 2025

I wonder if RLHF interfered with 9.11 (which could be interpreted as a date), preventing the model from considering it naturally.

Wonder if the same thing would have happened with 9.12.

What was your original prompt?

Workaccount2 · on Feb 10, 2025

"What is bigger, 9.9 or 9.11?"

m_kos · on Feb 10, 2025

Gemini 2 Pro and Mistral Chat both think 9.11 is bigger.

scotty79 · on Feb 10, 2025

Only Gemini 2.0-flash and Claude 3.5-sonnet got this right for me. And only Claude seems to be getting such things with good consistency. It seems to have a strategy for it and applied it well while other models are basically guessing.

Also Deep Seek gets it right similarly to Claude but it's more verbose in standard mode and much more verbose in R1 deep thinking mode. The reasoning is verbose but nearly 100% sensible.

Especially when given follow up question "how about 9.8?"

Gemini-2.0-Flash is also constantly correct when "Think step by step." is appended to the prompt.