Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I just found an O(M + N) diffing algorithm:

  $ my-diff yours mine
  --- yours 2019-10-28 15:11:07.150118090 -0700
  +++ mine  2019-10-28 16:46:42.093923066 -0700
  @@ -1,50 +1,75 @@
  - yours
  - yours
  [...]
  - yours
  + mine
  + mine
  [...]
  + mine
Spit out the diff and hunk header, then dump yours preceded by "- " and then mine preceded by "+ ".


This is pretty great, the edit script is always two steps.

1. delete left

2. insert right


There's a non-trivial sense in which you'd be justified calling this O(1), inasmuch as the edit script "delete yours, insert mine" can be yielded up as an edit script without even reading "yours" or "mine", simply yielding "yours" and "mine" by reference. That's in a more abstract sense than something "diff" could take. However, with at most a slight change to "diff" (and I'm not sure it's even that, I think it may have this already I'm just not checking for an HN post), you can yield an O(1) "delete yours" and then insert mine, so it's O(N) without the M [1].

In UNIX terms you can just spec out "cp mine yours" as the edit script, which is O(1) from the perspective of the diff algorithm. Applying that script is O(N), of course, but generating it is constant.

1: Super pedants will note that O(N + M) isn't terribly meaningful here since adding two linear terms means you can reduce that to O(N) under the big-O rules. However, I've always like keeping the information like that in the O notation, in some sense precisely because O(N+M) is-equivalent-to O(N) in big-O terms, but if you have additional knowledge like for some reason, M is exponential in a term you care about you can generate the new big-O term that you care about. In graph theory, my professor tended to want to express O(edges) as O(nodes^2), but I always liked retaining the O(edges) because O(edges) yields useful information if you know you have a sparse graph, for instance, whereas O(nodes^2) removes that. To his credit, he never marked me down for it, which puts him ahead of a lot of professors and their personal preferences...


O(N + M) is not equivalent to O(N) if you make no assumptions about the relative growths of N and M, that's why it's actually meaningful to keep both terms around. You can only reduce it to O(N) when N dominates M (M = O(N)).


In the case of file size inputs to a diff algorithm, it would not normally make sense to discuss how one file is going to be "exponentially" larger than another or something. They just are what they are, the bare inputs to the function in question.

By contrast, graphs can have significantly different relationships between nodes and edges, for instance, with sparse vs. dense graphs.

One could construct a situation in which file size inputs are varying related to each other in some O(...)-significant way, but it would be very strained, especially in the context of a diff algorithm (why are you trying to diff things significantly different in size at all?). You can fiddle with almost any O(...) pronouncement that way, e.g., "adding two integers isn't O(1)". True enough, but generally, more detail then we are looking for.


You’re wrong, the sizes of both input files are commonly considered when analysing algorithms that work on two files. Consider a problem that could be solved by sorting one of the files and then stepping through both files item by item (not saying that any meaningful problem with this property exists, it’s just an example). Then you would have a complexity of O(n log n + m) assuming n is the size of the file being sorted.

There is valuable information in running time being linear in the size of both files. You could define n as the combined size if you wanted to, but that could be confusing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: