In this project you will demonstrate your understanding of arrays, strings, and functions. You may also use typedefs and structs if you wish (see Chapter 8) – and will probably find the program easier to assemble if you do – but you are not required to use them in order to obtain full marks. You should not make any use of malloc() (Chapter 10) or file operations (Chapter 11) in this project.
Superstring Assembly
Suppose that multiple copies of a long string are cut up into much smaller pieces. If you are given the available fragments, is it possible to rebuild the larger string? For ple, if the original string was "algorithmsarefunfunfun", and you got given a list of fragments like "refun" and "unfun" "rith" and "lgo" and so on, plus lots more small fragments, could you rebuild the original string? Probably you could, right? But what about if the original string was millions of characters long, like War and Peace? Or billions of characters long, like your genetic sequence?
Let’s ask a slightly different question now: if you got given a set of small string fragments each from a larger string that you didn’t know anything about, could you at least find the shortest string in which every supplied fragment occurs at least once? This problem is known as the shortest superstring problem, and can be thought of as having parameters n, the number of string fragments provided, and m, the total length of those string fragments. The shortest superstring problem is very important in bioinformatics, where genome reassembly from short fragments called reads helps contribute to our understanding of the behavior of organisms.
Unfortunately, finding optimal solutions to the shortest superstring problem has been shown to be very challenging, and falls into a family of intractable problems that we’ll talk about later in the semester. In this project you will implement some approximation techniques that generate non-optimal superstrings from a collection of string fragments.
Input Data
Input to your program will (always) come from stdin, and will (always) consist of strictly lower-case alphabetic strings, one per input line, each line terminated by a single newline ’\n’ character. Be sure to read the message on the FAQ page about newlines. For ple, the following file test0.txt containing six fragments is a valid input:
accgtcgatg gcctag gtacctacta cgatgcc tcgatgccgca atgagaccgtc
As you develop your program according to the stages listed below, the output will evolve. Full output ples for this and two other test files are linked from the FAQ page. You should also check your program against other inputs too; and can easily generate your own tests using a text editor. Testing and debugging is your responsibility.
In terms of developing your program for this project, you may assume that no string fragment will be longer than 20 characters, and that there will be at most 1,000 such fragments supplied. You may also use brute-force programming rather than try and use more elegant data structures and algorithms
– this activity is primarily about C programming rather than algorithm design.
Stage 0 – Reading Strings (0/15 marks)
Before tackling the more interesting parts of the project, your very first program should read the set of input fragments into an array of strings, and then print them out again so that you can be sure you have read them correctly. You can do that as part of your DEBUG mode code, see the suggestion below. Each of the following stages will then require a function that carries out the required processing, in each case working through the strings that were read, and then building (and selectively writing) the required superstring.
Stage 1 – Overlapping Suffixes (8/15 marks)
In this first stage you are to build a superstring according to the following process: start with the first of the input fragments and use it to initialize a partial superstring; then process every other fragment in input