Assembler in C

First of all what is an assembler?^{[1] [2]}:

“Typically a modern assembler creates object code by translating assembly instruction mnemonics into opcodes, and by resolving symbolic names for memory locations and other entities. The use of symbolic references is a key feature of assemblers, saving tedious calculations and manual address updates after program modifications.”

Types of assembler:

There are two types of assemblers based on how many passes through the source are needed to produce the executable program.

One-pass assemblers go through the source code once and assume that all symbols will be defined before any instruction that references them.
Two-pass assemblers create a table with all symbols and their values in the first pass, then use the table in a second pass to generate code. The assembler must at least be able to determine the length of each instruction on the first pass so that the addresses of symbols can be calculated.

The advantage of the two-pass assembler is that symbols can be defined anywhere in program source code, allowing programs to be defined in more logical and meaningful ways, making two-pass assembler programs easier to read and maintain.

Starting now to make an assembler one first has to define an input and output form. That is what assembler takes as input and what kind of output it will provide. In here I’m trying to make an assembler based on book “System Programming and Operating System by D M Dhamdhere” so I’m assuming following input form:

            start 101

            read n

            mover breg,one

            movem breg,term

again:   mult breg,term

            add creg,one

            movem creg,term

            comp creg,four

            bc le,again

            div breg,two

            movem breg,result

            print result

            stop

n ds 2

result ds 1

one dc ‘1’

term ds 1

two dc ‘2’

four dc ‘4

end

And output form:

101) 90113

102) 42116

103) 52117

104) 32117

105) 13116

106) 53117

107) 63119

108) 72104

109) 82118

110) 52115

111) 100115

112) 0 000

113)

115)

116)00 0 001

117)

118)00 0 001

119)00 0 001

120)

At first declare fix structures that contains default information to be used.

struct directive{
     char symbol[10];
     int code;
}dir[5]={"start",1,"end",0,"origin",1,"ltorg",2};

struct mnemtab{
     char mnem[20];
     int opcode;
     int len;
}l[12]={"stop",0,1,"add",1,1,"sub",2,1,"mult",3,1,"mover",4,1,"movm",5,1
        "comp",6,1,"bc",7,1,"div",8,1,"read",9,1,"print",10,1};

struct regtab{
     char regsym[10];
     int code;
}reg[4]={"areg",1,"breg",2,"creg",3,"dreg",4};

struct dclcode{
     char mne[5];
     int address;
}dcl[2]={"ds",1,"dc",2};

struct compcode{
     char sym[5];
     int code;
}ccode[7]={"lt",1,"le",2,"eq",3,"gt",4,"ge",5,"any",6};

In above code I have declared all required struct that holds default information used in our program. The first one is “directive” which holds predefined directives and its code. Second one “mnemtab” holds mnemonics, its opcode and its length. Third one “regtab” is used for holding register name and its code. Fourth one “dclcode” are used for declaration statements and its code. The final one “compcode” are used for comparison symbols and its code. Beside this create some more structure that we will be used to store information as it process the input file:

struct buffer{
     char lbl[10];
     char m[10];
     char op1[10];
     char op2[10];
}buf;

struct symtab{
     char symbol[10];
     int add;
}sym[50];

struct littab{
     char lit[10];
     int address;
}literal[20];

struct tab_inc{
     int index;
     long fpos;
     int type;
}t_inc[10];

In above 4 structs the first one is used as buffer that will store information about the line that will be processed at the time. Second one is used as symbol table. This will store all symbol and its relevant addresses in it. The third one “littab” is used to store literals and its relevant address. Now create some getter, setter and update methods for above declared structs. First for getting directive code based on directive:

int get_dir(char a[10]){
      int i;
      for(i=0;i<4;i++){
            if(!strcmp(a,dir[i].symbol))
                 return dir[i].code;
      }
      return -1;
}

Update symbol to add new symbol entry in it. This function will also check for duplicate declaration of symbol. If duplicate declaration found then shows warning as “Multiple declaration of [symbolName]”.

void update_symtab(FILE *fe){
      int i,j;
      for(i=0;i<sc;i++){
            if(!strcmp(sym[i].symbol,buf.m)){
                   if(sym[i].add==0)
                         sym[i].add=lc;
                   else{
                         printf("\nMultiple declaration of %s",buf.m);
                         fprintf(fe,"\nMultiple declaration of %s",buf.m);
                   }
                   return;
            }
      }
}

Get symbol from symbol table.

int get_sym(char s[10]){
      int i,len=strlen(s);
      if(s[0]=='=') {
            for(i=0;i<ltc;i++){
                  if(!strcmp(literal[i].lit,s)) {
                        t_inc[tc].type=0;
                        return i;
                  }
            }
            strcpy(literal[ltc].lit,s);
            ltc++;
            t_inc[tc].type=0;
            return (ltc-1);
      }
      else if(s[len-1]==':'){
            s[len-1]='\0';
            strcpy(sym[sc].symbol,s);
            sym[sc].add=lc;
            sc++;
            return (sc-1);
      }
      else{
           for(i=0;i<sc;i++){
                 if(!strcmp(s,sym[i].symbol)) {
                        t_inc[tc].type=1;
                        return i;
                 }
           }
           strcpy(sym[sc].symbol,s);
           t_inc[tc].type=1;
           sym[sc].add=0;
           sc++;
           return (sc-1);
      }
}

Next is getting register code based on its name.

int get_reg(char a[]) {
      int i;
      for(i=0;i<3;i++) {
            if(!strcmp(a,reg[i].regsym))
                  return reg[i].code;
      }
      return 0;
}

Get mnemonics opcode based on its name,

int get_btcode(char a[20]) {
      int i;
      for(i=0;i<11;i++) {
            if(strcmp(l[i].mnem,a)==0)
                  return l[i].opcode;
      }
      return -1;
}

Get declaration statement code.

int get_dcl(char a[10]) {
      int i;
      for(i=0;i<3;i++) {
            if(!strcmp(dcl[i].mne,a))
                  return dcl[i].address;
      }
      return -1;
}

Clean buffer after processing on it.

void clean_buf() {
      strcpy(buf.lbl,"");
      strcpy(buf.m,"");
      strcpy(buf.op1,"");
      strcpy(buf.op2,"");
}

Get comparison statement code based on its name.

int get_ccode(char a[5]){
      int i;
      for(i=0;i<7;i++) {
            if(!strcmp(ccode[i].sym,a))
                   return ccode[i].code;
      }
      return -1;
}

Up to now I have declared data structure kind thing for the assembler and now let’s start building processing of assembler. At first one need to parse the input file to get desired information for this I have used following method which tokenize file line by line & store each token into buffer.

void parser(FILE *fi) {
      char a[20];
      int i;
      int ctr=0;
      while((c=getc(fi))!=EOF && (c!='\n')) {
            if((c!=' ') && (c!='\t') && (c!='\n')) {
                  i=0;
                  a[i++]=c;
                  while(((c=getc(fi))!=' ') && (c!=' ') && (c!='\n') && (c!=EOF) && (c!=',')) {
                         a[i++]=c;
                  }
                  a[i]='\0';
                  if(a[0]!='\0'){
                        if(ctr==0){
                             if(a[i-1]==':'){
                                    strcpy(buf.lbl,a);
                                    printf("\nlbl:%s",buf.lbl);
                                    continue;
                              }
                              else{
                                    strcpy(buf.m,a);
                                    printf("\nm:%s",buf.m);
                                    ctr++;
                              }
                        }
                        else if(ctr==1){
                              strcpy(buf.op1,a);
                              printf("\nop1:%s",buf.op1);
                              ctr++;
                        }
                        else if(ctr==2){
                              strcpy(buf.op2,a);
                              printf("\nop2:%s",buf.op2);
                              ctr=0;
                       }
                 }
                 if(c=='\n'){
                       ctr=0;
                       lnctr++;
                       return;
                 }
           }
      }
}

Now that data is parsed from file let’s start processing on this data. In pass-1 assemblers create a table with all symbols and their values. Also the assembler at least be able to determine the length of each instruction on the first pass so that the addresses of symbols can be calculated. I have created following method for Pass – 1 of assembler.

void pass1(FILE *fe,FILE *fo){
      int code,opcode,dcode;
      static int fstop=1,fend=1;
      long fpos;
      if(fstop){
            if(strcmp(buf.lbl,""))
                  get_sym(buf.lbl);
            opcode=get_btcode(buf.m);
            if(opcode==0){
                  fstop=0;
                  fprintf(fo,"%d) 0 000 \n",lc);
                  lc++;
                  return;
            }
            if(opcode==7){
                  fprintf(fo,"%d) %d",lc,opcode);
                  code=get_ccode(buf.op1);
                  fprintf(fo,"%d",code);
                  code=get_sym(buf.op2);
                  fpos=ftell(fo);
                  t_inc[tc].index=code;
                  t_inc[tc].fpos=fpos;
                  tc++;
                  fprintf(fo,"   \n");
                  lc++;
                  return;
            }
            if(opcode!=-1){
                  fprintf(fo,"%d) %d",lc,opcode);
                  code=get_reg(buf.op1);
                  if(code==0){
                        strcpy(buf.op2,buf.op1);
                  }
                  fprintf(fo,"%d",code);
                  code=get_sym(buf.op2);
                  fpos=ftell(fo);
                  t_inc[tc].index=code;
                  t_inc[tc].fpos=fpos;
                  tc++;
                  lc++;
                  fprintf(fo,"   \n");
            }
            else{
                  dcode=get_dir(buf.m);
                  if(dcode!=-1) {
                        switch(dcode){
                              case 1:
                                    lc=atoi(buf.op1);
                                    break;
                        }
                  }
                  else{
                        fprintf(fo,"%d) ***",lc);
                        code=get_reg(buf.op1);
                        if(code==0) {
                              if(!strcmp(buf.op2,""))
                                    strcpy(buf.op2,buf.op1);
                              else{
                                    fprintf(fo," *** ");
                                    fprintf(fe," Error at line %d :: Illegal register %s",lnctr 1,buf.op1);
                                    code=get_sym(buf.op2);
                                    fpos=ftell(fo);
                                    t_inc[tc].index=code;
                                    t_inc[tc].fpos=fpos;
                                    tc++;
                                    lc++;
                              }
                        }
                        fprintf(fo," %d",code);
                        code=get_sym(buf.op2);
                        fpos=ftell(fo);
                        t_inc[tc].index=code;
                        t_inc[tc].fpos=fpos;
                        tc++;
                        lc++;
                        fprintf(fo,"   \n");
                        printf("\n Error at line %d :: Unknown mnemonics: %s",lnctr-1,buf.m);
                        fprintf(fe,"\n Error at line %d :: Unknown mnemonics: %s",lnctr-1,buf.m);
                  }
            }
      }
      else{
            if(fend) {
                  if(!strcmp(buf.m,"end")){
                        fprintf(fo,"%d)   \n",lc);
                        lc++;
                        fend=0;
                  }
                  else{
                        code=get_dcl(buf.op1);
                        switch(code) {
                              case 1:
                                    fprintf(fo,"%d)  \n",lc);
                                    update_symtab(fe);
                                    lc+=atoi(buf.op2);
                                    break;
                              case 2:
                                    fprintf(fo,"%d)00 0 001  \n”,lc);
                                    update_symtab(fe);
                                    lc++;
                                    break;
                        }
                  }
            }
            else if(strcmp(buf.m,"")) {
                  code=decode_lit(buf.m,fe);
                  fprintf(fo,"%d) 00 0 %d  \n",lc,code);
                  lc++;
            }
      }
      clean_buf();
      return;
}

And for the final thing that is Pass – 2 of assembler. In pass-2 assembler uses the table created in first pass and generates code for that. For that I have created following method:

void pass2(FILE *fop){
      int i,adr,x;
      long pos;
      for(i=0;i<tc;i++){
            pos=t_inc[i].fpos;
            adr=t_inc[i].index;
            fseek(fop,pos,0);
            if(t_inc[i].type==0) {
                  if(literal[adr].address!=0)
                        fprintf(fop,"%d \t",literal[adr].address);
                  else{
                        fprintf(fop,"***");
                        printf("\n Undelared literal: %s",literal[adr].lit);
fprintf(fop,"\n Undelared literal: %s",literal[adr].lit);
                  }
            }
            else{
                  if(sym[adr].add!=0)
                        fprintf(fop,"%d\t",sym[adr].add);
                  else{
                        fprintf(fop,"***");
                        printf("\n Undeclared symbol: %s",sym[adr].symbol);
                        fprintf(fop,"\n Undeclared symbol: %s",sym[adr].symbol);
                  }
            }
      }
}

And it’s done. Assembler is ready to run. Go on try it and give your valuable response.

References:-

Note: – This is not perfect assembler solution and may have some bugs. It is just for understanding basic assembler concept. Also above code is not an optimized version.

C/C++

25 thoughts on “Assembler in C”

James Grenning says:

August 22, 2011 at 2:36 pm

Let’s see your automated tests.

1. harryjoy says:
  
  August 27, 2011 at 3:07 pm
  
  I have not done automated tests. I have tested this code manually by providing some different input files and analyzing the created output file.
  
Chirag says:

August 27, 2011 at 1:38 pm

what about the main() function ?

1. harryjoy says:
  
  August 27, 2011 at 3:05 pm
  In main function you have to call main 3 methods that are parser, pass1 and pass2 one by one. First read the input file and loop over it contents. Call parser and pass1 on it. Then after its completion call pass2 method. Sample main method will look like this:
```
void main() {
	FILE *fi,*fe,*fo;
	clrscr();
	fi=fopen("asm1.txt","r");
	fe=fopen("asm1.txt","a+");
	fo=fopen("asm.txt","w");
	fprintf(fe,"\n\n");
	while(c!=EOF){
		parser(fi);
		printf("\n");
		pass1(fe,fo);
	}
	pass2(fo);
	fclose(fi);
	fclose(fe);
	fclose(fo);
	getch();
}
```
  Also global variables that are shown above are declared as follow:
```
static int sc=0,lc=0,tc=0,ltc=0,lnctr=1;
char c='\0';
```
Ankit says:

November 5, 2011 at 1:22 pm

was searching for main() only thnx…

In pass1() it is giving error to decode_lit()….. Please help soon…..

Here it is:

int decode_lit(char a[10],FILE *fe)
{
	int i,length,j=0;
	char val[5];
	length=strlen(a);
	for(i=0;i<ltc;i++)
	{
		if(!strcmp(literal[i].lit,buf.m))
		{
			if(literal[i].address==0)
				literal[i].address=lc;
			else
			{
				printf("\n Multiple declaration of %s",buf.m);
				fprintf(fe,"\n Multiple declaration of %s",buf.m);
			}
		}
	}

	for(i=2;i<length;i++,j++)
		val[j]=a[i];

	return atoi(val);
}

Hope this helps.

Ankit says:

November 27, 2011 at 7:57 pm

Lets see….. I will try this and hope it helps me in solving my problem. I have to develop a generalized assembler which will catch all errors and if there are no errors then creates target code. Hope this helps me. And does this program checks for mnemonics?

1. harryjoy says:
  
  November 27, 2011 at 9:33 pm
  
  You are at right place, this code does the same and Yes this store mnemonics in “mnemtab“.
  
Ankit says:

November 27, 2011 at 7:59 pm

And thanks a lot for decode_lit().

Ankit says:

December 20, 2011 at 7:59 pm

sorry fr late reply bt thnx alot for d code……it helped alot

1. harryjoy says:
  
  December 21, 2011 at 9:11 am
  
  You are welcome. You can appreciate my work by including my name and this website address, in the comments/documents/credits, where you use this code.
  
Pingback: JavaPins
Rockstar says:

April 2, 2013 at 2:00 am

does this assembler work??? i havent seen yet…
…..reply plzz

1. harryjoy says:
  
  April 2, 2013 at 9:34 am
  
  Yes it works. Have you tried running it? Did you faced any issues in running it?
  
Rockstar says:

April 2, 2013 at 3:48 pm

ohh….no..i didnt check it……….. can u tell me how to run it in ubuntu….i mean what is the process to run……and also i need assembler for 8085 processors…will it work for 8085 processor also…
??

1. harryjoy says:
  
  April 4, 2013 at 2:31 pm
  
  Hi,
  
  you can run it same as you run any C/C++ program on Ubuntu. You can easily get steps to run a C program in Ubuntu on internet. I will get back to you soon on the 8085 processor question.
  
  Thanks.
  
Rockstar says:

April 6, 2013 at 3:32 pm

can i know where “location counter” is used?? reply plzz…urgent

Rockstar says:

April 6, 2013 at 6:28 pm

and another thing is that what are these “sc”,”lc” , “ltc”..and many more things that are used in this code???..like i<sc…what does this sc and other things refer to??

Maria Violeta Salutan says:

October 6, 2013 at 8:50 pm

mr. harryjoy when i run your code, my result is kinda bit different from your output. I was wondering why does all the number 5 at the beginning turn to *** in my output ?

101) 90113
102) 42116
103) *** 2117
104) 32117
105) 13116
106) *** 3117
107) 63119
108) 72104
109) 82118
110) *** 2115
111) 100115
112) 0 000
113)
115)
11600 0 001
117)
11800 0 001
11900 0 001
120)

1. harryjoy says:
  
  October 9, 2013 at 11:59 pm
  
  Hi Maria,
  
  In normal scenarios, it should pint the stars if the symbol that it is trying to parse is invalid/unknown. So have a look into your configs and input file, you might be having a symbol or variable that is unknown to your assembler.
  
  Thanks.
  
saugata says:

October 28, 2013 at 1:44 pm

Dear,
Thanx for such a step by step analysis, yet after running, I have seen a error:
“Error at line 5 :: Unknown mnemonics: again:”. How can I resolve this issue?
Can u pls tell me, how can i generate internediate code from your pass 1

saugata28 says:

October 28, 2013 at 1:48 pm

Dear,
Thank you very much for ur such a wealthy post and analysis. Yet, I found error after running the code which is :” Error at line 5 :: Unknown mnemonics: again:” and object code is not generated. Can you pls tell me how to resolve this? MOreover, I willbe glad if u advice me how to generate intermediate code from your pass1.
Thank You

1. sourav says:
  
  January 29, 2014 at 11:43 pm
  
  same error with me reply if you have corrected it.
  
2. harryjoy says:
  
  January 29, 2014 at 11:52 pm
  
  Basic reason could be the mnemonics you have used in input file is not available or misspelled in the code in mnemonics struct. First step would be check for it.
  
  To generate intermediate output of pass1, drop calling pass2 and shut the file read/write and check the output file.
  
  Regards

Share this:

25 thoughts on “Assembler in C”

Leave a comment Cancel reply