Assembler in C

First of all what is an assembler? [1] [2]:

Typically a modern assembler creates object code by translating assembly instruction mnemonics into opcodes, and by resolving symbolic names for memory locations and other entities. The use of symbolic references is a key feature of assemblers, saving tedious calculations and manual address updates after program modifications.

Types of assembler:

There are two types of assemblers based on how many passes through the source are needed to produce the executable program.

  • One-pass assemblers go through the source code once and assume that all symbols will be defined before any instruction that references them.
  • Two-pass assemblers create a table with all symbols and their values in the first pass, then use the table in a second pass to generate code. The assembler must at least be able to determine the length of each instruction on the first pass so that the addresses of symbols can be calculated.

The advantage of the two-pass assembler is that symbols can be defined anywhere in program source code, allowing programs to be defined in more logical and meaningful ways, making two-pass assembler programs easier to read and maintain.

Starting now to make an assembler one first has to define an input and output form. That is what assembler takes as input and what kind of output it will provide. In here I’m trying to make an assembler based on book “System Programming and Operating System by D M Dhamdhere” so I’m assuming following input form:

             start 101

             read n

             mover breg,one

             movem breg,term

again:   mult breg,term

             add creg,one

             movem creg,term

             comp creg,four

             bc le,again

             div breg,two

             movem breg,result

             print result

             stop

n ds 2

result ds 1

one dc ‘1’

term ds 1

two dc ‘2’

four dc ‘4

end

And output form:

101) 90113     

102) 42116     

103) 52117     

104) 32117     

105) 13116     

106) 53117     

107) 63119     

108) 72104     

109) 82118     

110) 52115     

111) 100115   

112) 0 000

113) 

115) 

116)00 0 001 

117) 

118)00 0 001 

119)00 0 001 

120)  

At first declare fix structures that contains default information to be used.

struct directive{
     char symbol[10];
     int code;
}dir[5]={"start",1,"end",0,"origin",1,"ltorg",2};

struct mnemtab{
     char mnem[20];
     int opcode;
     int len;
}l[12]={"stop",0,1,"add",1,1,"sub",2,1,"mult",3,1,"mover",4,1,"movm",5,1
        "comp",6,1,"bc",7,1,"div",8,1,"read",9,1,"print",10,1};

struct regtab{
     char regsym[10];
     int code;
}reg[4]={"areg",1,"breg",2,"creg",3,"dreg",4};

struct dclcode{
     char mne[5];
     int address;
}dcl[2]={"ds",1,"dc",2};

struct compcode{
     char sym[5];
     int code;
}ccode[7]={"lt",1,"le",2,"eq",3,"gt",4,"ge",5,"any",6};

In above code I have declared all required struct that holds default information used in our program. The first one is “directive” which holds predefined directives and its code. Second one “mnemtab” holds mnemonics, its opcode and its length. Third one “regtab” is used for holding register name and its code. Fourth one “dclcode” are used for declaration statements and its code. The final one “compcode” are used for comparison symbols and its code. Beside this create some more structure that we will be used to store information as it process the input file:

struct buffer{
     char lbl[10];
     char m[10];
     char op1[10];
     char op2[10];
}buf;

struct symtab{
     char symbol[10];
     int add;
}sym[50];

struct littab{
     char lit[10];
     int address;
}literal[20];

struct tab_inc{
     int index;
     long fpos;
     int type;
}t_inc[10];

In above 4 structs the first one is used as buffer that will store information about the line that will be processed at the time. Second one is used as symbol table. This will store all symbol and its relevant addresses in it. The third one “littab” is used to store literals and its relevant address. Now create some getter, setter and update methods for above declared structs. First for getting directive code based on directive:

int get_dir(char a[10]){
      int i;
      for(i=0;i<4;i++){
            if(!strcmp(a,dir[i].symbol))
                 return dir[i].code;
      }
      return -1;
}

Update symbol to add new symbol entry in it. This function will also check for duplicate declaration of symbol. If duplicate declaration found then shows warning as “Multiple declaration of [symbolName]”.

void update_symtab(FILE *fe){
      int i,j;
      for(i=0;i<sc;i++){
            if(!strcmp(sym[i].symbol,buf.m)){
                   if(sym[i].add==0)
                         sym[i].add=lc;
                   else{
                         printf("\nMultiple declaration of %s",buf.m);
                         fprintf(fe,"\nMultiple declaration of %s",buf.m);
                   }
                   return;
            }
      }
}

Get symbol from symbol table.

int get_sym(char s[10]){
      int i,len=strlen(s);
      if(s[0]=='=') {
            for(i=0;i<ltc;i++){
                  if(!strcmp(literal[i].lit,s)) {
                        t_inc[tc].type=0;
                        return i;
                  }
            }
            strcpy(literal[ltc].lit,s);
            ltc++;
            t_inc[tc].type=0;
            return (ltc-1);
      }
      else if(s[len-1]==':'){
            s[len-1]='\0';
            strcpy(sym[sc].symbol,s);
            sym[sc].add=lc;
            sc++;
            return (sc-1);
      }
      else{
           for(i=0;i<sc;i++){
                 if(!strcmp(s,sym[i].symbol)) {
                        t_inc[tc].type=1;
                        return i;
                 }
           }
           strcpy(sym[sc].symbol,s);
           t_inc[tc].type=1;
           sym[sc].add=0;
           sc++;
           return (sc-1);
      }
}

Next is getting register code based on its name.

int get_reg(char a[]) {
      int i;
      for(i=0;i<3;i++) {
            if(!strcmp(a,reg[i].regsym))
                  return reg[i].code;
      }
      return 0;
}

Get mnemonics opcode based on its name,

int get_btcode(char a[20]) {
      int i;
      for(i=0;i<11;i++) {
            if(strcmp(l[i].mnem,a)==0)
                  return l[i].opcode;
      }
      return -1;
}

Get declaration statement code.

int get_dcl(char a[10]) {
      int i;
      for(i=0;i<3;i++) {
            if(!strcmp(dcl[i].mne,a))
                  return dcl[i].address;
      }
      return -1;
}

Clean buffer after processing on it.

void clean_buf() {
      strcpy(buf.lbl,"");
      strcpy(buf.m,"");
      strcpy(buf.op1,"");
      strcpy(buf.op2,"");
}

Get comparison statement code based on its name.

int get_ccode(char a[5]){
      int i;
      for(i=0;i<7;i++) {
            if(!strcmp(ccode[i].sym,a))
                   return ccode[i].code;
      }
      return -1;
}

Up to now I have declared data structure kind thing for the assembler and now let’s start building processing of assembler. At first one need to parse the input file to get desired information for this I have used following method which tokenize file line by line & store each token into buffer.

void parser(FILE *fi) {
      char a[20];
      int i;
      int ctr=0;
      while((c=getc(fi))!=EOF && (c!='\n')) {
            if((c!=' ') && (c!='\t') && (c!='\n')) {
                  i=0;
                  a[i++]=c;
                  while(((c=getc(fi))!=' ') && (c!=' ') && (c!='\n') && (c!=EOF) && (c!=',')) {
                         a[i++]=c;
                  }
                  a[i]='\0';
                  if(a[0]!='\0'){
                        if(ctr==0){
                             if(a[i-1]==':'){
                                    strcpy(buf.lbl,a);
                                    printf("\nlbl:%s",buf.lbl);
                                    continue;
                              }
                              else{
                                    strcpy(buf.m,a);
                                    printf("\nm:%s",buf.m);
                                    ctr++;
                              }
                        }
                        else if(ctr==1){
                              strcpy(buf.op1,a);
                              printf("\nop1:%s",buf.op1);
                              ctr++;
                        }
                        else if(ctr==2){
                              strcpy(buf.op2,a);
                              printf("\nop2:%s",buf.op2);
                              ctr=0;
                       }
                 }
                 if(c=='\n'){
                       ctr=0;
                       lnctr++;
                       return;
                 }
           }
      }
}

Now that data is parsed from file let’s start processing on this data. In pass-1 assemblers create a table with all symbols and their values. Also the assembler at least be able to determine the length of each instruction on the first pass so that the addresses of symbols can be calculated. I have created following method for Pass – 1 of assembler.

void pass1(FILE *fe,FILE *fo){
      int code,opcode,dcode;
      static int fstop=1,fend=1;
      long fpos;
      if(fstop){
            if(strcmp(buf.lbl,""))
                  get_sym(buf.lbl);
            opcode=get_btcode(buf.m);
            if(opcode==0){
                  fstop=0;
                  fprintf(fo,"%d) 0 000 \n",lc);
                  lc++;
                  return;
            }
            if(opcode==7){
                  fprintf(fo,"%d) %d",lc,opcode);
                  code=get_ccode(buf.op1);
                  fprintf(fo,"%d",code);
                  code=get_sym(buf.op2);
                  fpos=ftell(fo);
                  t_inc[tc].index=code;
                  t_inc[tc].fpos=fpos;
                  tc++;
                  fprintf(fo,"   \n");
                  lc++;
                  return;
            }
            if(opcode!=-1){
                  fprintf(fo,"%d) %d",lc,opcode);
                  code=get_reg(buf.op1);
                  if(code==0){
                        strcpy(buf.op2,buf.op1);
                  }
                  fprintf(fo,"%d",code);
                  code=get_sym(buf.op2);
                  fpos=ftell(fo);
                  t_inc[tc].index=code;
                  t_inc[tc].fpos=fpos;
                  tc++;
                  lc++;
                  fprintf(fo,"   \n");
            }
            else{
                  dcode=get_dir(buf.m);
                  if(dcode!=-1) {
                        switch(dcode){
                              case 1:
                                    lc=atoi(buf.op1);
                                    break;
                        }
                  }
                  else{
                        fprintf(fo,"%d) ***",lc);
                        code=get_reg(buf.op1);
                        if(code==0) {
                              if(!strcmp(buf.op2,""))
                                    strcpy(buf.op2,buf.op1);
                              else{
                                    fprintf(fo," *** ");
                                    fprintf(fe," Error at line %d :: Illegal register %s",lnctr 1,buf.op1);
                                    code=get_sym(buf.op2);
                                    fpos=ftell(fo);
                                    t_inc[tc].index=code;
                                    t_inc[tc].fpos=fpos;
                                    tc++;
                                    lc++;
                              }
                        }
                        fprintf(fo," %d",code);
                        code=get_sym(buf.op2);
                        fpos=ftell(fo);
                        t_inc[tc].index=code;
                        t_inc[tc].fpos=fpos;
                        tc++;
                        lc++;
                        fprintf(fo,"   \n");
                        printf("\n Error at line %d :: Unknown mnemonics: %s",lnctr-1,buf.m);
                        fprintf(fe,"\n Error at line %d :: Unknown mnemonics: %s",lnctr-1,buf.m);
                  }
            }
      }
      else{
            if(fend) {
                  if(!strcmp(buf.m,"end")){
                        fprintf(fo,"%d)   \n",lc);
                        lc++;
                        fend=0;
                  }
                  else{
                        code=get_dcl(buf.op1);
                        switch(code) {
                              case 1:
                                    fprintf(fo,"%d)  \n",lc);
                                    update_symtab(fe);
                                    lc+=atoi(buf.op2);
                                    break;
                              case 2:
                                    fprintf(fo,"%d)00 0 001  \n”,lc);
                                    update_symtab(fe);
                                    lc++;
                                    break;
                        }
                  }
            }
            else if(strcmp(buf.m,"")) {
                  code=decode_lit(buf.m,fe);
                  fprintf(fo,"%d) 00 0 %d  \n",lc,code);
                  lc++;
            }
      }
      clean_buf();
      return;
}

And for the final thing that is Pass – 2 of assembler. In pass-2 assembler uses the table created in first pass and generates code for that. For that I have created following method:

void pass2(FILE *fop){
      int i,adr,x;
      long pos;
      for(i=0;i<tc;i++){
            pos=t_inc[i].fpos;
            adr=t_inc[i].index;
            fseek(fop,pos,0);
            if(t_inc[i].type==0) {
                  if(literal[adr].address!=0)
                        fprintf(fop,"%d \t",literal[adr].address);
                  else{
                        fprintf(fop,"***");
                        printf("\n Undelared literal: %s",literal[adr].lit);
fprintf(fop,"\n Undelared literal: %s",literal[adr].lit);
                  }
            }
            else{
                  if(sym[adr].add!=0)
                        fprintf(fop,"%d\t",sym[adr].add);
                  else{
                        fprintf(fop,"***");
                        printf("\n Undeclared symbol: %s",sym[adr].symbol);
                        fprintf(fop,"\n Undeclared symbol: %s",sym[adr].symbol);
                  }
            }
      }
}

And it’s done. Assembler is ready to run. Go on try it and give your valuable response.

References:-

  1. http://en.wikipedia.org/wiki/Assembly_language#Assembler
  2. http://searchdatacenter.techtarget.com/definition/assembler
  3. http://www.cse.iitb.ac.in/~dmd/

Note: – This is not perfect assembler solution and may have some bugs. It is just for understanding basic assembler concept. Also above code is not an optimized version.

25 thoughts on “Assembler in C

    1. In main function you have to call main 3 methods that are parser, pass1 and pass2 one by one. First read the input file and loop over it contents. Call parser and pass1 on it. Then after its completion call pass2 method. Sample main method will look like this:

      void main() {
      	FILE *fi,*fe,*fo;
      	clrscr();
      	fi=fopen("asm1.txt","r");
      	fe=fopen("asm1.txt","a+");
      	fo=fopen("asm.txt","w");
      	fprintf(fe,"\n\n");
      	while(c!=EOF){
      		parser(fi);
      		printf("\n");
      		pass1(fe,fo);
      	}
      	pass2(fo);
      	fclose(fi);
      	fclose(fe);
      	fclose(fo);
      	getch();
      }
      

      Also global variables that are shown above are declared as follow:

      static int sc=0,lc=0,tc=0,ltc=0,lnctr=1;
      char c='\0';
      
    1. Here it is:

      int decode_lit(char a[10],FILE *fe)
      {
      	int i,length,j=0;
      	char val[5];
      	length=strlen(a);
      	for(i=0;i<ltc;i++)
      	{
      		if(!strcmp(literal[i].lit,buf.m))
      		{
      			if(literal[i].address==0)
      				literal[i].address=lc;
      			else
      			{
      				printf("\n Multiple declaration of %s",buf.m);
      				fprintf(fe,"\n Multiple declaration of %s",buf.m);
      			}
      		}
      	}
      
      	for(i=2;i<length;i++,j++)
      		val[j]=a[i];
      
      	return atoi(val);
      }
      

      Hope this helps.

  1. Lets see….. I will try this and hope it helps me in solving my problem. I have to develop a generalized assembler which will catch all errors and if there are no errors then creates target code. Hope this helps me. And does this program checks for mnemonics?

  2. Pingback: JavaPins
  3. ohh….no..i didnt check it……….. can u tell me how to run it in ubuntu….i mean what is the process to run……and also i need assembler for 8085 processors…will it work for 8085 processor also…
    ??

    1. Hi,

      you can run it same as you run any C/C++ program on Ubuntu. You can easily get steps to run a C program in Ubuntu on internet. I will get back to you soon on the 8085 processor question.

      Thanks.

  4. and another thing is that what are these “sc”,”lc” , “ltc”..and many more things that are used in this code???..like i<sc…what does this sc and other things refer to??

  5. mr. harryjoy when i run your code, my result is kinda bit different from your output. I was wondering why does all the number 5 at the beginning turn to *** in my output ?

    101) 90113
    102) 42116
    103) *** 2117
    104) 32117
    105) 13116
    106) *** 3117
    107) 63119
    108) 72104
    109) 82118
    110) *** 2115
    111) 100115
    112) 0 000
    113)
    115)
    11600 0 001
    117)
    11800 0 001
    11900 0 001
    120)

    1. Hi Maria,

      In normal scenarios, it should pint the stars if the symbol that it is trying to parse is invalid/unknown. So have a look into your configs and input file, you might be having a symbol or variable that is unknown to your assembler.

      Thanks.

  6. Dear,
    Thanx for such a step by step analysis, yet after running, I have seen a error:
    “Error at line 5 :: Unknown mnemonics: again:”. How can I resolve this issue?
    Can u pls tell me, how can i generate internediate code from your pass 1

  7. Dear,
    Thank you very much for ur such a wealthy post and analysis. Yet, I found error after running the code which is :” Error at line 5 :: Unknown mnemonics: again:” and object code is not generated. Can you pls tell me how to resolve this? MOreover, I willbe glad if u advice me how to generate intermediate code from your pass1.
    Thank You

    1. Basic reason could be the mnemonics you have used in input file is not available or misspelled in the code in mnemonics struct. First step would be check for it.

      To generate intermediate output of pass1, drop calling pass2 and shut the file read/write and check the output file.

      Regards

Leave a reply to harryjoy Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.