I have been using the FUNSD dataset to predict sequence labeling in unstructured documents per this paper: LayoutLM: Pre-training of Text and Layout for Document Image Understanding . The data after cleaning and moving from a dict to a dataframe, looks like this:
The dataset is laid out as follows:
- The column
idis the unique identifier for each word group inside a document, shown in columntext(like Nodes) - The column
labelidentifies whether the word group are classified as a 'question' or an 'answer' - The column
linkingdenoting the WordGroups which are 'linked' (like Edges), linking corresponding 'questions' to 'answers' - The column
'box'denoting the location coordinates (x,y top left, x,ybottom right) of the word group relative to the top left corner (0.0). - The Column
'words'holds each individual word inside the wordgroup, and its location (box).
I aim to train a classifier to identify words inside the column 'words' that are linked together by using a Graph Neural Net, and the first step is to be able to transform my current dataset into a Network. My questions are as follows:
Is there a way to break each row in the column
'words'into a two columns[box_word, text_word], each only for one word, while replicating the other columns which remain the same:[id, label, text, box], resulting in a final dataframe with these columns:[box,text,label,box_word, text_word]I can Tokenize the columns
'text'andtext_word, one hot encode columnlabel, split columns with more than one numericboxandbox_wordinto individual columns , but How do I split up/rearrange the colum'linking'to define the edges of my Network Graph?Am I taking the correct route in Using the dataframe to generate a Network, and use it to train a GNN?
Any and all help/tips is appreciated.