Reading tabbed text in SVG

Discuss SVG code, accessible via the XML Editor.
UncleJosh
Posts: 2
Joined: Sat Jul 11, 2015 4:26 am

Reading tabbed text in SVG

Postby UncleJosh » Sat Jul 11, 2015 6:17 am

I have a series of PDFs that include rows of text that are clearly formatted into a table:

Code: Select all

5/31/2015  00  Adventure       3     150   0.001  1.50
5/31/2015  00  Excitement      1      65   0.10   6.50


I am using Inkscape to convert these to plain svg, and when I explore the svg, I find the svg:text nodes and inside them svg:tspan nodes.
Those nodes have a text value of 5/31/201511Adventure31500.0011.50 and 5/31/2015Excitement1650.1065.0

I am familiar with XML, but not so much SVG. I can't detect any tab characters in these text values, or any other characters that would indicate the column format, like a field break or something.

Is there a different tag I should be searching for to get this kind of formatting information?

~suv
Posts: 2272
Joined: Sun May 10, 2009 2:07 am

Re: Reading tabbed text in SVG

Postby ~suv » Sat Jul 11, 2015 8:16 am

There is no "tabbed" text (or tabular data) in SVG 1.1 (or Inkscape for that matter).

In Inkscape, with text from imported PDF files (i.e. PDF content converted to SVG structure), each letter is positioned with absolute coordinates (also explained here under 'Text editing tips') - you can e.g. find a list of x coordinates in the 'x' attribute of the <tspan> element (the relevant attributes are explained in the SVG 1.1 specification). Keep in mind that those coordinates will be affected by transformations stored in the 'transform' attribute of parent objects (e.g. on the <text> element as well as the parent layer group) - the content of PDFs is scaled (1.25) and flipped vertically relative to the SVG coordinate system. Depending on the use case and/or further processing of the data - if you plan to reassemble the individually positioned letters into meaningful words and tabular data - you probably would have to take the parent transforms to the coordinates of the individual letters into consideration.

<opinion>To the best of my knowledge, PDF was originally designed as a display format (same visual output on different output devices), never as a data exchange format. Extracting tabular data from PDF files will likely pose similar issues even if no conversion to a file format like SVG is involved.</opinion>

UncleJosh
Posts: 2
Joined: Sat Jul 11, 2015 4:26 am

Re: Reading tabbed text in SVG

Postby UncleJosh » Sat Jul 11, 2015 9:22 am

Thank you.

I pieced together a simple function that works to separate each line as I need it:

Code: Select all

def svg_tspan_to_wordlist(span_node):
    """return a list of word groups broken by apparent columns"""

    #horizontals is the position-list of each letter in span.text
    horizontals = [float(item) for item in span_node.get('x').split()]

    style_items = span_node.get('style').split(';')
    style_dict = dict([item.split(':') for item in style_items])
    font_size = float(style_dict['font-size'].replace('px',''))

    hdiffs = [0] + [horizontals[idx] - horizontals[idx-1] for idx in range(1, len(horizontals))]
   
    words = []
    letters = []
    for idx in range(len(horizontals)):
        if hdiffs[idx] > font_size:
            words.append(''.join(letters))
            letters = []
        letters.append(span_node.text[idx])
    words.append(''.join(letters))

    return words


This will save me a lot of time and disc space.

~suv
Posts: 2272
Joined: Sun May 10, 2009 2:07 am

Re: Reading tabbed text in SVG

Postby ~suv » Sat Jul 11, 2015 4:23 pm

Thank you for sharing the snippet with your solution - splitting words based on dx > font-size seems to work reasonable well and return the expected result (at least it did so for me with a quite randomly picked test case - a PDF file with a time table which imports the text in Inkscape similarly as you described earlier).


Return to “SVG / XML Code”