Opened 17 years ago

Closed 14 years ago

#484 closed Feature Requests (fixed)

Boost.Iostreams and newline translation

Reported by: jpstanley Owned by: Jonathan Turkanis
Milestone: Boost 1.36.0 Component: iostreams
Version: None Severity: Optimization
Keywords: Cc:

Description

I like the Boost.Iostreams framework in Boost 1.33. 
But I haven't found a way to make it do exactly what I
want.

I maintain software that parses files out of disk
images instead of the local file system.  I'd like to
be able to wrap a C++ istream around my raw binary
read/seek APIs to make it easier to parse data from
disk images and interface with existing code. 
Boost.Iostreams makes this job a piece of cake--except
that I haven't found a way to make the stream act like
an ifstream opened in text mode (I'm on Win32, btw). 
In other words, I'd like to have CR, LF, and CRLF line
endings translated into \n when I read the data.  There
is a newline_filter class provided which does nearly
exactly what I want--but it doesn't support seeking.  I
still need seekg() and tellg() to work.

I tried writing my own filter, but I ran into some
problems.  It seems the Boost stream buffer class is
built on the assumption that this will hold:

std::streamsize P0 = str.tellg();
str.read(buf, 100);
std::streamsize P1 = str.tellg();
assert(P1 - P0 == 100);

But with an ifstream opened in text mode, this
assertion can fail (P1 - P0 > 100) if two-character
CRLF combinations were translated into a one-character
newline in the intervening data.

Is it possible to implement an input-seekable filter
that will let a Boost.Iostream behave this way?

Change History (5)

comment:1 by Jonathan Turkanis, 17 years ago

Logged In: YES 
user_id=811799

Thanks for your interest in Boost.Iostreams.

SourceForge.net wrote:
> Submitted By: jpstanley (jpstanley)

> I maintain software that parses files out of disk
> images instead of the local file system.  I'd like to
> be able to wrap a C++ istream around my raw binary
> read/seek APIs to make it easier to parse data from
> disk images and interface with existing code.
> Boost.Iostreams makes this job a piece of cake--except
> that I haven't found a way to make the stream act like
> an ifstream opened in text mode (I'm on Win32, btw).
> In other words, I'd like to have CR, LF, and CRLF line
> endings translated into \n when I read the data.  There
> is a newline_filter class provided which does nearly
> exactly what I want--but it doesn't support seeking.  I
> still need seekg() and tellg() to work.
> 
> I tried writing my own filter, but I ran into some
> problems.  It seems the Boost stream buffer class is
> built on the assumption that this will hold:
> 
> std::streamsize P0 = str.tellg();
> str.read(buf, 100);
> std::streamsize P1 = str.tellg();
> assert(P1 - P0 == 100);

Note quite: you could reach EOF before reading 100
characters ;-) But I know what you mean. 

When you are reading a filtered sequence, there are two file
pointers two worry about: there's the current position in
the filtered sequence, and the current position in the
unfiltered sequence. When you successfully read 100
characters from the filtered sequence, as here:

    str.read(buf, 100);

the current position in the *filtered* sequence is advanced
by exactly 100 characters. This may correspond to more than
100 or fewer than 100 characters in the unfiltered sequence. 
 
> But with an ifstream opened in text mode, this
> assertion can fail (P1 - P0 > 100) if two-character
> CRLF combinations were translated into a one-character
> newline in the intervening data.

Here you're talking about the current position in the
unfiltered sequence. To query this value, you can't use
str.tellg(); you have to call seek() on the underlying filter.

> Is it possible to implement an input-seekable filter
> that will let a Boost.Iostream behave this way?

If you want to be able to seek within the filtered sequence,
it's possible, but it will be inefficient. If you want to be
able to seek with offsets interpretted relative to the
unfiltered sequence, I'm not sure it can be done. I'd like
you to describe more exactly what you want to do.

-- 
Jonathan Turkanis
www.kangaroologic.com

comment:2 by jpstanley, 17 years ago

Logged In: YES 
user_id=1326483

Thank you for your timely response.  

The main reason I want seekg() and tellg() to work is so
that I can remember where in the file I found a piece of
data and come back to it later.  For example, suppose I want
to build an index of an mbox file (essentially a large text
file with one email message following another).  When I have
the read head positioned at the start of a message, I use
tellg() to retrieve a pointer to it.  Then I can use seekg()
later on when I want to retrieve the message.

A byte offset in the unfiltered sequence would be the most
natural form of pointer--I could retrieve a single message
quickly, without filtering everything in front of it.  Of 
course I couldn't compare pointers to learn the size of a
message in filtered characters, but that's not my primary
objective.

To try and make the behavior I expect a little clearer,
here's an example program:

// test.txt contains the sequence "Hello\x0D\x0AWorld!"
// Compile on MSVC++ with CL /GX newline_test.cpp

#include <iostream>
#include <fstream>

int main(int argc, char **argv)
{
	std::ifstream infile("test.txt");

	infile.seekg(5, std::ios::beg);
	std::cout << infile.tellg() << std::endl;

	infile.get();
	std::cout << infile.tellg() << std::endl;

	infile.close();
	return 0;
}

The seekg() statement positions the read head at the CR
character.  The first tellg() returns, unsurprisingly, 5. 
Then get() extracts both the CR and the LF from the
unfiltered sequence and returns '\n'.  The second tellg()
call returns 7, even though there is only one character in
the filtered sequence between the two points.

comment:3 by Daryle Walker, 15 years ago

Component: Noneiostreams
Severity: Problem

comment:4 by Jonathan Turkanis, 15 years ago

Milestone: Boost 1.36.0
Severity: ProblemOptimization
Type: Support RequestsFeature Requests

I'm sorry I let this go so long.

I will consider it as a feature request for a form of seekability weaker than that currently supported, in which you can only request that the file pointer be restored to a location that was previously saved. A weak-seekable newline filter would then not have to worry about seeking relative to the current location, but would just have to remember the file offsets of the downstream device at various previousl-queried locations.

It may already be possible to implement this in the currently library, but introducing a new concept might clarify the situation.

I will consider implementing this in 1.36.

comment:5 by Jonathan Turkanis, 14 years ago

Resolution: Nonefixed
Status: assignedclosed

I've decided the correct way to solve this problem is to write a seekable filter adapter that provides an implementation of seek in terms of user supplied implementations of read and write. I have added this idea to the list of possible new filters and devices in the Iostreams Roadmap.

In case the wiki entry changes, here is the current description: "A seekable_filter_adapter that provides an implementation of seek() when the user has defined read() and write(). seek() would work by checking whether the offset is relative to the beginning of the stream, and if so, whether it corresponds to a previously saved offset. If so, it fetches the stored offset in the unfiltered stream and performs a seek on the downstream device. This would solve the problem raised by #484"

Note: See TracTickets for help on using tickets.